What are the future works in this paper?

The authors believe that this issue is an important research area for future work. To the best of their knowledge, only Lozano and Larrañaga [ 93 ] address this topic, which is a possible venue for future research. The book by Falkenauer [ 43 ] can be considered as a pioneering work in this research direction. Doing so, the author argues that the premises of the schema theorem can be satisfied, and valid approaches can be derived to tackle grouping problems by using genetic algorithms.

What is the advantage of density-based clustering methods?

Density-based clustering methods usually have the advantage of being flexible enough to discover clusters of arbitrary shape [38].

What is the way to avoid empty clusters?

Of course, empty clusters could be avoided by enforcing the objects closer to multiple identical medoids to be shared among the corresponding clusters.

What is the purpose of a density-based clustering algorithm?

In such criteria, a cluster is essentially a group of objects in the same dense region in the data space, and the goal of a density-based clustering algorithm is to find high-density regions (each region corresponding to a cluster) that are separated by low-density regions.

Why is the evolutionary algorithm used as a wrapper around a clustering algorithm?

This is because the evolutionary algorithm is used as a wrapper around a clustering algorithm, so that the fitness of an individual (i.e., a candidate set of selected attributes) is computed by running a clustering algorithm with the selected attributes and measuring the corresponding clustering validity criteria.

What is the mechanism used to control the rate of application of the individual mutation operators?

It consists of the use of a self-adjusting procedure that automatically controls the rates of application of the individual mutation operators based upon their relative success/failure averaged over past generations.

What is the choice among the evaluated variants?

Based upon an experimental evaluation, the authors argue that the pairwise nearest neighbor operator is the best choice among the assessed variants.

How can one evaluate the clustering solutions?

In order to do so, repeated runs of the evolutionary algorithm might be performed for different values of k, and the obtained clustering solutions could be comparatively assessed by some measure that reflects the partition quality6.

What is the fitness function used to assess partitions containing a given number of clusters?

3) Fitness Function: Many clustering validity criteria can be used for assessing partitions containing a given number (k) of clusters (e.g., see [72], [40], [75]).

(Open Access) A Survey of Evolutionary Algorithms for Clustering (2009) | Eduardo R. Hruschka

To appear in IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and Reviews

Abstract — This paper presents a survey of evolutionary

algorithms designed for clustering tasks. It tries to reflect the

profile of this area by focusing more on those subjects that have

been given more importance in the literature. In this context,

most of the paper is devoted to partitional algorithms that look

for hard clusterings of data, though overlapping (i.e., soft and

fuzzy) approaches are also covered in the manuscript. The paper

is original in what concerns two main aspects. First, it provides

an up-to-date overview that is fully devoted to evolutionary

algorithms for clustering, is not limited to any particular kind of

evolutionary approach, and comprises advanced topics, like

multi-objective and ensemble-based evolutionary clustering.

Second, it provides a taxonomy that highlights some very

important aspects in the context of evolutionary data clustering,

namely, fixed or variable number of clusters, cluster-oriented or

non-oriented operators, context-sensitive or context-insensitive

operators, guided or unguided operators, binary, integer or real

encodings, centroid-based, medoid-based, label-based, tree-based

or graph-based representations, among others. A number of

references is provided that describe applications of evolutionary

algorithms for clustering in different domains, such as image

processing, computer security, and bioinformatics. The paper

ends by addressing some important issues and open questions that

can be subject of future research.

Index Terms — evolutionary algorithms, clustering, applications.

I. INTRODUCTION

Clustering is a task whose goal is to determine a finite set of

categories (clusters) to describe a data set according to

similarities among its objects [75][40]. The applicability of

clustering is manifold, ranging from market segmentation [17]

and image processing [72] through document categorization

and web mining [102]. An application field that has shown to

be particularly promising for clustering techniques is

bioinformatics [7][13][129]. Indeed, the importance of

clustering gene-expression data measured with the aid of

microarray and other related technologies has grown fast and

persistently over the past recent years [74][60].

Clustering techniques can be broadly divided into three

main types [72]: overlapping (so-called non-exclusive),

partitional, and hierarchical. The last two are related to each

E. R. Hruschka, R. J. G. B. Campello, and A. C. P. L. F. de Carvalho are

with the Department of Computer Sciences of the University of São Paulo

(USP) at São Carlos, SP, Brazil. E-mails: {erh;campello;andre}@icmc.usp.br.

A.A. Freitas is with the Computer Science Department of the University of

Kent at Canterbury, Kent, UK, E-mail: A.A.Freitas@kent.ac.uk.

The authors acknowledge the Brazilian Research Agencies CNPq and

FAPESP for their financial support to this work.

other in that a hierarchical clustering is a nested sequence of

partitional clusterings, each of which represents a hard

partition of the data set into a different number of mutually

disjoint subsets. A hard partition of a data set X={x

, ...,x

where x

(j = 1, ..., N) stands for an n-dimensional feature or

attribute vector, is a collection C={C

, ...,C

} of k non-

overlapping data subsets C

≠∅ (non-null clusters) such that C

∪ C

∪...∪ C

= X and C

∩ C

= ∅ for i ≠j. If the condition of

mutual disjunction (C

∩ C

= ∅ for i ≠ j) is relaxed, then the

corresponding data partitions are said to be of overlapping

type. Overlapping algorithms produce data partitions that can

be soft (each object fully belongs to one or more clusters) [40]

or fuzzy (each object belongs to one or more clusters to

different degrees) [118][64].

In spite of the type of algorithm (partitional, hierarchical or

overlapping), the main goal of clustering is maximizing both

the homogeneity within each cluster and the heterogeneity

among different clusters [72][3]. In other words, objects that

belong to the same cluster should be more similar to each other

than objects that belong to different clusters. The problem of

measuring similarity is usually tackled indirectly, i.e., distance

measures are used for quantifying the degree of dissimilarity

among objects, in such a way that more similar objects have

lower dissimilarity values [73]. Several dissimilarity measures

can be employed for clustering tasks [72][132]. Each measure

has its bias and comes with its own advantages and drawbacks.

Therefore, each one may be more or less suitable to a given

analysis or application scenario. Indeed, it is well-known that

some measures are more suitable for gene clustering in

bioinformatics [74], whereas other measures are more

appropriate for text clustering and document categorization

[114], for instance.

Clustering is deemed one of the most difficult and

challenging problems in machine learning, particularly due to

its unsupervised nature. The unsupervised nature of the

problem implies that its structural characteristics are not

known, except if there is some sort of domain knowledge

available in advance. Specifically, the spatial distribution of

the data in terms of the number, volumes, densities, shapes,

and orientations of clusters (if any), are unknown [47]. These

adversities may be potentialized even further by an eventual

need for dealing with data objects described by attributes of

distinct natures (binary, discrete, continuous, and categorical),

conditions (complete and partially missing) and scales (ordinal

and nominal) [72][73].

From an optimization perspective, clustering can be

formally considered as a particular kind of NP-hard grouping

A Survey of Evolutionary Algorithms for

Clustering

Eduardo R. Hruschka, Member, IEEE, Ricardo J. G. B. Campello, Member, IEEE, Alex A. Freitas,

Member, IEEE, André C. P. L. F. de Carvalho, Member, IEEE

To appear in IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and Reviews

problem [43]. This has stimulated the search for efficient

approximation algorithms, including not only the use of ad hoc

heuristics for particular classes or instances of problems, but

also the use of general-purpose metaheuristics (e.g. see [116]).

Particularly, evolutionary algorithms are metaheuristics widely

believed to be effective on NP-hard problems, being able to

provide near-optimal solutions to such problems in reasonable

time. Under this assumption, a large number of evolutionary

algorithms for solving clustering problems have been proposed

in the literature. These algorithms are based on the

optimization of some objective function (i.e., the so-called

fitness function) that guides the evolutionary search.

This paper presents a survey of evolutionary algorithms

designed for clustering tasks. It tries to reflect the profile of

this area by focusing more on those subjects that have been

given more importance in the literature. In this context, most

of the paper is devoted to partitional algorithms that look for

hard data clusterings, though overlapping approaches are also

covered in the manuscript. It is important to stress that

comprehensive surveys on clustering have been previously

published, such as the outstanding papers by Jain et al. [73],

Jiang et al. [74], and Xu and Wunsch II [132], just to mention

a few. Nevertheless, to the best of the authors’ knowledge,

none has been fully devoted to evolutionary approaches. It is

worth mentioning, however, that reviews on similar subjects

have been previously published. The authors themselves have

previously published overviews on related topics. For instance,

in [109] the authors provide an overview of Genetic

Algorithms (GAs) for clustering, but only a small subset of the

existing evolutionary approaches (namely, GAs) is discussed

in that reference. In [50], in its turn, the author provides an

extensive review of evolutionary algorithms for data mining

applications, but the work focuses on specific evolutionary

approaches (GAs and Genetic Programming) and is mainly

intended for classification tasks, clustering being just slightly

touched in a peripheral section. Three previous monographs

[23][43][119] have also partially approached some of the

issues raised in the present manuscript. In particular, Cole [23]

reviewed a number of genetic algorithms for clustering

published until 1997, whereas [119] provided a review of

evolutionary algorithms for clustering that is more recent, yet

much more concise. In contrast, Falkenauer [43] describes in

details a high-level paradigm (meta-heuristic) that can be

adapted to deal with grouping problems broadly defined, thus

being useful for several applications – e.g., bin packing,

economies of scale, conceptual clustering, and equal piles.

However, data partitioning problems like those examined in

the present paper are not the primary focus of Falkenauer’s

book [43], which has been published in 1998.

Bearing the previous remarks in mind, it can be stated that

the present paper is original in the following two main aspects:

(i) It provides an up-to-date overview that is fully devoted to

evolutionary algorithms for clustering, is not limited to any

particular kind of evolutionary approach, and comprises

advanced topics, like multi-objective and ensemble-based

evolutionary clustering; and (ii) It provides a taxonomy that

allows the reader to identify every work surveyed with respect

to some very important aspects in the context of evolutionary

data clustering, such as:

• Fixed or variable number of clusters;

• Cluster-oriented or non-oriented operators;

• Context-sensitive or context-insensitive operators;

• Guided or unguided operators;

• Binary, integer or real encodings;

• Centroid-based, medoid-based, label-based, tree-based

or graph-based representations.

By cluster-oriented operators, it is meant here operators that

are task dependent, such as operators that copy, split, merge,

and eliminate clusters of data objects, in contrast to

conventional evolutionary operators that just exchange or

switch bits without any regard to their task-dependent

meaning. Guided operators are those operators that are guided

by some kind of information about the quality of individual

clusters, about the quality of the overall data partition, or about

their performance on previous applications, such as operators

that are more likely to be applied to poor quality clusters and

operators whose probability of application is proportional to

its success (or failure) in previous generations. Finally,

context-sensitivity will hereafter refer to the original concept as

defined by Falkenauer [43], which is limited to crossover

operators. In brief, a crossover operator is context-sensitive if:

(i) it is cluster-oriented; and (ii) two (possibly different)

chromosomes encoding the same clustering solution do not

generate a different offspring solution when they are crossed-

over. As a consequence, when the number of clusters, k, is

fixed in advance, it can be asserted that two chromosomes

encoding different clustering solutions with the same k must

not produce solutions with a number of clusters other than k as

a result of crossover. Of course, context-sensitivity is more

stringent than cluster-orientation.

The remainder of this paper is organized as follows. Section

II presents a survey of evolutionary algorithms for hard

partitional clustering, whereas Section III presents a review of

evolutionary algorithms for overlapping clustering. Section IV

discusses evolutionary algorithms for multi-objective

clustering and clustering ensembles. A number of references

that describe applications of evolutionary algorithms for

clustering in different domains is provided in Section V.

Finally, the material presented throughout the paper is

summarized in Section VI, which also addresses some

important issues for future research.

II. HARD PARTITIONAL CLUSTERING

As mentioned in the introduction, a hard partition of a data set

X is a collection of k non-overlapping clusters of these data.

The number of clusters, k, usually must be provided in

advance by the user. In some cases, however, it can be

estimated automatically by the clustering algorithm. Section

To appear in IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and Reviews

II.A describes evolutionary algorithms for which k is assumed

to be fixed a priori, whereas Section II.B addresses algorithms

capable of estimating k during the evolutionary search.

A. Algorithms with Fixed Number of Clusters

Several papers address evolutionary algorithms to solve

clustering problems for which the number of clusters (k) is

known or set up a priori (e.g., Bandyopadhyay and Maulik

[10]; Estivill-Castro and Murray [39]; Fränti et al. [48];

Kivijärvi et al. [79]; Krishna and Murty [83]; Krovi [84];

Bezdek et al. [14]; Kuncheva and Bezdek [85]; Lu et al.

[95][94]; Lucasius et al. [96]; Maulik and Bandyopadhyay

[100]; Merz and Zell [103]; Murthy and Chowdhury [107];

Scheunders [121]; Sheng and Liu [122]). Cole [23] reviews

and empirically assesses a number of such genetic algorithms

for clustering published up to 1997.

It is intuitive to think of algorithms that assume a fixed

number of clusters (k) as being particularly suitable for

applications in which there is information regarding the value

of k. For instance, domain knowledge may be available that

suggests a reasonable value – or a small interval of values –

for k. Having such information in hand, algorithms described

in this section can be potentially applied for tackling the

corresponding clustering problem. Alternatively, the reader

may think about using conventional clustering algorithms for

fixed k, such as k-means [101][72], EM (Expectation

Maximization) [34][61], and SOM (Self-Organized Maps)

[17][62] algorithms. However, these prototype-based

algorithms are quite sensitive to initialization of prototypes

and may get stuck at sub-optimal solutions. This is a well-

known problem, which becomes more evident for more

complex data sets

. A common approach to alleviate this

problem involves running the algorithm repeatedly for several

different prototype initializations. Nevertheless, note that one

can only guarantee that the best clustering solution for a fixed

value of k would be found if all possible initial configurations

of prototypes were evaluated. Of course, this approach is not

computationally feasible in practice, especially for large data

sets and large k. Running the algorithm only for a limited set of

initial prototypes, in turn, may be either inefficient or not

computationally attractive, depending on the number of

prototype initializations to be performed.

For this reason, other approaches have been investigated.

Among them, evolutionary algorithms have shown to be

promising alternatives. Evolutionary algorithms essentially

evolve clustering solutions through operators that use

probabilistic rules to process data partitions sampled from the

search space [43]. Roughly speaking, more fitted partitions

have higher probabilities of being sampled. Thus, the

evolutionary search is biased towards more promising

clustering solutions and tends to perform a more

We here define a prototype as a particular feature vector that represents a

given cluster. For instance, prototypes can be centroids, medoids, or any other

vector computed from the data partition and that represents a cluster (as in the

case of typical fuzzy clustering algorithms).

Complexity here refers to the number of different local minima and the

variance of their objective function values, which are usually strongly related

to the number n of data attributes and the number k of clusters.

computationally efficient exploration of the search space than

traditional randomized approaches (e.g., multiple runs of k-

means). Besides, traditional randomized approaches do not

make use of the information on the quality of previously

assessed partitions to generate potentially better partitions. For

this reason, these algorithms tend to be less efficient (in a

probabilistic sense) than an evolutionary search.

In spite of the theoretical advantages (in terms of

computational efficiency) of evolving clustering solutions,

much effort has also been undertaken towards showing that

evolutionary algorithms can provide partitions of better quality

than those found by traditional algorithms. In fact, this may be

possible provided that the parallel nature of the evolutionary

algorithms allows them to handle multiple solutions, possibly

guided by different distance measures and different fitness

evaluation functions.

This section reviews a significant part of the literature on

evolutionary algorithms for partitioning a data set into k

clusters. Potential advantages and drawbacks of each

algorithm are analyzed under the light of their corresponding

encoding schemes, operators, fitness functions, and

initialization procedures.

1) Encoding Schemes: Several encoding schemes have been

proposed in the literature. In order to explain them, let us

consider a simple pedagogical data set (Table I) formed by 10

objects x

(i = 1, 2,…,10) with two attributes each (n = 2),

denoted a

and a

. Such objects have been arbitrarily grouped

into three clusters (C

, C

, and C

). These clusters are depicted

in Fig. 1 and are used to illustrate how partitions can be

encoded to be processed by an evolutionary search. Aiming at

summarizing common encodings found in the literature, we

first here categorize them into three types: binary, integer, and

real.

TABLE I. PEDAGOGICAL DATA SET.

Object (x

) a

Cluster - C

1 1 Cluster 1 (C

)

1 2 Cluster 1 (C

)

2 1 Cluster 1 (C

)

2 2 Cluster 1 (C

)

10 1 Cluster 2 (C

)

10 2 Cluster 2 (C

)

11 1 Cluster 2 (C

)

11 2 Cluster 2 (C

)

5 5 Cluster 3 (C

)

5 6 Cluster 3 (C

)

0 2 4 6 8 10 12

Cluster 1

Cluster 2

Cluster 3

Fig. 1. Pedagogical data set (see Table I).

To appear in IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and Reviews

a) Binary encoding: In a binary encoding, each clustering

solution (partition) is usually represented as a binary

string of length N, where N is the number of data set

objects. Each position of the binary string corresponds

to a particular object, i.e., the ith position (gene)

represents the ith object. The value of the ith gene is 1

if the ith object is a prototype and zero otherwise. For

example, the partition depicted in Fig. 1 can be

encoded by means of the string [1000100010], in which

objects 1, 5, and 9 are cluster prototypes. Clearly, such

an encoding scheme inexorably leads to a medoid-

based representation, i.e., a prototype-based

representation in which the cluster prototypes

necessarily coincide with objects from the data set. The

partition encoded into a given genotype

can be derived

by the nearest prototype rule – taking into account the

proximities between objects and prototypes – in such a

way that the ith object is assigned to the cluster

represented by the closer (i.e. the most similar)

prototype. Kuncheva and Bezdek [85] make use of this

encoding approach, which allows the evolutionary

search to be performed by means of those classical GA

operators originally developed to manipulate binary

genotypes [54][105]. However, the use of such classical

operators usually suffers from serious drawbacks in the

specific context of evolutionary clustering, as will be

further discussed in Section II.A.2.a.

There is an alternative way to represent a given data

partition using a binary encoding. It is the use of a k ×

N matrix in which the rows represent clusters and the

columns represent objects. In this case, if the jth object

belongs to the ith cluster, then 1 is assigned to the ith

element of the jth column of the genotype, whereas the

other elements of the same column receive 0. For

example, using this representation, the partition

depicted in Fig. 1 would be encoded as [14]:

1 1 1 1 0 0 0 0 0 0

0 0 0 0 1 1 1 1 0 0

0 0 0 0 0 0 0 0 1 1

This matrix-based binary encoding scheme has the

clear disadvantage of requiring O(k⋅N) memory space,

against O(N) space of the usual string-based binary

encoding scheme formerly described. On the other

hand, the time it requires to recover the data partition

from a given genotype is O(k⋅N) – both in the average

and worst cases – against O(k⋅n⋅N) for the string-based

scheme (due to the nearest prototype rule

computations)

. This computational saving is relevant

The terms genotype, chromosome and individual usually have the same

meaning in the literature on evolutionary algorithms and will be freely

interchanged in this paper.

Actually, the nearest neighbor search can be performed in asymptotic

logarithmic time by exploiting the Delaunay triangulation [81], which is the

dual of the Voronoi diagram – e.g., see [98]. However, to the best of our

only for data sets with many attributes. When the

number of attributes n is not large, the advantage of the

matrix-based scheme reduces to the possibility of

extending it to handle soft partitions, by allowing

multiple elements of a given column to be non-null.

Soft partitional clustering is discussed in Section III.

b) Integer encoding: There are two ways of representing

clustering solutions by means of integer encoding. In

the first one, a genotype is an integer vector of N

positions, where N is the number of data set objects.

Each position corresponds to a particular object, i.e.,

the ith position (gene) represents the ith data set object.

Provided that a genotype represents a partition formed

by k clusters, each gene has a value over the alphabet

{1, 2, 3, …, k}. These values define the cluster labels,

thus leading to a label-based representation. For

example, the integer vector [1111222233] represents

the clusters depicted in Fig. 1. This encoding scheme is

adopted in [84][107][83][95][94]. In particular, only

partitions formed by two clusters are addressed in [84],

thus allowing the use of a binary representation for

which each gene has a value over the alphabet {0, 1}.

This integer encoding scheme is naturally redundant,

i.e., the encoding is one-to-many. In fact, there are k!

different genotypes that represent the same solution.

For example, there are 3! different genotypes that

correspond to the same clustering solution represented

in Fig. 1, namely: [1111222233], [1111333322],

[2222111133], [2222333311], [3333111122], and

[3333222211]. Thus, the size of the search space to be

explored by the genetic algorithm is much larger than

the original space of solutions. Depending on the

employed operators, this augmented space may reduce

the efficiency of the genetic algorithm. An alternative to

solve this problem is the use of a renumbering

procedure [43].

Another way of representing a partition by means of an

integer encoding scheme involves using an array of k

elements to provide a medoid-based representation of

the data set. In this case, each array element represents

the index of the object x

, i = 1, 2, …, N (with respect

to the order the objects appear in the data set)

corresponding to the prototype of a given cluster. As an

example, the array [1 5 9] can represent a partition in

which objects 1, 5, and 9 are the cluster prototypes

(medoids) of the data given in Table I. Taking into

account these prototypes and assuming a nearest

prototype rule for assigning objects to clusters, the

partition depicted in Fig. 1 can be recovered. Lucasius

et al. [96], for instance, make use of this approach. This

representation scheme is also adopted, for instance, in

[39] and [122].

Conceptually speaking, representing medoids by means

of an integer array of k elements, as previously

discussed, is usually more computationally efficient

than using the string-based binary encoding scheme

knowledge this idea has not been explored in the context of evolutionary

algorithms for clustering.

To appear in IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and Reviews

described in Section II.A.1.a. However, it must be

noticed that such an integer encoding scheme may be

redundant if unordered genotypes are allowed, in which

case the solutions [1 5 9], [1 9 5], [5 1 9], [5 9 1], [9 1

5], and [9 5 1] encode the same partition depicted in

Fig. 1. In such a case, a renumbering procedure should

be used in order to avoid potential redundancy

problems.

When comparing the two different integer encoding

schemes discussed in this section, one has to take into

account some different aspects that may be of interest.

Considering space complexity issues, the integer

encoding is O(N) when a label-based representation is

used, whereas it is O(k) when a medoid-based

representation is adopted. Thus, in principle, one may

conclude that the latter is more advantageous than the

former (since k is typically much lower than N).

However, this is not necessarily true. Actually, the

suitability of each of the aforementioned encoding

schemes is highly dependent upon the fitness function

used to guide the evolutionary search, as well as upon

the evolutionary operators that manipulate the

clustering solutions being evolved – as it will become

evident in the following sections. In brief, the label-

based encoding does not require any additional

processing to make available the information on the

membership of each object to its corresponding cluster.

Such information may be necessary for computing

cluster statistics, which, by their turn, can be needed for

computing the fitness function and/or for guiding the

application of evolutionary operators. It is easy to see

that, contrarily to the label-based encoding, the

medoid-based encoding requires further processing in

order to recover the clusters encoded into the genotype.

Consequently, depending on the computational cost

involved in cluster recovering, a particular encoding

may become more (or less) suitable for a given

clustering problem.

c) Real encoding: In real encoding the genotypes are

made up of real numbers that represent the coordinates

of the cluster prototypes. This means that, unlike the

integer encoding scheme discussed in Section II.A.1.b,

real encoding is necessarily associated with a

prototype-based representation of partitions. However,

unlike the string-based binary encoding scheme

discussed in Section II.A.1.a, real encoding does not

necessarily leads to a medoid-based representation.

Instead, it may also be (and in fact usually is)

associated with a centroid-based representation of the

partitions, as discussed in the sequel.

If genotype i encodes k clusters in an n dimensional

space, ℜ

, then its length is n⋅k. Thus, the first n

positions represent the n coordinates of the first cluster

prototype, the next n positions represent the coordinates

of the second cluster prototype, and so forth. To

illustrate this, the genotype [1.5 1.5 10.5 1.5 5.0 5.5]

encodes the prototypes (1.5, 1.5), (10.5, 1.5), and (5.0,

5.5) of clusters C

, C

, and C

in Table I, respectively.

Given the genotype, the corresponding clusters can be

recovered by the nearest prototype rule, in such a way

that the ith object is assigned to the cluster represented

by the most similar prototype.

The genotype representation adopted in references

[121][100][103][10] follows a real encoding scheme in

which the prototype locations are not restricted to the

positions of the objects. This representation, named

centroid-based representation, is also adopted by

Fränti et al. [48] and Kivijärvi et al. [79]. These

authors, however, additionally encode into the genotype

a partitioning table that describes, for each object, the

index of the cluster to which the object belongs.

Alternatively, one could encode the real-valued

coordinates of a set of k medoids. In order to do so, it is

only necessary to enforce the constraint that the

prototype locations coincide with positions of objects in

the data set. In the pedagogical example of Table I, the

coordinates of a set of objects – e.g. {x

, x

} – can

be represented by the genotype [1 1 10 1 5 5]. These

medoids allow recovering the clusters depicted in Fig. 1

by using by the nearest prototype rule as well.

The potential advantages and drawbacks of the real

encoding schemes are fundamentally the same as the

integer medoid-based encoding scheme discussed in

Section II.A.1.b, with the caveat that the former

demands O(n⋅k) memory space in order to represent a

given genotype, whereas the latter demands only O(k)

space. Possibly for this reason, the use of a real

medoid-based encoding scheme has not been reported

in any work surveyed in the present paper.

2) Operators: A number of crossover and mutation

operators for clustering problems have been investigated. In

order to aid the perception of common features shared by

these operators, we address them according to the encoding

schemes for which they have been designed.

a) Crossover: Falkenauer [43] addresses several

drawbacks of traditional genetic algorithms when they

are applied to tackle grouping tasks. As far as crossover

operators are concerned, an important problem to be

considered involves the context-insensitivity concept.

Formally, context-insensitivity means that [43] “the

schemata defined on the genes of the simple

chromosome do not convey useful information that

could be exploited by the implicit sampling process

carried out by a clustering genetic algorithm”. In the

following we illustrate, by means of pedagogical

examples, the context-insensitivity problem. Then,

having such examples in mind, we analyze crossover

operators frequently described in the literature.

Let us assume that genotypes [1111222233] and

[1111333322] – encoded under the label-based integer

encoding discussed in Section II.A.1.b – are

recombined under the standard one-point crossover, as

depicted in Fig. 2 (bold type refers to the exchanged

genetic information).

A Survey of Evolutionary Algorithms for Clustering

Figures

Citations

A survey on nature inspired metaheuristic algorithms for partitional clustering

Efficient agglomerative hierarchical clustering

Relative clustering validity criteria: A comparative overview

Relative clustering validity criteria: A comparative overview: Relative Clustering Validity Criteria

Automatic clustering using nature-inspired metaheuristics

References

Genetic algorithms in search, optimization, and machine learning

Maximum likelihood from incomplete data via the EM algorithm

Adaptation in natural and artificial systems

Neural Networks: A Comprehensive Foundation

Some methods for classification and analysis of multivariate observations

Related Papers (5)

Data clustering: a review

Some methods for classification and analysis of multivariate observations

A Cluster Separation Measure

A fast and elitist multiobjective genetic algorithm: NSGA-II

UCI Machine Learning Repository

Frequently Asked Questions (11)

Q1. What are the contributions in this paper?

Q2. What are the future works in this paper?

Q3. What is the advantage of density-based clustering methods?

Q4. What is the way to avoid empty clusters?

Q5. What is the purpose of a density-based clustering algorithm?

Q6. Why is the evolutionary algorithm used as a wrapper around a clustering algorithm?

Q7. What is the mechanism used to control the rate of application of the individual mutation operators?

Q8. What is the way to represent a partition by means of an integer encoding scheme?

Q9. What is the choice among the evaluated variants?

Q10. How can one evaluate the clustering solutions?

Q11. What is the fitness function used to assess partitions containing a given number of clusters?