scispace - formally typeset
Open AccessJournal ArticleDOI

A Survey of Evolutionary Algorithms for Clustering

TLDR
An up-to-date overview that is fully devoted to evolutionary algorithms for clustering, is not limited to any particular kind of evolutionary approach, and comprises advanced topics like multiobjective and ensemble-based evolutionary clustering.
Abstract
This paper presents a survey of evolutionary algorithms designed for clustering tasks. It tries to reflect the profile of this area by focusing more on those subjects that have been given more importance in the literature. In this context, most of the paper is devoted to partitional algorithms that look for hard clusterings of data, though overlapping (i.e., soft and fuzzy) approaches are also covered in the paper. The paper is original in what concerns two main aspects. First, it provides an up-to-date overview that is fully devoted to evolutionary algorithms for clustering, is not limited to any particular kind of evolutionary approach, and comprises advanced topics like multiobjective and ensemble-based evolutionary clustering. Second, it provides a taxonomy that highlights some very important aspects in the context of evolutionary data clustering, namely, fixed or variable number of clusters, cluster-oriented or nonoriented operators, context-sensitive or context-insensitive operators, guided or unguided operators, binary, integer, or real encodings, centroid-based, medoid-based, label-based, tree-based, or graph-based representations, among others. A number of references are provided that describe applications of evolutionary algorithms for clustering in different domains, such as image processing, computer security, and bioinformatics. The paper ends by addressing some important issues and open questions that can be subject of future research.

read more

Content maybe subject to copyright    Report

To appear in IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and Reviews
1
Abstract This paper presents a survey of evolutionary
algorithms designed for clustering tasks. It tries to reflect the
profile of this area by focusing more on those subjects that have
been given more importance in the literature. In this context,
most of the paper is devoted to partitional algorithms that look
for hard clusterings of data, though overlapping (i.e., soft and
fuzzy) approaches are also covered in the manuscript. The paper
is original in what concerns two main aspects. First, it provides
an up-to-date overview that is fully devoted to evolutionary
algorithms for clustering, is not limited to any particular kind of
evolutionary approach, and comprises advanced topics, like
multi-objective and ensemble-based evolutionary clustering.
Second, it provides a taxonomy that highlights some very
important aspects in the context of evolutionary data clustering,
namely, fixed or variable number of clusters, cluster-oriented or
non-oriented operators, context-sensitive or context-insensitive
operators, guided or unguided operators, binary, integer or real
encodings, centroid-based, medoid-based, label-based, tree-based
or graph-based representations, among others. A number of
references is provided that describe applications of evolutionary
algorithms for clustering in different domains, such as image
processing, computer security, and bioinformatics. The paper
ends by addressing some important issues and open questions that
can be subject of future research.
Index Terms — evolutionary algorithms, clustering, applications.
I. INTRODUCTION
Clustering is a task whose goal is to determine a finite set of
categories (clusters) to describe a data set according to
similarities among its objects [75][40]. The applicability of
clustering is manifold, ranging from market segmentation [17]
and image processing [72] through document categorization
and web mining [102]. An application field that has shown to
be particularly promising for clustering techniques is
bioinformatics [7][13][129]. Indeed, the importance of
clustering gene-expression data measured with the aid of
microarray and other related technologies has grown fast and
persistently over the past recent years [74][60].
Clustering techniques can be broadly divided into three
main types [72]: overlapping (so-called non-exclusive),
partitional, and hierarchical. The last two are related to each
E. R. Hruschka, R. J. G. B. Campello, and A. C. P. L. F. de Carvalho are
with the Department of Computer Sciences of the University of São Paulo
(USP) at São Carlos, SP, Brazil. E-mails: {erh;campello;andre}@icmc.usp.br.
A.A. Freitas is with the Computer Science Department of the University of
Kent at Canterbury, Kent, UK, E-mail: A.A.Freitas@kent.ac.uk.
The authors acknowledge the Brazilian Research Agencies CNPq and
FAPESP for their financial support to this work.
other in that a hierarchical clustering is a nested sequence of
partitional clusterings, each of which represents a hard
partition of the data set into a different number of mutually
disjoint subsets. A hard partition of a data set X={x
1
,x
2
, ...,x
N
},
where x
j
(j = 1, ..., N) stands for an n-dimensional feature or
attribute vector, is a collection C={C
1
,C
2
, ...,C
k
} of k non-
overlapping data subsets C
i
≠∅ (non-null clusters) such that C
1
C
2
... C
k
= X and C
i
C
j
= for i j. If the condition of
mutual disjunction (C
i
C
j
= for i j) is relaxed, then the
corresponding data partitions are said to be of overlapping
type. Overlapping algorithms produce data partitions that can
be soft (each object fully belongs to one or more clusters) [40]
or fuzzy (each object belongs to one or more clusters to
different degrees) [118][64].
In spite of the type of algorithm (partitional, hierarchical or
overlapping), the main goal of clustering is maximizing both
the homogeneity within each cluster and the heterogeneity
among different clusters [72][3]. In other words, objects that
belong to the same cluster should be more similar to each other
than objects that belong to different clusters. The problem of
measuring similarity is usually tackled indirectly, i.e., distance
measures are used for quantifying the degree of dissimilarity
among objects, in such a way that more similar objects have
lower dissimilarity values [73]. Several dissimilarity measures
can be employed for clustering tasks [72][132]. Each measure
has its bias and comes with its own advantages and drawbacks.
Therefore, each one may be more or less suitable to a given
analysis or application scenario. Indeed, it is well-known that
some measures are more suitable for gene clustering in
bioinformatics [74], whereas other measures are more
appropriate for text clustering and document categorization
[114], for instance.
Clustering is deemed one of the most difficult and
challenging problems in machine learning, particularly due to
its unsupervised nature. The unsupervised nature of the
problem implies that its structural characteristics are not
known, except if there is some sort of domain knowledge
available in advance. Specifically, the spatial distribution of
the data in terms of the number, volumes, densities, shapes,
and orientations of clusters (if any), are unknown [47]. These
adversities may be potentialized even further by an eventual
need for dealing with data objects described by attributes of
distinct natures (binary, discrete, continuous, and categorical),
conditions (complete and partially missing) and scales (ordinal
and nominal) [72][73].
From an optimization perspective, clustering can be
formally considered as a particular kind of NP-hard grouping
A Survey of Evolutionary Algorithms for
Clustering
Eduardo R. Hruschka, Member, IEEE, Ricardo J. G. B. Campello, Member, IEEE, Alex A. Freitas,
Member, IEEE, André C. P. L. F. de Carvalho, Member, IEEE

To appear in IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and Reviews
2
problem [43]. This has stimulated the search for efficient
approximation algorithms, including not only the use of ad hoc
heuristics for particular classes or instances of problems, but
also the use of general-purpose metaheuristics (e.g. see [116]).
Particularly, evolutionary algorithms are metaheuristics widely
believed to be effective on NP-hard problems, being able to
provide near-optimal solutions to such problems in reasonable
time. Under this assumption, a large number of evolutionary
algorithms for solving clustering problems have been proposed
in the literature. These algorithms are based on the
optimization of some objective function (i.e., the so-called
fitness function) that guides the evolutionary search.
This paper presents a survey of evolutionary algorithms
designed for clustering tasks. It tries to reflect the profile of
this area by focusing more on those subjects that have been
given more importance in the literature. In this context, most
of the paper is devoted to partitional algorithms that look for
hard data clusterings, though overlapping approaches are also
covered in the manuscript. It is important to stress that
comprehensive surveys on clustering have been previously
published, such as the outstanding papers by Jain et al. [73],
Jiang et al. [74], and Xu and Wunsch II [132], just to mention
a few. Nevertheless, to the best of the authors’ knowledge,
none has been fully devoted to evolutionary approaches. It is
worth mentioning, however, that reviews on similar subjects
have been previously published. The authors themselves have
previously published overviews on related topics. For instance,
in [109] the authors provide an overview of Genetic
Algorithms (GAs) for clustering, but only a small subset of the
existing evolutionary approaches (namely, GAs) is discussed
in that reference. In [50], in its turn, the author provides an
extensive review of evolutionary algorithms for data mining
applications, but the work focuses on specific evolutionary
approaches (GAs and Genetic Programming) and is mainly
intended for classification tasks, clustering being just slightly
touched in a peripheral section. Three previous monographs
[23][43][119] have also partially approached some of the
issues raised in the present manuscript. In particular, Cole [23]
reviewed a number of genetic algorithms for clustering
published until 1997, whereas [119] provided a review of
evolutionary algorithms for clustering that is more recent, yet
much more concise. In contrast, Falkenauer [43] describes in
details a high-level paradigm (meta-heuristic) that can be
adapted to deal with grouping problems broadly defined, thus
being useful for several applications e.g., bin packing,
economies of scale, conceptual clustering, and equal piles.
However, data partitioning problems like those examined in
the present paper are not the primary focus of Falkenauer’s
book [43], which has been published in 1998.
Bearing the previous remarks in mind, it can be stated that
the present paper is original in the following two main aspects:
(i) It provides an up-to-date overview that is fully devoted to
evolutionary algorithms for clustering, is not limited to any
particular kind of evolutionary approach, and comprises
advanced topics, like multi-objective and ensemble-based
evolutionary clustering; and (ii) It provides a taxonomy that
allows the reader to identify every work surveyed with respect
to some very important aspects in the context of evolutionary
data clustering, such as:
Fixed or variable number of clusters;
Cluster-oriented or non-oriented operators;
Context-sensitive or context-insensitive operators;
Guided or unguided operators;
Binary, integer or real encodings;
Centroid-based, medoid-based, label-based, tree-based
or graph-based representations.
By cluster-oriented operators, it is meant here operators that
are task dependent, such as operators that copy, split, merge,
and eliminate clusters of data objects, in contrast to
conventional evolutionary operators that just exchange or
switch bits without any regard to their task-dependent
meaning. Guided operators are those operators that are guided
by some kind of information about the quality of individual
clusters, about the quality of the overall data partition, or about
their performance on previous applications, such as operators
that are more likely to be applied to poor quality clusters and
operators whose probability of application is proportional to
its success (or failure) in previous generations. Finally,
context-sensitivity will hereafter refer to the original concept as
defined by Falkenauer [43], which is limited to crossover
operators. In brief, a crossover operator is context-sensitive if:
(i) it is cluster-oriented; and (ii) two (possibly different)
chromosomes encoding the same clustering solution do not
generate a different offspring solution when they are crossed-
over. As a consequence, when the number of clusters, k, is
fixed in advance, it can be asserted that two chromosomes
encoding different clustering solutions with the same k must
not produce solutions with a number of clusters other than k as
a result of crossover. Of course, context-sensitivity is more
stringent than cluster-orientation.
The remainder of this paper is organized as follows. Section
II presents a survey of evolutionary algorithms for hard
partitional clustering, whereas Section III presents a review of
evolutionary algorithms for overlapping clustering. Section IV
discusses evolutionary algorithms for multi-objective
clustering and clustering ensembles. A number of references
that describe applications of evolutionary algorithms for
clustering in different domains is provided in Section V.
Finally, the material presented throughout the paper is
summarized in Section VI, which also addresses some
important issues for future research.
II. HARD PARTITIONAL CLUSTERING
As mentioned in the introduction, a hard partition of a data set
X is a collection of k non-overlapping clusters of these data.
The number of clusters, k, usually must be provided in
advance by the user. In some cases, however, it can be
estimated automatically by the clustering algorithm. Section

To appear in IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and Reviews
3
II.A describes evolutionary algorithms for which k is assumed
to be fixed a priori, whereas Section II.B addresses algorithms
capable of estimating k during the evolutionary search.
A. Algorithms with Fixed Number of Clusters
Several papers address evolutionary algorithms to solve
clustering problems for which the number of clusters (k) is
known or set up a priori (e.g., Bandyopadhyay and Maulik
[10]; Estivill-Castro and Murray [39]; Fränti et al. [48];
Kivijärvi et al. [79]; Krishna and Murty [83]; Krovi [84];
Bezdek et al. [14]; Kuncheva and Bezdek [85]; Lu et al.
[95][94]; Lucasius et al. [96]; Maulik and Bandyopadhyay
[100]; Merz and Zell [103]; Murthy and Chowdhury [107];
Scheunders [121]; Sheng and Liu [122]). Cole [23] reviews
and empirically assesses a number of such genetic algorithms
for clustering published up to 1997.
It is intuitive to think of algorithms that assume a fixed
number of clusters (k) as being particularly suitable for
applications in which there is information regarding the value
of k. For instance, domain knowledge may be available that
suggests a reasonable value or a small interval of values
for k. Having such information in hand, algorithms described
in this section can be potentially applied for tackling the
corresponding clustering problem. Alternatively, the reader
may think about using conventional clustering algorithms for
fixed k, such as k-means [101][72], EM (Expectation
Maximization) [34][61], and SOM (Self-Organized Maps)
[17][62] algorithms. However, these prototype-based
algorithms are quite sensitive to initialization of prototypes
1
and may get stuck at sub-optimal solutions. This is a well-
known problem, which becomes more evident for more
complex data sets
2
. A common approach to alleviate this
problem involves running the algorithm repeatedly for several
different prototype initializations. Nevertheless, note that one
can only guarantee that the best clustering solution for a fixed
value of k would be found if all possible initial configurations
of prototypes were evaluated. Of course, this approach is not
computationally feasible in practice, especially for large data
sets and large k. Running the algorithm only for a limited set of
initial prototypes, in turn, may be either inefficient or not
computationally attractive, depending on the number of
prototype initializations to be performed.
For this reason, other approaches have been investigated.
Among them, evolutionary algorithms have shown to be
promising alternatives. Evolutionary algorithms essentially
evolve clustering solutions through operators that use
probabilistic rules to process data partitions sampled from the
search space [43]. Roughly speaking, more fitted partitions
have higher probabilities of being sampled. Thus, the
evolutionary search is biased towards more promising
clustering solutions and tends to perform a more
1
We here define a prototype as a particular feature vector that represents a
given cluster. For instance, prototypes can be centroids, medoids, or any other
vector computed from the data partition and that represents a cluster (as in the
case of typical fuzzy clustering algorithms).
2
Complexity here refers to the number of different local minima and the
variance of their objective function values, which are usually strongly related
to the number n of data attributes and the number k of clusters.
computationally efficient exploration of the search space than
traditional randomized approaches (e.g., multiple runs of k-
means). Besides, traditional randomized approaches do not
make use of the information on the quality of previously
assessed partitions to generate potentially better partitions. For
this reason, these algorithms tend to be less efficient (in a
probabilistic sense) than an evolutionary search.
In spite of the theoretical advantages (in terms of
computational efficiency) of evolving clustering solutions,
much effort has also been undertaken towards showing that
evolutionary algorithms can provide partitions of better quality
than those found by traditional algorithms. In fact, this may be
possible provided that the parallel nature of the evolutionary
algorithms allows them to handle multiple solutions, possibly
guided by different distance measures and different fitness
evaluation functions.
This section reviews a significant part of the literature on
evolutionary algorithms for partitioning a data set into k
clusters. Potential advantages and drawbacks of each
algorithm are analyzed under the light of their corresponding
encoding schemes, operators, fitness functions, and
initialization procedures.
1) Encoding Schemes: Several encoding schemes have been
proposed in the literature. In order to explain them, let us
consider a simple pedagogical data set (Table I) formed by 10
objects x
i
(i = 1, 2,…,10) with two attributes each (n = 2),
denoted a
1
and a
2
. Such objects have been arbitrarily grouped
into three clusters (C
1
, C
2
, and C
3
). These clusters are depicted
in Fig. 1 and are used to illustrate how partitions can be
encoded to be processed by an evolutionary search. Aiming at
summarizing common encodings found in the literature, we
first here categorize them into three types: binary, integer, and
real.
TABLE I. PEDAGOGICAL DATA SET.
Object (x
i
) a
1
a
2
Cluster - C
j
x
1
1 1 Cluster 1 (C
1
)
x
2
1 2 Cluster 1 (C
1
)
x
3
2 1 Cluster 1 (C
1
)
x
4
2 2 Cluster 1 (C
1
)
x
5
10 1 Cluster 2 (C
2
)
x
6
10 2 Cluster 2 (C
2
)
x
7
11 1 Cluster 2 (C
2
)
x
8
11 2 Cluster 2 (C
2
)
x
9
5 5 Cluster 3 (C
3
)
x
10
5 6 Cluster 3 (C
3
)
0
2
4
6
8
0 2 4 6 8 10 12
a
1
a
2
Cluster 1
Cluster 2
Cluster 3
Fig. 1. Pedagogical data set (see Table I).

To appear in IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and Reviews
4
a) Binary encoding: In a binary encoding, each clustering
solution (partition) is usually represented as a binary
string of length N, where N is the number of data set
objects. Each position of the binary string corresponds
to a particular object, i.e., the ith position (gene)
represents the ith object. The value of the ith gene is 1
if the ith object is a prototype and zero otherwise. For
example, the partition depicted in Fig. 1 can be
encoded by means of the string [1000100010], in which
objects 1, 5, and 9 are cluster prototypes. Clearly, such
an encoding scheme inexorably leads to a medoid-
based representation, i.e., a prototype-based
representation in which the cluster prototypes
necessarily coincide with objects from the data set. The
partition encoded into a given genotype
3
can be derived
by the nearest prototype rule taking into account the
proximities between objects and prototypes in such a
way that the ith object is assigned to the cluster
represented by the closer (i.e. the most similar)
prototype. Kuncheva and Bezdek [85] make use of this
encoding approach, which allows the evolutionary
search to be performed by means of those classical GA
operators originally developed to manipulate binary
genotypes [54][105]. However, the use of such classical
operators usually suffers from serious drawbacks in the
specific context of evolutionary clustering, as will be
further discussed in Section II.A.2.a.
There is an alternative way to represent a given data
partition using a binary encoding. It is the use of a k ×
N matrix in which the rows represent clusters and the
columns represent objects. In this case, if the jth object
belongs to the ith cluster, then 1 is assigned to the ith
element of the jth column of the genotype, whereas the
other elements of the same column receive 0. For
example, using this representation, the partition
depicted in Fig. 1 would be encoded as [14]:
1 1 1 1 0 0 0 0 0 0
0 0 0 0 1 1 1 1 0 0
0 0 0 0 0 0 0 0 1 1
This matrix-based binary encoding scheme has the
clear disadvantage of requiring O(kN) memory space,
against O(N) space of the usual string-based binary
encoding scheme formerly described. On the other
hand, the time it requires to recover the data partition
from a given genotype is O(kN) both in the average
and worst cases against O(knN) for the string-based
scheme (due to the nearest prototype rule
computations)
4
. This computational saving is relevant
3
The terms genotype, chromosome and individual usually have the same
meaning in the literature on evolutionary algorithms and will be freely
interchanged in this paper.
4
Actually, the nearest neighbor search can be performed in asymptotic
logarithmic time by exploiting the Delaunay triangulation [81], which is the
dual of the Voronoi diagram e.g., see [98]. However, to the best of our
only for data sets with many attributes. When the
number of attributes n is not large, the advantage of the
matrix-based scheme reduces to the possibility of
extending it to handle soft partitions, by allowing
multiple elements of a given column to be non-null.
Soft partitional clustering is discussed in Section III.
b) Integer encoding: There are two ways of representing
clustering solutions by means of integer encoding. In
the first one, a genotype is an integer vector of N
positions, where N is the number of data set objects.
Each position corresponds to a particular object, i.e.,
the ith position (gene) represents the ith data set object.
Provided that a genotype represents a partition formed
by k clusters, each gene has a value over the alphabet
{1, 2, 3, …, k}. These values define the cluster labels,
thus leading to a label-based representation. For
example, the integer vector [1111222233] represents
the clusters depicted in Fig. 1. This encoding scheme is
adopted in [84][107][83][95][94]. In particular, only
partitions formed by two clusters are addressed in [84],
thus allowing the use of a binary representation for
which each gene has a value over the alphabet {0, 1}.
This integer encoding scheme is naturally redundant,
i.e., the encoding is one-to-many. In fact, there are k!
different genotypes that represent the same solution.
For example, there are 3! different genotypes that
correspond to the same clustering solution represented
in Fig. 1, namely: [1111222233], [1111333322],
[2222111133], [2222333311], [3333111122], and
[3333222211]. Thus, the size of the search space to be
explored by the genetic algorithm is much larger than
the original space of solutions. Depending on the
employed operators, this augmented space may reduce
the efficiency of the genetic algorithm. An alternative to
solve this problem is the use of a renumbering
procedure [43].
Another way of representing a partition by means of an
integer encoding scheme involves using an array of k
elements to provide a medoid-based representation of
the data set. In this case, each array element represents
the index of the object x
i
, i = 1, 2, …, N (with respect
to the order the objects appear in the data set)
corresponding to the prototype of a given cluster. As an
example, the array [1 5 9] can represent a partition in
which objects 1, 5, and 9 are the cluster prototypes
(medoids) of the data given in Table I. Taking into
account these prototypes and assuming a nearest
prototype rule for assigning objects to clusters, the
partition depicted in Fig. 1 can be recovered. Lucasius
et al. [96], for instance, make use of this approach. This
representation scheme is also adopted, for instance, in
[39] and [122].
Conceptually speaking, representing medoids by means
of an integer array of k elements, as previously
discussed, is usually more computationally efficient
than using the string-based binary encoding scheme
knowledge this idea has not been explored in the context of evolutionary
algorithms for clustering.

To appear in IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and Reviews
5
described in Section II.A.1.a. However, it must be
noticed that such an integer encoding scheme may be
redundant if unordered genotypes are allowed, in which
case the solutions [1 5 9], [1 9 5], [5 1 9], [5 9 1], [9 1
5], and [9 5 1] encode the same partition depicted in
Fig. 1. In such a case, a renumbering procedure should
be used in order to avoid potential redundancy
problems.
When comparing the two different integer encoding
schemes discussed in this section, one has to take into
account some different aspects that may be of interest.
Considering space complexity issues, the integer
encoding is O(N) when a label-based representation is
used, whereas it is O(k) when a medoid-based
representation is adopted. Thus, in principle, one may
conclude that the latter is more advantageous than the
former (since k is typically much lower than N).
However, this is not necessarily true. Actually, the
suitability of each of the aforementioned encoding
schemes is highly dependent upon the fitness function
used to guide the evolutionary search, as well as upon
the evolutionary operators that manipulate the
clustering solutions being evolved as it will become
evident in the following sections. In brief, the label-
based encoding does not require any additional
processing to make available the information on the
membership of each object to its corresponding cluster.
Such information may be necessary for computing
cluster statistics, which, by their turn, can be needed for
computing the fitness function and/or for guiding the
application of evolutionary operators. It is easy to see
that, contrarily to the label-based encoding, the
medoid-based encoding requires further processing in
order to recover the clusters encoded into the genotype.
Consequently, depending on the computational cost
involved in cluster recovering, a particular encoding
may become more (or less) suitable for a given
clustering problem.
c) Real encoding: In real encoding the genotypes are
made up of real numbers that represent the coordinates
of the cluster prototypes. This means that, unlike the
integer encoding scheme discussed in Section II.A.1.b,
real encoding is necessarily associated with a
prototype-based representation of partitions. However,
unlike the string-based binary encoding scheme
discussed in Section II.A.1.a, real encoding does not
necessarily leads to a medoid-based representation.
Instead, it may also be (and in fact usually is)
associated with a centroid-based representation of the
partitions, as discussed in the sequel.
If genotype i encodes k clusters in an n dimensional
space,
n
, then its length is nk. Thus, the first n
positions represent the n coordinates of the first cluster
prototype, the next n positions represent the coordinates
of the second cluster prototype, and so forth. To
illustrate this, the genotype [1.5 1.5 10.5 1.5 5.0 5.5]
encodes the prototypes (1.5, 1.5), (10.5, 1.5), and (5.0,
5.5) of clusters C
1
, C
2
, and C
3
in Table I, respectively.
Given the genotype, the corresponding clusters can be
recovered by the nearest prototype rule, in such a way
that the ith object is assigned to the cluster represented
by the most similar prototype.
The genotype representation adopted in references
[121][100][103][10] follows a real encoding scheme in
which the prototype locations are not restricted to the
positions of the objects. This representation, named
centroid-based representation, is also adopted by
Fränti et al. [48] and Kivijärvi et al. [79]. These
authors, however, additionally encode into the genotype
a partitioning table that describes, for each object, the
index of the cluster to which the object belongs.
Alternatively, one could encode the real-valued
coordinates of a set of k medoids. In order to do so, it is
only necessary to enforce the constraint that the
prototype locations coincide with positions of objects in
the data set. In the pedagogical example of Table I, the
coordinates of a set of objects e.g. {x
1
, x
5
, x
9
} can
be represented by the genotype [1 1 10 1 5 5]. These
medoids allow recovering the clusters depicted in Fig. 1
by using by the nearest prototype rule as well.
The potential advantages and drawbacks of the real
encoding schemes are fundamentally the same as the
integer medoid-based encoding scheme discussed in
Section II.A.1.b, with the caveat that the former
demands O(nk) memory space in order to represent a
given genotype, whereas the latter demands only O(k)
space. Possibly for this reason, the use of a real
medoid-based encoding scheme has not been reported
in any work surveyed in the present paper.
2) Operators: A number of crossover and mutation
operators for clustering problems have been investigated. In
order to aid the perception of common features shared by
these operators, we address them according to the encoding
schemes for which they have been designed.
a) Crossover: Falkenauer [43] addresses several
drawbacks of traditional genetic algorithms when they
are applied to tackle grouping tasks. As far as crossover
operators are concerned, an important problem to be
considered involves the context-insensitivity concept.
Formally, context-insensitivity means that [43] “the
schemata defined on the genes of the simple
chromosome do not convey useful information that
could be exploited by the implicit sampling process
carried out by a clustering genetic algorithm”. In the
following we illustrate, by means of pedagogical
examples, the context-insensitivity problem. Then,
having such examples in mind, we analyze crossover
operators frequently described in the literature.
Let us assume that genotypes [1111222233] and
[1111333322] encoded under the label-based integer
encoding discussed in Section II.A.1.b are
recombined under the standard one-point crossover, as
depicted in Fig. 2 (bold type refers to the exchanged
genetic information).

Citations
More filters
Journal ArticleDOI

A survey on nature inspired metaheuristic algorithms for partitional clustering

TL;DR: An up-to-date review of all major nature inspired metaheuristic algorithms employed till date for partitional clustering and key issues involved during formulation of various metaheuristics as a clustering problem and major application areas are discussed.
Journal ArticleDOI

Efficient agglomerative hierarchical clustering

TL;DR: An efficient hybrid hierarchical clustering is proposed based on agglomerative method that builds a hierarchy based on a group of centroids that is relatively consistent regardless the variation of the settings, i.e., clustering methods, data distributions, and distance measures.
Journal IssueDOI

Relative clustering validity criteria: A comparative overview

TL;DR: An alternative, possibly complementary methodology for comparing clustering validity criteria is described and an extensive comparison of the performances of 40 criteria over a collection of 962,928 partitions derived from five well-known clustering algorithms and 1080 different data sets of a given class of interest is made.
Journal ArticleDOI

Relative clustering validity criteria: A comparative overview: Relative Clustering Validity Criteria

TL;DR: An alternative, possibly complementary methodology for comparing clustering validity criteria is described and an extensive comparison of the performances of 40 criteria over a collection of 962,928 partitions derived from five well‐known clustering algorithms and 1080 different data sets of a given class of interest is made.
Journal ArticleDOI

Automatic clustering using nature-inspired metaheuristics

TL;DR: An up-to-date review of all major nature-inspired metaheuristic algorithms used thus far for automatic clustering, with a strong tendency in using multiobjective and hybrid algorithms to address non-linearly separable problems.
References
More filters
Book

Genetic algorithms in search, optimization, and machine learning

TL;DR: In this article, the authors present the computer techniques, mathematical tools, and research results that will enable both students and practitioners to apply genetic algorithms to problems in many fields, including computer programming and mathematics.
Book

Adaptation in natural and artificial systems

TL;DR: Names of founding work in the area of Adaptation and modiication, which aims to mimic biological optimization, and some (Non-GA) branches of AI.
Book

Neural Networks: A Comprehensive Foundation

Simon Haykin
TL;DR: Thorough, well-organized, and completely up to date, this book examines all the important aspects of this emerging technology, including the learning process, back-propagation learning, radial-basis function networks, self-organizing systems, modular networks, temporal processing and neurodynamics, and VLSI implementation of neural networks.

Some methods for classification and analysis of multivariate observations

TL;DR: The k-means algorithm as mentioned in this paper partitions an N-dimensional population into k sets on the basis of a sample, which is a generalization of the ordinary sample mean, and it is shown to give partitions which are reasonably efficient in the sense of within-class variance.
Frequently Asked Questions (11)
Q1. What are the contributions in this paper?

This paper presents a survey of evolutionary algorithms designed for clustering tasks. In this context, most of the paper is devoted to partitional algorithms that look for hard clusterings of data, though overlapping ( i. e., soft and fuzzy ) approaches are also covered in the manuscript. The paper is original in what concerns two main aspects. The paper ends by addressing some important issues and open questions that can be subject of future research. 

The authors believe that this issue is an important research area for future work. To the best of their knowledge, only Lozano and Larrañaga [ 93 ] address this topic, which is a possible venue for future research. The book by Falkenauer [ 43 ] can be considered as a pioneering work in this research direction. Doing so, the author argues that the premises of the schema theorem can be satisfied, and valid approaches can be derived to tackle grouping problems by using genetic algorithms. 

Density-based clustering methods usually have the advantage of being flexible enough to discover clusters of arbitrary shape [38]. 

Of course, empty clusters could be avoided by enforcing the objects closer to multiple identical medoids to be shared among the corresponding clusters. 

In such criteria, a cluster is essentially a group of objects in the same dense region in the data space, and the goal of a density-based clustering algorithm is to find high-density regions (each region corresponding to a cluster) that are separated by low-density regions. 

This is because the evolutionary algorithm is used as a wrapper around a clustering algorithm, so that the fitness of an individual (i.e., a candidate set of selected attributes) is computed by running a clustering algorithm with the selected attributes and measuring the corresponding clustering validity criteria. 

It consists of the use of a self-adjusting procedure that automatically controls the rates of application of the individual mutation operators based upon their relative success/failure averaged over past generations. 

Another way of representing a partition by means of an integer encoding scheme involves using an array of k elements to provide a medoid-based representation of the data set. 

Based upon an experimental evaluation, the authors argue that the pairwise nearest neighbor operator is the best choice among the assessed variants. 

In order to do so, repeated runs of the evolutionary algorithm might be performed for different values of k, and the obtained clustering solutions could be comparatively assessed by some measure that reflects the partition quality6. 

3) Fitness Function: Many clustering validity criteria can be used for assessing partitions containing a given number (k) of clusters (e.g., see [72], [40], [75]).