scispace - formally typeset
Open AccessJournal ArticleDOI

Survey of clustering algorithms

Reads0
Chats0
TLDR
Clustering algorithms for data sets appearing in statistics, computer science, and machine learning are surveyed, and their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts are illustrated.
Abstract
Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.

read more

Content maybe subject to copyright    Report

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005 645
Survey of Clustering Algorithms
Rui Xu, Student Member, IEEE and Donald Wunsch II, Fellow, IEEE
Abstract—Data analysis plays an indispensable role for un-
derstanding various phenomena. Cluster analysis, primitive
exploration with little or no prior knowledge, consists of research
developed across a wide variety of communities. The diversity,
on one hand, equips us with many tools. On the other hand,
the profusion of options causes confusion. We survey clustering
algorithms for data sets appearing in statistics, computer science,
and machine learning, and illustrate their applications in some
benchmark data sets, the traveling salesman problem, and bioin-
formatics, a new field attracting intensive efforts. Several tightly
related topics, proximity measure, and cluster validation, are also
discussed.
Index Terms—Adaptive resonance theory (ART), clustering,
clustering algorithm, cluster validation, neural networks, prox-
imity, self-organizing feature map (SOFM).
I. INTRODUCTION
W
E ARE living in a world full of data. Every day, people
encounter a large amount of information and store or
represent it as data, for further analysis and management. One
of the vital means in dealing with these data is to classify or
group them into a set of categories or clusters. Actually, as one
of the most primitive activities of human beings [14], classi-
fication plays an important and indispensable role in the long
history of human development. In order to learn a new object
or understand a new phenomenon, people always try to seek
the features that can describe it, and further compare it with
other known objects or phenomena, based on the similarity or
dissimilarity, generalized as proximity, according to some cer-
tain standards or rules. “Basically, classification systems are ei-
ther supervised or unsupervised, depending on whether they as-
sign new inputs to one of a finite number of discrete supervised
classes or unsupervised categories, respectively [38], [60], [75].
In supervised classification, the mapping from a set of input data
vectors (
, where is the input space dimensionality), to
a finite set of discrete class labels (
, where is
the total number of class types), is modeled in terms of some
mathematical function
, where is a vector of
adjustable parameters. The values of these parameters are de-
termined (optimized) by an inductive learning algorithm (also
termed inducer), whose aim is to minimize an empirical risk
functional (related to an inductive principle) on a finite data set
of input–output examples,
, where is
the finite cardinality of the available representative data set [38],
Manuscript received March 31, 2003; revised September 28, 2004. This work
was supported in part by the National Science Foundation and in part by the
M. K. Finley Missouri Endowment.
The authors are with the Department of Electrical and Computer Engineering,
University of Missouri-Rolla, Rolla, MO 65409 USA (e-mail: rxu@umr.edu;
dwunsch@ece.umr.edu).
Digital Object Identifier 10.1109/TNN.2005.845141
[60], [167]. When the inducer reaches convergence or termi-
nates, an induced classifier is generated [167].
In unsupervised classification, called clustering or ex-
ploratory data analysis, no labeled data are available [88],
[150]. The goal of clustering is to separate a finite unlabeled
data set into a finite and discrete set of “natural, hidden data
structures, rather than provide an accurate characterization
of unobserved samples generated from the same probability
distribution [23], [60]. This can make the task of clustering fall
outside of the framework of unsupervised predictive learning
problems, such as vector quantization [60] (see Section II-C),
probability density function estimation [38] (see Section II-D),
[60], and entropy maximization [99]. It is noteworthy that
clustering differs from multidimensional scaling (perceptual
maps), whose goal is to depict all the evaluated objects in a
way that minimizes the topographical distortion while using as
few dimensions as possible. Also note that, in practice, many
(predictive) vector quantizers are also used for (nonpredictive)
clustering analysis [60].
Nonpredictive clustering is a subjective process in nature,
which precludes an absolute judgment as to the relative effi-
cacy of all clustering techniques [23], [152]. As pointed out by
Backer and Jain [17], “in cluster analysis a group of objects is
split up into a number of more or less homogeneous subgroups
on the basis of an often subjectively chosen measure of sim-
ilarity (i.e., chosen subjectively based on its ability to create
“interesting” clusters), such that the similarity between objects
within a subgroup is larger than the similarity between objects
belonging to different subgroups””
1
.
Clustering algorithms partition data into a certain number
of clusters (groups, subsets, or categories). There is no univer-
sally agreed upon definition [88]. Most researchers describe a
cluster by considering the internal homogeneity and the external
separation [111], [124], [150], i.e., patterns in the same cluster
should be similar to each other, while patterns in different clus-
ters should not. Both the similarity and the dissimilarity should
be examinable in a clear and meaningful way. Here, we give
some simple mathematical descriptions of several types of clus-
tering, based on the descriptions in [124].
Given a set of input patterns
,
where
and each measure
is said to be a feature (attribute, dimension, or variable).
(Hard) partitional clustering attempts to seek a
-par-
tition of
, such that
1)
;
2)
;
3)
and .
1
The preceding quote is taken verbatim from verbiage suggested by the
anonymous associate editor, a suggestion which we gratefully acknowledge.
1045-9227/$20.00 © 2005 IEEE

646 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005
Fig. 1. Clustering procedure. The typical cluster analysis consists of four steps with a feedback pathway. These steps are closely related to each other and affect
the derived clusters.
) Hierarchical clustering attempts to construct a
tree-like nested structure partition of
, such that
, and imply or for all
.
For hard partitional clustering, each pattern only belongs to
one cluster. However, a pattern may also be allowed to belong
to all clusters with a degree of membership,
, which
represents the membership coefcient of the
th object in the
th cluster and satises the following two constraints:
and
as introduced in fuzzy set theory [293]. This is known as fuzzy
clustering, reviewed in Section II-G.
Fig. 1 depicts the procedure of cluster analysis with four basic
steps.
1) Feature selection or extraction. As pointed out by Jain
et al. [151], [152] and Bishop [38], feature selection
choosesdistinguishingfeaturesfrom aset ofcandidates,
while feature extraction utilizes some transformations
to generate useful and novel features from the original
ones. Both are very crucial to the effectiveness of clus-
tering applications. Elegant selection of features can
greatly decrease the workload and simplify the subse-
quentdesignprocess.Generally,idealfeaturesshouldbe
of use in distinguishing patterns belonging to different
clusters, immune to noise, easy to extract and interpret.
We elaborate the discussion on feature extraction in
Section II-L, in the context of data visualization and
dimensionality reduction. More information on feature
selection can be found in [38], [151], and [250].
2) Clustering algorithm design or selection. The step is
usually combined with the selection of a corresponding
proximity measure and the construction of a criterion
function. Patterns are grouped according to whether
they resemble each other. Obviously, the proximity
measure directly affects the formation of the resulting
clusters. Almost all clustering algorithms are explicitly
or implicitly connected to some denition of proximity
measure. Some algorithms even work directly on the
proximity matrix, as dened in Section II-A. Once
a proximity measure is chosen, the construction of a
clustering criterion function makes the partition of
clusters an optimization problem, which is well dened
mathematically, and has rich solutions in the literature.
Clusteringis ubiquitous,and awealth ofclustering algo-
rithmshasbeen developedto solvedifferentproblems in
specicelds. However,there isno clusteringalgorithm
that can be universallyusedto solve all problems. Ithas
been very difcult to develop a unied framework for
reasoning about it (clustering) at a technical level, and
profoundly diverse approaches to clustering [166], as
proved through an impossibility theorem. Therefore, it
is important to carefully investigate the characteristics
of the problem at hand, in order to select or design an
appropriate clustering strategy.
3) Cluster validation. Given a data set, each clustering
algorithm can always generate a division, no matter
whether the structure exists or not. Moreover, different
approaches usually lead to different clusters; and even
for the same algorithm, parameter identication or
the presentation order of input patterns may affect the
nal results. Therefore, effective evaluation standards
and criteria are important to provide the users with a
degree of condence for the clustering results derived
from the used algorithms. These assessments should
be objective and have no preferences to any algorithm.
Also, they should be useful for answering questions
like how many clusters are hidden in the data, whether
the clusters obtained are meaningful or just an artifact
of the algorithms, or why we choose some algorithm
instead of another. Generally, there are three categories
of testing criteria: external indices, internal indices,
and relative indices. These are dened on three types
of clustering structures, known as partitional clus-
tering, hierarchical clustering, and individual clusters
[150]. Tests for the situation, where no clustering
structure exists in the data, are also considered [110],
but seldom used, since users are condent of the pres-
ence of clusters. External indices are based on some
prespecied structure, which is the reection of prior
information on the data, and used as a standard to
validate the clustering solutions. Internal tests are not
dependent on external information (prior knowledge).
On the contrary, they examine the clustering structure
directly from the original data. Relative criteria place

XU AND WUNSCH II: SURVEY OF CLUSTERING ALGORITHMS 647
the emphasis on the comparison of different clustering
structures, in order to provide a reference, to decide
which one may best reveal the characteristics of the
objects. We will not survey the topic in depth and refer
interested readers to [74], [110], and [150]. However,
we will cover more details on how to determine the
number of clusters in Section II-M. Some more recent
discussion can be found in [22], [37], [121], [180],
and [181]. Approaches for fuzzy clustering validity
are reported in [71], [104], [123], and [220].
4)
Results interpretation. The ultimate goal of clustering
is to provide users with meaningful insights from the
original data, so that they can effectively solve the
problems encountered. Experts in the relevant elds in-
terpret the data partition. Further analyzes, even exper-
iments, may be required to guarantee the reliability of
extracted knowledge.
Note that the ow chart also includes a feedback pathway.
Clusteranalysisisnotaone-shot process.In manycircumstances,
it needs a series of trials and repetitions. Moreover, there are no
universal and effective criteria to guide the selection of features
and clustering schemes. Validation criteria provide some insights
on the quality of clustering solutions. But even how to choose the
appropriate criterion is still a problem requiring more efforts.
Clustering has been applied in a wide variety of elds,
ranging from engineering (machine learning, articial intelli-
gence, pattern recognition, mechanical engineering, electrical
engineering), computer sciences (web mining, spatial database
analysis, textual document collection, image segmentation),
life and medical sciences (genetics, biology, microbiology,
paleontology, psychiatry, clinic, pathology), to earth sciences
(geography. geology, remote sensing), social sciences (soci-
ology, psychology, archeology, education), and economics
(marketing, business) [88], [127]. Accordingly, clustering is
also known as numerical taxonomy, learning without a teacher
(or unsupervised learning), typological analysis and partition.
The diversity reects the important position of clustering in
scientic research. On the other hand, it causes confusion, due
to the differing terminologies and goals. Clustering algorithms
developed to solve a particular problem, in a specialized eld,
usually make assumptions in favor of the application of interest.
These biases inevitably affect performance in other problems
that do not satisfy these premises. For example, the
-means
algorithm is based on the Euclidean measure and, hence, tends
to generate hyperspherical clusters. But if the real clusters are
in other geometric forms,
-means may no longer be effective,
and we need to resort to other schemes. This situation also
holds true for mixture-model clustering, in which a model is t
to data in advance.
Clustering has a long history, with lineage dating back to Aris-
totle [124]. General references on clustering techniques include
[14], [75], [77], [88], [111], [127], [150], [161], [259]. Important
survey papers on clustering techniques also exist in the literature.
Starting from a statistical pattern recognition viewpoint, Jain,
Murty,andFlynnreviewedtheclusteringalgorithmsandotherim-
portant issues related to cluster analysis [152], while Hansen and
Jaumard described the clustering problems under a mathematical
programming scheme [124]. Kolatch and He investigated appli-
cationsofclusteringalgorithmsforspatialdatabasesystems[171]
and information retrieval [133], respectively. Berkhin further ex-
panded the topic to the whole eld of data mining [33]. Murtagh
reported the advances in hierarchical clustering algorithms [210]
andBaraldi surveyedseveralmodelsfor fuzzyandneuralnetwork
clustering [24]. Some more survey papers can also be found in
[25], [40], [74], [89], and [151]. In addition to the review papers,
comparative research on clustering algorithms is also signicant.
Rauber, Paralic, and Pampalk presented empirical results for ve
typical clustering algorithms [231]. Wei, Lee, and Hsu placed the
emphasis onthe comparisonof fast algorithmsfor large databases
[280]. Scheunders compared several clustering techniques for
color image quantization, with emphasis on computational time
and thepossibility ofobtaining global optima[239]. Applications
and evaluations of different clustering algorithms for the analysis
of gene expression data from DNA microarray experiments were
described in [153], [192], [246], and [271]. Experimental evalua-
tionondocumentclusteringtechniques, basedonhierarchicaland
-means clustering algorithms, were summarized by Steinbach,
Karypis, and Kumar [261].
In contrast to the above, the purpose of this paper is to pro-
vide a comprehensive and systematic description of the inu-
ential and important clustering algorithms rooted in statistics,
computer science, and machine learning, with emphasis on new
advances in recent years.
The remainder of the paper is organized as follows. In Sec-
tion II, we review clustering algorithms, based on the natures
of generated clusters and techniques and theories behind them.
Furthermore, we discuss approaches for clustering sequential
data, large data sets, data visualization, and high-dimensional
data through dimension reduction. Two important issues on
cluster analysis, including proximity measure and how to
choose the number of clusters, are also summarized in the
section. This is the longest section of the paper, so, for conve-
nience, we give an outline of Section II in bullet form here:
II. Clustering Algorithms
A. Distance and Similarity Measures
(See also Table I)
B. Hierarchical
Agglomerative
Single linkage, complete linkage, group average
linkage, median linkage, centroid linkage, Wards
method, balanced iterative reducing and clustering
using hierarchies (BIRCH), clustering using rep-
resentatives (CURE), robust clustering using links
(ROCK)
Divisive
divisive analysis (DIANA), monothetic analysis
(MONA)
C. Squared Error-Based (Vector Quantization)
-means, iterative self-organizing data analysis
technique (ISODATA), genetic
-means algorithm
(GKA), partitioning around medoids (PAM)
D. pdf Estimation via Mixture Densities
Gaussian mixture density decomposition (GMDD),
AutoClass
E. Graph Theory-Based
Chameleon, Delaunay triangulation graph (DTG),
highly connected subgraphs (HCS), clustering iden-

648 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005
TABLE I
S
IMILARITY AND
DISSIMILARITY MEASURE FOR
QUANTITATIVE
FEATURES
tication via connectivity kernels (CLICK), cluster
afnity search technique (CAST)
F. Combinatorial Search Techniques-Based
Genetically guided algorithm (GGA), TS clustering,
SA clustering
G. Fuzzy
Fuzzy
-means (FCM), mountain method (MM), pos-
sibilistic
-means clustering algorithm (PCM), fuzzy
-shells (FCS)
H. Neural Networks-Based
Learning vector quantization (LVQ), self-organizing
feature map (SOFM), ART, simplied ART (SART),
hyperellipsoidal clustering network (HEC), self-split-
ting competitive learning network (SPLL)
I. Kernel-Based
Kernel
-means, support vector clustering (SVC)
J. Sequential Data
Sequence Similarity
Indirect sequence clustering
Statistical sequence clustering
K. Large-Scale Data Sets (See also Table II)
CLARA, CURE, CLARANS, BIRCH, DBSCAN,
DENCLUE, WaveCluster, FC, ART
L. Data visualization and High-dimensional Data
PCA, ICA, Projection pursuit, Isomap, LLE,
CLIQUE, OptiGrid, ORCLUS
M. How Many Clusters?
Applications in two benchmark data sets, the traveling
salesman problem, and bioinformatics are illustrated in Sec-
tion III. We conclude the paper in Section IV.
II. C
LUSTERING ALGORITHMS
Different starting points and criteria usually lead to different
taxonomies of clustering algorithms [33], [88], [124], [150],
[152], [171]. A rough but widely agreed frame is to classify
clustering techniques as hierarchical clustering and parti-
tional clustering, based on the properties of clusters generated
[88], [152]. Hierarchical clustering groups data objects with
a sequence of partitions, either from singleton clusters to a
cluster including all individuals or vice versa, while partitional
clustering directly divides data objects into some prespecied
number of clusters without the hierarchical structure. We
follow this frame in surveying the clustering algorithms in the
literature. Beginning with the discussion on proximity measure,
which is the basis for most clustering algorithms, we focus on
hierarchical clustering and classical partitional clustering algo-
rithms in Section II-BD. Starting from part E, we introduce
and analyze clustering algorithms based on a wide variety of
theories and techniques, including graph theory, combinato-
rial search techniques, fuzzy set theory, neural networks, and
kernels techniques. Compared with graph theory and fuzzy set

XU AND WUNSCH II: SURVEY OF CLUSTERING ALGORITHMS 649
TABLE II
C
OMPUTATIONAL COMPLEXITY OF
CLUSTERING
ALGORITHMS
theory, which had already been widely used in cluster analysis
before the 1980s, the other techniques have been nding their
applications in clustering just in the recent decades. In spite of
the short history, much progress has been achieved. Note that
these techniques can be used for both hierarchical and parti-
tional clustering. Considering the more frequent requirement of
tackling sequential data sets, large-scale, and high-dimensional
data sets in many current applications, we review clustering
algorithms for them in the following three parts. We focus
particular attention on clustering algorithms applied in bioin-
formatics. We offer more detailed discussion on how to identify
appropriate number of clusters, which is particularly important
in cluster validity, in the last part of the section.
A. Distance and Similarity Measures
It is natural to ask what kind of standards we should use to
determine the closeness, or how to measure the distance (dis-
similarity) or similarity between a pair of objects, an object and
a cluster, or a pair of clusters. In the next section on hierarchical
clustering, we will illustrate linkage metrics for measuring prox-
imity between clusters. Usually, a prototype is used to represent
a cluster so that it can be further processed like other objects.
Here, we focus on reviewing measure approaches between in-
dividuals due to the previous consideration.
A data object is described by a set of features, usually repre-
sented as a multidimensional vector. The features can be quan-
titative or qualitative, continuous or binary, nominal or ordinal,
which determine the corresponding measure mechanisms.
A distance or dissimilarity function on a data set
is dened
to satisfy the following conditions.
1) Symmetry.
;
2) Positivity.
for all and .
If conditions
3) Triangle inequality.
for all and
and (4) Reexivity.
also
hold, it is called a metric.
Likewise, a similarity function is dened to satisfy the con-
ditions in the following.
1) Symmetry.
;
2) Positivity.
, for all and .
If it also satises conditions
3)
for all and
and (4) , it is called a simi-
larity metric.
For a data set with
input patterns, we can dene an
symmetric matrix, called proximity matrix, whose th
element represents the similarity or dissimilarity measure for
the
th and th patterns .
Typically, distance functions are used to measure continuous
features, while similarity measures are more important for qual-
itative variables. We summarize some typical measures for con-
tinuous features in Table I. The selection of different measures
is problem dependent. For binary features, a similarity measure
is commonly used (dissimilarity measures can be obtained by
simply using
). Suppose we use two binary sub-
scripts to count features in two objects.
and represent
the number of simultaneous absence or presence of features in
two objects, and
and count the features present only in
one object. Then two types of commonly used similarity mea-
sures for data points
and are illustrated in the following.
simple matching coefcient
Rogers and Tanimoto measure.
Gower and Legendre measure
These measures compute the match between two objects
directly. Unmatched pairs are weighted based on their
contribution to the similarity.
Jaccard coefcient
Sokal and Sneath measure.
Gower and Legendre measure

Figures
Citations
More filters
Journal ArticleDOI

Clustering by fast search and find of density peaks

TL;DR: A method in which the cluster centers are recognized as local density maxima that are far away from any points of higher density, and the algorithm depends only on the relative densities rather than their absolute values.

The Self-Organizing Map

TL;DR: An overview of the self-organizing map algorithm, on which the papers in this issue are based, is presented in this article, where the authors present an overview of their work.
Journal Article

When is nearest neighbor meaningful

TL;DR: In this article, the authors explore the effect of dimensionality on the nearest neighbor problem and show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance of the farthest data point.
Book

Handbook of Blind Source Separation: Independent Component Analysis and Applications

TL;DR: This handbook provides the definitive reference on Blind Source Separation, giving a broad and comprehensive description of all the core principles and methods, numerical algorithms and major applications in the fields of telecommunications, biomedical engineering and audio, acoustic and speech processing.
Journal ArticleDOI

A survey of techniques for internet traffic classification using machine learning

TL;DR: This survey paper looks at emerging research into the application of Machine Learning techniques to IP traffic classification - an inter-disciplinary blend of IP networking and data mining techniques.
References
More filters
Journal ArticleDOI

Basic Local Alignment Search Tool

TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.
Journal ArticleDOI

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.
Book

Fuzzy sets

TL;DR: A separation theorem for convex fuzzy sets is proved without requiring that the fuzzy sets be disjoint.
Journal ArticleDOI

A new look at the statistical model identification

TL;DR: In this article, a new estimate minimum information theoretical criterion estimate (MAICE) is introduced for the purpose of statistical identification, which is free from the ambiguities inherent in the application of conventional hypothesis testing procedure.
Journal ArticleDOI

Optimization by Simulated Annealing

TL;DR: There is a deep and useful connection between statistical mechanics and multivariate or combinatorial optimization (finding the minimum of a given function depending on many parameters), and a detailed analogy with annealing in solids provides a framework for optimization of very large and complex systems.
Frequently Asked Questions (14)
Q1. What are the contributions mentioned in the paper "Survey of clustering algorithms" ?

Clustering algorithms for data exploration this paper have been widely used in statistics, computer science, and machine learning. 

The term, “curse of dimensionality,” which was first used by Bellman to indicate the exponential growth of complexity in the case of multivariate function estimation under a high dimensionality situation [28], is generally used to describe the problems accompanying high dimensional spaces [34], [132]. 

A minimum cut (mincut) procedure, which aims to separate a graph with a minimum number of edges, is used to find these HCSs recursively. 

The software package GeneCluster, developed by Whitehead Institute/MIT Center for Genome Research (WICGR), was used in this analysis. 

It also uses a nearest-neighbor algorithm to divide data into small subsets, before GAs-based clustering, in order to reduce the computational complexity. 

If a sequence comparison is regarded as a process of transforming a given sequence to another with a series of substitution, insertion, and deletion operations, the distance between the two sequences can be defined by virtue of the minimum number of required operations. 

As long as the parameter vector is decided, the posterior probability for assigning a data point to a cluster can be easily calculated with Bayes’s theorem. 

Since many genes usually display more than one function, fuzzy clustering may be more effective in exposing these relations [73]. 

ORCLUS (arbitrarily ORiented projected CLUster generation) [2] defines a generalized projected cluster as a densely distributed subset of data objects in a subspace, along with a subset of vectors that represent the subspace. 

By designing and calculating an inner-product kernel, the authors can avoid the time-consuming, sometimes even infeasible process to explicitly describe the nonlinear mapping and compute the corresponding points in the transformed space. 

Several other constructive clustering algorithms, including the FACS and plastic neural gas, can be accessed in [223] and [232], respectively. 

After the normalization of the fluorescence intensities, the gene expression profiles are represented as a matrix , where is the expression level of the th gene in the th condition, tissue, or experimental stage. 

Most of these architectures utilize prototype vectors to represent clusters, e.g., cluster detection and labeling network (CDL) [82], HEC [194], and SPLL [296]. 

The iris data set [92] is one of the most popular data sets to examine the performance of novel methods in pattern recognition and machine learning.