What are the contributions mentioned in the paper "Survey of clustering algorithms" ?

Clustering algorithms for data exploration this paper have been widely used in statistics, computer science, and machine learning.

What is the term used to describe the problems accompanying high dimensional spaces?

The term, “curse of dimensionality,” which was first used by Bellman to indicate the exponential growth of complexity in the case of multivariate function estimation under a high dimensionality situation [28], is generally used to describe the problems accompanying high dimensional spaces [34], [132].

What is the important graph representation for HC analysis?

A minimum cut (mincut) procedure, which aims to separate a graph with a minimum number of edges, is used to find these HCSs recursively.

What software package was used in this analysis?

The software package GeneCluster, developed by Whitehead Institute/MIT Center for Genome Research (WICGR), was used in this analysis.

What is the way to reduce the computational complexity of GAs-based clustering?

It also uses a nearest-neighbor algorithm to divide data into small subsets, before GAs-based clustering, in order to reduce the computational complexity.

How many operations can be used to determine the distance between a given sequence?

If a sequence comparison is regarded as a process of transforming a given sequence to another with a series of substitution, insertion, and deletion operations, the distance between the two sequences can be defined by virtue of the minimum number of required operations.

How can the authors calculate the posterior probability for assigning a data point to a cluster?

As long as the parameter vector is decided, the posterior probability for assigning a data point to a cluster can be easily calculated with Bayes’s theorem.

What is the way to expose the relations between genes?

Since many genes usually display more than one function, fuzzy clustering may be more effective in exposing these relations [73].

What is the definition of a generalized projected cluster?

ORCLUS (arbitrarily ORiented projected CLUster generation) [2] defines a generalized projected cluster as a densely distributed subset of data objects in a subspace, along with a subset of vectors that represent the subspace.

How can the authors avoid the time-consuming process to describe the nonlinear mapping?

By designing and calculating an inner-product kernel, the authors can avoid the time-consuming, sometimes even infeasible process to explicitly describe the nonlinear mapping and compute the corresponding points in the transformed space.

How many other algorithms can be accessed?

Several other constructive clustering algorithms, including the FACS and plastic neural gas, can be accessed in [223] and [232], respectively.

What is the expression level of the th gene in the th condition, tissue, or?

After the normalization of the fluorescence intensities, the gene expression profiles are represented as a matrix , where is the expression level of the th gene in the th condition, tissue, or experimental stage.

What are some other neural network architectures that are used for clustering?

Most of these architectures utilize prototype vectors to represent clusters, e.g., cluster detection and labeling network (CDL) [82], HEC [194], and SPLL [296].

What is the popular data set to examine the performance of novel methods in pattern recognition and machine?

The iris data set [92] is one of the most popular data sets to examine the performance of novel methods in pattern recognition and machine learning.

(Open Access) Survey of clustering algorithms (2005) | Rui Xu

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005 645

Survey of Clustering Algorithms

Rui Xu, Student Member, IEEE and Donald Wunsch II, Fellow, IEEE

Abstract—Data analysis plays an indispensable role for un-

derstanding various phenomena. Cluster analysis, primitive

exploration with little or no prior knowledge, consists of research

developed across a wide variety of communities. The diversity,

on one hand, equips us with many tools. On the other hand,

the profusion of options causes confusion. We survey clustering

algorithms for data sets appearing in statistics, computer science,

and machine learning, and illustrate their applications in some

benchmark data sets, the traveling salesman problem, and bioin-

formatics, a new ﬁeld attracting intensive efforts. Several tightly

related topics, proximity measure, and cluster validation, are also

discussed.

Index Terms—Adaptive resonance theory (ART), clustering,

clustering algorithm, cluster validation, neural networks, prox-

imity, self-organizing feature map (SOFM).

I. INTRODUCTION

E ARE living in a world full of data. Every day, people

encounter a large amount of information and store or

represent it as data, for further analysis and management. One

of the vital means in dealing with these data is to classify or

group them into a set of categories or clusters. Actually, as one

of the most primitive activities of human beings [14], classi-

ﬁcation plays an important and indispensable role in the long

history of human development. In order to learn a new object

or understand a new phenomenon, people always try to seek

the features that can describe it, and further compare it with

other known objects or phenomena, based on the similarity or

dissimilarity, generalized as proximity, according to some cer-

tain standards or rules. “Basically, classiﬁcation systems are ei-

ther supervised or unsupervised, depending on whether they as-

sign new inputs to one of a ﬁnite number of discrete supervised

classes or unsupervised categories, respectively [38], [60], [75].

In supervised classiﬁcation, the mapping from a set of input data

vectors (

, where is the input space dimensionality), to

a ﬁnite set of discrete class labels (

, where is

the total number of class types), is modeled in terms of some

mathematical function

, where is a vector of

adjustable parameters. The values of these parameters are de-

termined (optimized) by an inductive learning algorithm (also

termed inducer), whose aim is to minimize an empirical risk

functional (related to an inductive principle) on a ﬁnite data set

of input–output examples,

, where is

the ﬁnite cardinality of the available representative data set [38],

Manuscript received March 31, 2003; revised September 28, 2004. This work

was supported in part by the National Science Foundation and in part by the

M. K. Finley Missouri Endowment.

The authors are with the Department of Electrical and Computer Engineering,

University of Missouri-Rolla, Rolla, MO 65409 USA (e-mail: rxu@umr.edu;

dwunsch@ece.umr.edu).

Digital Object Identiﬁer 10.1109/TNN.2005.845141

[60], [167]. When the inducer reaches convergence or termi-

nates, an induced classiﬁer is generated [167].

In unsupervised classiﬁcation, called clustering or ex-

ploratory data analysis, no labeled data are available [88],

[150]. The goal of clustering is to separate a ﬁnite unlabeled

data set into a ﬁnite and discrete set of “natural,” hidden data

structures, rather than provide an accurate characterization

of unobserved samples generated from the same probability

distribution [23], [60]. This can make the task of clustering fall

outside of the framework of unsupervised predictive learning

problems, such as vector quantization [60] (see Section II-C),

probability density function estimation [38] (see Section II-D),

[60], and entropy maximization [99]. It is noteworthy that

clustering differs from multidimensional scaling (perceptual

maps), whose goal is to depict all the evaluated objects in a

way that minimizes the topographical distortion while using as

few dimensions as possible. Also note that, in practice, many

(predictive) vector quantizers are also used for (nonpredictive)

clustering analysis [60].

Nonpredictive clustering is a subjective process in nature,

which precludes an absolute judgment as to the relative efﬁ-

cacy of all clustering techniques [23], [152]. As pointed out by

Backer and Jain [17], “in cluster analysis a group of objects is

split up into a number of more or less homogeneous subgroups

on the basis of an often subjectively chosen measure of sim-

ilarity (i.e., chosen subjectively based on its ability to create

“interesting” clusters), such that the similarity between objects

within a subgroup is larger than the similarity between objects

belonging to different subgroups””

Clustering algorithms partition data into a certain number

of clusters (groups, subsets, or categories). There is no univer-

sally agreed upon deﬁnition [88]. Most researchers describe a

cluster by considering the internal homogeneity and the external

separation [111], [124], [150], i.e., patterns in the same cluster

should be similar to each other, while patterns in different clus-

ters should not. Both the similarity and the dissimilarity should

be examinable in a clear and meaningful way. Here, we give

some simple mathematical descriptions of several types of clus-

tering, based on the descriptions in [124].

Given a set of input patterns

where

and each measure

is said to be a feature (attribute, dimension, or variable).

• (Hard) partitional clustering attempts to seek a

-par-

tition of

, such that

;

and .

The preceding quote is taken verbatim from verbiage suggested by the

anonymous associate editor, a suggestion which we gratefully acknowledge.

646 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005

Fig. 1. Clustering procedure. The typical cluster analysis consists of four steps with a feedback pathway. These steps are closely related to each other and affect

the derived clusters.

•) Hierarchical clustering attempts to construct a

tree-like nested structure partition of

, such that

, and imply or for all

For hard partitional clustering, each pattern only belongs to

one cluster. However, a pattern may also be allowed to belong

to all clusters with a degree of membership,

, which

represents the membership coefﬁcient of the

th object in the

th cluster and satisﬁes the following two constraints:

and

as introduced in fuzzy set theory [293]. This is known as fuzzy

clustering, reviewed in Section II-G.

Fig. 1 depicts the procedure of cluster analysis with four basic

steps.

1) Feature selection or extraction. As pointed out by Jain

et al. [151], [152] and Bishop [38], feature selection

choosesdistinguishingfeaturesfrom aset ofcandidates,

while feature extraction utilizes some transformations

to generate useful and novel features from the original

ones. Both are very crucial to the effectiveness of clus-

tering applications. Elegant selection of features can

greatly decrease the workload and simplify the subse-

quentdesignprocess.Generally,idealfeaturesshouldbe

of use in distinguishing patterns belonging to different

clusters, immune to noise, easy to extract and interpret.

We elaborate the discussion on feature extraction in

Section II-L, in the context of data visualization and

dimensionality reduction. More information on feature

selection can be found in [38], [151], and [250].

2) Clustering algorithm design or selection. The step is

usually combined with the selection of a corresponding

proximity measure and the construction of a criterion

function. Patterns are grouped according to whether

they resemble each other. Obviously, the proximity

measure directly affects the formation of the resulting

clusters. Almost all clustering algorithms are explicitly

or implicitly connected to some deﬁnition of proximity

measure. Some algorithms even work directly on the

proximity matrix, as deﬁned in Section II-A. Once

a proximity measure is chosen, the construction of a

clustering criterion function makes the partition of

clusters an optimization problem, which is well deﬁned

mathematically, and has rich solutions in the literature.

Clusteringis ubiquitous,and awealth ofclustering algo-

rithmshasbeen developedto solvedifferentproblems in

speciﬁcﬁelds. However,there isno clusteringalgorithm

that can be universallyusedto solve all problems. “Ithas

been very difﬁcult to develop a uniﬁed framework for

reasoning about it (clustering) at a technical level, and

profoundly diverse approaches to clustering” [166], as

proved through an impossibility theorem. Therefore, it

is important to carefully investigate the characteristics

of the problem at hand, in order to select or design an

appropriate clustering strategy.

3) Cluster validation. Given a data set, each clustering

algorithm can always generate a division, no matter

whether the structure exists or not. Moreover, different

approaches usually lead to different clusters; and even

for the same algorithm, parameter identiﬁcation or

the presentation order of input patterns may affect the

ﬁnal results. Therefore, effective evaluation standards

and criteria are important to provide the users with a

degree of conﬁdence for the clustering results derived

from the used algorithms. These assessments should

be objective and have no preferences to any algorithm.

Also, they should be useful for answering questions

like how many clusters are hidden in the data, whether

the clusters obtained are meaningful or just an artifact

of the algorithms, or why we choose some algorithm

instead of another. Generally, there are three categories

of testing criteria: external indices, internal indices,

and relative indices. These are deﬁned on three types

of clustering structures, known as partitional clus-

tering, hierarchical clustering, and individual clusters

[150]. Tests for the situation, where no clustering

structure exists in the data, are also considered [110],

but seldom used, since users are conﬁdent of the pres-

ence of clusters. External indices are based on some

prespeciﬁed structure, which is the reﬂection of prior

information on the data, and used as a standard to

validate the clustering solutions. Internal tests are not

dependent on external information (prior knowledge).

On the contrary, they examine the clustering structure

directly from the original data. Relative criteria place

XU AND WUNSCH II: SURVEY OF CLUSTERING ALGORITHMS 647

the emphasis on the comparison of different clustering

structures, in order to provide a reference, to decide

which one may best reveal the characteristics of the

objects. We will not survey the topic in depth and refer

interested readers to [74], [110], and [150]. However,

we will cover more details on how to determine the

number of clusters in Section II-M. Some more recent

discussion can be found in [22], [37], [121], [180],

and [181]. Approaches for fuzzy clustering validity

are reported in [71], [104], [123], and [220].

Results interpretation. The ultimate goal of clustering

is to provide users with meaningful insights from the

original data, so that they can effectively solve the

problems encountered. Experts in the relevant ﬁelds in-

terpret the data partition. Further analyzes, even exper-

iments, may be required to guarantee the reliability of

extracted knowledge.

Note that the ﬂow chart also includes a feedback pathway.

Clusteranalysisisnotaone-shot process.In manycircumstances,

it needs a series of trials and repetitions. Moreover, there are no

universal and effective criteria to guide the selection of features

and clustering schemes. Validation criteria provide some insights

on the quality of clustering solutions. But even how to choose the

appropriate criterion is still a problem requiring more efforts.

Clustering has been applied in a wide variety of ﬁelds,

ranging from engineering (machine learning, artiﬁcial intelli-

gence, pattern recognition, mechanical engineering, electrical

engineering), computer sciences (web mining, spatial database

analysis, textual document collection, image segmentation),

life and medical sciences (genetics, biology, microbiology,

paleontology, psychiatry, clinic, pathology), to earth sciences

(geography. geology, remote sensing), social sciences (soci-

ology, psychology, archeology, education), and economics

(marketing, business) [88], [127]. Accordingly, clustering is

also known as numerical taxonomy, learning without a teacher

(or unsupervised learning), typological analysis and partition.

The diversity reﬂects the important position of clustering in

scientiﬁc research. On the other hand, it causes confusion, due

to the differing terminologies and goals. Clustering algorithms

developed to solve a particular problem, in a specialized ﬁeld,

usually make assumptions in favor of the application of interest.

These biases inevitably affect performance in other problems

that do not satisfy these premises. For example, the

-means

algorithm is based on the Euclidean measure and, hence, tends

to generate hyperspherical clusters. But if the real clusters are

in other geometric forms,

-means may no longer be effective,

and we need to resort to other schemes. This situation also

holds true for mixture-model clustering, in which a model is ﬁt

to data in advance.

Clustering has a long history, with lineage dating back to Aris-

totle [124]. General references on clustering techniques include

[14], [75], [77], [88], [111], [127], [150], [161], [259]. Important

survey papers on clustering techniques also exist in the literature.

Starting from a statistical pattern recognition viewpoint, Jain,

Murty,andFlynnreviewedtheclusteringalgorithmsandotherim-

portant issues related to cluster analysis [152], while Hansen and

Jaumard described the clustering problems under a mathematical

programming scheme [124]. Kolatch and He investigated appli-

cationsofclusteringalgorithmsforspatialdatabasesystems[171]

and information retrieval [133], respectively. Berkhin further ex-

panded the topic to the whole ﬁeld of data mining [33]. Murtagh

reported the advances in hierarchical clustering algorithms [210]

andBaraldi surveyedseveralmodelsfor fuzzyandneuralnetwork

clustering [24]. Some more survey papers can also be found in

[25], [40], [74], [89], and [151]. In addition to the review papers,

comparative research on clustering algorithms is also signiﬁcant.

Rauber, Paralic, and Pampalk presented empirical results for ﬁve

typical clustering algorithms [231]. Wei, Lee, and Hsu placed the

emphasis onthe comparisonof fast algorithmsfor large databases

[280]. Scheunders compared several clustering techniques for

color image quantization, with emphasis on computational time

and thepossibility ofobtaining global optima[239]. Applications

and evaluations of different clustering algorithms for the analysis

of gene expression data from DNA microarray experiments were

described in [153], [192], [246], and [271]. Experimental evalua-

tionondocumentclusteringtechniques, basedonhierarchicaland

-means clustering algorithms, were summarized by Steinbach,

Karypis, and Kumar [261].

In contrast to the above, the purpose of this paper is to pro-

vide a comprehensive and systematic description of the inﬂu-

ential and important clustering algorithms rooted in statistics,

computer science, and machine learning, with emphasis on new

advances in recent years.

The remainder of the paper is organized as follows. In Sec-

tion II, we review clustering algorithms, based on the natures

of generated clusters and techniques and theories behind them.

Furthermore, we discuss approaches for clustering sequential

data, large data sets, data visualization, and high-dimensional

data through dimension reduction. Two important issues on

cluster analysis, including proximity measure and how to

choose the number of clusters, are also summarized in the

section. This is the longest section of the paper, so, for conve-

nience, we give an outline of Section II in bullet form here:

II. Clustering Algorithms

•

A. Distance and Similarity Measures

(See also Table I)

•

B. Hierarchical

—

Agglomerative

Single linkage, complete linkage, group average

linkage, median linkage, centroid linkage, Ward’s

method, balanced iterative reducing and clustering

using hierarchies (BIRCH), clustering using rep-

resentatives (CURE), robust clustering using links

(ROCK)

—

Divisive

divisive analysis (DIANA), monothetic analysis

(MONA)

•

C. Squared Error-Based (Vector Quantization)

—

-means, iterative self-organizing data analysis

technique (ISODATA), genetic

-means algorithm

(GKA), partitioning around medoids (PAM)

•

D. pdf Estimation via Mixture Densities

—

Gaussian mixture density decomposition (GMDD),

AutoClass

•

E. Graph Theory-Based

—

Chameleon, Delaunay triangulation graph (DTG),

highly connected subgraphs (HCS), clustering iden-

648 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005

TABLE I

IMILARITY AND

DISSIMILARITY MEASURE FOR

QUANTITATIVE

FEATURES

tiﬁcation via connectivity kernels (CLICK), cluster

afﬁnity search technique (CAST)

•

F. Combinatorial Search Techniques-Based

—

Genetically guided algorithm (GGA), TS clustering,

SA clustering

•

G. Fuzzy

—

Fuzzy

-means (FCM), mountain method (MM), pos-

sibilistic

-means clustering algorithm (PCM), fuzzy

-shells (FCS)

•

H. Neural Networks-Based

—

Learning vector quantization (LVQ), self-organizing

feature map (SOFM), ART, simpliﬁed ART (SART),

hyperellipsoidal clustering network (HEC), self-split-

ting competitive learning network (SPLL)

•

I. Kernel-Based

—

Kernel

-means, support vector clustering (SVC)

•

J. Sequential Data

—

Sequence Similarity

—

Indirect sequence clustering

—

Statistical sequence clustering

•

K. Large-Scale Data Sets (See also Table II)

—

CLARA, CURE, CLARANS, BIRCH, DBSCAN,

DENCLUE, WaveCluster, FC, ART

•

L. Data visualization and High-dimensional Data

—

PCA, ICA, Projection pursuit, Isomap, LLE,

CLIQUE, OptiGrid, ORCLUS

•

M. How Many Clusters?

Applications in two benchmark data sets, the traveling

salesman problem, and bioinformatics are illustrated in Sec-

tion III. We conclude the paper in Section IV.

II. C

LUSTERING ALGORITHMS

Different starting points and criteria usually lead to different

taxonomies of clustering algorithms [33], [88], [124], [150],

[152], [171]. A rough but widely agreed frame is to classify

clustering techniques as hierarchical clustering and parti-

tional clustering, based on the properties of clusters generated

[88], [152]. Hierarchical clustering groups data objects with

a sequence of partitions, either from singleton clusters to a

cluster including all individuals or vice versa, while partitional

clustering directly divides data objects into some prespeciﬁed

number of clusters without the hierarchical structure. We

follow this frame in surveying the clustering algorithms in the

literature. Beginning with the discussion on proximity measure,

which is the basis for most clustering algorithms, we focus on

hierarchical clustering and classical partitional clustering algo-

rithms in Section II-B–D. Starting from part E, we introduce

and analyze clustering algorithms based on a wide variety of

theories and techniques, including graph theory, combinato-

rial search techniques, fuzzy set theory, neural networks, and

kernels techniques. Compared with graph theory and fuzzy set

XU AND WUNSCH II: SURVEY OF CLUSTERING ALGORITHMS 649

TABLE II

OMPUTATIONAL COMPLEXITY OF

CLUSTERING

ALGORITHMS

theory, which had already been widely used in cluster analysis

before the 1980s, the other techniques have been ﬁnding their

applications in clustering just in the recent decades. In spite of

the short history, much progress has been achieved. Note that

these techniques can be used for both hierarchical and parti-

tional clustering. Considering the more frequent requirement of

tackling sequential data sets, large-scale, and high-dimensional

data sets in many current applications, we review clustering

algorithms for them in the following three parts. We focus

particular attention on clustering algorithms applied in bioin-

formatics. We offer more detailed discussion on how to identify

appropriate number of clusters, which is particularly important

in cluster validity, in the last part of the section.

A. Distance and Similarity Measures

It is natural to ask what kind of standards we should use to

determine the closeness, or how to measure the distance (dis-

similarity) or similarity between a pair of objects, an object and

a cluster, or a pair of clusters. In the next section on hierarchical

clustering, we will illustrate linkage metrics for measuring prox-

imity between clusters. Usually, a prototype is used to represent

a cluster so that it can be further processed like other objects.

Here, we focus on reviewing measure approaches between in-

dividuals due to the previous consideration.

A data object is described by a set of features, usually repre-

sented as a multidimensional vector. The features can be quan-

titative or qualitative, continuous or binary, nominal or ordinal,

which determine the corresponding measure mechanisms.

A distance or dissimilarity function on a data set

is deﬁned

to satisfy the following conditions.

1) Symmetry.

;

2) Positivity.

for all and .

If conditions

3) Triangle inequality.

for all and

and (4) Reﬂexivity.

also

hold, it is called a metric.

Likewise, a similarity function is deﬁned to satisfy the con-

ditions in the following.

1) Symmetry.

;

2) Positivity.

, for all and .

If it also satisﬁes conditions

for all and

and (4) , it is called a simi-

larity metric.

For a data set with

input patterns, we can deﬁne an

symmetric matrix, called proximity matrix, whose th

element represents the similarity or dissimilarity measure for

the

th and th patterns .

Typically, distance functions are used to measure continuous

features, while similarity measures are more important for qual-

itative variables. We summarize some typical measures for con-

tinuous features in Table I. The selection of different measures

is problem dependent. For binary features, a similarity measure

is commonly used (dissimilarity measures can be obtained by

simply using

). Suppose we use two binary sub-

scripts to count features in two objects.

and represent

the number of simultaneous absence or presence of features in

two objects, and

and count the features present only in

one object. Then two types of commonly used similarity mea-

sures for data points

and are illustrated in the following.

•

simple matching coefﬁcient

Rogers and Tanimoto measure.

Gower and Legendre measure

These measures compute the match between two objects

directly. Unmatched pairs are weighted based on their

contribution to the similarity.

•

Jaccard coefﬁcient

Sokal and Sneath measure.

Gower and Legendre measure

Survey of clustering algorithms

Figures

Citations

Clustering by fast search and find of density peaks

The Self-Organizing Map

When is nearest neighbor meaningful

Handbook of Blind Source Separation: Independent Component Analysis and Applications

A survey of techniques for internet traffic classification using machine learning

References

Basic Local Alignment Search Tool

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Fuzzy sets

A new look at the statistical model identification

Optimization by Simulated Annealing

Related Papers (5)

Data clustering: a review

Some methods for classification and analysis of multivariate observations

A density-based algorithm for discovering clusters in large spatial Databases with Noise

Data Mining: Concepts and Techniques

Pattern Recognition with Fuzzy Objective Function Algorithms

Frequently Asked Questions (14)

Q1. What are the contributions mentioned in the paper "Survey of clustering algorithms" ?

Q2. What is the term used to describe the problems accompanying high dimensional spaces?

Q3. What is the important graph representation for HC analysis?

Q4. What software package was used in this analysis?

Q5. What is the way to reduce the computational complexity of GAs-based clustering?

Q6. How many operations can be used to determine the distance between a given sequence?

Q7. How can the authors calculate the posterior probability for assigning a data point to a cluster?

Q8. What is the way to expose the relations between genes?

Q9. What is the definition of a generalized projected cluster?

Q10. How can the authors avoid the time-consuming process to describe the nonlinear mapping?

Q11. How many other algorithms can be accessed?

Q12. What is the expression level of the th gene in the th condition, tissue, or?

Q13. What are some other neural network architectures that are used for clustering?

Q14. What is the popular data set to examine the performance of novel methods in pattern recognition and machine?