An examination of procedures for determining the number of clusters in a data set

doi:10.1007/BF02294245

Home
/
Papers
/
An examination of procedures for determining the number of clusters in a data set

Journal Article•DOI•

An examination of procedures for determining the number of clusters in a data set

Glenn W. Milligan¹, Martha C. Cooper¹•Institutions (1)

Ohio State University¹

01 Jun 1985-Psychometrika (Springer)-Vol. 50, Iss: 2, pp 159-179

TL;DR: A Monte Carlo evaluation of 30 procedures for determining the number of clusters was conducted on artificial data sets which contained either 2, 3, 4, or 5 distinct nonoverlapping clusters to provide a variety of clustering solutions.

read less

Abstract: A Monte Carlo evaluation of 30 procedures for determining the number of clusters was conducted on artificial data sets which contained either 2, 3, 4, or 5 distinct nonoverlapping clusters. To provide a variety of clustering solutions, the data sets were analyzed by four hierarchical clustering methods. External criterion measures indicated excellent recovery of the true cluster structure by the methods at the correct hierarchy level. Thus, the clustering present in the data was quite strong. The simulation results for the stopping rules revealed a wide range in their ability to determine the correct number of clusters in the data. Several procedures worked fairly well, whereas others performed rather poorly. Thus, the latter group of rules would appear to have little validity, particularly for data sets containing distinct clusters. Applied researchers are urged to select one or more of the better criteria. However, users are cautioned that the performance of some of the criteria may be data dependent.

...read moreread less

Citations

PDF

Open Access

More filters

Algorithms for clustering data

[...]

Anil K. Jain¹, Richard C. Dubes¹•Institutions (1)

Michigan State University¹

01 Jan 1988

9,439 citations

Book•

Algorithms for clustering data

[...]

Anil K. Jain¹, Richard C. Dubes¹•Institutions (1)

Michigan State University¹

01 Jan 1988

8,586 citations

Journal Article•DOI•

Species assemblages and indicator species:the need for a flexible asymmetrical approach

[...]

Marc Dufrêne¹, Pierre Legendre²•Institutions (2)

Université catholique de Louvain¹, Université de Montréal²

01 Aug 1997-Ecological Monographs

TL;DR: A new and simple method to find indicator species and species assemblages characterizing groups of sites, and a new way to present species-site tables, accounting for the hierarchical relationships among species, is proposed.

...read moreread less

Abstract: This paper presents a new and simple method to find indicator species and species assemblages characterizing groups of sites The novelty of our approach lies in the way we combine a species relative abundance with its relative frequency of occurrence in the various groups of sites This index is maximum when all individuals of a species are found in a single group of sites and when the species occurs in all sites of that group; it is a symmetric indicator The statistical significance of the species indicator values is evaluated using a randomization procedure Contrary to TWINSPAN, our indicator index for a given species is independent of the other species relative abundances, and there is no need to use pseudospecies The new method identifies indicator species for typologies of species releves obtained by any hierarchical or nonhierarchical classification procedure; its use is independent of the classification method Because indicator species give ecological meaning to groups of sites, this method provides criteria to compare typologies, to identify where to stop dividing clusters into subsets, and to point out the main levels in a hierarchical classification of sites Species can be grouped on the basis of their indicator values for each clustering level, the heterogeneous nature of species assemblages observed in any one site being well preserved Such assemblages are usually a mixture of eurytopic (higher level) and stenotopic species (characteristic of lower level clusters) The species assemblage approach demonstrates the importance of the ''sampled patch size,'' ie, the diversity of sampled ecological combinations, when we compare the frequencies of core and satellite species A new way to present species-site tables, accounting for the hierarchical relationships among species, is proposed A large data set of carabid beetle distributions in open habitats of Belgium is used as a case study to illustrate the new method

...read moreread less

7,449 citations

Cites background from "An examination of procedures for de..."

...This is a common question in cluster analysis because no single objective criterion receives general support (Milligan and Cooper 1985)....
[...]

Journal Article•DOI•

Survey of clustering algorithms

[...]

Rui Xu¹, Donald C. Wunsch¹•Institutions (1)

Missouri University of Science and Technology¹

01 May 2005-IEEE Transactions on Neural Networks

TL;DR: Clustering algorithms for data sets appearing in statistics, computer science, and machine learning are surveyed, and their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts are illustrated.

...read moreread less

Abstract: Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.

...read moreread less

5,744 citations

Journal Article•DOI•

Enterotypes of the human gut microbiome

[...]

Manimozhiyan Arumugam, Jeroen Raes, Eric Pelletier¹, Denis Le Paslier¹, Takuji Yamada, Daniel R. Mende, Gabriel Fernandes, Julien Tap, Thomas Brüls¹, Jean-Michel Batto², Marcelo Bertalan³, Natalia Borruel, Francesc Casellas, Leyden Fernández⁴, Laurent Gautier³, Torben Hansen⁵, Masahira Hattori⁶, Tetsuya Hayashi⁷, Michiel Kleerebezem⁸, Ken Kurokawa⁹, Marion Leclerc², Florence Levenez², Chaysavanh Manichanh, H. Bjørn Nielsen³, Trine Nielsen⁵, Nicolas Pons², Julie Poulain¹⁰, Junjie Qin, Thomas Sicheritz-Pontén³, Sebastian Tims⁸, David Torrents⁴, Edgardo Ugarte, Erwin G. Zoetendal⁸, Jun Wang, Francisco Guarner, Oluf Pedersen⁵, Willem M. de Vos, Søren Brunak³, Joël Doré², Jean Weissenbach¹, S. Dusko Ehrlich², Peer Bork - Show less +38 more•Institutions (10)

University of Évry Val d'Essonne¹, Institut national de la recherche agronomique², Technical University of Denmark³, Barcelona Supercomputing Center⁴, University of Copenhagen⁵, University of Tokyo⁶, University of Miyazaki⁷, Wageningen University and Research Centre⁸, Tokyo Institute of Technology⁹, French Alternative Energies and Atomic Energy Commission¹⁰

12 May 2011-Nature

TL;DR: Three robust clusters (referred to as enterotypes hereafter) are identified that are not nation or continent specific and confirmed in two published, larger cohorts, indicating that intestinal microbiota variation is generally stratified, not continuous.

...read moreread less

Abstract: Our knowledge of species and functional composition of the human gut microbiome is rapidly increasing, but it is still based on very few cohorts and little is known about variation across the world. By combining 22 newly sequenced faecal metagenomes of individuals from four countries with previously published data sets, here we identify three robust clusters (referred to as enterotypes hereafter) that are not nation or continent specific. We also confirmed the enterotypes in two published, larger cohorts, indicating that intestinal microbiota variation is generally stratified, not continuous. This indicates further the existence of a limited number of well-balanced host-microbial symbiotic states that might respond differently to diet and drug intake. The enterotypes are mostly driven by species composition, but abundant molecular functions are not necessarily provided by abundant species, highlighting the importance of a functional analysis to understand microbial communities. Although individual host properties such as body mass index, age, or gender cannot explain the observed enterotypes, data-driven marker genes or functional modules can be identified for each of these host properties. For example, twelve genes significantly correlate with age and three functional modules with the body mass index, hinting at a diagnostic potential of microbial markers.

...read moreread less

5,566 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Pattern Classification and Scene Analysis.

[...]

Ulf Grenander, Richard O. Duda, Peter E. Hart

01 Sep 1974-Journal of the American Statistical Association

14,948 citations

Book•

Pattern classification and scene analysis

[...]

Richard O. Duda, Peter E. Hart

01 Jan 1973

TL;DR: In this article, a unified, comprehensive and up-to-date treatment of both statistical and descriptive methods for pattern recognition is provided, including Bayesian decision theory, supervised and unsupervised learning, nonparametric techniques, discriminant analysis, clustering, preprosessing of pictorial data, spatial filtering, shape description techniques, perspective transformations, projective invariants, linguistic procedures, and artificial intelligence techniques for scene analysis.

...read moreread less

Abstract: Provides a unified, comprehensive and up-to-date treatment of both statistical and descriptive methods for pattern recognition. The topics treated include Bayesian decision theory, supervised and unsupervised learning, nonparametric techniques, discriminant analysis, clustering, preprosessing of pictorial data, spatial filtering, shape description techniques, perspective transformations, projective invariants, linguistic procedures, and artificial intelligence techniques for scene analysis.

...read moreread less

13,647 citations

Book•

Cluster Analysis

[...]

Brian Everitt, Sabine Landau, Morven Leese

01 Jan 1974

TL;DR: This fourth edition of the highly successful Cluster Analysis represents a thorough revision of the third edition and covers new and developing areas such as classification likelihood and neural networks for clustering.

...read moreread less

Abstract: Cluster analysis comprises a range of methods for classifying multivariate data into subgroups. By organising multivariate data into such subgroups, clustering can help reveal the characteristics of any structure or patterns present. These techniques are applicable in a wide range of areas such as medicine, psychology and market research. This fourth edition of the highly successful Cluster Analysis represents a thorough revision of the third edition and covers new and developing areas such as classification likelihood and neural networks for clustering. Real life examples are used throughout to demonstrate the application of the theory, and figures are used extensively to illustrate graphical techniques. The book is comprehensive yet relatively non-mathematical, focusing on the practical aspects of cluster analysis.

...read moreread less

9,857 citations

Journal Article•DOI•

A Cluster Separation Measure

[...]

David L. Davies¹, Donald W. Bouldin¹•Institutions (1)

University of Tennessee¹

01 Feb 1979-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A measure is presented which indicates the similarity of clusters which are assumed to have a data density which is a decreasing function of distance from a vector characteristic of the cluster which can be used to infer the appropriateness of data partitions.

...read moreread less

Abstract: A measure is presented which indicates the similarity of clusters which are assumed to have a data density which is a decreasing function of distance from a vector characteristic of the cluster. The measure can be used to infer the appropriateness of data partitions and can therefore be used to compare relative appropriateness of various divisions of the data. The measure does not depend on either the number of clusters analyzed nor the method of partitioning of the data and can be used to guide a cluster seeking algorithm.

...read moreread less

6,757 citations