How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis

doi:10.1093/COMJNL/41.8.578

Home
/
Papers
/
How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis

Journal Article•DOI•

How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis

Chris Fraley¹, Adrian E. Raftery¹•Institutions (1)

University of Washington¹

01 Jan 1998-The Computer Journal (Oxford University Press)-Vol. 41, Iss: 8, pp 578-588

TL;DR: The problems of determining the number of clusters and the clustering method are solved simultaneously by choosing the best model, and the EM result provides a measure of uncertainty about the associated classification of each data point.

read less

Abstract: We consider the problem of determining the structure of clustered data, without prior knowledge of the number of clusters or any other information about their composition. Data are represented by a mixture model in which each component corresponds to a different cluster. Models with varying geometric properties are obtained through Gaussian components with different parametrizations and cross-cluster constraints. Noise and outliers can be modelled by adding a Poisson process component. Partitions are determined by the expectation-maximization (EM) algorithm for maximum likelihood, with initial values from agglomerative hierarchical clustering. Models are compared using an approximation to the Bayes factor based on the Bayesian information criterion (BIC); unlike significance tests, this allows comparison of more than two models at the same time, and removes the restriction that the models compared be nested. The problems of determining the number of clusters and the clustering method are solved simultaneously by choosing the best model. Moreover, the EM result provides a measure of uncertainty about the associated classification of each data point. Examples are given, showing that this approach can give performance that is much better than standard procedures, which often fail to identify groups that are either overlapping or of varying sizes and shapes.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Collinearity: a review of methods to deal with it and a simulation study evaluating their performance

[...]

Carsten F. Dormann¹, Jane Elith¹, Sven Bacher¹, Carsten M. Buchmann¹, Gudrun Carl¹, Gabriel Carré¹, Jaime Ricardo García Márquez¹, Bernd Gruber¹, Bruno Lafourcade¹, Pedro J. Leitão¹, Tamara Münkemüller¹, Colin J. McClean¹, Patrick E. Osborne¹, Björn Reineking¹, Boris Schröder¹, Andrew K. Skidmore¹, Damaris Zurell¹, Sven Lautenbach¹ - Show less +14 more•Institutions (1)

Helmholtz Centre for Environmental Research - UFZ¹

01 Jan 2013-Ecography

TL;DR: It was found that methods specifically designed for collinearity, such as latent variable methods and tree based models, did not outperform the traditional GLM and threshold-based pre-selection and the value of GLM in combination with penalised methods and thresholds when omitted variables are considered in the final interpretation.

...read moreread less

Abstract: Collinearity refers to the non independence of predictor variables, usually in a regression-type analysis. It is a common feature of any descriptive ecological data set and can be a problem for parameter estimation because it inflates the variance of regression parameters and hence potentially leads to the wrong identification of relevant predictors in a statistical model. Collinearity is a severe problem when a model is trained on data from one region or time, and predicted to another with a different or unknown structure of collinearity. To demonstrate the reach of the problem of collinearity in ecology, we show how relationships among predictors differ between biomes, change over spatial scales and through time. Across disciplines, different approaches to addressing collinearity problems have been developed, ranging from clustering of predictors, threshold-based pre-selection, through latent variable methods, to shrinkage and regularisation. Using simulated data with five predictor-response relationships of increasing complexity and eight levels of collinearity we compared ways to address collinearity with standard multiple regression and machine-learning approaches. We assessed the performance of each approach by testing its impact on prediction to new data. In the extreme, we tested whether the methods were able to identify the true underlying relationship in a training dataset with strong collinearity by evaluating its performance on a test dataset without any collinearity. We found that methods specifically designed for collinearity, such as latent variable methods and tree based models, did not outperform the traditional GLM and threshold-based pre-selection. Our results highlight the value of GLM in combination with penalised methods (particularly ridge) and threshold-based pre-selection when omitted variables are considered in the final interpretation. However, all approaches tested yielded degraded predictions under change in collinearity structure and the ‘folk lore’-thresholds of correlation coefficients between predictor variables of |r| >0.7 was an appropriate indicator for when collinearity begins to severely distort model estimation and subsequent prediction. The use of ecological understanding of the system in pre-analysis variable selection and the choice of the least sensitive statistical approaches reduce the problems of collinearity, but cannot ultimately solve them.

...read moreread less

6,199 citations

Journal Article•DOI•

Estimating the number of clusters in a data set via the gap statistic

[...]

Robert Tibshirani¹, Guenther Walther¹, Trevor Hastie¹•Institutions (1)

Stanford University¹

01 Jan 2001-Journal of The Royal Statistical Society Series B-statistical Methodology

TL;DR: In this paper, the authors proposed a method called the "gap statistic" for estimating the number of clusters (groups) in a set of data, which uses the output of any clustering algorithm (e.g. K-means or hierarchical), comparing the change in within-cluster dispersion with that expected under an appropriate reference null distribution.

...read moreread less

Abstract: We propose a method (the ‘gap statistic’) for estimating the number of clusters (groups) in a set of data. The technique uses the output of any clustering algorithm (e.g. K-means or hierarchical), comparing the change in within-cluster dispersion with that expected under an appropriate reference null distribution. Some theory is developed for the proposal and a simulation study shows that the gap statistic usually outperforms other methods that have been proposed in the literature.

...read moreread less

4,283 citations

Journal Article•DOI•

Model-Based Clustering, Discriminant Analysis, and Density Estimation

[...]

Chris Fraley¹, Adrian E. Raftery¹•Institutions (1)

University of Washington¹

01 Jun 2002-Journal of the American Statistical Association

TL;DR: This work reviews a general methodology for model-based clustering that provides a principled statistical approach to important practical questions that arise in cluster analysis, such as how many clusters are there, which clustering method should be used, and how should outliers be handled.

...read moreread less

Abstract: Cluster analysis is the automated search for groups of related observations in a dataset. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures, and most clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as how many clusters are there, which clustering method should be used, and how should outliers be handled. We review a general methodology for model-based clustering that provides a principled statistical approach to these issues. We also show that this can be useful for other problems in multivariate analysis, such as discriminant analysis and multivariate density estimation. We give examples from medical diagnosis, minefield detection, cluster recovery from noisy data, and spatial density estimation. Finally, we mention limitations of the methodology and discuss recent development...

...read moreread less

4,123 citations

Cites background or methods from "How Many Clusters? Which Clustering..."

...The single-link clustering method seems to be not related to a statistical model and does not perform well in instances where clusters are not well separated (e.g., Fraley and Raftery 1998)....
[...]
...Finally, in a range of applications of model-based clustering, model choice based on BIC has given good results (Campbell, Fraley, Murtagh, and Raftery 1997; Campbell, Fraley, Stanford, Murtagh, and Raftery 1999; DasGupta and Raftery 1998; Fraley and Raftery 1998; Stanford and Raftery 2000)....
[...]
...Sections 2–5 include a review of material covered in earlier work (Fraley and Raftery 1998) and elsewhere....
[...]
...A non-Gaussian component can often be approximated by several Gaussian ones (e.g., Dasgupta and Raftery 1998; Fraley and Raftery 1998)....
[...]
...Sections 2-5 include a review of material covered in earlier work (Fraley and Raftery 1998) and elsewhere....
[...]

Book•

Estimating the number of clusters in a dataset via the gap statistic

[...]

Robert Tibshirani

01 Jan 2000

TL;DR: The gap statistic is proposed for estimating the number of clusters (groups) in a set of data by comparing the change in within‐cluster dispersion with that expected under an appropriate reference null distribution.

...read moreread less

3,860 citations

Journal Article•DOI•

Discriminant analysis of principal components: a new method for the analysis of genetically structured populations

[...]

Thibaut Jombart¹, Sébastien Devillard², Francois Balloux¹•Institutions (2)

Imperial College London¹, University of Lyon²

15 Oct 2010-BMC Genetics

TL;DR: The Discriminant Analysis of Principal Components (DAPC) is introduced, a multivariate method designed to identify and describe clusters of genetically related individuals that performs generally better than STRUCTURE at characterizing population subdivision.

...read moreread less

Abstract: The dramatic progress in sequencing technologies offers unprecedented prospects for deciphering the organization of natural populations in space and time. However, the size of the datasets generated also poses some daunting challenges. In particular, Bayesian clustering algorithms based on pre-defined population genetics models such as the STRUCTURE or BAPS software may not be able to cope with this unprecedented amount of data. Thus, there is a need for less computer-intensive approaches. Multivariate analyses seem particularly appealing as they are specifically devoted to extracting information from large datasets. Unfortunately, currently available multivariate methods still lack some essential features needed to study the genetic structure of natural populations. We introduce the Discriminant Analysis of Principal Components (DAPC), a multivariate method designed to identify and describe clusters of genetically related individuals. When group priors are lacking, DAPC uses sequential K-means and model selection to infer genetic clusters. Our approach allows extracting rich information from genetic data, providing assignment of individuals to groups, a visual assessment of between-population differentiation, and contribution of individual alleles to population structuring. We evaluate the performance of our method using simulated data, which were also analyzed using STRUCTURE as a benchmark. Additionally, we illustrate the method by analyzing microsatellite polymorphism in worldwide human populations and hemagglutinin gene sequence variation in seasonal influenza. Analysis of simulated data revealed that our approach performs generally better than STRUCTURE at characterizing population subdivision. The tools implemented in DAPC for the identification of clusters and graphical representation of between-group structures allow to unravel complex population structures. Our approach is also faster than Bayesian clustering algorithms by several orders of magnitude, and may be applicable to a wider range of datasets.

...read moreread less

3,770 citations

Cites methods from "How Many Clusters? Which Clustering..."

...As advocated in previous studies [5,22], we use Bayesian Information Criterion (BIC) to assess the best supported model, and therefore the number and nature of clusters....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Maximum likelihood from incomplete data via the EM algorithm

[...]

Arthur P. Dempster¹, Nan M. Laird¹, Donald B. Rubin¹•Institutions (1)

Harvard University¹

01 Sep 1977-Journal of the royal statistical society series b-methodological

49,597 citations

Journal Article•DOI•

Estimating the Dimension of a Model

[...]

Gideon Schwarz

01 Mar 1978-Annals of Statistics

TL;DR: In this paper, the problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms of its asymptotic expansion.

...read moreread less

Abstract: The problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms of its asymptotic expansion. These terms are a valid large-sample criterion beyond the Bayesian context, since they do not depend on the a priori distribution.

...read moreread less

38,681 citations

Estimating the dimension of a model

[...]

Gideon Schwarz

01 Jan 2005

...read moreread less

36,760 citations

"How Many Clusters? Which Clustering..." refers methods in this paper

...When EM is used to find the maximum mixture likelihood, a more reliable approximation to twice the log Bayes factor called the BIC (Schwarz [32]) is applicable: 2 log p(x |M)+ constant ≈ 2lM(x, θ̂)− mM log(n) ≡ BIC, where p(x |M) is the (integrated) likelihood of the data for the modelM, lM(x, θ̂ ) is the maximized mixture loglikelihood for the model andmM is the number of independent parameters to be estimated in the model....
[...]

Book•

Matrix computations

[...]

Gene H. Golub

01 Jan 1983

34,729 citations

"How Many Clusters? Which Clustering..." refers background in this paper

...This latter quantity falls in the range [0, 1], and values near zero imply ill-conditioning [34]....
[...]

Some methods for classification and analysis of multivariate observations

[...]

James B. MacQueen

01 Jan 1967

TL;DR: The k-means algorithm as mentioned in this paper partitions an N-dimensional population into k sets on the basis of a sample, which is a generalization of the ordinary sample mean, and it is shown to give partitions which are reasonably efficient in the sense of within-class variance.

...read moreread less

Abstract: The main purpose of this paper is to describe a process for partitioning an N-dimensional population into k sets on the basis of a sample. The process, which is called 'k-means,' appears to give partitions which are reasonably efficient in the sense of within-class variance. That is, if p is the probability mass function for the population, S = {S1, S2, * *, Sk} is a partition of EN, and ui, i = 1, 2, * , k, is the conditional mean of p over the set Si, then W2(S) = ff=ISi f z u42 dp(z) tends to be low for the partitions S generated by the method. We say 'tends to be low,' primarily because of intuitive considerations, corroborated to some extent by mathematical analysis and practical computational experience. Also, the k-means procedure is easily programmed and is computationally economical, so that it is feasible to process very large samples on a digital computer. Possible applications include methods for similarity grouping, nonlinear prediction, approximating multivariate distributions, and nonparametric tests for independence among several variables. In addition to suggesting practical classification methods, the study of k-means has proved to be theoretically interesting. The k-means concept represents a generalization of the ordinary sample mean, and one is naturally led to study the pertinent asymptotic behavior, the object being to establish some sort of law of large numbers for the k-means. This problem is sufficiently interesting, in fact, for us to devote a good portion of this paper to it. The k-means are defined in section 2.1, and the main results which have been obtained on the asymptotic behavior are given there. The rest of section 2 is devoted to the proofs of these results. Section 3 describes several specific possible applications, and reports some preliminary results from computer experiments conducted to explore the possibilities inherent in the k-means idea. The extension to general metric spaces is indicated briefly in section 4. The original point of departure for the work described here was a series of problems in optimal classification (MacQueen [9]) which represented special

...read moreread less

24,320 citations