A Cluster Separation Measure

doi:10.1109/TPAMI.1979.4766909

Home
/
Papers
/
A Cluster Separation Measure

Journal Article•DOI•

A Cluster Separation Measure

David L. Davies¹, Donald W. Bouldin¹•Institutions (1)

University of Tennessee¹

01 Feb 1979-IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE)-Vol. 1, Iss: 2, pp 224-227

TL;DR: A measure is presented which indicates the similarity of clusters which are assumed to have a data density which is a decreasing function of distance from a vector characteristic of the cluster which can be used to infer the appropriateness of data partitions.

read less

Abstract: A measure is presented which indicates the similarity of clusters which are assumed to have a data density which is a decreasing function of distance from a vector characteristic of the cluster. The measure can be used to infer the appropriateness of data partitions and can therefore be used to compare relative appropriateness of various divisions of the data. The measure does not depend on either the number of clusters analyzed nor the method of partitioning of the data and can be used to guide a cluster seeking algorithm.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

An examination of procedures for determining the number of clusters in a data set

[...]

Glenn W. Milligan¹, Martha C. Cooper¹•Institutions (1)

Ohio State University¹

01 Jun 1985-Psychometrika

TL;DR: A Monte Carlo evaluation of 30 procedures for determining the number of clusters was conducted on artificial data sets which contained either 2, 3, 4, or 5 distinct nonoverlapping clusters to provide a variety of clustering solutions.

...read moreread less

Abstract: A Monte Carlo evaluation of 30 procedures for determining the number of clusters was conducted on artificial data sets which contained either 2, 3, 4, or 5 distinct nonoverlapping clusters. To provide a variety of clustering solutions, the data sets were analyzed by four hierarchical clustering methods. External criterion measures indicated excellent recovery of the true cluster structure by the methods at the correct hierarchy level. Thus, the clustering present in the data was quite strong. The simulation results for the stopping rules revealed a wide range in their ability to determine the correct number of clusters in the data. Several procedures worked fairly well, whereas others performed rather poorly. Thus, the latter group of rules would appear to have little validity, particularly for data sets containing distinct clusters. Applied researchers are urged to select one or more of the better criteria. However, users are cautioned that the performance of some of the criteria may be data dependent.

...read moreread less

3,551 citations

Cites background from "A Cluster Separation Measure"

...Davies and Bouldin (1979) provided a general framework for measures of cluster separation....
[...]

Journal Article•DOI•

On Clustering Validation Techniques

[...]

Maria Halkidi¹, Yannis Batistakis¹, Michalis Vazirgiannis¹•Institutions (1)

Athens University of Economics and Business¹

02 Dec 2001

TL;DR: The fundamental concepts of clustering are introduced while it surveys the widely known clustering algorithms in a comparative way and the issues that are under-addressed by the recent algorithms are illustrated.

...read moreread less

Abstract: Cluster analysis aims at identifying groups of similar objects and, therefore helps to discover distribution of patterns and interesting correlations in large data sets. It has been subject of wide research since it arises in many application domains in engineering, business and social sciences. Especially, in the last years the availability of huge transactional and experimental data sets and the arising requirements for data mining created needs for clustering algorithms that scale and can be applied in diverse domains. This paper introduces the fundamental concepts of clustering while it surveys the widely known clustering algorithms in a comparative way. Moreover, it addresses an important issue of clustering process regarding the quality assessment of the clustering results. This is also related to the inherent features of the data set under concern. A review of clustering validity measures and approaches available in the literature is presented. Furthermore, the paper illustrates the issues that are under-addressed by the recent algorithms and gives the trends in clustering process.

...read moreread less

2,643 citations

Cites background from "A Cluster Separation Measure"

...A simple choice for Ri j that satisfies the above conditions is Davies and Bouldin (1979): Ri j = (si + s j )/di j ....
[...]
...The Ri j index is defined to satisfy the following conditions (Davies and Bouldin, 1979): 1....
[...]
...Some alternative definitions of the dissimilarity between two clusters as well as the dispersion of a cluster, ci , is defined in Davies and Bouldin (1979)....
[...]
...The Ri j index is defined to satisfy the following conditions (Davies and Bouldin, 1979):...
[...]

Book•

Cluster analysis

[...]

Mark Aldenderfer

01 Jan 1984

TL;DR: Cluster analysis is a multivariate procedure for detecting natural groupings in data that resembles discriminant analysis in one respect—the researcher seeks to classify a set of objects into subgroups although neither the number nor members of the subgroups are known.

...read moreread less

Abstract: SYSTAT provides a variety of cluster analysis methods on rectangular or symmetric data matrices. Cluster analysis is a multivariate procedure for detecting natural groupings in data. It resembles discriminant analysis in one respect—the researcher seeks to classify a set of objects into subgroups although neither the number nor members of the subgroups are known. CLUSTER provides three procedures for clustering: Hierarchical Clustering, K-Clustering, and Additive Trees. The Hierarchical Clustering procedure comprises hierarchical linkage methods. The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation. The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering. Hierarchical Clustering clusters cases, variables, or both cases and variables simultaneously; K-Clustering clusters cases only; and Additive Trees clusters a similarity or dissimilarity matrix. Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary, quantitative and frequency count data. Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram. When the MATRIX option is used to cluster cases and variables, SYSTAT uses a gray-scale or color spectrum to represent the values. SYSTAT further provides five indices, viz., statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree. Options for cutting (or pruning) and coloring the hierarchical tree are also provided. In the K-Clustering procedure SYSTAT offers two algorithms, KMEANS and KMEDIANS, for partitioning. Further, SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS. Cluster analysis is a multivariate procedure for detecting groupings in data. The objects in these groups may be: Cases (observations or rows of a rectangular data file). For example, suppose health indicators (numbers of doctors, nurses, hospital beds, life expectancy, etc.) are recorded for countries (cases), then developed nations may form a subgroup or cluster separate from developing countries. Variables (characteristics or columns of the data). For example, suppose causes of death (cancer, cardiovascular, lung disease, diabetes, accidents, etc.) are recorded for each U.S. state (case); the results show that accidents are relatively independent of the illnesses. Cases and variables (individual entries in the data matrix). For example, certain wines are associated with good years of production. Other wines have other years that are better. Clusters may be of two sorts: overlapping or exclusive. Overlapping clusters allow the same object to appear in more than one …

...read moreread less

2,533 citations

Cites background from "A Cluster Separation Measure"

...Then the DB (Davies and Bouldin, 1979) Index is defined as DB’s Index = ....
[...]
...Define = as the measure of dispersion of cluster , = , as the dissimilarity measure between clusters and and Then the DB (Davies and Bouldin, 1979) Index is defined as DB’s Index = ....
[...]

Journal Article•DOI•

Clustering of the self-organizing map

[...]

Juha Vesanto¹, Esa Alhoniemi¹•Institutions (1)

Helsinki University of Technology¹

01 May 2000-IEEE Transactions on Neural Networks

TL;DR: The two-stage procedure--first using SOM to produce the prototypes that are then clustered in the second stage--is found to perform well when compared with direct clustering of the data and to reduce the computation time.

...read moreread less

Abstract: The self-organizing map (SOM) is an excellent tool in exploratory phase of data mining. It projects input space on prototypes of a low-dimensional regular grid that can be effectively utilized to visualize and explore properties of the data. When the number of SOM units is large, to facilitate quantitative analysis of the map and the data, similar units need to be grouped, i.e., clustered. In this paper, different approaches to clustering of the SOM are considered. In particular, the use of hierarchical agglomerative clustering and partitive clustering using K-means are investigated. The two-stage procedure-first using SOM to produce the prototypes that are then clustered in the second stage-is found to perform well when compared with direct clustering of the data and to reduce the computation time.

...read moreread less

2,387 citations

Cites methods from "A Cluster Separation Measure"

...In our simulations, we used the Davies–Bouldin index [13], which uses for within-cluster distance and for between clusters distance....
[...]

Journal Article•DOI•

NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set

[...]

Malika Charrad, Nadia Ghazzali, Véronique Boiteau, Azam Niknafs

03 Nov 2014-Journal of Statistical Software

TL;DR: The R package NbClust provides 30 indices which determine the number of clusters in a data set and it offers also the best clustering scheme from different results to the user.

...read moreread less

Abstract: Clustering is the partitioning of a set of objects into groups (clusters) so that objects within a group are more similar to each others than objects in different groups. Most of the clustering algorithms depend on some assumptions in order to define the subgroups present in a data set. As a consequence, the resulting clustering scheme requires some sort of evaluation as regards its validity. The evaluation procedure has to tackle difficult problems such as the quality of clusters, the degree with which a clustering scheme fits a specific data set and the optimal number of clusters in a partitioning. In the literature, a wide variety of indices have been proposed to find the optimal number of clusters in a partitioning of a data set during the clustering process. However, for most of indices proposed in the literature, programs are unavailable to test these indices and compare them. The R package NbClust has been developed for that purpose. It provides 30 indices which determine the number of clusters in a data set and it offers also the best clustering scheme from different results to the user. In addition, it provides a function to perform k-means and hierarchical clustering with different distance measures and aggregation methods. Any combination of validation indices and clustering methods can be requested in a single function call. This enables the user to simultaneously evaluate several clustering schemes while varying the number of clusters, to help determining the most appropriate number of clusters for the data set of interest.

...read moreread less

1,912 citations

Cites background from "A Cluster Separation Measure"

...The Davies and Bouldin (1979) index is a function of the sum ratio of within-cluster scatter to between-cluster separation....
[...]
..."db" (Davies and Bouldin 1979) Minimum value of the index 11....
[...]
...2001) × 7 Silhouette (Rousseeuw 1987) × 8 Hartigan (Hartigan 1975) × × 9 Cindex (Hubert and Levin 1976) × × 10 DB (Davies and Bouldin 1979) × × × 11 Ratkowsky (Ratkowsky and Lance 1978) × 12 Scott (Scott and Symons 1971) × 13 Marriot (Marriot 1971) × 14 Ball (Ball and Hall 1965) × 15 Trcovw (Milligan and Cooper 1985) × 16 Tracew (Milligan and Cooper 1985) × 17 Friedman (Friedman and Rubin 1967) × 18 Rubin (Friedman and Rubin 1967) × 19 Dunn (Dunn 1974) × × Table 1: Indices implemented in SAS and R packages....
[...]
...The value of q minimizing DB(q) is regarded as specifying the number of clusters (Milligan and Cooper 1985; Davies and Bouldin 1979)....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Pattern Classification and Scene Analysis.

[...]

Ulf Grenander, Richard O. Duda, Peter E. Hart

01 Sep 1974-Journal of the American Statistical Association

14,948 citations

Journal Article•DOI•

The use of multiple measurements in taxonomic problems

[...]

R. A. Fisher

01 Sep 1936-Annals of Human Genetics

14,009 citations

Book•

Pattern recognition principles

[...]

Julius T. Tou¹, Rafael C. Gonzalez²•Institutions (2)

University of Florida¹, University of Tennessee²

01 Jan 1974

TL;DR: The present work gives an account of basic principles and available techniques for the analysis and design of pattern processing and recognition systems.

...read moreread less

Abstract: The present work gives an account of basic principles and available techniques for the analysis and design of pattern processing and recognition systems. Areas covered include decision functions, pattern classification by distance functions, pattern classification by likelihood functions, the perceptron and the potential function approaches to trainable pattern classifiers, statistical approach to trainable classifiers, pattern preprocessing and feature selection, and syntactic pattern recognition.

...read moreread less

3,237 citations

Journal Article•DOI•

Who belongs in the family

[...]

Robert L. Thorndike¹•Institutions (1)

Columbia University¹

01 Dec 1953-Psychometrika

TL;DR: I was sitting before my TV set, a while back, watching Captain Video and pondering the organizational problems of psychologists, psychometricians, psychodiagnosticians, psycho-somatists, psychosomnabulists, and psychoceramics, and decided to enlist Captain Video's help to bring me from the Black Planet that superogalactian hypermetrician, Dr. Idnozs HcahscrorTenib, cosmos-famous disc

...read moreread less

Abstract: I was sitting before my TV set, a while back, watching Captain Video and pondering the organizational problems of psychologists, psychometricians, psychodiagnosticians, psycho-somatists, psychosomnabulists, and psychoceramics (crack-pots to you). Wondering what I might do, in my small way, to help out, I decided to enlist Captain Video's help to bring me from the Black Planet that superogalactian hypermetrician, Dr. Idnozs HcahscrorTenib, cosmos-famous discoverer of Serutan. Why delay? The Galaxy was on its way. and in half a light year Dr. Tenib was at my side prepared to devote his gargantuan talents to the task. Seeing no point in confusing the good doctor by trying to describe to him the present administrative hodgepodge, I said, "Doctor, let's start from scratch. I want you to find out for me how these good people who are present at the annual meeting of the APA structure themselves? What families are represented? How many, or better, how few? And who belongs to each?" "We proceed," said the Doctor. "Bring sample of population; I measure." So we set out to design a sample. The problem presented some interesting theoretical aspects, but the final solution was relatively simple. We stationed representatives at each of the three state beverage stores and followed every third badge-wearing individual who came out of a store. We selected only outgoing patrons for obvious reasons. After assisting each respondent to unburden himself, we brought him to Dr. Idnozs (as we came to call him among ourselves) for study. "Now," murmured the Doctor, "we give tests. First is 'Draw-a-Psychiatrist Test.' " "We score this," he confided, "by if it gives horns." Presently we started on the physiological test battery. "We draw off saliva drop by drop," explained our idiot savant, "and see does he drool when we bring in Skinner Box." Later came the Peculiar Preference Blank. "Forced-choice, you know," whispered the Doctor. "Would you rather make mud pies or kiss gorgeous blonde?"

...read moreread less

1,279 citations

Journal Article•DOI•

On Some Invariant Criteria for Grouping Data

[...]

H. P. Friedman¹, J. Rubin¹•Institutions (1)

IBM¹

01 Dec 1967-Journal of the American Statistical Association

TL;DR: This paper attacks the problem of exploring the structure of multivariate data in search of “clusters” by using a computer procedure to obtain the “best” partition of n objects into g groups.

...read moreread less

Abstract: This paper deals with methods of “cluster analysis”. In particular we attack the problem of exploring the structure of multivariate data in search of “clusters”. The approach taken is to use a computer procedure to obtain the “best” partition of n objects into g groups. A number of mathematical criteria for “best” are discussed and related to statistical theory. A procedure for optimizing the criteria is outlined. Some of the criteria are compared with respect to their behavior on actual data. Results of data analysis are presented and discussed.

...read moreread less

586 citations