scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Cluster Separation Measure

TL;DR: A measure is presented which indicates the similarity of clusters which are assumed to have a data density which is a decreasing function of distance from a vector characteristic of the cluster which can be used to infer the appropriateness of data partitions.
Abstract: A measure is presented which indicates the similarity of clusters which are assumed to have a data density which is a decreasing function of distance from a vector characteristic of the cluster. The measure can be used to infer the appropriateness of data partitions and can therefore be used to compare relative appropriateness of various divisions of the data. The measure does not depend on either the number of clusters analyzed nor the method of partitioning of the data and can be used to guide a cluster seeking algorithm.
Citations
More filters
Journal ArticleDOI
TL;DR: A Monte Carlo evaluation of 30 procedures for determining the number of clusters was conducted on artificial data sets which contained either 2, 3, 4, or 5 distinct nonoverlapping clusters to provide a variety of clustering solutions.
Abstract: A Monte Carlo evaluation of 30 procedures for determining the number of clusters was conducted on artificial data sets which contained either 2, 3, 4, or 5 distinct nonoverlapping clusters. To provide a variety of clustering solutions, the data sets were analyzed by four hierarchical clustering methods. External criterion measures indicated excellent recovery of the true cluster structure by the methods at the correct hierarchy level. Thus, the clustering present in the data was quite strong. The simulation results for the stopping rules revealed a wide range in their ability to determine the correct number of clusters in the data. Several procedures worked fairly well, whereas others performed rather poorly. Thus, the latter group of rules would appear to have little validity, particularly for data sets containing distinct clusters. Applied researchers are urged to select one or more of the better criteria. However, users are cautioned that the performance of some of the criteria may be data dependent.

3,551 citations


Cites background from "A Cluster Separation Measure"

  • ...Davies and Bouldin (1979) provided a general framework for measures of cluster separation....

    [...]

Journal ArticleDOI
02 Dec 2001
TL;DR: The fundamental concepts of clustering are introduced while it surveys the widely known clustering algorithms in a comparative way and the issues that are under-addressed by the recent algorithms are illustrated.
Abstract: Cluster analysis aims at identifying groups of similar objects and, therefore helps to discover distribution of patterns and interesting correlations in large data sets. It has been subject of wide research since it arises in many application domains in engineering, business and social sciences. Especially, in the last years the availability of huge transactional and experimental data sets and the arising requirements for data mining created needs for clustering algorithms that scale and can be applied in diverse domains. This paper introduces the fundamental concepts of clustering while it surveys the widely known clustering algorithms in a comparative way. Moreover, it addresses an important issue of clustering process regarding the quality assessment of the clustering results. This is also related to the inherent features of the data set under concern. A review of clustering validity measures and approaches available in the literature is presented. Furthermore, the paper illustrates the issues that are under-addressed by the recent algorithms and gives the trends in clustering process.

2,643 citations


Cites background from "A Cluster Separation Measure"

  • ...A simple choice for Ri j that satisfies the above conditions is Davies and Bouldin (1979): Ri j = (si + s j )/di j ....

    [...]

  • ...The Ri j index is defined to satisfy the following conditions (Davies and Bouldin, 1979): 1....

    [...]

  • ...Some alternative definitions of the dissimilarity between two clusters as well as the dispersion of a cluster, ci , is defined in Davies and Bouldin (1979)....

    [...]

  • ...The Ri j index is defined to satisfy the following conditions (Davies and Bouldin, 1979):...

    [...]

Book
01 Jan 1984
TL;DR: Cluster analysis is a multivariate procedure for detecting natural groupings in data that resembles discriminant analysis in one respect—the researcher seeks to classify a set of objects into subgroups although neither the number nor members of the subgroups are known.
Abstract: SYSTAT provides a variety of cluster analysis methods on rectangular or symmetric data matrices. Cluster analysis is a multivariate procedure for detecting natural groupings in data. It resembles discriminant analysis in one respect—the researcher seeks to classify a set of objects into subgroups although neither the number nor members of the subgroups are known. CLUSTER provides three procedures for clustering: Hierarchical Clustering, K-Clustering, and Additive Trees. The Hierarchical Clustering procedure comprises hierarchical linkage methods. The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation. The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering. Hierarchical Clustering clusters cases, variables, or both cases and variables simultaneously; K-Clustering clusters cases only; and Additive Trees clusters a similarity or dissimilarity matrix. Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary, quantitative and frequency count data. Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram. When the MATRIX option is used to cluster cases and variables, SYSTAT uses a gray-scale or color spectrum to represent the values. SYSTAT further provides five indices, viz., statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree. Options for cutting (or pruning) and coloring the hierarchical tree are also provided. In the K-Clustering procedure SYSTAT offers two algorithms, KMEANS and KMEDIANS, for partitioning. Further, SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS. Cluster analysis is a multivariate procedure for detecting groupings in data. The objects in these groups may be: Cases (observations or rows of a rectangular data file). For example, suppose health indicators (numbers of doctors, nurses, hospital beds, life expectancy, etc.) are recorded for countries (cases), then developed nations may form a subgroup or cluster separate from developing countries. Variables (characteristics or columns of the data). For example, suppose causes of death (cancer, cardiovascular, lung disease, diabetes, accidents, etc.) are recorded for each U.S. state (case); the results show that accidents are relatively independent of the illnesses. Cases and variables (individual entries in the data matrix). For example, certain wines are associated with good years of production. Other wines have other years that are better. Clusters may be of two sorts: overlapping or exclusive. Overlapping clusters allow the same object to appear in more than one …

2,533 citations


Cites background from "A Cluster Separation Measure"

  • ...Then the DB (Davies and Bouldin, 1979) Index is defined as DB’s Index = ....

    [...]

  • ...Define = as the measure of dispersion of cluster , = , as the dissimilarity measure between clusters and and Then the DB (Davies and Bouldin, 1979) Index is defined as DB’s Index = ....

    [...]

Journal ArticleDOI
TL;DR: The two-stage procedure--first using SOM to produce the prototypes that are then clustered in the second stage--is found to perform well when compared with direct clustering of the data and to reduce the computation time.
Abstract: The self-organizing map (SOM) is an excellent tool in exploratory phase of data mining. It projects input space on prototypes of a low-dimensional regular grid that can be effectively utilized to visualize and explore properties of the data. When the number of SOM units is large, to facilitate quantitative analysis of the map and the data, similar units need to be grouped, i.e., clustered. In this paper, different approaches to clustering of the SOM are considered. In particular, the use of hierarchical agglomerative clustering and partitive clustering using K-means are investigated. The two-stage procedure-first using SOM to produce the prototypes that are then clustered in the second stage-is found to perform well when compared with direct clustering of the data and to reduce the computation time.

2,387 citations


Cites methods from "A Cluster Separation Measure"

  • ...In our simulations, we used the Davies–Bouldin index [13], which uses for within-cluster distance and for between clusters distance....

    [...]

Journal ArticleDOI
TL;DR: The R package NbClust provides 30 indices which determine the number of clusters in a data set and it offers also the best clustering scheme from different results to the user.
Abstract: Clustering is the partitioning of a set of objects into groups (clusters) so that objects within a group are more similar to each others than objects in different groups. Most of the clustering algorithms depend on some assumptions in order to define the subgroups present in a data set. As a consequence, the resulting clustering scheme requires some sort of evaluation as regards its validity. The evaluation procedure has to tackle difficult problems such as the quality of clusters, the degree with which a clustering scheme fits a specific data set and the optimal number of clusters in a partitioning. In the literature, a wide variety of indices have been proposed to find the optimal number of clusters in a partitioning of a data set during the clustering process. However, for most of indices proposed in the literature, programs are unavailable to test these indices and compare them. The R package NbClust has been developed for that purpose. It provides 30 indices which determine the number of clusters in a data set and it offers also the best clustering scheme from different results to the user. In addition, it provides a function to perform k-means and hierarchical clustering with different distance measures and aggregation methods. Any combination of validation indices and clustering methods can be requested in a single function call. This enables the user to simultaneously evaluate several clustering schemes while varying the number of clusters, to help determining the most appropriate number of clusters for the data set of interest.

1,912 citations


Cites background from "A Cluster Separation Measure"

  • ...The Davies and Bouldin (1979) index is a function of the sum ratio of within-cluster scatter to between-cluster separation....

    [...]

  • ..."db" (Davies and Bouldin 1979) Minimum value of the index 11....

    [...]

  • ...2001) × 7 Silhouette (Rousseeuw 1987) × 8 Hartigan (Hartigan 1975) × × 9 Cindex (Hubert and Levin 1976) × × 10 DB (Davies and Bouldin 1979) × × × 11 Ratkowsky (Ratkowsky and Lance 1978) × 12 Scott (Scott and Symons 1971) × 13 Marriot (Marriot 1971) × 14 Ball (Ball and Hall 1965) × 15 Trcovw (Milligan and Cooper 1985) × 16 Tracew (Milligan and Cooper 1985) × 17 Friedman (Friedman and Rubin 1967) × 18 Rubin (Friedman and Rubin 1967) × 19 Dunn (Dunn 1974) × × Table 1: Indices implemented in SAS and R packages....

    [...]

  • ...The value of q minimizing DB(q) is regarded as specifying the number of clusters (Milligan and Cooper 1985; Davies and Bouldin 1979)....

    [...]

References
More filters
Book
01 Jan 1974
TL;DR: The present work gives an account of basic principles and available techniques for the analysis and design of pattern processing and recognition systems.
Abstract: The present work gives an account of basic principles and available techniques for the analysis and design of pattern processing and recognition systems. Areas covered include decision functions, pattern classification by distance functions, pattern classification by likelihood functions, the perceptron and the potential function approaches to trainable pattern classifiers, statistical approach to trainable classifiers, pattern preprocessing and feature selection, and syntactic pattern recognition.

3,237 citations

Journal ArticleDOI
TL;DR: I was sitting before my TV set, a while back, watching Captain Video and pondering the organizational problems of psychologists, psychometricians, psychodiagnosticians, psycho-somatists, psychosomnabulists, and psychoceramics, and decided to enlist Captain Video's help to bring me from the Black Planet that superogalactian hypermetrician, Dr. Idnozs HcahscrorTenib, cosmos-famous disc
Abstract: I was sitting before my TV set, a while back, watching Captain Video and pondering the organizational problems of psychologists, psychometricians, psychodiagnosticians, psycho-somatists, psychosomnabulists, and psychoceramics (crack-pots to you). Wondering what I might do, in my small way, to help out, I decided to enlist Captain Video's help to bring me from the Black Planet that superogalactian hypermetrician, Dr. Idnozs HcahscrorTenib, cosmos-famous discoverer of Serutan. Why delay? The Galaxy was on its way. and in half a light year Dr. Tenib was at my side prepared to devote his gargantuan talents to the task. Seeing no point in confusing the good doctor by trying to describe to him the present administrative hodgepodge, I said, "Doctor, let's start from scratch. I want you to find out for me how these good people who are present at the annual meeting of the APA structure themselves? What families are represented? How many, or better, how few? And who belongs to each?" "We proceed," said the Doctor. "Bring sample of population; I measure." So we set out to design a sample. The problem presented some interesting theoretical aspects, but the final solution was relatively simple. We stationed representatives at each of the three state beverage stores and followed every third badge-wearing individual who came out of a store. We selected only outgoing patrons for obvious reasons. After assisting each respondent to unburden himself, we brought him to Dr. Idnozs (as we came to call him among ourselves) for study. "Now," murmured the Doctor, "we give tests. First is 'Draw-a-Psychiatrist Test.' " "We score this," he confided, "by if it gives horns." Presently we started on the physiological test battery. "We draw off saliva drop by drop," explained our idiot savant, "and see does he drool when we bring in Skinner Box." Later came the Peculiar Preference Blank. "Forced-choice, you know," whispered the Doctor. "Would you rather make mud pies or kiss gorgeous blonde?"

1,279 citations

Journal ArticleDOI
H. P. Friedman1, J. Rubin1
TL;DR: This paper attacks the problem of exploring the structure of multivariate data in search of “clusters” by using a computer procedure to obtain the “best” partition of n objects into g groups.
Abstract: This paper deals with methods of “cluster analysis”. In particular we attack the problem of exploring the structure of multivariate data in search of “clusters”. The approach taken is to use a computer procedure to obtain the “best” partition of n objects into g groups. A number of mathematical criteria for “best” are discussed and related to statistical theory. A procedure for optimizing the criteria is outlined. Some of the criteria are compared with respect to their behavior on actual data. Results of data analysis are presented and discussed.

586 citations