scispace - formally typeset
Search or ask a question
Journal ArticleDOI

An examination of procedures for determining the number of clusters in a data set

01 Jun 1985-Psychometrika (Springer)-Vol. 50, Iss: 2, pp 159-179
TL;DR: A Monte Carlo evaluation of 30 procedures for determining the number of clusters was conducted on artificial data sets which contained either 2, 3, 4, or 5 distinct nonoverlapping clusters to provide a variety of clustering solutions.
Abstract: A Monte Carlo evaluation of 30 procedures for determining the number of clusters was conducted on artificial data sets which contained either 2, 3, 4, or 5 distinct nonoverlapping clusters. To provide a variety of clustering solutions, the data sets were analyzed by four hierarchical clustering methods. External criterion measures indicated excellent recovery of the true cluster structure by the methods at the correct hierarchy level. Thus, the clustering present in the data was quite strong. The simulation results for the stopping rules revealed a wide range in their ability to determine the correct number of clusters in the data. Several procedures worked fairly well, whereas others performed rather poorly. Thus, the latter group of rules would appear to have little validity, particularly for data sets containing distinct clusters. Applied researchers are urged to select one or more of the better criteria. However, users are cautioned that the performance of some of the criteria may be data dependent.
Citations
More filters
01 Jan 1988

9,439 citations

Book
01 Jan 1988

8,586 citations

Journal ArticleDOI
TL;DR: A new and simple method to find indicator species and species assemblages characterizing groups of sites, and a new way to present species-site tables, accounting for the hierarchical relationships among species, is proposed.
Abstract: This paper presents a new and simple method to find indicator species and species assemblages characterizing groups of sites The novelty of our approach lies in the way we combine a species relative abundance with its relative frequency of occurrence in the various groups of sites This index is maximum when all individuals of a species are found in a single group of sites and when the species occurs in all sites of that group; it is a symmetric indicator The statistical significance of the species indicator values is evaluated using a randomization procedure Contrary to TWINSPAN, our indicator index for a given species is independent of the other species relative abundances, and there is no need to use pseudospecies The new method identifies indicator species for typologies of species releves obtained by any hierarchical or nonhierarchical classification procedure; its use is independent of the classification method Because indicator species give ecological meaning to groups of sites, this method provides criteria to compare typologies, to identify where to stop dividing clusters into subsets, and to point out the main levels in a hierarchical classification of sites Species can be grouped on the basis of their indicator values for each clustering level, the heterogeneous nature of species assemblages observed in any one site being well preserved Such assemblages are usually a mixture of eurytopic (higher level) and stenotopic species (characteristic of lower level clusters) The species assemblage approach demonstrates the importance of the ''sampled patch size,'' ie, the diversity of sampled ecological combinations, when we compare the frequencies of core and satellite species A new way to present species-site tables, accounting for the hierarchical relationships among species, is proposed A large data set of carabid beetle distributions in open habitats of Belgium is used as a case study to illustrate the new method

7,449 citations


Cites background from "An examination of procedures for de..."

  • ...This is a common question in cluster analysis because no single objective criterion receives general support (Milligan and Cooper 1985)....

    [...]

Journal ArticleDOI
TL;DR: Clustering algorithms for data sets appearing in statistics, computer science, and machine learning are surveyed, and their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts are illustrated.
Abstract: Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.

5,744 citations

Journal ArticleDOI
12 May 2011-Nature
TL;DR: Three robust clusters (referred to as enterotypes hereafter) are identified that are not nation or continent specific and confirmed in two published, larger cohorts, indicating that intestinal microbiota variation is generally stratified, not continuous.
Abstract: Our knowledge of species and functional composition of the human gut microbiome is rapidly increasing, but it is still based on very few cohorts and little is known about variation across the world. By combining 22 newly sequenced faecal metagenomes of individuals from four countries with previously published data sets, here we identify three robust clusters (referred to as enterotypes hereafter) that are not nation or continent specific. We also confirmed the enterotypes in two published, larger cohorts, indicating that intestinal microbiota variation is generally stratified, not continuous. This indicates further the existence of a limited number of well-balanced host-microbial symbiotic states that might respond differently to diet and drug intake. The enterotypes are mostly driven by species composition, but abundant molecular functions are not necessarily provided by abundant species, highlighting the importance of a functional analysis to understand microbial communities. Although individual host properties such as body mass index, age, or gender cannot explain the observed enterotypes, data-driven marker genes or functional modules can be identified for each of these host properties. For example, twelve genes significantly correlate with age and three functional modules with the body mass index, hinting at a diagnostic potential of microbial markers.

5,566 citations

References
More filters
Book
01 Jan 1973
TL;DR: In this article, a unified, comprehensive and up-to-date treatment of both statistical and descriptive methods for pattern recognition is provided, including Bayesian decision theory, supervised and unsupervised learning, nonparametric techniques, discriminant analysis, clustering, preprosessing of pictorial data, spatial filtering, shape description techniques, perspective transformations, projective invariants, linguistic procedures, and artificial intelligence techniques for scene analysis.
Abstract: Provides a unified, comprehensive and up-to-date treatment of both statistical and descriptive methods for pattern recognition. The topics treated include Bayesian decision theory, supervised and unsupervised learning, nonparametric techniques, discriminant analysis, clustering, preprosessing of pictorial data, spatial filtering, shape description techniques, perspective transformations, projective invariants, linguistic procedures, and artificial intelligence techniques for scene analysis.

13,647 citations

Book
01 Jan 1974
TL;DR: This fourth edition of the highly successful Cluster Analysis represents a thorough revision of the third edition and covers new and developing areas such as classification likelihood and neural networks for clustering.
Abstract: Cluster analysis comprises a range of methods for classifying multivariate data into subgroups. By organising multivariate data into such subgroups, clustering can help reveal the characteristics of any structure or patterns present. These techniques are applicable in a wide range of areas such as medicine, psychology and market research. This fourth edition of the highly successful Cluster Analysis represents a thorough revision of the third edition and covers new and developing areas such as classification likelihood and neural networks for clustering. Real life examples are used throughout to demonstrate the application of the theory, and figures are used extensively to illustrate graphical techniques. The book is comprehensive yet relatively non-mathematical, focusing on the practical aspects of cluster analysis.

9,857 citations

Journal ArticleDOI
TL;DR: A measure is presented which indicates the similarity of clusters which are assumed to have a data density which is a decreasing function of distance from a vector characteristic of the cluster which can be used to infer the appropriateness of data partitions.
Abstract: A measure is presented which indicates the similarity of clusters which are assumed to have a data density which is a decreasing function of distance from a vector characteristic of the cluster. The measure can be used to infer the appropriateness of data partitions and can therefore be used to compare relative appropriateness of various divisions of the data. The measure does not depend on either the number of clusters analyzed nor the method of partitioning of the data and can be used to guide a cluster seeking algorithm.

6,757 citations


"An examination of procedures for de..." refers background in this paper

  • ...Davies and Bouldin (1979) provided a general framework for measures of cluster separation....

    [...]

Book
01 Feb 1975

6,068 citations