scispace - formally typeset
Search or ask a question

Some methods for classification and analysis of multivariate observations

01 Jan 1967-Vol. 1, pp 281-297
TL;DR: The k-means algorithm as mentioned in this paper partitions an N-dimensional population into k sets on the basis of a sample, which is a generalization of the ordinary sample mean, and it is shown to give partitions which are reasonably efficient in the sense of within-class variance.
Abstract: The main purpose of this paper is to describe a process for partitioning an N-dimensional population into k sets on the basis of a sample. The process, which is called 'k-means,' appears to give partitions which are reasonably efficient in the sense of within-class variance. That is, if p is the probability mass function for the population, S = {S1, S2, * *, Sk} is a partition of EN, and ui, i = 1, 2, * , k, is the conditional mean of p over the set Si, then W2(S) = ff=ISi f z u42 dp(z) tends to be low for the partitions S generated by the method. We say 'tends to be low,' primarily because of intuitive considerations, corroborated to some extent by mathematical analysis and practical computational experience. Also, the k-means procedure is easily programmed and is computationally economical, so that it is feasible to process very large samples on a digital computer. Possible applications include methods for similarity grouping, nonlinear prediction, approximating multivariate distributions, and nonparametric tests for independence among several variables. In addition to suggesting practical classification methods, the study of k-means has proved to be theoretically interesting. The k-means concept represents a generalization of the ordinary sample mean, and one is naturally led to study the pertinent asymptotic behavior, the object being to establish some sort of law of large numbers for the k-means. This problem is sufficiently interesting, in fact, for us to devote a good portion of this paper to it. The k-means are defined in section 2.1, and the main results which have been obtained on the asymptotic behavior are given there. The rest of section 2 is devoted to the proofs of these results. Section 3 describes several specific possible applications, and reports some preliminary results from computer experiments conducted to explore the possibilities inherent in the k-means idea. The extension to general metric spaces is indicated briefly in section 4. The original point of departure for the work described here was a series of problems in optimal classification (MacQueen [9]) which represented special

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: An algorithm based on an enhanced sparse representation in transform domain based on a specially developed collaborative Wiener filtering achieves state-of-the-art denoising performance in terms of both peak signal-to-noise ratio and subjective visual quality.
Abstract: We propose a novel image denoising strategy based on an enhanced sparse representation in transform domain. The enhancement of the sparsity is achieved by grouping similar 2D image fragments (e.g., blocks) into 3D data arrays which we call "groups." Collaborative Altering is a special procedure developed to deal with these 3D groups. We realize it using the three successive steps: 3D transformation of a group, shrinkage of the transform spectrum, and inverse 3D transformation. The result is a 3D estimate that consists of the jointly filtered grouped image blocks. By attenuating the noise, the collaborative filtering reveals even the finest details shared by grouped blocks and, at the same time, it preserves the essential unique features of each individual block. The filtered blocks are then returned to their original positions. Because these blocks are overlapping, for each pixel, we obtain many different estimates which need to be combined. Aggregation is a particular averaging procedure which is exploited to take advantage of this redundancy. A significant improvement is obtained by a specially developed collaborative Wiener filtering. An algorithm based on this novel denoising strategy and its efficient implementation are presented in full detail; an extension to color-image denoising is also developed. The experimental results demonstrate that this computationally scalable algorithm achieves state-of-the-art denoising performance in terms of both peak signal-to-noise ratio and subjective visual quality.

7,912 citations

Journal ArticleDOI
TL;DR: A new and simple method to find indicator species and species assemblages characterizing groups of sites, and a new way to present species-site tables, accounting for the hierarchical relationships among species, is proposed.
Abstract: This paper presents a new and simple method to find indicator species and species assemblages characterizing groups of sites The novelty of our approach lies in the way we combine a species relative abundance with its relative frequency of occurrence in the various groups of sites This index is maximum when all individuals of a species are found in a single group of sites and when the species occurs in all sites of that group; it is a symmetric indicator The statistical significance of the species indicator values is evaluated using a randomization procedure Contrary to TWINSPAN, our indicator index for a given species is independent of the other species relative abundances, and there is no need to use pseudospecies The new method identifies indicator species for typologies of species releves obtained by any hierarchical or nonhierarchical classification procedure; its use is independent of the classification method Because indicator species give ecological meaning to groups of sites, this method provides criteria to compare typologies, to identify where to stop dividing clusters into subsets, and to point out the main levels in a hierarchical classification of sites Species can be grouped on the basis of their indicator values for each clustering level, the heterogeneous nature of species assemblages observed in any one site being well preserved Such assemblages are usually a mixture of eurytopic (higher level) and stenotopic species (characteristic of lower level clusters) The species assemblage approach demonstrates the importance of the ''sampled patch size,'' ie, the diversity of sampled ecological combinations, when we compare the frequencies of core and satellite species A new way to present species-site tables, accounting for the hierarchical relationships among species, is proposed A large data set of carabid beetle distributions in open habitats of Belgium is used as a case study to illustrate the new method

7,449 citations

Journal ArticleDOI
01 Jun 2010
TL;DR: A brief overview of clustering is provided, well known clustering methods are summarized, the major challenges and key issues in designing clustering algorithms are discussed, and some of the emerging and useful research directions are pointed out.
Abstract: Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into a system of ranked taxa: domain, kingdom, phylum, class, etc. Cluster analysis is the formal study of methods and algorithms for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is to find structure in data and is therefore exploratory in nature. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, K-means, was first published in 1955. In spite of the fact that K-means was proposed over 50 years ago and thousands of clustering algorithms have been published since then, K-means is still widely used. This speaks to the difficulty in designing a general purpose clustering algorithm and the ill-posed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semi-supervised clustering, ensemble clustering, simultaneous feature selection during data clustering, and large scale data clustering.

6,601 citations

Journal ArticleDOI
16 Feb 2007-Science
TL;DR: A method called “affinity propagation,” which takes as input measures of similarity between pairs of data points, which found clusters with much lower error than other methods, and it did so in less than one-hundredth the amount of time.
Abstract: Clustering data by identifying a subset of representative examples is important for processing sensory signals and detecting patterns in data. Such "exemplars" can be found by randomly choosing an initial subset of data points and then iteratively refining it, but this works well only if that initial choice is close to a good solution. We devised a method called "affinity propagation," which takes as input measures of similarity between pairs of data points. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. We used affinity propagation to cluster images of faces, detect genes in microarray data, identify representative sentences in this manuscript, and identify cities that are efficiently accessed by airline travel. Affinity propagation found clusters with much lower error than other methods, and it did so in less than one-hundredth the amount of time.

6,429 citations

Journal ArticleDOI
TL;DR: Clustering algorithms for data sets appearing in statistics, computer science, and machine learning are surveyed, and their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts are illustrated.
Abstract: Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.

5,744 citations


Cites background from "Some methods for classification and..."

  • ...The effect of different types of kernel functions, which are rich in the literature, is also an interesting topic for further exploration....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: In this paper, a procedure for forming hierarchical groups of mutually exclusive subsets, each of which has members that are maximally similar with respect to specified characteristics, is suggested for use in large-scale (n > 100) studies when a precise optimal solution for a specified number of groups is not practical.
Abstract: A procedure for forming hierarchical groups of mutually exclusive subsets, each of which has members that are maximally similar with respect to specified characteristics, is suggested for use in large-scale (n > 100) studies when a precise optimal solution for a specified number of groups is not practical. Given n sets, this procedure permits their reduction to n − 1 mutually exclusive sets by considering the union of all possible n(n − 1)/2 pairs and selecting a union having a maximal value for the functional relation, or objective function, that reflects the criterion chosen by the investigator. By repeating this process until only one group remains, the complete hierarchical structure and a quantitative estimate of the loss associated with each stage in the grouping can be obtained. A general flowchart helpful in computer programming and a numerical example are included.

17,405 citations

Book
01 Jan 1953

10,512 citations

Book
01 Jan 1963
TL;DR: The authors continued the story of psychology with added research and enhanced content from the most dynamic areas of the field, such as cognition, gender and diversity studies, neuroscience and more, while at the same time using the most effective teaching approaches and learning tools.
Abstract: This new edition continues the story of psychology with added research and enhanced content from the most dynamic areas of the field--cognition, gender and diversity studies, neuroscience and more, while at the same time using the most effective teaching approaches and learning tools

3,332 citations

01 Jan 1951
TL;DR: In this paper, it was shown that if a > and n are independent, then the combination (a, -y) > (#, y) is a sufficient statistic for a procedure equivalent to,S, a >, it is shown that a v j3.1.
Abstract: 1. Summary Bohnenblust, Shapley, and Sherman [2] have introduced a method of comparing two sampling procedures or experiments; essentially their concept is that one experiment a is more informative than a second experiment ,, a v ,S, if, for every possible risk function, any risk attainable with , is also attainable with a. If a is a sufficient statistic for a procedure equivalent to ,S, a >,, it is shown that a v j3. In the case of dichotomies, the converse is proved. Whether > and v are equivalent in general is not known. Various properties of > and n are obtained, such as the following: if a > , and y is independent of both, then the combination (a, -y) > (#, y). An application to a problem in 2 X 2 tables is discussed.

985 citations


"Some methods for classification and..." refers background in this paper

  • ...281 282 FIFTH BERKELEY SYMPOSIUM: MAC QIEEN cases of the problem of optimal information structures as formulated by Marschak [11], [12]....

    [...]