scispace - formally typeset
Search or ask a question

Some methods for classification and analysis of multivariate observations

01 Jan 1967-Vol. 1, pp 281-297
TL;DR: The k-means algorithm as mentioned in this paper partitions an N-dimensional population into k sets on the basis of a sample, which is a generalization of the ordinary sample mean, and it is shown to give partitions which are reasonably efficient in the sense of within-class variance.
Abstract: The main purpose of this paper is to describe a process for partitioning an N-dimensional population into k sets on the basis of a sample. The process, which is called 'k-means,' appears to give partitions which are reasonably efficient in the sense of within-class variance. That is, if p is the probability mass function for the population, S = {S1, S2, * *, Sk} is a partition of EN, and ui, i = 1, 2, * , k, is the conditional mean of p over the set Si, then W2(S) = ff=ISi f z u42 dp(z) tends to be low for the partitions S generated by the method. We say 'tends to be low,' primarily because of intuitive considerations, corroborated to some extent by mathematical analysis and practical computational experience. Also, the k-means procedure is easily programmed and is computationally economical, so that it is feasible to process very large samples on a digital computer. Possible applications include methods for similarity grouping, nonlinear prediction, approximating multivariate distributions, and nonparametric tests for independence among several variables. In addition to suggesting practical classification methods, the study of k-means has proved to be theoretically interesting. The k-means concept represents a generalization of the ordinary sample mean, and one is naturally led to study the pertinent asymptotic behavior, the object being to establish some sort of law of large numbers for the k-means. This problem is sufficiently interesting, in fact, for us to devote a good portion of this paper to it. The k-means are defined in section 2.1, and the main results which have been obtained on the asymptotic behavior are given there. The rest of section 2 is devoted to the proofs of these results. Section 3 describes several specific possible applications, and reports some preliminary results from computer experiments conducted to explore the possibilities inherent in the k-means idea. The extension to general metric spaces is indicated briefly in section 4. The original point of departure for the work described here was a series of problems in optimal classification (MacQueen [9]) which represented special

Content maybe subject to copyright    Report

Citations
More filters
Posted Content
01 Jan 2001
TL;DR: This paper gives a lightning overview of data mining and its relation to statistics, with particular emphasis on tools for the detection of adverse drug reactions.
Abstract: The growing interest in data mining is motivated by a common problem across disciplines: how does one store, access, model, and ultimately describe and understand very large data sets? Historically, different aspects of data mining have been addressed independently by different disciplines. This is the first truly interdisciplinary text on data mining, blending the contributions of information science, computer science, and statistics. The book consists of three sections. The first, foundations, provides a tutorial overview of the principles underlying data mining algorithms and their application. The presentation emphasizes intuition rather than rigor. The second section, data mining algorithms, shows how algorithms are constructed to solve specific problems in a principled manner. The algorithms covered include trees and rules for classification and regression, association rules, belief networks, classical statistical models, nonlinear models such as neural networks, and local "memory-based" models. The third section shows how all of the preceding analysis fits together when applied to real-world data mining problems. Topics include the role of metadata, how to handle missing data, and data preprocessing.

3,765 citations

Journal ArticleDOI
01 Sep 1990
TL;DR: Regularization networks are mathematically related to the radial basis functions, mainly used for strict interpolation tasks as mentioned in this paper, and two extensions of the regularization approach are presented, along with the approach's corrections to splines, regularization, Bayes formulation, and clustering.
Abstract: The problem of the approximation of nonlinear mapping, (especially continuous mappings) is considered. Regularization theory and a theoretical framework for approximation (based on regularization techniques) that leads to a class of three-layer networks called regularization networks are discussed. Regularization networks are mathematically related to the radial basis functions, mainly used for strict interpolation tasks. Learning as approximation and learning as hypersurface reconstruction are discussed. Two extensions of the regularization approach are presented, along with the approach's corrections to splines, regularization, Bayes formulation, and clustering. The theory of regularization networks is generalized to a formulation that includes task-dependent clustering and dimensionality reduction. Applications of regularization networks are discussed. >

3,595 citations

Journal ArticleDOI
TL;DR: This chapter presents the basic schemes of VNS and some of its extensions, and presents five families of applications in which VNS has proven to be very successful.

3,572 citations

Journal ArticleDOI
TL;DR: From basic techniques to the state-of-the-art, this paper attempts to present a comprehensive survey for CF techniques, which can be served as a roadmap for research and practice in this area.
Abstract: As one of the most successful approaches to building recommender systems, collaborative filtering (CF) uses the known preferences of a group of users to make recommendations or predictions of the unknown preferences for other users. In this paper, we first introduce CF tasks and their main challenges, such as data sparsity, scalability, synonymy, gray sheep, shilling attacks, privacy protection, etc., and their possible solutions. We then present three main categories of CF techniques: memory-based, modelbased, and hybrid CF algorithms (that combine CF with other recommendation techniques), with examples for representative algorithms of each category, and analysis of their predictive performance and their ability to address the challenges. From basic techniques to the state-of-the-art, we attempt to present a comprehensive survey for CF techniques, which can be served as a roadmap for research and practice in this area.

3,406 citations


Cites methods from "Some methods for classification and..."

  • ...A commonly-used partitioning method is k-means, proposed by MacQueen [78], which has two main advantages: relative efficiency and easy implementation....

    [...]

Posted Content
TL;DR: In this article, the authors present evidence that firms' patents, profits and market value are systematically related to the technological position of firms' research programs, and that firms are seen to "move" in technology space in response to the pattern of contemporaneous profits at different positions.
Abstract: This paper presents evidence that firms' patents, profits and market value are systematically related to the"technological position" of firms' research programs. Further, firms are seen to "move" in technology space in response to the pattern of contemporaneous profits at different positions. These movements tend to erode excess returns."Spillovers" of R&D are modelled by examining whether the R&D of neighboring firms in technology space has an observable impact on the firm's R&D success. Firms whose neighbors do much R&D produce more patents per dollar of their own R&D,with a positive interaction that gives high R&D firms the largest benefit from spillovers. In terms of profit and market value, however, their are both positive and negative effects of nearby firms' R&D. The net effect is positive for high R&D firms, but firms with R&D about one standard deviation below the mean are made worse off overall by the R&D of others.

3,313 citations

References
More filters
Journal ArticleDOI
TL;DR: In this paper, a procedure for forming hierarchical groups of mutually exclusive subsets, each of which has members that are maximally similar with respect to specified characteristics, is suggested for use in large-scale (n > 100) studies when a precise optimal solution for a specified number of groups is not practical.
Abstract: A procedure for forming hierarchical groups of mutually exclusive subsets, each of which has members that are maximally similar with respect to specified characteristics, is suggested for use in large-scale (n > 100) studies when a precise optimal solution for a specified number of groups is not practical. Given n sets, this procedure permits their reduction to n − 1 mutually exclusive sets by considering the union of all possible n(n − 1)/2 pairs and selecting a union having a maximal value for the functional relation, or objective function, that reflects the criterion chosen by the investigator. By repeating this process until only one group remains, the complete hierarchical structure and a quantitative estimate of the loss associated with each stage in the grouping can be obtained. A general flowchart helpful in computer programming and a numerical example are included.

17,405 citations

Book
01 Jan 1953

10,512 citations

Book
01 Jan 1963
TL;DR: The authors continued the story of psychology with added research and enhanced content from the most dynamic areas of the field, such as cognition, gender and diversity studies, neuroscience and more, while at the same time using the most effective teaching approaches and learning tools.
Abstract: This new edition continues the story of psychology with added research and enhanced content from the most dynamic areas of the field--cognition, gender and diversity studies, neuroscience and more, while at the same time using the most effective teaching approaches and learning tools

3,332 citations

01 Jan 1951
TL;DR: In this paper, it was shown that if a > and n are independent, then the combination (a, -y) > (#, y) is a sufficient statistic for a procedure equivalent to,S, a >, it is shown that a v j3.1.
Abstract: 1. Summary Bohnenblust, Shapley, and Sherman [2] have introduced a method of comparing two sampling procedures or experiments; essentially their concept is that one experiment a is more informative than a second experiment ,, a v ,S, if, for every possible risk function, any risk attainable with , is also attainable with a. If a is a sufficient statistic for a procedure equivalent to ,S, a >,, it is shown that a v j3. In the case of dichotomies, the converse is proved. Whether > and v are equivalent in general is not known. Various properties of > and n are obtained, such as the following: if a > , and y is independent of both, then the combination (a, -y) > (#, y). An application to a problem in 2 X 2 tables is discussed.

985 citations


"Some methods for classification and..." refers background in this paper

  • ...281 282 FIFTH BERKELEY SYMPOSIUM: MAC QIEEN cases of the problem of optimal information structures as formulated by Marschak [11], [12]....

    [...]