scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Classification in 1985"


Journal ArticleDOI
J. A. Hartigan1
TL;DR: In this article, a number of statistical models for forming and evaluating clusters are reviewed, including hierarchical, multivariate, and mixture methods, and the failure of likelihood tests for the number of components.
Abstract: A number of statistical models for forming and evaluating clusters are reviewed. Hierarchical algorithms are evaluated by their ability to discover high density regions in a population, and complete linkage hopelessly fails; the others don't do too well either. Single linkage is at least of mathematical interest because it is related to the minimum spanning tree and percolation. Mixture methods are examined, related to k-means, and the failure of likelihood tests for the number of components is noted. The DIP test for estimating the number of modes in a univariate population measures the distance between the empirical distribution function and the closest unimodal distribution function (or k-modal distribution function when testing for k modes). Its properties are examined and multivariate extensions are proposed. Ultrametric and evolutionary distances on trees are considered briefly.

353 citations


Journal ArticleDOI
TL;DR: Algorithms are described that exploit a special representation of the clusters of any treeT Rn, one that permits testing in constant time whether a given cluster exists inT, and enable well-known indices of consensus between two trees to be computed inO(n) time.
Abstract: LetR n denote the set of rooted trees withn leaves in which: the leaves are labeled by the integers in {1, ...,n}; and among interior vertices only the root may have degree two. Associated with each interior vertexv in such a tree is the subset, orcluster, of leaf labels in the subtree rooted atv. Cluster {1, ...,n} is calledtrivial. Clusters are used in quantitative measures of similarity, dissimilarity and consensus among trees. For anyk trees inR n , thestrict consensus tree C(T 1, ...,T k ) is that tree inR n containing exactly those clusters common to every one of thek trees. Similarity between treesT 1 andT 2 inR n is measured by the numberS(T 1,T 2) of nontrivial clusters in bothT 1 andT 2; dissimilarity, by the numberD(T 1,T 2) of clusters inT 1 orT 2 but not in both. Algorithms are known to computeC(T 1, ...,T k ) inO(kn 2) time, andS(T 1,T 2) andD(T 1,T 2) inO(n 2) time. I propose a special representation of the clusters of any treeT R n , one that permits testing in constant time whether a given cluster exists inT. I describe algorithms that exploit this representation to computeC(T 1, ...,T k ) inO(kn) time, andS(T 1,T 2) andD(T 1,T 2) inO(n) time. These algorithms are optimal in a technical sense. They enable well-known indices of consensus between two trees to be computed inO(n) time. All these results apply as well to comparable problems involving unrooted trees with labeled leaves.

256 citations


Journal ArticleDOI
TL;DR: The tree obtained by regrafting branches on to a largest common pruned tree is shown to contain all the classes present in the strict consensus tree.
Abstract: Given two or more dendrograms (rooted tree diagrams) based on the same set of objects, ways are presented of defining and obtaining common pruned trees. Bounds on the size of a largest common pruned tree are introduced, as is a categorization of objects according to whether they belong to all, some, or no largest common pruned trees. Also described is a procedure for regrafting pruned branches, yielding trees for which one can assess the reliability of the depicted relationships. The tree obtained by regrafting branches on to a largest common pruned tree is shown to contain all the classes present in the strict consensus tree. The theory is illustrated by application to two classifications of a set of forty-nine stratigraphical pollen spectra.

221 citations


Journal ArticleDOI
TL;DR: In this paper, the properties of several significance tests for distinguishing between the hypothesisH of a homogeneous population and an alternativeA involving heterogeneity, with emphasis on the case of multidimensional observations, are investigated.
Abstract: We investigate the properties of several significance tests for distinguishing between the hypothesisH of a “homogeneous” population and an alternativeA involving “clustering” or “heterogeneity,” with emphasis on the case of multidimensional observationsx 1, ,x n eℝ p Four types of test statistics are considered: the (s-th) largest gap between observations, their mean distance (or similarity), the minimum within-cluster sum of squares resulting from a k-means algorithm, and the resulting maximum F statistic The asymptotic distributions underH are given forn→∞ and the asymptotic power of the tests is derived for neighboring alternatives

151 citations


Journal ArticleDOI
TL;DR: In this article, an approach to numerical classification is described, which treats the assignment of objects to types as a continuous variable, called an assignment measure, which allows one not only to determine the types of objects, but also to see relationships among the objects of the same type and among the types themselves.
Abstract: An approach to numerical classification is described, which treats the assignment of objects to types as a continuous variable, called an assignment measure. Describing a classification by an assignment measure allows one not only to determine the types of objects, but also to see relationships among the objects of the same type and among the types themselves.

129 citations


Journal ArticleDOI
TL;DR: It is shown that by appropriate specification of the underlying model, the mixture maximum likelihood approach to clustering can be applied in the context of a three-way table and is illustrated using a soybean data set which consists of multiattribute measurements on a number of genotypes each grown in several environments.
Abstract: Clustering or classifying individuals into groups such that there is relative homogeneity within the groups and heterogeneity between the groups is a problem which has been considered for many years. Most available clustering techniques are applicable only to a two-way data set, where one of the modes is to be partitioned into groups on the basis of the other mode. Suppose, however, that the data set is three-way. Then what is needed is a multivariate technique which will cluster one of the modes on the basis of both of the other modes simultaneously. It is shown that by appropriate specification of the underlying model, the mixture maximum likelihood approach to clustering can be applied in the context of a three-way table. It is illustrated using a soybean data set which consists of multiattribute measurements on a number of genotypes each grown in several environments. Although the problem is set in the framework of clustering genotypes, the technique is applicable to other types of three-way data sets.

86 citations


Journal ArticleDOI
TL;DR: A new methodology which simultaneously estimates in a least-squares fashion both an ultrametric tree and respective variable weightings for profile data that have been converted into (weighted) Euclidean distances is presented.
Abstract: This paper presents the development of a new methodology which simultaneously estimates in a least-squares fashion both an ultrametric tree and respective variable weightings for profile data that have been converted into (weighted) Euclidean distances. We first review the relevant classification literature on this topic. The new methodology is presented including the alternating least-squares algorithm used to estimate the parameters. The method is applied to a synthetic data set with known structure as a test of its operation. An application of this new methodology to ethnic group rating data is also discussed. Finally, extensions of the procedure to model additive, multiple, and three-way trees are mentioned.

57 citations


Journal ArticleDOI
TL;DR: The observable effects of evolution are strong enough to be detected in classifications constructed before the acceptance of evolutionary theory; and traditional classifications can contain substantial scientific information despite their reliance on incompletely understood processes of judgment.
Abstract: Relative frequency of genera as a function of number of species per genus is plotted for six eighteenth-century classifications: Linnaeus' classifications of animals, plants, minerals, and diseases, and Sauvages' classifications of plants and diseases. The distributions for animals and plants form positively skewed hollow curves similar but not identical to those found in modern biological classifications and predicted by mathematical models of evolution. The distributions for minerals and diseases, however, are more nearly symmetric and convex. The difference between the eighteenth-century and modern classifications of animals and plants probably reflects psychological properties of the taxonomists' judgments; but the difference between the classifications of animals and plants and those of minerals and diseases reflects evolutionary properties of the materials classified, since all six classifications were constructed by the same taxonomists using the same methods. Consequently, the observable effects of evolution are strong enough to be detected in classifications constructed before the acceptance of evolutionary theory; and traditional classifications can contain substantial scientific information despite their reliance on incompletely understood processes of judgment.

16 citations


Journal ArticleDOI
TL;DR: To enable PLL methods to be used when the numbern of objects being clustered is large, this work describes an efficient PLL algorithm that operates inO(n2 logn) time andO( n2) space.
Abstract: Proportional link linkage (PLL) clustering methods are a parametric family of monotone invariant agglomerative hierarchical clustering methods. This family includes the single, minimedian, and complete linkage clustering methods as special cases; its members are used in psychological and ecological applications. Since the literature on clustering space distortion is oriented to quantitative input data, we adapt its basic concepts to input data with only ordinal significance and analyze the space distortion properties of PLL methods. To enable PLL methods to be used when the numbern of objects being clustered is large, we describe an efficient PLL algorithm that operates inO(n 2 logn) time andO(n 2) space.

15 citations


Journal ArticleDOI
TL;DR: Clustering of dyad distributions is proposed as a method for finding appropriate models for dyad independence and illustrated by analyzing how cooperative learning methods affect friendship data for school children.
Abstract: Existing statistical models for network data that are easy to estimate and fit are based on the assumption of dyad independence or conditional dyad independence if the individuals are categorized into subgroups. We discuss how such models might be overparameterized and argue that there is a need for subgrouping methods to find appropriate models. We propose clustering of dyad distributions as such a method and illustrate it by analyzing how cooperative learning methods affect friendship data for school children.

13 citations


Journal ArticleDOI
TL;DR: This paper considers other ways to measure buyer similarity in conjoint analysis and compares the resulting measures to the traditional use of part-worth commonalities.
Abstract: In the commercial application of conjoint analysis, it is not unusual to compute pairwise similarity measures for buyers based on their commonality across part-worth utilities. The resulting similarity matrix may then be processed by various clustering techniques. This paper considers other ways to measure buyer similarity in conjoint analysis and compares the resulting measures to the traditional use of part-worth commonalities. An empirical example is used to illustrate the approaches and compare their results.