scispace - formally typeset
Search or ask a question

Showing papers on "Cluster analysis published in 1971"


Journal ArticleDOI
TL;DR: This article proposes several criteria which isolate specific aspects of the performance of a method, such as its retrieval of inherent structure, its sensitivity to resampling and the stability of its results in the light of new data.
Abstract: Many intuitively appealing methods have been suggested for clustering data, however, interpretation of their results has been hindered by the lack of objective criteria. This article proposes several criteria which isolate specific aspects of the performance of a method, such as its retrieval of inherent structure, its sensitivity to resampling and the stability of its results in the light of new data. These criteria depend on a measure of similarity between two different clusterings of the same set of data; the measure essentially considers how each pair of data points is assigned in each clustering.

6,179 citations


Journal ArticleDOI
TL;DR: A family of graph-theoretical algorithms based on the minimal spanning tree are capable of detecting several kinds of cluster structure in arbitrary point sets; description of the detected clusters is possible in some cases by extensions of the method.
Abstract: A family of graph-theoretical algorithms based on the minimal spanning tree are capable of detecting several kinds of cluster structure in arbitrary point sets; description of the detected clusters is possible in some cases by extensions of the method. Development of these clustering algorithms was based on examples from two-dimensional space because we wanted to copy the human perception of gestalts or point groupings. On the other hand, all the methods considered apply to higher dimensional spaces and even to general metric spaces. Advantages of these methods include determinacy, easy interpretation of the resulting clusters, conformity to gestalt principles of perceptual organization, and invariance of results under monotone transformations of interpoint distance. Brief discussion is made of the application of cluster detection to taxonomy and the selection of good feature spaces for pattern recognition. Detailed analyses of several planar cluster detection problems are illustrated by text and figures. The well-known Fisher iris data, in four-dimensional space, have been analyzed by these methods also. PL/1 programs to implement the minimal spanning tree methods have been fully debugged.

1,832 citations




Journal ArticleDOI
TL;DR: In this paper, the authors proposed an approach motivated by decision theory to restrict their attention to admissible decision rules, which eliminates bad rules, though additional information is necessary to select the best rule from among the admissible rules.
Abstract: SUMMARY Since it is usually impossible to determine a 'best' clustering procedure, admissible clustering procedures are suggested. Let A denote some property which should be satisfied by any reasonable procedure either in general or when used in a special application. Any procedure which satisfies A is called A-admissible. Nine admissibility conditions are defined and several standard clustering methods are compared with them. This paper attacks the perplexing problem of choosing a clustering procedure from among the myriad of procedures now proposed. Almost always in practice, not enough is known about the a priori conditions, the losses, etc., to determine a 'best' procedure. How, then, should one proceed? We propose an approach motivated by decision theory. Decision theory tells us to restrict our attention to admissible decision rules. This eliminates obviously bad rules, though additional information is necessary to select the best rule from among the admissible rules. Our suggestion is therefore to formulate properties which any reasonable procedure should satisfy and call a procedure satisfying them admissible. Requiring admissibility again eliminates obviously bad clustering algoiithms but does not attempt to specify the best method. We will specify several types of admissibility; one for each desirable property we define. Given some condition A, any clustering procedure satisfying A will be called A -admissible. Table 1 lists possible admissibility conditions versus some common clustering procedures. It specifies which conditions are satisfied by the various procedures and which are not. Given a specific problem, and a more complete table, a user could decide on a set of admissibility conditions and then look to the table for his admissible techniques. Two papers which take a somewhat similar approach but use different nomenclature are those by Rubin (1967) and by Jardine & Sibson (1968). The latter list other papers which compare clustering methods. Some additions to this list may be found in the referenlces below.

199 citations




Journal ArticleDOI
01 Dec 1971-Nature
TL;DR: Automatic clustering methods are shown to provide a basis for adaptive pattern recognition when neither the number nor the specifications of classes are known in advance.
Abstract: Automatic clustering methods are shown to provide a basis for adaptive pattern recognition when neither the number nor the specifications of classes are known in advance.

96 citations


Journal ArticleDOI

90 citations


Journal ArticleDOI
M. R. Hoare1, P. Pal1
01 Mar 1971-Nature
TL;DR: In this article, the authors describe the geometrical features of the clustering of small numbers of interacting particles, and present an approach that is quite different from others which consider the problem as one of finding dense packings of spheres.
Abstract: This article describes the chief geometrical features of the clustering of small numbers of interacting particles. The approach is quite different from others which consider the problem as one of finding dense packings of spheres.

76 citations


Journal ArticleDOI
TL;DR: In this paper, the observational evidence for (and against) hierarchical clustering of galaxies and clusters on various characteristic length scales, from 0.1 Mpc to at least 100 Mpc, is reviewed.
Abstract: The nonrandom clustering of galaxies and clusters of galaxies has attracted increasing attention in the past 40 years. The mathematical tests and criteria used to distinguish between an observed distribution and a Poisson, i.e., random, statistically uniform distribution, are outlined. The observational evidence for (and against) hierarchical clustering of galaxies and clusters on various characteristic length scales A from 0.1 Mpc to at least 100 Mpc, is reviewed. Some cosmological implications and possible origins of the clustering spectrum are discussed. Key words: galaxies - clusters of galaxies 1. Introduction 2. Subclustering: pairs and multiplets (A 0.1 Mpc) 3. Small-scale clustering: groups and small clusters (A 1 Mpc) 4. Statistical principles 5. Index of dumpiness and dispersion-subdivision curves 6. Correlation techniques and power spectrum analysis 7. Galactic absorption effects A. Average effects B. Fluctuation effects 8. Medium-scale clustering: large clusters, cloud complexes, and superclusters (A 10 Mpc) 9. The distribution of rich clusters A. The Zwicky Catalog B. The Abell Catalog 10.Large-scale clustering and density gradients (A 100 Mpc) 11.Galaxy clustering and the density-radius relation 12.The spectrum of clustering and its origin


Journal ArticleDOI
TL;DR: In this paper, the authors extended Mantel's approach for testing space-time clustering of a single set of points to test for clustering between two such sets of points.
Abstract: SUMMARY Mantel's approach for testing space-time clustering of a single set of points is extended to test for clustering between two such sets of points. Randomization tests are proposed for two situations: (a) one set is considered fixed and the other random and (b) both sets of points are considered raindom. For each situation an empirical randomization test and its normal approximation are given. In either case a variety of space and time distance measures may be used. The adequacy of the normal approximation for a contrived example was determined empirically using two favored measures, the 0, 1 indicator function, and the reciprocal of distance (time) plus a constant. As examples for actual data, the reciprocal measure is used to test for space-time clustering between a set of dog and a set of cat lymphoma cases, and the iindicator measure is used for underground nuclear tests and earthquakes.

Journal ArticleDOI
TL;DR: Mathematical evaluation measures to characterize the effect of known erroneous performance by stemming routines are presented, and an expanded probabilistic model is introduced to handle a more general case in which any element need not belong unambiguously to a single cluster.
Abstract: This paper presents mathematical evaluation measures to characterize the effect of known erroneous performance by stemming routines, and generalizes these procedures to other types of nonstatistical clustering algorithms. When clusters, or groups of intrinsically related elements, are split into smaller groups (by under-matching the elements), there is a loss in recall in information retrieval; larger groups (caused by over-matching) induce a loss in precision or relevance. The magnitude of error is taken to be a function of frequencies of cluster elements. When these are words in a subject-term index generated by a stemming algorithm, retrieval capability is also affected by the strength of the algorithm, the size and content of the stemmed index, and the number of words in a query. The present Project Intrex stemming algorithm has estimated stemming-error losses of 4% in recall and 1% in relevance on one-word queries; the former could be reduced to almost zero by straightforward corrections of known errors in the algorithm. An expanded probabilistic model is introduced to handle a more general case in which any element need not belong unambiguously to a single cluster. Error evaluation in document classification and thesauri also is discussed in broad terms.



Journal ArticleDOI
TL;DR: An approach to clustering and decision making is presented where a prior problem knowledge is inserted interactively in the form of subcategory mean vectors and covariance matrices and in the expert's confidence that these means and covariances accurately characterize the category.
Abstract: An approach to clustering and decision making is presented where a prior problem knowledge is inserted interactively. The problem knowledge inserted is in the form of subcategory mean vectors and covariance matrices and in the expert's confidence that these means and covariances accurately characterize the category. Then observations of patterns from the category are used to update these a priori supplied means and covariances. The extent to which new observations update the a priori values depends upon the expert's a priori confidence.

Journal ArticleDOI
TL;DR: This paper describes a model and associated computer program for carrying out clustering of elements in a space of a predetermined number of dimensions and may be used for multi‐dimensional scale analysis and for construction of sociograms.
Abstract: This paper describes a model and associated computer program for carrying out clustering of elements in a space of a predetermined number of dimensions. In addition to clustering of elements, the model may be used for multi‐dimensional scale analysis and for construction of sociograms. The model takes as input data a set of ‘affinities’ between elements, the inverse of which may be considered as psychological or sociological distances. It then moves elements in a Euclidean n‐space toward the point at which the geometric distance is equal to the given psychological or social ‘distance,’ as if a set of attractive and repulsive forces was acting upon each element from other elements. The second major portion of the paper consists of ten examples, which are analyzed by means of the accompanying computer program.

Proceedings Article
01 Jan 1971


Journal ArticleDOI
TL;DR: Applying some measures of clustering in recall to computer-generated sequences revealed that some are likely to give a spurious picture of differences in the tendency for items to cluster in sequences that also differ in length.
Abstract: Properties of recently proposed measures of clustering in recall are considered. Applying some measures to computer-generated sequences revealed that some of them are likely to give a spurious picture of differences in the tendency for items to cluster in sequences that also differ in length.

Journal ArticleDOI
TL;DR: In this paper, a method is presented for analysis of clustering differences among categories in the same list, which allows category characteristics to be varied within Ss by using mixed lists, whereas previously only homogeneous lists could be employed.
Abstract: A method is presented for analysis of clustering differences among categories in the same list. This allows category characteristics to be varied within Ss by using mixed lists, whereas previously only homogeneous lists could be employed.

Journal ArticleDOI
TL;DR: A unified approach to designing a data analyzer that performs cluster-seeking, feature selection, and categorizer design under a weighted least-square performance criterion that can be used as a fast procedure to evaluate the discriminatory capability of sensors and/or preprocessors.
Abstract: This paper gives a unified approach to designing a data analyzer that performs cluster-seeking, feature selection, and categorizer design under a weighted least-square performance criterion. The cost of misrecognitions is preserved throughout the process, It can be used as a fast procedure to evaluate the discriminatory capability of sensors and/or preprocessors.

Proceedings Article
01 Sep 1971
TL;DR: A general algorithm for finding the optimum classification with respect to a given criterion is derived and for a particular case, the algorithm reduces to a repeated application of a straightforward decision rule which behaves as a valley-seeking technique.
Abstract: The problem of clustering multivariate observations is viewed as the replacement of a set of vectors with a set of labels and representative vectors. A general criterion for clustering is derived as a measure of representation error. Some special cases are derived by simplifying the general criterion. A general algorithm for finding the optimum classification with respect to a given criterion is derived. For a particular case, the algorithm reduces to a repeated application of a straightforward decision rule which behaves as a valley-seeking technique. Asymptotic properties of the procedure are developed. Numerical examples are presented for the finite sample case.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a clustering method based on a single theoretic concept, where the measures may or may not be highly correlated, but it is assumed that one measure is of primary importance while the other measures are of secondary importance.
Abstract: pertains to a single theoretic concept, this concept has beeni made operational by several different measurements. The measures may or may not be highly correlated, but it is assumed that one measure is of primary importance while the other measures are of secondary importance. Several other clustering methods are discussed in relation to the proposed problem. An application of the proposed method is given which demonstrates that the procedure works well with large sainples. It also shows that the procedulre works well when, althou-gh the set of measurements conitains only one theoretic concept, it contains more than one empirical dimension.

Journal ArticleDOI
TL;DR: In this paper, the states with triangular and linear chain configurations of three a-clusters in C12 are treated by the reaction matrix theory on the basis of a realistic nuclear force.
Abstract: The states with triangular and linear chain configurations of three a-clusters in C12 are treated by the reaction matrix theory on the basis of a realistic nuclear force. Clustering dependence of G-matrix is found to be essential for clusterization. It also ·brings on a change of effective interaction between the two configurations, which has a role in reducing the excitation energy of the linear chain state. Its origin is mainly in the triplet even-state tensor force.

Journal ArticleDOI
TL;DR: This is a description of the first stage of an attempt to improve a thesaurus by providing it with new terms derived by computer analysis of semantic proximity between concepts from a large file of 20,000 documents.
Abstract: This is a description of the first stage of an attempt to improve a thesaurus by providing it with new terms derived by computer analysis of semantic proximity between concepts from a large file of 20,000 documents. At the first stage the semantic proximity between concept and core words was established on the level of a set of higher complexity. This set is defined when making the qualitative and quantitative choice of the concepts capable of being grouped into classes. The second stage will be the classification of terms belonging to that coherent set by clustering.

Journal ArticleDOI
TL;DR: In this paper, two groups of Ss free-recalled a list of active or passive voice sentences, and the nouns in the sentences were classified into three categories: actee nouns, passive nouns and active nouns.
Abstract: 2 groups of Ss free-recalled a list of active or passive voice sentences. The “actee” nouns in the sentences were classified into 3 categories. Ss reliably clustered their recall by conceptual category and clustering increased over 5 trials. No significant group differences were obtained on either organization or total recall. On a posttest of organization, 76.9% of the passive Ss and 60% of the active Ss organized the sentences into E-provided categories. The results suggested that category clustering occurs in the free recall of sentences.


Journal ArticleDOI
TL;DR: This paper found that there was considerable pairwise clustering in free recall following verbal-discrimination learning for two possible classifications: pairs vs right-wrong functions, which may indicate the effect of an intrapair association on the subsequent free recall organization.
Abstract: Clustering in free recall following verbal-discrimination learning was assessed for two possible classifications: pairs vs right-wrong functions. There was considerable pairwise clustering in free recall in two different studies. This outcome does not follow directly from frequency theory or other explanations of verbal-discrimination learning which assume acquired equivalence by function; it may, however, indicate the effect of an intrapair association on the subsequent free recall organization.