scispace - formally typeset
Search or ask a question

Showing papers on "Cluster analysis published in 1977"


Book
01 Jan 1977
TL;DR: In this paper, the authors present an assessment of specific aspects of multivariate statistical models, including reduction of dimensionality, reduction of dependence, and clustering of multidimensional dependencies.
Abstract: Reduction of Dimensionality. Development and Study of Multivariate Dependencies. Multidimensional Classification and Clustering. Assessment of Specific Aspects of Multivariate Statistical Models. Summarization and Exposure. References. Appendix. Indexes.

1,059 citations


Journal ArticleDOI
TL;DR: A computer program, ADDTREE, for the construction of additive trees is described and applied to several sets of data, and some empirical and theoretical advantages of tree representations over spatial representations of proximity data are illustrated.
Abstract: Similarity data can be represented by additive trees. In this model, objects are represented by the external nodes of a tree, and the dissimilarity between objects is the length of the path joining them. The additive tree is less restrictive than the ultrametric tree, commonly known as the hierarchical clustering scheme. The two representations are characterized and compared. A computer program, ADDTREE, for the construction of additive trees is described and applied to several sets of data. A comparison of these results to the results of multidimensional scaling illustrates some empirical and theoretical advantages of tree representations over spatial representations of proximity data.

594 citations


Journal ArticleDOI
TL;DR: Results indicate that physical clustering of logically adjacent items is a critical performance parameter for relational query evaluation and methods that depend on sorting the records themselves seem to be the algorithm of choice.
Abstract: A model of storage and access to a relational data base is presented. Using this model, four techniques for evaluating a general relational query that involves the operations of projection, restriction, and join are compared on the basis of cost of accessing secondary storage. The techniques are compared numerically and analytically for various values of important parameters. Results indicate that physical clustering of logically adjacent items is a critical performance parameter. In the absence of such clustering, methods that depend on sorting the records themselves seem to be the algorithm of choice.

259 citations


Book ChapterDOI
Joseph B. Kruskal1
01 Jan 1977
TL;DR: This chapter describes the relationship between the clustering and multidimensional scaling, and describes some applications of clustering to astronomy that are not famous in the field of clusters.
Abstract: Publisher Summary This chapter describes the relationship between the clustering and multidimensional scaling. The clustering and multidimensional scaling are both methods for analyzing data. To some extent, they are in competition with one another. However, the clustering and multidimensional scaling stand in a strongly complementary relationship. They can be used together in several ways, and these joint uses are often desirable. The chapter describes some applications of clustering to astronomy that are not famous in the field of clustering. Many of the basic concepts of clustering belong to the biological inheritance of humans and many other animals. The concept of similarity is built into the human nervous system. There are three main types of data used in clustering: (1) multivariate data, (2) proximity data, and (3) clustering data. The multivariate data gives the values of several variables for several individuals. The proximity data consist of proximities among objects of the same kind, either proximities among individuals, or proximities among variables, or proximities among stimuli, or proximities among objects of any single cohesive type.

125 citations


Book ChapterDOI
J. A. Hartigan1
01 Jan 1977
TL;DR: A statistical problem that is encountered in deciding which of the many clusters presented by algorithms are real is discussed, which requires the asymptotic theory to be validated by Monte Carlo experiments.
Abstract: Publisher Summary The very large growth in clustering techniques and applications is not yet supported by development of statistical theory by which the clustering results are evaluated. A number of branches of statistics are relevant to clustering, namely, discriminant analysis, eigenvector analysis, analysis of variance, multiple comparisons, density estimation, contingency tables, piecewise fitting, and regression. These are all areas where the techniques are used in evaluating clusters or where clustering operations occur. This chapter discusses a statistical problem that is encountered in deciding which of the many clusters presented by algorithms are real. There is no easy generally applicable definition of real. A data cluster is real if it corresponds to one of the population clusters. The mixture techniques, k-means, single linkage, complete linkage and other common algorithms are examined to give measures of the reality of their clusters. A reasonable significance testing procedure requires the asymptotic theory to be validated by Monte Carlo experiments.

104 citations


Journal ArticleDOI
01 Aug 1977
TL;DR: It is demonstrated that there exist classes of global optimization problems for which the probability of obtaining a solution is greater for the proposed model than for multiple local optimizations.
Abstract: A model for finding the local optima of a multimodal function defined in a region A ? Rn is proposed. The method uses a local optimizer which is started from a number of points sampled in A. In order to reduce the number of function evaluations needed to reach the local optima, the parallel local search processes are stopped repeatedly, the working points clustered, and a reduced number of processes from each cluster resumed. A direct nonhierarchical cluster analysis technique is presented. The dissimilarity measure used is the Euclidean distance between points. Clusters are grown from seed points. The number of required distance evaluations is less than or equal to c(n-1), where n is the number of points and c is the number of clusters arrived at. Thresholds are determined by the point density in a body which in turn is determined by the given points. The covariance matrix is diagonalized, and a decision on the dimensionality of the space containing the points can be made. The volume of the body is proportional to the square root of the product of the corresponding eigenvalues. The performance of the clustering analysis technique is illustrated. It is demonstrated that there exist classes of global optimization problems for which the probability of obtaining a solution is greater for the proposed model than for multiple local optimizations. Some experiences gained from using the model are reported.

98 citations


Journal ArticleDOI
TL;DR: In this article, a technique for clustering leisure activities which takes into consideration individual differences in the perceived needs that the activities satisfy is presented, and the authors demonstrate how to cluster leisure activities based on individual differences.
Abstract: The present study demonstrates a technique for clustering leisure activities which takes into consideration individual differences in the perceived needs that the activities satisfy. This extends p...

92 citations


Book ChapterDOI
01 Jan 1977
TL;DR: The most desirable cluster analysis models for substantive applications should have the input proximity data expressible in a manner faithfully representing only the reliable information content of the empirically measured data.
Abstract: Publisher Summary The output of a cluster analysis method is a collection of subsets of the object set termed clusters characterized in some manner by relative internal coherence and/or external isolation, along with a natural stratification of these identified clusters by levels of cohesive intensity. In formalizing a model of the cluster analysis methods, it is essential to consider the nature and inherent reliability of the proximity data that constitutes the input in substantive clustering applications. The proximity value scales are dichotomous. It is the practice of most authors of cluster methods to assume that the proximity values are available in the form of a real symmetric matrix, where any unjustified structure implicit in the real values is either to be ignored or axiomatically disallowed. The most desirable cluster analysis models for substantive applications should have the input proximity data expressible in a manner faithfully representing only the reliable information content of the empirically measured data.

85 citations


Journal ArticleDOI
01 Oct 1977
TL;DR: A nearest neighbor recognition rule for syntactic patterns using the proposed distance as a similarity measure, a clustering procedure for syntactical patterns is described and a character recognition experiment is given.
Abstract: A distance between two syntactic patterns is defined in terms of error transformations. This definition is extended to the case of distance measures between one syntactic pattern and a group of syntactic patterns. A nearest neighbor recognition rule for syntactic patterns using the proposed distance is then given. Using the proposed distance as a similarity measure, a clustering procedure for syntactic patterns is described. A character recognition experiment is given as an illustrative example.

82 citations


Journal ArticleDOI
TL;DR: A topological design aspect of the access problem, which is formulated as the locating of generic access facilities (GAF's) to obtain an economic connection of nodes (users) to a resource connection point (RESCOP).
Abstract: In any network where a large number of widely dispersed "users" share a limited number of "resources," the strategy for access will play a large part in determining the cost and performance of the network. In this paper we consider a topological design aspect of the access problem. In particular, we consider the problem of locating "access facilities," or concentration points, to obtain an economic connection of users to resources. The problem is formulated as the locating of generic access facilities (GAF's) to obtain an economic connection of nodes (users) to a resource connection point (RESCOP). The nodes may be connected through multipoint lines, but with a constraint on the number of nodes which may share a single line. The GAF's are constrained in capacity, expressed as the number of nodes they can support, and have a cost associated with them. The basic solution technique presented is a heuristic algorithm characterized by the following four steps. 1) Simplify the problem to a point-to-point problem by replacing clusters of nodes by single "center-of-mass" (COM) nodes. 2) Partition the reduced set of COM nodes by applying an Add algorithm, resulting in one of the COM nodes selected as a GAF site. 3) Select one of the original nodes as a real GAF site in each partition by examining the original nodes closest to the COM node selected in the Add algorithm, and selecting the best. 4) Apply a line-layout algorithm to each partition, with its selected GAF site serving as the central node.

79 citations


Journal ArticleDOI
TL;DR: By visually mapping anodically decorated transistors, the authors found that in highly defective sites, emitter-collector shorts-pipes-tend to collect in clusters of totally defective areas.
Abstract: This paper examines a model of LSI device failure and the departure from Poisson statistics that it necessitates. By visually mapping anodically decorated transistors, the authors found that in highly defective sites, emitter-collector shorts-pipes-tend to collect in clusters of totally defective areas. Less defective sites have a nearly random distribution of defects, though some limited clustering may still exist. In general, a slightly curved relationship is obtained when the logarithm of actual yield is plotted versus area. However, for a small enough area, such as a single chip, one can make a linear approximation and use it to estimate the fraction of the area that is totally defective, and the defect density. The paper describes an analytical method of modeling device failures, and of projecting yields for areas larger than the data base from which the parameters of the yield equation were estimated.

Journal ArticleDOI
TL;DR: A comparison of clustering times with other methods show that large files can be clustered by single-link in a time at least comparable to various heuristic algorithms which theoretically require fewer operations.
Abstract: A method for clustering large files of documents using a clustering algorithm which takes O(n2) operations (single-link) is proposed. This method is tested on a file of 11,613 documents derived from an operational system. One property of the generated cluster hierarchy (hierarchy connection percentage) is examined and it indicates that the hierarchy is similar to those from other test collections. A comparison of clustering times with other methods showsthat large files can be clustered by single-link in a time at least comparable to various heuristic algorithms which theoretically require fewer operations.

Journal ArticleDOI
TL;DR: An algorithm is described which accomplishes journal classification using the single-link clustering technique and a novel application of the method of bibliographic coupling, which consists in the use of two-step bibliographical coupling linkages, rather than the usual one-step linkages.
Abstract: The classification of journal titles into fields or specialties is a problem of practical importance in library and information science. An algorithm is described which accomplishes such a classification using the single-link clustering technique and a novel application of the method of bibliographic coupling. The novelty consists in the use of two-step bibliographic coupling linkages, rather than the usual one-step linkages. This modification of the similarity measure leads to a marked improvement in the performance of single-link clustering in the formation of field or specialty clusters of journals. Results of an experiment using this algorithm are reported which grouped 890 journals into 168 clusters. This scope is an improvement of nearly an order of magnitude over previous journal clustering experiments. The results are evaluated by comparison with an independently derived manual classification of the same journal set. The generally good agreement indicates that this method of journal clustering will have significant practical utility for journal classification.

Journal ArticleDOI
J.H. Liou1, S.B. Yao1
TL;DR: The costs of retrieval, update and storage space for this data base structure are mathematically formulated and an example illustrates that this new database structure can be superior to the classical combination of indexed sequential and file inversion techniques.

Journal ArticleDOI
TL;DR: A 4-dimensional histogram is computed to reduce the large LANDSAT pixel data to the much smaller number of distinct vectors and their frequency of occurrence in the scene, using the histogram count as a probability density estimate.

Journal ArticleDOI
TL;DR: In this article, the authors extended the work of Strauss (1975) on clustering in the two-colour case and compared it with the more general methods of Besag (1974).
Abstract: This paper is concerned with nearest-neighbour systems on the coloured lattice (unordered state space). It extends the paper of Strauss (1975) on clustering in the two-colour case. Comparison is made with the more general methods of Besag (1974). Some tests are developed, and illustrated with an example. NEAREST-NEIGHBOUR SYSTEM; MARKOV RANDOM FIELD; CLUSTERING; QUALITATIVE DATA


Journal ArticleDOI
TL;DR: In this article, a complete-link hierarchical clustering technique for rehabilitation counselors is discussed. But the authors focus on the role and function of the rehabilitation counselor in the rehabilitation care.

Journal ArticleDOI
TL;DR: A distinction is made between implicit and explicit group overlap in sociological data, and literature is briefly reviewed in terms of this distinction, and the conclusion is drawn that for implicit overlap, the method of data analysis should use continuous input, while yielding output in a discrete form of subsets.
Abstract: A distinction is made between implicit and explicit group overlap in sociological data, and literature is briefly reviewed in terms of this distinction. The conclusion is drawn that for implicit overlap, the method of data analysis should use continuous input, while yielding output in a discrete form of (possibly overlapping) subsets. Such a method of clustering (ADCLUS) is presented briefly and is applied to the communication structure of a biomedical area of specialization.

01 Dec 1977
TL;DR: The Cluster Compression Algorithm (CCA), which was developed to reduce costs associated with transmitting, storing, distributing, and interpreting LANDSAT multispectral image data is described and experimental results are presented to show trade-offs and characteristics of the various implementations.
Abstract: The Cluster Compression Algorithm (CCA), which was developed to reduce costs associated with transmitting, storing, distributing, and interpreting LANDSAT multispectral image data is described. The CCA is a preprocessing algorithm that uses feature extraction and data compression to more efficiently represent the information in the image data. The format of the preprocessed data enables simply a look-up table decoding and direct use of the extracted features to reduce user computation for either image reconstruction, or computer interpretation of the image data. Basically, the CCA uses spatially local clustering to extract features from the image data to describe spectral characteristics of the data set. In addition, the features may be used to form a sequence of scalar numbers that define each picture element in terms of the cluster features. This sequence, called the feature map, is then efficiently represented by using source encoding concepts. Various forms of the CCA are defined and experimental results are presented to show trade-offs and characteristics of the various implementations. Examples are provided that demonstrate the application of the cluster compression concept to multi-spectral images from LANDSAT and other sources.

01 Aug 1977
TL;DR: A simple clustering transformation is combined with the Thompson, Thames, and Mastin (TTM) method of generating computational grids to produce controlled mesh spacings to create a hybrid scheme for airfoil problems.
Abstract: A simple clustering transformation is combined with the Thompson, Thames, and Mastin (TTM) method of generating computational grids to produce controlled mesh spacings. For various practical grids, the resulting hybrid scheme is easier to apply than the inhomogeneous clustering terms included in the TTM method for this purpose. The technique is illustrated in application to airfoil problems, and listings of a FORTRAN computer code for this usage are included.

Journal ArticleDOI
TL;DR: Taxonomic data, obtained for 141 Enterobacteriaceae strains for which 240 unit characters were recorded, were subjected to numerical taxonomy analysis employing 36 coefficients and it was found that 15 coefficients provided useful discriminating properties.
Abstract: Taxonomic data, obtained for 141 Enterobacteriaceae strains for which 240 unit characters were recorded, were subjected to numerical taxonomy analysis employing 36 coefficients. Clustering was by unweighted average linkage. From sorted similarity matrices, it was found that 15 coefficients, which included SSM, SH, STD, SJ, SNM, SO, SRT, SSHD, Sin−1(SSM), SP, So, SUN1, SUN4, SD, and SK2, provided useful discriminating properties. The coefficients SH and STD were found to provide results indistinguishable from SSM, and the coefficients SO and So yielded results very similar to those obtained with SSM coefficient.

Journal ArticleDOI
G.M. White1
01 Aug 1977
TL;DR: Digital pattern recognition will lead you to love reading starting from now, because book is the window to open the new world and more books you read can mean also the bore is full.
Abstract: We may not be able to make you love reading, but digital pattern recognition will lead you to love reading starting from now. Book is the window to open the new world. The world that you want is in the better stage and level. World will always guide you to even the prestige stage of the life. You know, this is some of how reading will give you the kindness. In this case, more books you read more knowledge you know, but it can mean also the bore is full.



Journal ArticleDOI
TL;DR: A new method of cluster analysis is described that groups coplanar points associated with different instants of time and an ornithological application illustrates the method.
Abstract: SUMMARY A new method of cluster analysis is described that groups coplanar points associated with different instants of time. An ornithological application illustrates the method.

Journal ArticleDOI
TL;DR: In this paper, atomistic calculations on a 1 2 {110} edge dislocation show a restricted tendency of clustering of helium atom along this dislocation, and clusters with up to 4 helium atoms have been studied.

Journal ArticleDOI
TL;DR: A decision-directed approach for classifying discrete data through the use of a clustering algorithm based on a sorting scheme based on the estimated probability distribution of the data and an arbitrary distance measure.
Abstract: This article presents a decision-directed approach for classifying discrete data. In the clustering algorithm, probable clusters are initiated through the use of a sorting scheme based on the estimated probability distribution of the data and an arbitrary distance measure. The subsequent iterative reclassification procedures are directed by the estimated distribution of each class. The distribution estimation adopted is modified from the dependence tree procedure. The algorithm performance is then evaluated through the use of simulated and clinical data. Finally, the algorithm is applied to disease categorization and to signs and symptoms extraction for each disease class.

Journal ArticleDOI
TL;DR: In this article, a relativistic quantum gas of pions, due to Bose-Einstein statistics and the resulting possibility of condensation, exhibits a structure similar to that obtained in the statistical bootstrap model, with clusters of condensed pions taking the place of fireballs.
Abstract: It is shown that a relativistic quantum gas of pions, due to Bose-Einstein statistics and the resulting possibility of condensation, exhibits a structure similar to that obtained in the statistical bootstrap model, with clusters of condensed pions taking the place of fireballs. The critical temperatureT* of the BE system is, however, associated with a first-order phase transition from a gas of pions and clusters at low energy density to a system of condensed pions at high energy density. The phase transition and the nature of the two phases are investigated.