scispace - formally typeset
Search or ask a question

Showing papers on "Cluster analysis published in 1985"


Journal ArticleDOI
TL;DR: A Monte Carlo evaluation of 30 procedures for determining the number of clusters was conducted on artificial data sets which contained either 2, 3, 4, or 5 distinct nonoverlapping clusters to provide a variety of clustering solutions.
Abstract: A Monte Carlo evaluation of 30 procedures for determining the number of clusters was conducted on artificial data sets which contained either 2, 3, 4, or 5 distinct nonoverlapping clusters. To provide a variety of clustering solutions, the data sets were analyzed by four hierarchical clustering methods. External criterion measures indicated excellent recovery of the true cluster structure by the methods at the correct hierarchy level. Thus, the clustering present in the data was quite strong. The simulation results for the stopping rules revealed a wide range in their ability to determine the correct number of clusters in the data. Several procedures worked fairly well, whereas others performed rather poorly. Thus, the latter group of rules would appear to have little validity, particularly for data sets containing distinct clusters. Applied researchers are urged to select one or more of the better criteria. However, users are cautioned that the performance of some of the criteria may be data dependent.

3,551 citations


Journal ArticleDOI
TL;DR: An O(kn) approximation algorithm that guarantees solutions with an objective function value within two times the optimal solution value is presented and it is shown that this approximation algorithm succeeds as long as the set of points satisfies the triangular inequality.

1,784 citations


Journal ArticleDOI
TL;DR: In this paper, eight empirical studies using attitudinal data to cluster countries are reviewed and the major dimensions accounting for similarities among countries are discussed, and a final synthesis of clusters is presented.
Abstract: Eight empirical studies using attitudinal data to cluster countries are reviewed. The major dimensions accounting for similarities among countries are discussed, and a final synthesis of clusters is presented.

1,579 citations


01 Jan 1985

396 citations


Journal ArticleDOI
TL;DR: Taxonomies, factor analysis and clustering are discussed as tools to investigate the structure of competitors within an industry (‘strategic groups’) and an example using cluster analysis is presented as one means of operationalizing this concept.
Abstract: Summary Taxonomies, factor analysis and clustering are discussed as tools to investigate the structure of competitors within an industry (strategic groups'). An example using cluster analysis is presented as one means of operationalizing this concept. Careful definition and selection of the dimensions used to identify the boundaries between strategic groups (their mobility barriers) are particularly crucial in the effective application of analytical tools.

379 citations


Journal ArticleDOI
J. A. Hartigan1
TL;DR: In this article, a number of statistical models for forming and evaluating clusters are reviewed, including hierarchical, multivariate, and mixture methods, and the failure of likelihood tests for the number of components.
Abstract: A number of statistical models for forming and evaluating clusters are reviewed. Hierarchical algorithms are evaluated by their ability to discover high density regions in a population, and complete linkage hopelessly fails; the others don't do too well either. Single linkage is at least of mathematical interest because it is related to the minimum spanning tree and percolation. Mixture methods are examined, related to k-means, and the failure of likelihood tests for the number of components is noted. The DIP test for estimating the number of modes in a univariate population measures the distance between the empirical distribution function and the closest unimodal distribution function (or k-modal distribution function when testing for k modes). Its properties are examined and multivariate extensions are proposed. Ultrametric and evolutionary distances on trees are considered briefly.

353 citations


Journal ArticleDOI
TL;DR: In this paper, the use of cluster analysis as a tool for system modularization is discussed and several clustering techniques are discussed and used on two medium-size systems and a group of small projects.
Abstract: This paper examines the use of cluster analysis as a tool for system modularization. Several clustering techniques are discussed and used on two medium-size systems and a group of small projects. The small projects are presented because they provide examples (that will fit into a paper) of certain types of phenomena. Data bindings between the routines of the system provide the basis for the bindings. It appears that the clustering of data bindings provides a meaningful view of system modularization.

312 citations


Journal ArticleDOI
TL;DR: In this paper, the use of co-citations to cluster the Science citation Indey (SCI) database is reviewed. Two proposed improvements in the methodology are introduced: fractional citation counting and variable level clustering with a maximum cluster size limit.
Abstract: Earlier experiments in the use of co-citations to cluster theScience citation Indey (SCI) database are reviewed. Two proposed improvements in the methodology are introduced: fractional citation counting and variable level clustering with a maximum cluster size limit. Results of an experiment using the 1979SCI are described comparing the new methods with those previously employed. It is found that fractional citation counting helps reduce the bias toward high referencing fields such as biomedicine and biochemistry inherent in the use of an integer citation count threshold, and increases the range of subject matters covered by clusters. Variable level clustering, on the other hand, increases recall as measured by the percentage of highly cited items included in clusters. It is concluded that the two new methods used in combination will improve our ability to generate comprehensive maps of science as envisioned byDerek Price. This topic will be discussed in a forthcoming paper.

311 citations


Journal ArticleDOI
TL;DR: In this paper, a technique graphique theorique permettant d'obtenir les schemas intrinseques dans les ensembles de donnees ponctuelles is decrit.
Abstract: On decrit une technique graphique theorique permettant d'obtenir les schemas intrinseques dans les ensembles de donnees ponctuelles. On l'applique aux distributions de galaxies a deux et a trois dimensions ainsi qu'a des echantillons aleatoires comparables et a des simulations numeriques

261 citations


Journal ArticleDOI
TL;DR: The map shows a tightly integrated network of approximate disciplinary regions, unique in that for the first time links between mathematics and biomedical science have brought about a closure of the previously linear arrangement of disciplines.
Abstract: Previous attempts to map science using the co-citation clustering methodology are reviewed, and their shortcomings analyzed. Two enhancements of the methodology presented in Part I of the paper-fractional citation counting and variable level clustering—are briefly described and a third enhancement, the iterative clustering of clusters, is introduced. When combined, these three techniques improve our ability to generate comprehensive and representative mappings of science across the multidisciplinaryScience Citation Index (SCI) data base. Results of a four step analysis of the 1979SCI are presented, and the resulting map at the fourth iteration is described in detail. The map shows a tightly integrated network of approximate disciplinary regions, unique in that for the first time links between mathematics and biomedical science have brought about a closure of the previously linear arrangement of disciplines. Disciplinary balance between biomedical and physical science has improved, and the appearance of less cited subject areas, such as mathematics and applied science, makes this map the most comprehensive one yet produced by the co-citation methodology. Remaining problems and goals for future work are discussed.

221 citations


Journal ArticleDOI
TL;DR: A clustering algorithm based on a standard K-means approach which requires no user parameter specification is presented and experimental data show that this new algorithm performs as well or better than the previously used clustering techniques when tested as part of a speaker-independent isolated word recognition system.
Abstract: Studies of isolated word recognition systems have shown that a set of carefully chosen templates can be used to bring the performance of speaker-independent systems up to that of systems trained to the individual speaker. The earliest work in this area used a sophisticated set of pattern recognition algorithms in a human-interactive mode to create the set of templates (multiple patterns) for each word in the vocabulary. Not only was this procedure time consuming but it was impossible to reproduce exactly because it was highly dependent on decisions made by the experimenter. Subsequent work led to an automatic clustering procedure which, given only a set of clustering parameters, clustered patterns with the same performance as the previously developed supervised algorithms. The one drawback of the automatic procedure was that the specification of the input parameter set was found to be somewhat dependent on the vocabulary type and size of population to be clustered. Since a naive user of such a statistical clustering algorithm could not be expected, in general, to know how to choose the word clustering parameters, even this automatic clustering algorithm was not appropriate for a completely general word recognition system. It is the purpose of this paper to present a clustering algorithm based on a standard K-means approach which requires no user parameter specification. Experimental data show that this new algorithm performs as well or better than the previously used clustering techniques when tested as part of a speaker-independent isolated word recognition system.

Journal ArticleDOI
TL;DR: Two new methodologies to overcome the drawbacks of the classical approach to parts grouping are developed and are very easy to implement because they take advantage of the information already stored in the CAD system.
Abstract: Parts grouping into families can be performed in flexible manufacturing systems (FMSs) to simplify two classes of problems: long horizon planning and short horizon planning. In this paper the emphasis is on the part families problem applicable to the short horizon planning. Traditionally, parts grouping was based on classification and coding systems, some of which are reviewed in this paper. To overcome the drawbacks of the classical approach to parts grouping, two new methodologies are developed. The methodologies presented are very easy to implement because they take advantage of the information already stored in the CAD system. One of the basic elements of this system is the algorithm for solving the part families problem. Some of the existing clustering algorithms for solving this problem are discussed. A new clustering algorithm has been developed. The computational complexity and some of the computational results of solving the part families problem are also discussed.


Journal ArticleDOI
TL;DR: By combining a nonparametric classifier, based on a clustering algorithm, with a quad-tree representation of the image, the scheme is both simple to implement and performs well, giving satisfactory results at signal-to-noise ratios well below 1.

Journal ArticleDOI
TL;DR: A clustering method is presented to describe the discontinuities in a multivariate (multispecies) series of biological samples, obtained from a single station at successive times and produces a nonhierarchical partition of the series into nonoverlapping homogeneous groups, which are the steps of the ecological succession.
Abstract: A clustering method is presented to describe the discontinuities in a multivariate (multispecies) series of biological samples, obtained from a single station at successive times. The method takes into account the sequence of sampling (time contiguity constraint) and makes it possible to eliminate singletons. Such singletons can be found in most ecological series, due to random components or to external forcings such as a temporary shift of water masses or immigration and emigration at a fixed station. The clustering proceeds from a sample x sample association matrix, built with an appropriately chosen similarity or distance coefficient. Agglomerative clustering is applied with the time constraint, and a randomization test is performed to verify whether the fusion is valid. This test compares the number of "high" distances in the between-group matrix to that in the fusion matrix of the two groups tested. When a singleton is discovered, with this same test, it is temporarily removed from the study and the ...

Journal ArticleDOI
TL;DR: Heuristique de mesure de similitudes ponderees appliquée au probleme du groupement des machines en cellules dans la technologie de groupe as discussed by the authors.
Abstract: Heuristique de mesure de similitudes ponderees appliquee au probleme du groupement des machines en cellules dans la technologie de groupe

Journal ArticleDOI
TL;DR: An algorithm for generating artificial data sets which contain distinct nonoverlapping clusters is presented, useful for generating test data sets for Monte Carlo validation research conducted on clustering methods or statistics.
Abstract: An algorithm for generating artificial data sets which contain distinct nonoverlapping clusters is presented. The algorithm is useful for generating test data sets for Monte Carlo validation research conducted on clustering methods or statistics. The algorithm generates data sets which contain either 1, 2, 3, 4, or 5 clusters. By default, the data are embedded in either a 4, 6, or 8 dimensional space. Three different patterns for assigning the points to the clusters are provided. One pattern assigns the points equally to the clusters while the remaining two schemes produce clusters of unequal sizes. Finally, a number of methods for introducing error in the data have been incorporated in the algorithm.

Journal ArticleDOI
TL;DR: The results confirmed the findings of previous Monte Carlo studies on clustering procedures in that accuracy was inversely related to coverage, and that algorithms using correlation as the similarity measure were significantly more accurate than those using Euclidean distances.
Abstract: Nine hierarchical and four nonhierarchical clustering algorithms were compared on their ability to resolve 200 multivariate normal mixtures. The effects of coverage, similarity measures, and cluster overlap were studied by including different levels of coverage for the hierarchical algorithms, Euclidean distances and Pearson correlation coefficients, and truncated multivariate normal mixtures in the analysis. The results confirmed the findings of previous Monte Carlo studies on clustering procedures in that accuracy was inversely related to coverage, and that algorithms using correlation as the similarity measure were significantly more accurate than those using Euclidean distances. No evidence was found for the assumption that the positive effects of the use of correlation coefficients are confined to unconstrained mixture models.

01 Oct 1985
TL;DR: The main goal of this thesis is to compare clustered file searches and inverted file searches in order to determine under what circumstances one search is to be preferred over the other.
Abstract: The major component of a document retrieval system is the component that searches the document collection and selects the documents to be returned in response to a query. Since users wait for the results of the search, the component must be efficient as well as effective. The main goal of this thesis is to compare clustered file searches and inverted file searches in order to determine under what circumstances one search is to be preferred over the other. A preliminary goal is to define a good cluster search. Three types of agglomerative clustering strategies, the single link, the complete link, and the group average link methods, are investigated. Searches of the single link hierarchy, the cluster hierarchy used extensively in previous research, are shown to be inferior to searches of the other hierarchy types. Searches of the group average link and complete link hierarchies perform similarly for small collections; for larger collections, searches of the complete link hierarchy are more effective. A top-down search of the group average link hierarchy is the most time efficient search asymptotically. The experimental evidence suggests that the difference in the efficiency and effectiveness of the complete link and group average link searches is due to the restricted depth of the complete link hierarchy. The depth of the group average link hierarchy increases as the size of the collection increases, but the depth of the complete link hierarchy does not. Thus the largest clusters in the complete link hierarchy are not very large, and the clusters can be accurately represented by centroids. Since the depth of the hierarchy does not increase with collection size, searches of the complete link hierarchy should remain effective for larger collections. The top-down search of the complete link hierarchy is somewhat more effective than the inverted file search. The relative efficiency of the two searches depends on the relative efficiency of accessing a page and computing a similarity, since the cluster search accesses many more pages but computes fewer similarities than the inverted file search. For an inexpensive similarity measure, the inverted file search is much more efficient.


Journal ArticleDOI
TL;DR: A clustering algorithm making use of some properties of Sugeno's g λ measure is presented and its performance, when run on the well-known set of the iris data, is briefly described.

Proceedings Article
18 Aug 1985
TL;DR: It is clarified that conceptual clustering processes can be explicated as being composed of three distinct but inter-dependent subprocesses, each of which may be characterized along a number of dimensions related to search, thus facilitating a better understanding of the conceptual clusters process as a whole.
Abstract: Methods for Conceptual Clustering may be explicated in two lights. Conceptual Clustering methods may be viewed as extensions to techniques of numerical taxonomy, a collection of methods developed by social and natural scientists for creating classification schemes over object sets. Alternatively, conceptual clustering may be viewed as a form of learning by observation or concept formation, as opposed to methods of learning from examples or concept identification. In this paper we survey and compare a number of conceptual clustering methods along dimensions suggested by each of these views. The point we most wish to clarify is that conceptual clustering processes can be explicated as being composed of three distinct but inter-dependent subprocesses: the process of deriving a hierarchical classification scheme; the process of aggregating objects into individual classes; and the process of assigning conceptual descriptions to object classes. Each subprocess may be characterized along a number of dimensions related to search, thus facilitating a better understanding of the conceptual clustering process as a whole.

Journal ArticleDOI
TL;DR: It is shown that by appropriate specification of the underlying model, the mixture maximum likelihood approach to clustering can be applied in the context of a three-way table and is illustrated using a soybean data set which consists of multiattribute measurements on a number of genotypes each grown in several environments.
Abstract: Clustering or classifying individuals into groups such that there is relative homogeneity within the groups and heterogeneity between the groups is a problem which has been considered for many years. Most available clustering techniques are applicable only to a two-way data set, where one of the modes is to be partitioned into groups on the basis of the other mode. Suppose, however, that the data set is three-way. Then what is needed is a multivariate technique which will cluster one of the modes on the basis of both of the other modes simultaneously. It is shown that by appropriate specification of the underlying model, the mixture maximum likelihood approach to clustering can be applied in the context of a three-way table. It is illustrated using a soybean data set which consists of multiattribute measurements on a number of genotypes each grown in several environments. Although the problem is set in the framework of clustering genotypes, the technique is applicable to other types of three-way data sets.

Journal ArticleDOI
TL;DR: Concepts used in knowledge description are divided into tangible ones and intermediate ones depending on whether or not they appear in the input or the output of the system.
Abstract: Knowledge organization is a very important step in building an expert system. The problem is how to organize knowledge into a conceptual structure and thus make it complete, concise, and consistent. In this paper, concepts used in knowledge description are divided into tangible ones and intermediate ones depending on whether or not they appear in the input or the output of the system. Intermediate concepts and their relationships with tangible concepts are subjected to changes. A distance measure for rules and an algorithm for conceptual clustering are described. New intermediate concepts are generated using this algorithm. A few new concepts may replace a large number of old relationships and also generate new rules for the system. An experiment on traditional Chinese medicine shows that the proposed method produces results similar to those generated by experts.

Journal ArticleDOI
TL;DR: A two-level pipelined systolic pattern clustering array is proposed and the modularity and the regularity of the system architecture make it suited for VLSI implementations.
Abstract: Cluster analysis is a valuable tool in exploratory pattern analysis, especially when very little prior information about the data is available. In unsupervised pattern recognition and image segmentation applications, clustering techniques play an important role. The squared-error clustering technique is the most popular one among different clustering techniques. Due to the iterative nature of the squared-error clustering, it demands substantial CPU time, even for modest numbers of patterns. Recent advances in VLSI microelectronic technology triggered the idea of implementing the squared-error clustering directly in hardware. A two-level pipelined systolic pattern clustering array is proposed in this paper. The memory storage and access schemes are designed to enable a rhythmic data flow between processing units. Each processing unit is pipelined to further enhance the system performance. The total processing time for each pass of pattern labeling and cluster center updating is essentially dominated by the time required to fetch the pattern matrix once. Detailed architectural configuration, system performance evaluation, and simulation experiments are presented. The modularity and the regularity of the system architecture make it suited for VLSI implementations.

Journal ArticleDOI
TL;DR: In this paper, the feasibility of representing the temporal structure of a multidimensional rainfall process with simpler stochastic models and a study of the effect of parameter robustness on the time scale was investigated via performing controlled numerical experiments.
Abstract: The feasibility of representing the temporal structure of a multidimensional rainfall process with simpler stochastic models and a study of the effect of parameter robustness on the time scale is investigated here via performing controlled numerical experiments. A multidimensional representation for precipitation, given in the theory recently proposed by E. Waymire et al. (1984), is used for simulating rainfall in space and time. The model produces moving storms with realistic mesoscale meteorological features, e.g., clustering, birth and death of cells, cell intensity attenuation in time and space, etc. Two-year traces of rainfall intensities at fixed gage stations were generated at intervals of 0.1 hours for three climates. These traces are then aggregated at different time scales ranging from 1 to 24 hours. First- and second-order statistics are evaluated from the above series at each aggregation level and they are used for estimating the parameters of three one-dimensional models of temporal rainfall at a point: (1) Poisson model with independent marks, (2) rectangular pulses with independent intensity and duration, and (3) Neyman-Scott with independent marks. Only at very high levels of aggregation the compound Poisson model was able to reproduce the statistical structure of the simulated traces. The rectangular pulses model was found to have parameters which vary significantly with the time scale of aggregation and moreover, it leads to large distortions in the mean depth and duration of storm events. The Neyman-Scott scheme proved to be the most stable model throughout the different time scales and it also reproduced quite well the average storm characteristics at the event level. None of these three models was able to reproduce in a satisfactory manner the simulated extreme value distribution of the multidimensional model.

Proceedings Article
18 Aug 1985
TL;DR: This paper addresses a problem of induction (generalization learning) which is more difficult than any comparable work in AI and achieves considerable generality with superior noise management and low computational complexity.
Abstract: This paper addresses a problem of induction (generalization learning) which is more difficult than any comparable work in AI. The subject of the present research is a hard problem of new terms, a task of realistic constructive induction. While the approach is quite general, the system is analyzed and tested in an environment of heuristic search where noise management and incremental learning are necessary. Here constructive induction becomes feature formation from data represented in elementary form. A high-level attribute or feature such as "piece advantage" in checkers is much more abstract than an elementary descriptor or primitive such as contents of a checkerboard square. Features have often been used in evaluation functions; primitives are usually too detailed for this. To create abstract features from primitives (i.e. to restructure data descriptions), a new form of clustering is used which involves layering of knowledge and invariance of utility relationships related to data primitives and task goals. The scheme, which is both model- and data-driven, requires little background, domain-specific knowledge, but rather constructs it. The method achieves considerable generality with superior noise management and low computational complexity. Although the domains addressed are difficult, initial experimental results are encouraging.

Journal ArticleDOI
TL;DR: An algorithm for record clustering that is capable of detecting sudden changes in users' access patterns and then suggesting an appropriate assignment of records to blocks is presented.
Abstract: An algorithm for record clustering is presented. It is capable of detecting sudden changes in users' access patterns and then suggesting an appropriate assignment of records to blocks. It is conceptually simple, highly intuitive, does not need to classify queries into types, and avoids collecting individual query statistics. Experimental results indicate that it converges rapidly; its performance is about 50 percent better than that of the total sort method, and about 100 percent better than that of randomly assigning records to blocks.

Journal ArticleDOI
TL;DR: A districting algorithm is used to determine clusters including appropriate numbers of students, and for each cluster, a route and the stops along this route are determined.

Journal ArticleDOI
TL;DR: A branch and bound algorithm for optimal clustering is developed and applied to a variety of test problems and concludes that the method is practical for problems of up to 100 or so observations if the number of clusters is about 6 or less and the clusters are reasonably well separated.
Abstract: A branch and bound algorithm for optimal clustering is developed and applied to a variety of test problems. The objective function is minimization of within-group sum-of-squares although the algorithm can be applied to loss functions which meet certain conditions. The algorithm is based on earlier work of Koontz et. al. (1975).The efficiency of the method for determining optimal solutions is studied as a function of problem size, number of clusters, and underlying degree of separability of the observations. The value of the approach in determining lower bounds is also investigated.We conclude that the method is practical for problems of up to 100 or so observations if the number of clusters is about 6 or less and the clusters are reasonably well separated. If separation is poor and/or a larger number of clusters are sought, the computing time increases significantly. The approach provides very tight lower bounds early in the enumeration for problems with moderate separation and six or fewer clusters.