scispace - formally typeset
Search or ask a question

Showing papers on "Cluster analysis published in 1988"


01 Jan 1988

9,439 citations


Book
01 Jan 1988

8,586 citations


Journal ArticleDOI
TL;DR: This paper provides an introduction to the field of artificial neural nets by reviewing six important neural net models that can be used for pattern classification and exploring how some existing classification and clustering algorithms can be performed using simple neuron-like components.
Abstract: Artificial neural net models have been studied for many years in the hope of achieving human-like performance in the fields of speech and image recognition. These models are composed of many nonlinear computational elements operating in parallel and arranged in patterns reminiscent of biological neural nets. Computational elements or nodes are connected via weights that are typically adapted during use to improve performance. There has been a recent resurgence in the field of artificial neural nets caused by new net topologies and algorithms, analog VLSI implementation techniques, and the belief that massive parallelism is essential for high performance speech and image recognition. This paper provides an introduction to the field of artificial neural nets by reviewing six important neural net models that can be used for pattern classification. These nets are highly parallel building blocks that illustrate neural net components and design principles and can be used to construct more complex systems. In addition to describing these nets, a major emphasis is placed on exploring how some existing classification and clustering algorithms can be performed using simple neuron-like components. Single-layer nets can implement algorithms required by Gaussian maximum-likelihood classifiers and optimum minimum-error classifiers for binary patterns corrupted by noise. More generally, the decision regions required by any classification algorithm can be generated in a straightforward manner by three-layer feed-forward nets.

3,164 citations


Book
01 Jan 1988
TL;DR: The Mixture Likelihood Approach to Clustering and the Case Study Homogeneity of Mixing Proportions Assessing the Performance of the Mixture likelihood approach toClustering.
Abstract: General Introduction Introduction History of Mixture Models Background to the General Classification Problem Mixture Likelihood Approach to Clustering Identifiability Likelihood Estimation for Mixture Models via EM Algorithm Start Values for EMm Algorithm Properties of Likelihood Estimators for Mixture Models Information Matrix for Mixture Models Tests for the Number of Components in a Mixture Partial Classification of the Data Classification Likelihood Approach to Clustering Mixture Models with Normal Components Likelihood Estimation for a Mixture of Normal Distribution Normal Homoscedastic Components Asymptotic Relative Efficiency of the Mixture Likelihood Approach Expected and Observed Information Matrices Assessment of Normality for Component Distributions: Partially Classified Data Assessment of Typicality: Partially Classified Data Assessment of Normality and Typicality: Unclassified Data Robust Estimation for Mixture Models Applications of Mixture Models to Two-Way Data Sets Introduction Clustering of Hemophilia Data Outliers in Darwin's Data Clustering of Rare Events Latent Classes of Teaching Styles Estimation of Mixing Proportions Introduction Likelihood Estimation Discriminant Analysis Estimator Asymptotic Relative Efficiency of Discriminant Analysis Estimator Moment Estimators Minimum Distance Estimators Case Study Homogeneity of Mixing Proportions Assessing the Performance of the Mixture Likelihood Approach to Clustering Introduction Estimators of the Allocation Rates Bias Correction of the Estimated Allocation Rates Estimated Allocation Rates of Hemophilia Data Estimated Allocation Rates for Simulated Data Other Methods of Bias Corrections Bias Correction for Estimated Posterior Probabilities Partitioning of Treatment Means in ANOVA Introduction Clustering of Treatment Means by the Mixture Likelihood Approach Fitting of a Normal Mixture Model to a RCBD with Random Block Effects Some Other Methods of Partitioning Treatment Means Example 1 Example 2 Example 3 Example 4 Mixture Likelihood Approach to the Clustering of Three-Way Data Introduction Fitting a Normal Mixture Model to Three-Way Data Clustering of Soybean Data Multidimensional Scaling Approach to the Analysis of Soybean Data References Appendix

2,397 citations


Journal ArticleDOI
John Daugman1
TL;DR: A three-layered neural network based on interlaminar interactions involving two layers with fixed weights and one layer with adjustable weights finds coefficients for complete conjoint 2-D Gabor transforms without restrictive conditions for image analysis, segmentation, and compression.
Abstract: A three-layered neural network is described for transforming two-dimensional discrete signals into generalized nonorthogonal 2-D Gabor representations for image analysis, segmentation, and compression. These transforms are conjoint spatial/spectral representations, which provide a complete image description in terms of locally windowed 2-D spectral coordinates embedded within global 2-D spatial coordinates. In the present neural network approach, based on interlaminar interactions involving two layers with fixed weights and one layer with adjustable weights, the network finds coefficients for complete conjoint 2-D Gabor transforms without restrictive conditions. In wavelet expansions based on a biologically inspired log-polar ensemble of dilations, rotations, and translations of a single underlying 2-D Gabor wavelet template, image compression is illustrated with ratios up to 20:1. Also demonstrated is image segmentation based on the clustering of coefficients in the complete 2-D Gabor transform. >

1,977 citations


Journal ArticleDOI
TL;DR: Algorithms that can be used to allow the implementation of hierarchic agglomerative clustering methods for document retrieval, and experimental evidence suggests that nearest neighbor clusters provide a reasonably efficient and effective means of including interdocument similarity information in document retrieval systems.
Abstract: This article reviews recent research into the use of hierarchic agglomerative clustering methods for document retrieval. After an introduction to the calculation of interdocument similarities and to clustering methods that are appropriate for document clustering, the article discusses algorithms that can be used to allow the implementation of these methods on databases of nontrivial size. The validation of document hierarchies is described using tests based on the theory of random graphs and on empirical characteristics of document collections that are to be clustered. A range of search strategies is available for retrieval from document hierarchies and the results are presented of a series of research projects that have used these strategies to search the clusters resulting from several different types of hierarchic agglomerative clustering method. It is suggested that the complete linkage method is probably the most effective method in terms of retrieval performance; however, it is also difficult to implement in an efficient manner. Other applications of document clustering techniques are discussed briefly; experimental evidence suggests that nearest neighbor clusters, possibly represented as a network model, provide a reasonably efficient and effective means of including interdocument similarity information in document retrieval systems.

842 citations


Journal ArticleDOI
TL;DR: The present simulation study examined the standardization problem and found that those approaches which standardize by division by the range of the variable gave consistently superior recovery of the underlying cluster structure.
Abstract: A methodological problem in applied clustering involves the decision of whether or not to standardize the input variables prior to the computation of a Euclidean distance dissimilarity measure. Existing results have been mixed with some studies recommending standardization and others suggesting that it may not be desirable. The existence of numerous approaches to standardization complicates the decision process. The present simulation study examined the standardization problem. A variety of data structures were generated which varied the intercluster spacing and the scales for the variables. The data sets were examined in four different types of error environments. These involved error free data, error perturbed distances, inclusion of outliers, and the addition of random noise dimensions. Recovery of true cluster structure as found by four clustering methods was measured at the correct partition level and at reduced levels of coverage. Results for eight standardization strategies are presented. It was found that those approaches which standardize by division by the range of the variable gave consistently superior recovery of the underlying cluster structure. The result held over different error conditions, separation distances, clustering methods, and coverage levels. The traditionalz-score transformation was found to be less effective in several situations.

715 citations


Proceedings ArticleDOI
01 Jan 1988
TL;DR: This work gives a polynomial time approximation scheme that estimates the optimal number of clusters under the second measure of cluster size within factors arbitrarily close to 1 for a fixed cluster size.
Abstract: In a clustering problem, the aim is to partition a given set of n points in d-dimensional space into k groups, called clusters, so that points within each cluster are near each other. Two objective functions frequently used to measure the performance of a clustering algorithm are, for any L4 metric, (a) the maximum distance between pairs of points in the same cluster, and (b) the maximum distance between points in each cluster and a chosen cluster center; we refer to either measure as the cluster size.We show that one cannot approximate the optimal cluster size for a fixed number of clusters within a factor close to 2 in polynomial time, for two or more dimensions, unless P=NP. We also present an algorithm that achieves this factor of 2 in time O(n log k), and show that this running time is optimal in the algebraic decision tree model. For a fixed cluster size, on the other hand, we give a polynomial time approximation scheme that estimates the optimal number of clusters under the second measure of cluster size within factors arbitrarily close to 1. Our approach is extended to provide approximation algorithms for the restricted centers, suppliers, and weighted suppliers problems that run in optimal O(n log k) time and achieve optimal or nearly optimal approximation bounds.

485 citations


Proceedings ArticleDOI
05 Dec 1988
TL;DR: An algorithm that separates the pixels in the image into clusters based on both their intensity and their clusters is developed, which performs better than the K-means algorithm and its nonadaptive extensions that incorporate spatial constraints by the use of Gibbs random fields.
Abstract: A generalization of the K-means clustering algorithm to include spatial constraints and to account for local intensity variations in the image is proposed. Spatial constraints are included by the use of a Gibbs random field model. Local intensity variations are accounted for in an iterative procedure involving averaging over a sliding window whose size decreases as the algorithm progresses. Results with an eight-neighbor Gibbs random field model applied to pictures of industrial objects and a variety of other images show that the algorithm performs better than the K-means algorithm and its nonadaptive extensions. >

247 citations


Proceedings ArticleDOI
24 Jul 1988
TL;DR: A neural-network clustering algorithm proposed by T. Kohonen (1986, 88) is used to design a codebook for the vector quantization of images and the results are compared with coded images when the cookbook is designed by the Linde-Buzo-Gray algorithm.
Abstract: A neural-network clustering algorithm proposed by T. Kohonen (1986, 88) is used to design a codebook for the vector quantization of images. This neural-network clustering algorithm, which is better known as the Kohonen self-organizing feature maps, is a two-dimensional set of extensively interconnected nodes or unit of processors. The synaptic strengths between the input and the output nodes represent the centroid of the clusters after the network has been adapted to the input patterns. Input vectors are presented one at a time, and the weights connecting the input signals to the neurons are adaptively updated such that the point density function of the weights tends to approximate the probability density function of the input vector. Results are presented for a number of coded images using the codebook designed by the self-organization feature maps. The results are compared with coded images when the cookbook is designed by the Linde-Buzo-Gray algorithm. >

247 citations


Journal ArticleDOI
TL;DR: The method is a direct extension of the method of Taylor (1987) incorporating a consensus sequence approach and allows considerable freedom in the control of the clustering of the sequences, allowing the program to be adapted to particular problems.
Abstract: A method for the alignment of two or more biological sequences is described. The method is a direct extension of the method of Taylor (1987) incorporating a consensus sequence approach and allows considerable freedom in the control of the clustering of the sequences. At one extreme this is equivalent to the earlier method (Taylor 1987), whereas at the other, the clustering approaches the binary method of Feng and Doolittle (1987). Such freedom allows the program to be adapted to particular problems, which has the important advantage of resulting in considerable savings in computer time, allowing very large problems to be tackled. Besides a detailed analysis of the alignment of the cytochrome c superfamily, the clustering and alignment of the PIR sequence data bank (3500 sequences approx.) is described.

Journal ArticleDOI
TL;DR: The author illustrates the improved clustering of similar records that Gray codes can achieve with multiattribute hashing, and discusses how Gray codes could be applied to some retrieval methods designed for range queries, such as the grid file and the approach based on the so-called z-ordering.
Abstract: It is suggested that Gray codes be used to improve the performance of methods for partial match and range queries. Specifically, the author illustrates the improved clustering of similar records that Gray codes can achieve with multiattribute hashing. Gray codes are used instead of binary codes to map record signatures to buckets. In Gray codes, successive codewords differ in the value of exactly one bit position; thus, successive buckets hold records with similar record signatures. The proposed method achieves better clustering of similar records, thus reducing the I/O time. A mathematical model is developed to derive formulas giving the average performance of both methods, and it is shown that the proposed method achieves 0-50% relative savings over the binary codes. The author also discusses how Gray codes could be applied to some retrieval methods designed for range queries, such as the grid file and the approach based on the so-called z-ordering. Gray codes are also used to design good distance-preserving functions, which map a k-dimensional (k-D) space into a one-dimensional one, in such a way that points are close in the k-D space are likely to be close in the 1-D space. >

Journal ArticleDOI
TL;DR: A forward selection procedure for identifying the subset of variables is proposed and studied in the context of complete linkage hierarchical clustering, and can be applied to other clustering methods, too.
Abstract: Standard clustering algorithms can completely fail to identify clear cluster structure if that structure is confined to a subset of the variables. A forward selection procedure for identifying the subset is proposed and studied in the context of complete linkage hierarchical clustering. The basic approach can be applied to other clustering methods, too.


Journal ArticleDOI
TL;DR: The purpose of this paper is to collect the main global and local, numerical and stochastic, convergence results for FCM in a brief and unified way.
Abstract: One of the main techniques embodied in many pattern recognition systems is cluster analysis — the identification of substructure in unlabeled data sets. The fuzzy c-means algorithms (FCM) have often been used to solve certain types of clustering problems. During the last two years several new local results concerning both numerical and stochastic convergence of FCM have been found. Numerical results describe how the algorithms behave when evaluated as optimization algorithms for finding minima of the corresponding family of fuzzy c-means functionals. Stochastic properties refer to the accuracy of minima of FCM functionals as approximations to parameters of statistical populations which are sometimes assumed to be associated with the data. The purpose of this paper is to collect the main global and local, numerical and stochastic, convergence results for FCM in a brief and unified way.

Journal ArticleDOI
TL;DR: A study is made of the properties of the three types of clustered sequences of nodes for hierarchies and DAGs, and algorithms are developed for generating the clustered sequences, retrieving the descendants of a given node, and inserting new nodes into existing clusters of nodes which preserve their clustering properties.
Abstract: A DAG (direct acyclic graph) is an important data structure which requires efficient support in CAD (computer-aided design) databases. It typically arise from the design hierarchy, which describes complex designs in terms of subdesigns. A study is made of the properties of the three types of clustered sequences of nodes for hierarchies and DAGs, and algorithms are developed for generating the clustered sequences, retrieving the descendants of a given node, and inserting new nodes into existing clustered sequences of nodes which preserve their clustering properties. The performance of the clustering sequences is compared. >

Journal ArticleDOI
TL;DR: A new divisive algorithm for multidimensional data clustering that produces much smaller quantization errors than the median-cut and mean-split algorithms and is close to the local optimal ones derived by the k-means iterative procedure.
Abstract: A new divisive algorithm for multidimensional data clustering is suggested. Based on the minimization of the sum-of-squared-errors, the proposed method produces much smaller quantization errors than the median-cut and mean-split algorithms. It is also observed that the solutions obtained from our algorithm are close to the local optimal ones derived by the k-means iterative procedure.

Journal ArticleDOI
TL;DR: A number of ways of investigating heterogeneity in a two-way contingency table are reviewed in this article, where the authors consider chi-square decompositions of the Pearson Chi-square statistic with respect to the nodes of a hierarchical clustering of the rows and/or the columns of the table.
Abstract: A number of ways of investigating heterogeneity in a two-way contingency table are reviewed In particular, we consider chi-square decompositions of the Pearson chi-square statistic with respect to the nodes of a hierarchical clustering of the rows and/or the columns of the table A cut-off point which indicates “significant clustering” may be defined on the binary trees associated with the respective row and column cluster analyses This approach provides a simple graphical procedure which is useful in interpreting a significant chi-square statistic of a contingency table

Book
01 Jan 1988
TL;DR: Theoretical Foundations for Machine Learning: Representation of Complex Knowledge by Clauses and Equivalence Between Theorems and Clauses are presented.
Abstract: 1 Why Machine Learning and AI: The Contributions of AI to Learning Techniques 2 Theoretical Foundations for Machine Learning 3 Representation of Complex Knowledge by Clauses 4 Representation of Knowledge about Actions and the Addition of New Rules to a Knowledge Base 5 Learning by Doing 6 A Formal Presentation of Version Spaces 7 Explanation-Based Learning 8 Learning by Similarity Detection: The Empirical Approach 9 Learning by Similarity Detection: The 'Rational' Approach 10 Automatic Construction of Taxonomies: Techniques for Clustering 11 Debugging and Understanding in Depth: The Learning of Micro-Worlds 12 Learning by Analogy Appendix 1 Equivalence Between Theorems and Clauses Appendix 2 Synthesis of Predicates Appendix 3 Machine Learning in Context

Journal ArticleDOI
TL;DR: In this paper, a finite mixture density model was proposed for the clustering of mixed mode data, and a simplex algorithm was used to obtain maximum likelihood estimates and several small scale numerical examples indicate that its performance is relatively satisfactory.

Journal ArticleDOI
TL;DR: A method of calculating the maximum-likelihood clustering for the unsupervised estimation of polynomial models for the data in images of smooth surfaces or for range data for such surfaces is presented.
Abstract: A method of calculating the maximum-likelihood clustering for the unsupervised estimation of polynomial models for the data in images of smooth surfaces or for range data for such surfaces is presented. An image or a depth map of a region of smooth 3-D surface is modeled as a polynomial plus white noise. A region of physically meaningful textured-image such as the image of foliage, grass, or road in outdoor scenes or conductor or lintburn on a thick-film substrate is modeled as a colored Gaussian-Markov random field (MRF) with a polynomial mean-value function. Unsupervised-model parameter-estimation is accomplished by determining the segmentation and model parameter values that maximize the likelihood of the data or a more general Bayesian performance functional. Agglomerative clustering is used for this purpose. >

Journal ArticleDOI
TL;DR: Heuristic algorithms for batching a set of orders such that the total distance travelled by the order picking machine is minimized are presented and their efficiency and validity are illustrated through computer simulation.
Abstract: This paper deals with an order picking problem in an automated storage and retrieval system (AS/RS). We present heuristic algorithms for batching a set of orders such that the total distance travelled by the order picking machine is minimized. These algorithms are based on cluster analysis and their efficiency and validity are illustrated through computer simulation. The results show that the algorithms developed perform substantially better than those from previous studies.

Journal ArticleDOI
TL;DR: The classification method and the preliminary results obtained with liver biopsy electrophoretograms are described and heuristic clustering is also compared to other classification techniques.
Abstract: The interpretation of two-dimensional gel electrophoresis (2-DGE) profiles can be facilitated by artificial intelligence and machine learning programs. We have incorporated into our 2-DGE computer analysis system (termed MELANIE-Medical Electrophoresis Analysis Interactive Expert system) a program which automatically classifies 2-DGE patterns using heuristic clustering analysis. This program is a step toward machine learning. In this publication, we describe the classification method and the preliminary results obtained with liver biopsy electrophoretograms. Heuristic clustering is also compared to other classification techniques.

Proceedings ArticleDOI
01 May 1988
TL;DR: This paper describes one approach to the automatic generation of global thesauri, based on the discrimination value model of Salton, Yang, and Yu and on an appropriate clustering algorithm, which has been implemented and applied to two document collections.
Abstract: The importance of a thesaurus in the successful operation of an information retrieval system is well recognized. Yet techniques which support the automatic generation of thesauri remain largely undiscovered. This paper describes one approach to the automatic generation of global thesauri, based on the discrimination value model of Salton, Yang, and Yu and on an appropriate clustering algorithm. This method has been implemented and applied to two document collections. Preliminary results indicate that this method, which produces improvements in retrieval performance in excess of 10 and 15 percent in the test collections, is viable and worthy of continued investigation.

Journal ArticleDOI
TL;DR: This paper discusses effective and computationally feasible approaches for this task in situations where there are fairly large and complex data sets; the techniques stressed are all-subsets regression and a kind of recursive partition clustering.

Journal ArticleDOI
TL;DR: Two basic components of the knowledge based system, namely the expert system and heuristic clustering algorithm are discussed, which considers alternative process plans and multiple machines for solving the generalized group technology problem.
Abstract: In this paper a knowledge based system (EXGT-S) for solving the generalized group technology problem is presented. The formulation of the group technology problem involves constraints related to machine capacity, material handling system capabilities, machine cell dimensions and technological requirements. Il has been developed for an automated manufacturing system. EXGT-S is based on the tandem system architecture presented in Kusiak (1987). It considers alternative process plans and multiple machines. EXGT-S takes advantage of the developments in expert systems and optimization. Two basic components of the knowledge based system, namely the expert system and heuristic clustering algorithm are discussed. Each partial solution generated by the clustering algorithm is evaluated by the expert system which modifies search directions of the algorithm.

Proceedings ArticleDOI
11 Apr 1988
TL;DR: The authors propose a text-to-speech synthesis method based on automatic synthesis unit generation techniques using a natural speech database that is more consistent that those obtained through other methods, with the result that more intelligible speech can be reconstructed.
Abstract: The authors propose a text-to-speech synthesis method based on automatic synthesis unit generation techniques using a natural speech database. They have termed the automatic procedure context oriented clustering (COC). Using the COC procedure, 627 phonetic synthesis units were generated automatically based on 432 words uttered by a male speaker. This systematic approach has several advantages. First, as synthesis units can be generated automatically without any a priori phonological knowledge, it is easy to change the number of units and voices. Second, following from this, the technique can be applied to any language. Third, the generation of allophonic synthesis units is not dependent on the human decisions but on the statistical characteristics of spectral parameters in natural speech. Thus, the generated units are more consistent that those obtained through other methods, with the result that more intelligible speech can be reconstructed. >

Book
01 Jun 1988
TL;DR: Experimental results are reported showing that Voronoi trees are a proper and very efficient tool for the representation of proximity properties and generation of suitable clusterings.

Journal ArticleDOI
TL;DR: In this paper, it is shown that the clustering structures in the observed universe and in numerical simulations are not well represented by a homogeneous measure on a fractal, and it is proposed that a good description of these clustering structure is given by multifractals - fractals having more than one scaling index.
Abstract: It is shown that the clustering structures in the observed universe and in numerical simulations are not well represented by a homogeneous measure on a fractal. It is proposed that a good description of these clustering structures is given by multifractals - fractals having more than one scaling index. The multifractal characteristics of a data sample and a numerical simulation of an axion-dominated universe are evaluated. The clustering structures revealed by the multifractal analysis are quite similar, both showing a variation of dimensionality from 1 to 3 over similar ranges of scales. The multifractal description provides a natural way of scaling the numerical models to the data. If a bias is applied to the simulation, the result is a distribution of points that has almost constant dimension, 1.5, on all scales. This reduction of the distribution of dimensionality by biasing may be a problem for theories that invoke naive biasing schemes. 17 references.

Proceedings ArticleDOI
01 Feb 1988
TL;DR: In this paper, the authors established formal bounds on the combinatorics of this approach and showed that the expected complexity of recognizing isolated objects is quadratic in the number of model and sensory fragments, but exponential in the size of the correct interpretation.
Abstract: The problem of recognizing rigid objects from noisy sensory data has been successfully attacked in previous work by using a constrained search approach. Empirical investigations have shown the method to be very effective when recognizing and localizing isolated objects, but less effective when dealing with occluded objects where much of the sensory data arises from objects other than the one of interest. When clustering techniques such as the Hough transform are used to isolate likely subspaces of the search space, empirical performance in cluttered scenes improves considerably. In this note, we establish formal bounds on the combinatorics of this approach. Under some simple assumptions, we show that the expected complexity of recognizing isolated objects is quadratic in the number of model and sensory fragments, but that the expected complexity of recognizing objects in cluttered environments is exponential in the size of the correct interpretation. We also provide formal bounds on the efficacy of using the Hough transform to preselect likely subspaces, showing that the problem remains exponential, but that in practical terms the size of the problem is significantly decreased.