scispace - formally typeset
Search or ask a question

Showing papers on "Cluster analysis published in 1992"


Journal ArticleDOI
01 Jun 1992
TL;DR: A document browsing technique that employs docum-ent clustering as its primary operation is presented and a fast (linear time) clustering algorithm is presented that provides a powerful new access paradigm.
Abstract: Document clustering has not been well received as an information retrieval tool. Objections to its use fall into two main categories: first, that clustering is too slow for large corpora (with running time often quadratic in the number of documents); and second, that clustering does not appreciably improve retrieval.We argue that these problems arise only when clustering is used in an attempt to improve conventional search techniques. However, looking at clustering as an information access tool in its own right obviates these objections, and provides a powerful new access paradigm. We present a document browsing technique that employs document clustering as its primary operation. We also present fast (linear time) clustering algorithms which support this interactive browsing paradigm.

1,596 citations


Journal ArticleDOI
TL;DR: Two stochastic algorithms are derived from this general Classification EM algorithm, incorporating random perturbations, to reduce the initial-position dependence of the classical optimization clustering algorithms.

810 citations


Proceedings ArticleDOI
01 Jun 1992
TL;DR: It is shown that optimal effectiveness occurs when using only a small proportion of the indexing terms available, and that effectiveness peaks at a higher feature set size and lower effectiveness level for a syntactic phrase indexing than for word-based indexing.
Abstract: Syntactic phrase indexing and term clustering have been widely explored as text representation techniques for text retrieval. In this paper we study the properties of phrasal and clustered indexing languages on a text categorization task, enabling us to study their properties in isolation from query interpretation issues. We show that optimal effectiveness occurs when using only a small proportion of the indexing terms available, and that effectiveness peaks at a higher feature set size and lower effectiveness level for a syntactic phrase indexing than for word-based indexing. We also present results suggesting that traditional term clustering method are unlikely to provide significantly improved text representations. An improved probabilistic text categorization method is also presented.

667 citations


Proceedings ArticleDOI
23 Feb 1992
TL;DR: The effect of selecting varying numbers and kinds of features for use in predicting category membership was investigated on the Reuters and MUC-3 text categorization data sets and the optimal feature set size for word-based indexing was found to be surprisingly low despite the large training sets.
Abstract: The effect of selecting varying numbers and kinds of features for use in predicting category membership was investigated on the Reuters and MUC-3 text categorization data sets. Good categorization performance was achieved using a statistical classifier and a proportional assignment strategy. The optimal feature set size for word-based indexing was found to be surprisingly low (10 to 15 features) despite the large training sets. The extraction of new text features by syntactic analysis and feature clustering was investigated on the Reuters data set. Syntactic indexing phrases, clusters of these phrases, and clusters of words were all found to provide less effective representations than individual words.

585 citations


Journal ArticleDOI
Thrasyvoulos N. Pappas1
TL;DR: The algorithm that is presented is a generalization of the K-means clustering algorithm to include spatial constraints and to account for local intensity variations in the image to preserve the most significant features of the originals, while removing unimportant details.
Abstract: The problem of segmenting images of objects with smooth surfaces is considered. The algorithm that is presented is a generalization of the K-means clustering algorithm to include spatial constraints and to account for local intensity variations in the image. Spatial constraints are included by the use of a Gibbs random field model. Local intensity variations are accounted for in an iterative procedure involving averaging over a sliding window whose size decreases as the algorithm progresses. Results with an 8-neighbor Gibbs random field model applied to pictures of industrial objects, buildings, aerial photographs, optical characters, and faces show that the algorithm performs better than the K-means algorithm and its nonadaptive extensions that incorporate spatial constraints by the use of Gibbs random fields. A hierarchical implementation is also presented that results in better performance and faster speed of execution. The segmented images are caricatures of the originals which preserve the most significant features, while removing unimportant details. They can be used in image recognition and as crude representations of the image. >

575 citations


Journal ArticleDOI
TL;DR: In this paper, the authors summarize the current theoretical and experimental understanding of clustering phenomena on surfaces, with an emphasis on dynamical properties, including surface diffusion coefficients and adatom binding energies.

559 citations


Journal ArticleDOI
TL;DR: The suitability of a back-propagation neural network for classification of multispectral image data is explored and a methodology is developed for selection of both training parameters and data sets for the training phase.
Abstract: The suitability of a back-propagation neural network for classification of multispectral image data is explored. A methodology is developed for selection of both training parameters and data sets for the training phase. A new technique is also developed to accelerate the learning phase. To benchmark the network, the results are compared to those obtained using three other algorithms: a statistical contextual technique, a supervised piecewise linear classifier, and an unsupervised multispectral clustering algorithm. All three techniques were applied to simulated and real satellite imagery. Results from the classification of both Monte Carlo simulation and real imagery are summarized. >

414 citations


Journal ArticleDOI
TL;DR: This paper identifies important characteristics of clustering algorithms and proposes a general framework for analyzing and evaluating such algorithms and presents an analytic performance comparison of Dominant Sequence Clustering (DSC), explaining why DSC is superior to other algorithms.

393 citations


Journal ArticleDOI
TL;DR: A novel approach is adopted which employs a hybrid clustering and least squares algorithm which significantly enhances the real-time or adaptive capability of radial basis function models.
Abstract: Recursive identification of non-linear systems is investigated using radial basis function networks. A novel approach is adopted which employs a hybrid clustering and least squares algorithm. The recursive clustering algorithm adjusts the centres of the radial basis function network while the recursive least squares algorithm estimates the connection weights of the network. Because these two recursive learning rules are both linear, rapid convergence is guaranteed and this hybrid algorithm significantly enhances the real-time or adaptive capability of radial basis function models. The application to simulated real data are included to demonstrate the effectiveness of this hybrid approach.

359 citations


Journal Article

322 citations


Journal ArticleDOI
TL;DR: The fuzzy systems performed well until over 50% of their fuzzy-associative-memory (FAM) rules were removed, and they also performed well when the key FAM equilibration rule was replaced with destructive, or ;sabotage', rules.
Abstract: Fuzzy control systems and neural-network control systems for backing up a simulated truck, and truck-and-trailer, to a loading dock in a parking lot are presented. The supervised backpropagation learning algorithm trained the neural network systems. The robustness of the neural systems was tested by removing random subsets of training data in learning sequences. The neural systems performed well but required extensive computation for training. The fuzzy systems performed well until over 50% of their fuzzy-associative-memory (FAM) rules were removed. They also performed well when the key FAM equilibration rule was replaced with destructive, or 'sabotage', rules. Unsupervised differential competitive learning (DCL) and product-space clustering adaptively generated FAM rules from training data. The original fuzzy control systems and neural control systems generated trajectory data. The DCL system rapidly recovered the underlying FAM rules. Product-space clustering converted the neural truck systems into structured sets of FAM rules that approximated the neural system's behavior. >

Proceedings ArticleDOI
01 Jul 1992
TL;DR: Efficient new randomized and deterministic methods for transforming optimal solutions for a type of relaxed integer linear program into provably good solutions for the corresponding NP-hard discrete optimization problem are presented.
Abstract: We present efficient new randomized and deterministic methods for transforming optimal solutions for a type of relaxed integer linear program into provably good solutions for the corresponding NP-hard discrete optimization problem. Without any constraint violation, the e-approximation problem for many problems of this type is itself NP-hard. Our methods provide polynomial-time e-approximations while attempting to minimize the packing constraint violation.Our methods lead to the first known approximation algorithms with provable performance guarantees for the s-median problem, the tree prunning problem, and the generalized assignment problem. These important problems have numerous applications to data compression, vector quantization, memory-based learning, computer graphics, image processing, clustering, regression, network location, scheduling, and communication. We provide evidence via reductions that our approximation algorithms are nearly optimal in terms of the packing constraint violation. We also discuss some recent applications of our techniques to scheduling problems.

Journal ArticleDOI
01 Mar 1992
TL;DR: A hierarchical, agglomerative, symbolic clustering methodology based on a similarity measure that takes into consideration the position, span, and content of symbolic objects is proposed and is capable of discerning clusters in data sets made up of numeric as well as symbolic objects consisting of different types and combinations of qualitative and quantitative feature values.
Abstract: A hierarchical, agglomerative, symbolic clustering methodology based on a similarity measure that takes into consideration the position, span, and content of symbolic objects is proposed. The similarity measure used is of a new type in the sense that it is not just another aspect of dissimilarity. The clustering methodology forms composite symbolic objects using a Cartesian join operator when two symbolic objects are merged. The maximum and minimum similarity values at various merging levels permit the determination of the number of clusters in the data set. The composite symbolic objects representing different clusters give a description of the resulting classes and lead to knowledge acquisition. The algorithm is capable of discerning clusters in data sets made up of numeric as well as symbolic objects consisting of different types and combinations of qualitative and quantitative feature values. In particular, the algorithm is applied to fat-oil and microcomputer data. >

01 Jun 1992
TL;DR: Efficient new randomized and deterministic methods for transforming optimal solutions for a type of relaxed integer linear program into provably good solutions for the corresponding NP-hard discrete optimization problem are presented.
Abstract: We present efficient new randomized and deterministic methods for transforming optimal solutions for a type of relaxed integer linear program into provably good solutions for the corresponding NP-hard discrete optimization problem. Without any constraint violation, the epsilon-approximation problem for many problems of this type is itself NP-hard. Our methods provide polynomial-time epsilon-approximations while attempting to minimize the packing constraint violation. Our methods lead to the first known approximation algorithms with provable performance guarantees for the s-median problem, the tree pruning problem, and the generalized assignment problem. These important problems have numerous applications to data compression, vector quantization, memory-based learning, computer graphics, image processing, clustering, regression, network location, scheduling, protocol testing, and communication. We provide evidence via reductions that our approximation algorithms are nearly optimal in terms of the packing constraint violation. We also discuss some recent applications of our techniques to scheduling problems.

Proceedings ArticleDOI
08 Mar 1992
TL;DR: A fuzzy Kohonen clustering network which integrates the fuzzy c-means (FCM) model into the learning rate and updating strategies of the Kohonen network is proposed, and it is proved that the proposed scheme is equivalent to the c-Means algorithms.
Abstract: The authors propose a fuzzy Kohonen clustering network which integrates the fuzzy c-means (FCM) model into the learning rate and updating strategies of the Kohonen network. This yields an optimization problem related to FCM, and the numerical results show improved convergence as well as reduced labeling errors. It is proved that the proposed scheme is equivalent to the c-means algorithms. The new method can be viewed as a Kohonen type of FCM, but it is self-organizing, since the size of the update neighborhood and the learning rate in the competitive layer are automatically adjusted during learning. Anderson's IRIS data were used to illustrate this method. The results are compared with the standard Kohonen approach. >

Journal ArticleDOI
TL;DR: Several generalizations of the fuzzy c-shells (FCS) algorithm are presented for characterizing and detecting clusters that are hyperellipsoidal shells and show that the AFCS algorithm requires less memory than the HT-based methods, and it is at least an order of magnitude faster than theHT approach.
Abstract: Several generalizations of the fuzzy c-shells (FCS) algorithm are presented for characterizing and detecting clusters that are hyperellipsoidal shells. An earlier generalization, the adaptive fuzzy c-shells (AFCS) algorithm, is examined in detail and is found to have global convergence problems when the shapes to be detected are partial. New formulations are considered wherein the norm inducing matrix in the distance metric is unconstrained in contrast to the AFCS algorithm. The resulting algorithm, called the AFCS-U algorithm, performs better for partial shapes. Another formulation based on the second-order quadrics equation is considered. These algorithms can detect ellipses and circles in 2D data. They are compared with the Hough transform (HT)-based methods for ellipse detection. Existing HT-based methods for ellipse detection are evaluated, and a multistage method incorporating the good features of all the methods is used for comparison. Numerical examples of real image data show that the AFCS algorithm requires less memory than the HT-based methods, and it is at least an order of magnitude faster than the HT approach. >

Proceedings ArticleDOI
07 Jun 1992
TL;DR: The clustering technique described provides a basis for automatic feature selection and dimensionality reduction and Adaptation of kernel shape provides a tradeoff of increased accuracy for increased complexity and training time.
Abstract: Probabilistic neural networks (PNNs) learn quickly from examples in one pass and asymptotically achieve the Bayes-optimal decision boundaries. The major disadvantage of a PNN stems from the fact that it requires one node or neuron for each training pattern. Various clustering techniques have been proposed to reduce this requirement to one node per cluster center. The correct choice of clustering technique will depend on the data distribution, data rate, and hardware implementation. Adaptation of kernel shape provides a tradeoff of increased accuracy for increased complexity and training time. The technique described also provides a basis for automatic feature selection and dimensionality reduction. >

Journal ArticleDOI
TL;DR: A new formulation based on the treatment of the time window constraints as soft constraints that can be violated at a cost and heuristically decompose the problem into an assignment/clustering component and a series of routing and scheduling components is presented.
Abstract: The Vehicle Routing and Scheduling Problem with Time Window constraints is formulated as a mixed integer program, and optimization-based heuristics which extend the cluster-first, route-second algorithm of Fisher and Jaikumar are developed for its solution. We present a new formulation based on the treatment of the time window constraints as soft constraints that can be violated at a cost and we heuristically decompose the problem into an assignment/clustering component and a series of routing and scheduling components. Numerical results based on randomly generated and benchmark problem sets indicate that the algorithm compares favorably to state-of-the-art local insertion and improvement heuristics.

Book ChapterDOI
TL;DR: In experiments with both artificial and real data it is demonstrated that the multilayer SOM forms clusters that match better to the desired classes than do direct SOM's, classical k-means, or Isodata algorithms.
Abstract: A multilayer hierarchical self-organizing map (HSOM) is discussed as an unsupervised clustering method. The HSOM is shown to form arbitrarily complex clusters, in analogy with multilayer feedforward networks. In addition, the HSOM provides a natural measure for the distance of a point from a cluster that weighs all the points belonging to the cluster appropriately. In experiments with both artificial and real data it is demonstrated that the multilayer SOM forms clusters that match better to the desired classes than do direct SOM's, classical k-means, or Isodata algorithms.


Journal ArticleDOI
TL;DR: A new approach to the fuzzy c spherical shells algorithm is presented, which uses a cluster validity measure to identify good clusters, merges all compatible clusters, and eliminates spurious clusters to achieve the final results.
Abstract: The fuzzy c spherical shells (FCSS) algorithm is specially designed to search for clusters that can be described by circular arcs or, generally, by shells of hyperspheres. A new approach to the FCSS algorithm is presented. This algorithm is computationally and implementationally simpler than other clustering algorithms that have been suggested for this purpose. An unsupervised algorithm which automatically finds the optimum number of clusters is not known. It uses a cluster validity measure to identify good clusters, merges all compatible clusters, and eliminates spurious clusters to achieve the final results. Experimental results on several data sets are presented. >

Journal ArticleDOI
TL;DR: In this paper, a new QCD-motivated clustering algorithm was proposed to define jets in lepton-hadron and hadronhadron collisions, which combines the k ⊥ algorithm, proposed earlier for e + e − annihilation, with a pre-clustering procedure that ensures the universal factorization of initial state collinear singularities.

Journal ArticleDOI
TL;DR: In this paper, a neural network clustering method for the part-machine grouping problem in group technology is presented, which utilizes binary-valued inputs and it can be trained without supervision.
Abstract: SUMMARY This paper presents a neural network clustering method for the part-machine grouping problem in group technology. Among the several neural networks, a Carpenter-Grossberg network is selected due to the fact that this clustering method utilizes binary-valued inputs and it can be trained without supervision. It is shown that this adaptive leader algorithm offers the capability of handling large, industry-size data sets due to the computational efficiency. The algorithm was tested on three data sets from prior literature, and solutions obtained were found to result in block diagonal forms. Some solutions were also found to be identical to solutions presented by others. Experiments on larger data sets, involving 10000 parts by 100 machine types, revealed that the method results in the identification of clusters with fast execution times. If a block diagonal structure existed in the input data, it was identified to a good degree of perfection. It was also found to be efficient with some imperfections i...

Journal ArticleDOI
TL;DR: This paper reviews methods of cluster analysis in the context of classifying patients on the basis of clinical and/or laboratory type observations, with particular attention devoted to the mixture likelihood-based approach.
Abstract: In this paper we review methods of cluster analysis in the context of classifying patients on the basis of clinical and/or laboratory type observations. Both hierarchical and non-hierarchical methods of clustering are considered, although the emphasis is on the latter type, with particular attention devoted to the mixture likelihood-based approach. For the purposes of dividing a given data set into g clusters, this approach fits a mixture model of g components, using the method of maximum likelihood. It thus provides a sound statistical basis for clustering. The important but difficult question of how many clusters are there in the data can be addressed within the framework of standard statistical theory, although theoretical and computational difficulties still remain. Two case studies, involving the cluster analysis of some haemophilia and diabetes data respectively, are reported to demonstrate the mixture likelihood-based approach to clustering.

Book ChapterDOI
19 May 1992
TL;DR: This contribution addresses the problem of detection and tracking of moving vehicles in image sequences from traffic scenes recorded by a stationary camera by using a parameterized vehicle model and a recursive estimator based on a motion model for motion estimation.
Abstract: This contribution addresses the problem of detection and tracking of moving vehicles in image sequences from traffic scenes recorded by a stationary camera. In order to exploit the a priori knowledge about the shape and the physical motion of vehicles in traffic scenes, a parameterized vehicle model is used for an intraframe matching process and a recursive estimator based on a motion model is used for motion estimation. The initial guess about the position and orientation for the models are computed with the help of a clustering approach of moving image features. Shadow edges of the models are taken into account in the matching process. This enables tracking of vehicles under complex illumination conditions and within a small effective field of view. Results on real world traffic scenes are presented and open problems are outlined.

Proceedings ArticleDOI
08 Nov 1992
TL;DR: The DS quality measure, a general metric for evaluation of clustering algorithms, is established and motivates the RW-ST algorithm, a self-tuning clustering method based on random walks in the circuit netlist, which efficiently captures a globally good circuit clustering.
Abstract: The complexity of next-generation VLSI systems will exceed the capabilities of top-down layout synthesis algorithms, particularly in netlist partitioning and module placement. Bottom-up clustering is needed to “condense” the netlist so that the problem size becomes tractable to existing optimization methods. In this paper, we establish the DS qua.lity measure, the first general metric for evaluation of clustering algorithms. The DS metric in turn motivates our RWST algorithm, a new self-tuning clustering method based on random walks in the circuit netlist. RWST efficiently captures a globally good circuit clustering. When incorporated within a two-phase iterative Fiduccia-Mattheyses partitioning strategy, the RW-ST clustering method improves bisection width by an average of 17% over previous maiching-based methods.

Journal ArticleDOI
01 Oct 1992-Proteins
TL;DR: The results of these clusterings indicate conservation of α‐and β‐structures even when sequence similarity is relatively low, and suggest reliable structural and statistical analyses of three dimensional protein structures should be based on unbiased data.
Abstract: Reliable structural and statistical analyses of three dimensional protein structures should be based on unbiased data. The Protein Data Bank is highly redundant, containing several entries for identical or very similar sequences. A technique was developed for clustering the known structures based on their sequences and contents of alpha- and beta-structures. First, sequences were aligned pairwise. A representative sample of sequences was then obtained by grouping similar sequences together, and selecting a typical representative from each group. The similarity significance threshold needed in the clustering method was found by analyzing similarities of random sequences. Because three dimensional structures for proteins of same structural class are generally more conserved than their sequences, the proteins were clustered also according to their contents of secondary structural elements. The results of these clusterings indicate conservation of alpha- and beta-structures even when sequence similarity is relatively low. An unbiased sample of 103 high resolution structures, representing a wide variety of proteins, was chosen based on the suggestions made by the clustering algorithm. The proteins were divided into structural classes according to their contents and ratios of secondary structural elements. Previous classifications have suffered from subjectice view of secondary structures, whereas here the classification was based on backbone geometry. The concise view lead to reclassification of some structures. The representative set of structures facilitates unbiased analyses of relationships between protein sequence, function, and structure as well as of structural characteristics. (Less)

Proceedings Article
12 Jul 1992
TL;DR: It is shown that major learning processes, namely generalizatiorl and clustering, can be solved in a homogeneous way by using a similarity measure.
Abstract: There are still very few systems performing a Similarity Based Learning and using a First Order Logic (FOL) representation. This limitation comes from the intrinsic complexity of the learning processes in FOL and from the difficulty to deal with numerical knowledge in this representation. In this paper, we show that major learning processes, namely generalizatiorl and clustering, can be solved in a homogeneous way by using a similarity measure. As this measure is defined, the similarity computation comes down to a problem of solving a set of equations in several unknowns. The representation language used to express our examples is a subset of FOL allowing to express both quantitative knowledge and a relevance scale on the predicates.

Journal ArticleDOI
TL;DR: The adaptive fuzzy leader clustering (AFLC) architecture is a hybrid neural-fuzzy system that learns online in a stable and efficient manner and successfully classifies features extracted from real data, discrete or continuous, indicating the potential strength of this new clustering algorithm in analyzing complex data sets.
Abstract: A modular, unsupervised neural network architecture that can be used for clustering and classification of complex data sets is presented. The adaptive fuzzy leader clustering (AFLC) architecture is a hybrid neural-fuzzy system that learns online in a stable and efficient manner. The system used a control structure similar to that found in the adaptive resonance theory (ART-1) network to identify the cluster centers initially. The initial classification of an input takes place in a two-stage process: a simple competitive stage and a distance metric comparison stage. The cluster prototypes are then incrementally updated by relocating the centroid position from fuzzy C-means (FCM) system equations for the centroids and the membership values. The operational characteristics of AFLC and the critical parameters involved in its operation are discussed. The AFLC algorithm is applied to the Anderson iris data and laser-luminescent finger image data. The AFLC algorithm successfully classifies features extracted from real data, discrete or continuous, indicating the potential strength of this new clustering algorithm in analyzing complex data sets. >

Proceedings ArticleDOI
01 Jun 1992
TL;DR: This work investigates the performance of some of the best-known object clustering algorithms on four different workloads based upon the tektronix benchmark and demonstrates that even when the workload and object graph are fixed, the choice of the clustering algorithm depends upon the goals of the system.
Abstract: We investigate the performance of some of the best-known object clustering algorithms on four different workloads based upon the tektronix benchmark. For all four workloads, stochastic clustering gave the best performance for a variety of performance metrics. Since stochastic clustering is computationally expensive, it is interesting that for every workload there was at least one cheaper clustering algorithm that matched or almost matched stochastic clustering. Unfortunately, for each workload, the algorithm that approximated stochastic clustering was different. Our experiments also demonstrated that even when the workload and object graph are fixed, the choice of the clustering algorithm depends upon the goals of the system. For example, if the goal is to perform well on traversals of small portions of the database starting with a cold cache, the important metric is the per-traversal expansion factor, and a well-chosen placement tree will be nearly optimal; if the goal is to achieve a high steady-state performance with a reasonably large cache, the appropriate metric is the number of pages to which the clustering algorithm maps the active portion of the database. For this metric, the PRP clustering algorithm, which only uses access probabilities achieves nearly optimal performance.