scispace - formally typeset
Search or ask a question

Showing papers on "Cluster analysis published in 1996"


Proceedings Article
02 Aug 1996
TL;DR: In this paper, a density-based notion of clusters is proposed to discover clusters of arbitrary shape, which can be used for class identification in large spatial databases and is shown to be more efficient than the well-known algorithm CLAR-ANS.
Abstract: Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to the combination of these requirements. In this paper, we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it. We performed an experimental evaluation of the effectiveness and efficiency of DBSCAN using synthetic data and real data of the SEQUOIA 2000 benchmark. The results of our experiments demonstrate that (1) DBSCAN is significantly more effective in discovering clusters of arbitrary shape than the well-known algorithm CLAR-ANS, and that (2) DBSCAN outperforms CLARANS by a factor of more than 100 in terms of efficiency.

17,056 citations


Proceedings Article
01 Jan 1996
TL;DR: DBSCAN, a new clustering algorithm relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape, is presented which requires only one input parameter and supports the user in determining an appropriate value for it.
Abstract: Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to the combination of these requirements. In this paper, we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it. We performed an experimental evaluation of the effectiveness and efficiency of DBSCAN using synthetic data and real data of the SEQUOIA 2000 benchmark. The results of our experiments demonstrate that (1) DBSCAN is significantly more effective in discovering clusters of arbitrary shape than the well-known algorithm CLARANS, and that (2) DBSCAN outperforms CLARANS by a factor of more than 100 in terms of efficiency.

14,297 citations


Proceedings ArticleDOI
01 Jun 1996
TL;DR: Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) as discussed by the authors is a data clustering method that is especially suitable for very large databases.
Abstract: Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs.This paper presents a data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints). BIRCH can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. BIRCH is also the first clustering algorithm proposed in the database area to handle "noise" (data points that are not part of the underlying pattern) effectively.We evaluate BIRCH's time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIRCH versus CLARANS, a clustering method proposed recently for large datasets, and show that BIRCH is consistently superior.

4,090 citations


Journal ArticleDOI
TL;DR: Experimental results on a database of 400 trademark images show that an integrated color- and shape-based feature representation results in 99% of the images being retrieved within the top two positions.

1,017 citations


Journal ArticleDOI
01 Oct 1996
TL;DR: The self-organizing map method, which converts complex, nonlinear statistical relationships between high-dimensional data into simple geometric relationships on a low-dimensional display, can be utilized for many tasks: reduction of the amount of training data, speeding up learning nonlinear interpolation and extrapolation, generalization, and effective compression of information for its transmission.
Abstract: The self-organizing map (SOM) method is a new, powerful software tool for the visualization of high-dimensional data. It converts complex, nonlinear statistical relationships between high-dimensional data into simple geometric relationships on a low-dimensional display. As it thereby compresses information while preserving the most important topological and metric relationships of the primary data elements on the display, it may also be thought to produce some kind of abstractions. The term self-organizing map signifies a class of mappings defined by error-theoretic considerations. In practice they result in certain unsupervised, competitive learning processes, computed by simple-looking SOM algorithms. Many industries have found the SOM-based software tools useful. The most important property of the SOM, orderliness of the input-output mapping, can be utilized for many tasks: reduction of the amount of training data, speeding up learning nonlinear interpolation and extrapolation, generalization, and effective compression of information for its transmission.

845 citations


Book
31 Aug 1996
TL;DR: This paper presents a meta-analyses of Hierarchy as a Clustering Structure, a model for hierarchical clustering based on the model developed in [Bouchut-Boyaval, M3].
Abstract: 1. Classes and Clusters. 2. Geometry of Data. 3. Clustering Algorithms: A Review. 4. Single Cluster Clustering. 5. Partition: Square Data Table. 6. Partition: Rectangular Table. 7. Hierarchy as a Clustering Structure.

739 citations


21 May 1996
TL;DR: This work presents an exact Expectation{Maximization algorithm for determining the parameters of this mixture of factor analyzers which concurrently performs clustering and dimensionality reduction, and can be thought of as a reduced dimension mixture of Gaussians.
Abstract: Factor analysis, a statistical method for modeling the covariance structure of high dimensional data using a small number of latent variables, can be extended by allowing di erent local factor models in di erent regions of the input space. This results in a model which concurrently performs clustering and dimensionality reduction, and can be thought of as a reduced dimension mixture of Gaussians. We present an exact Expectation{Maximization algorithm for tting the parameters of this mixture of factor analyzers.

705 citations


01 Jan 1996
TL;DR: A data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is presented, and it is demonstrated that it is especially suitable for very large databases.
Abstract: Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs.This paper presents a data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints). BIRCH can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. BIRCH is also the first clustering algorithm proposed in the database area to handle "noise" (data points that are not part of the underlying pattern) effectively.We evaluate BIRCH's time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIRCH versus CLARANS, a clustering method proposed recently for large datasets, and show that BIRCH is consistently superior.

685 citations


Journal ArticleDOI
TL;DR: The results suggest that 2D descriptors and hierarchical clustering methods are best at separating biologically active molecules from inactives, a prerequisite for a good compound selection method.
Abstract: An evaluation of a variety of structure-based clustering methods for use in compound selection is presented. The use of MACCS, Unity and Daylight 2D descriptors; Unity 3D rigid and flexible descriptors and two in-house 3D descriptors based on potential pharmacophore points, are considered. The use of Ward's and group-average hierarchical agglomerative, Guenoche hierarchical divisive, and Jarvis−Patrick nonhierarchical clustering methods are compared. The results suggest that 2D descriptors and hierarchical clustering methods are best at separating biologically active molecules from inactives, a prerequisite for a good compound selection method. In particular, the combination of MACCS descriptors and Ward's clustering was optimal.

631 citations


Proceedings Article
01 Aug 1996
TL;DR: This paper proposes a formal notion of context-specific independence (CSI), based on regularities in the conditional probability tables (CPTs) at a node, and proposes a technique, analogous to (and based on) d-separation, for determining when such independence holds in a given network.
Abstract: Bayesian networks provide a language for qualitatively representing the conditional independence properties of a distribution, This allows a natural and compact representation of the distribution, eases knowledge acquisition, and supports effective inference algorithms. It is well-known, however, that there are certain independencies that we cannot capture qualitatively within the Bayesian network structure: independencies that hold only in certain contexts, i.e., given a specific assignment of values to certain variables, In this paper, we propose a formal notion of context-specific independence (CSI), based on regularities in the conditional probability tables (CPTs) at a node. We present a technique, analogous to (and based on) d-separation, for determining when such independence holds in a given network. We then focus on a particular qualitative representation scheme--tree-structured CPTs-- for capturing CSI. We suggest ways in which this representation can be used to support effective inference algorithms, in particular, we present a structural decomposition of the resulting network which can improve the performance of clustering algorithms, and an alternative algorithm based on outset conditioning.

614 citations


Book
29 Jan 1996
TL;DR: This book seeks to cover the areas of clustering and related methods of data analysis where major advances are being made in hierarchical clustering, variable selection and weighting, additive trees and other network models, and relevance of neural network models to clustering.
Abstract: At a moderately advanced level, this book seeks to cover the areas of clustering and related methods of data analysis where major advances are being made. Topics include: hierarchical clustering, variable selection and weighting, additive trees and other network models, relevance of neural network models to clustering, the role of computational complexity in cluster analysis, latent class approaches to cluster analysis, theory and method with applications of a hierarchical classes model in psychology and psychopathology, combinatorial data analysis, clusterwise aggregation of relations, review of the Japanese-language results on clustering, review of the Russian-language results on clustering and multidimensional scaling, practical advances, and significance tests.

Journal ArticleDOI
TL;DR: This work presents a new approach for clustering, based on the physical properties of an inhomogeneous ferromagnetic model, which outperforms other algorithms for toy problems as well as for real data.
Abstract: We present a new approach for clustering, based on the physical properties of an inhomogeneous ferromagnetic model. We do not assume any structure of the underlying distribution of the data. A Potts spin is assigned to each data point and short range interactions between neighboring points are introduced. Spin-spin correlations, measured ( by Monte Carlo procedure) in a superparamagnetic regime in which aligned domains appear, serve to partition the data points into clusters. Our method outperforms other algorithms for toy problems as well as for real data. [S0031-9007(96)00104-4] Many natural phenomena can be viewed as optimization processes, and the drive to understand and analyze them yielded powerful mathematical methods. Thus when wishing to solve a hard optimization problem, it may be advantageous to identify a related physical problem, for


Journal ArticleDOI
TL;DR: A computer program is developed that automatically, systematically and rapidly clusters an ensemble of structures into a set of conformationally related subfamilies, and selects a representative structure from each cluster.
Abstract: Unlike structures determined by X-ray crystallography, which are deposited in the Brookhaven Protein Data Bank (Abola et al., 1987) as a single structure, each NMR-derived structure is often deposited as an ensemble containing many structures, each consistent with the restraint set used. However, there is often a need to select a single 'representative' structure, or a 'representative' subset of structures, from such an ensemble. This is useful, for example, in the case of homology modelling or when compiling a relational database of protein structures. It has been shown that cluster analysis, based on overall fold, followed by selection of the structure closest to the centroid of the largest cluster, is likely to identify a structure more representative of the ensemble than the commonly used minimized average structure (Sutcliffe, 1993). Two approaches to the problem of clustering ensembles of NMR-derived structures have been described. One of these (Adzhubei et al., 1995) performs the pairwise superposition of all structures using C a atoms to generate a set of r.m.s. distances. After cluster analysis based on these distances, a user-defined cut-off is required to determine the final membership of clusters and therefore the representative structures. The other approach (Diamond, 1995) uses collective superpositions and rigid-body transformations. Again, the position at which to draw a cut-off based on the particular clustering pattern was not addressed. Whenever fixed values are used for the cut-off in clustering, there is a danger of missing 'true' clusters under the threshold imposed by the rigid cut-off value. Considering the highly diverse nature of NMR-derived ensembles of proteins, it would seem most appropriate to avoid the use of predefined values for determining clusters. In fact, of the 302 ensembles we have studied, the average pairwise r.m.s. distance across an ensemble varied from 0.29 to 11.3 A (mean value 3.0, SD 1.9 A). Here we present an automated method for cut-off determination that avoids the dangers of using fixed values for this purpose. We have developed a computer program that automatically, systematically and rapidly (i) clusters an ensemble of structures into a set of conformationally related subfamilies, and (ii) selects a representative structure from each cluster. The program uses the method of average linkage to define how clusters are built up, followed by the application of a penalty function that seeks to minimize simultaneously the number of clusters

Proceedings Article
03 Dec 1996
TL;DR: Experimental results indicate that the proposed techniques are useful for revealing hidden cluster structure in data sets of sequences.
Abstract: This paper discusses a probabilistic model-based approach to clustering sequences, using hidden Markov models (HMMs). The problem can be framed as a generalization of the standard mixture model approach to clustering in feature space. Two primary issues are addressed. First, a novel parameter initialization procedure is proposed, and second, the more difficult problem of determining the number of clusters K, from the data, is investigated. Experimental results indicate that the proposed techniques are useful for revealing hidden cluster structure in data sets of sequences.

Journal ArticleDOI
TL;DR: The validity-guided VGC algorithm uses cluster-validity information to guide a fuzzy (re)clustering process toward better solutions, and VGC's performance approaches that of the (supervised) k-nearest-neighbors algorithm.
Abstract: When clustering algorithms are applied to image segmentation, the goal is to solve a classification problem. However, these algorithms do not directly optimize classification duality. As a result, they are susceptible to two problems: 1) the criterion they optimize may not be a good estimator of "true" classification quality, and 2) they often admit many (suboptimal) solutions. This paper introduces an algorithm that uses cluster validity to mitigate problems 1 and 2. The validity-guided (re)clustering (VGC) algorithm uses cluster-validity information to guide a fuzzy (re)clustering process toward better solutions. It starts with a partition generated by a soft or fuzzy clustering algorithm. Then it iteratively alters the partition by applying (novel) split-and-merge operations to the clusters. Partition modifications that result in improved partition validity are retained. VGC is tested on both synthetic and real-world data. For magnetic resonance image (MRI) segmentation, evaluations by radiologists show that VGC outperforms the (unsupervised) fuzzy c-means algorithm, and VGC's performance approaches that of the (supervised) k-nearest-neighbors algorithm.

Proceedings ArticleDOI
01 Jun 1996
TL;DR: PBSM (Partition Based Spatial-Merge), a new algorithm for performing spatial join operation that is especially effective when neither of the inputs to the join have an index on the joining attribute, is described.
Abstract: This paper describes PBSM (Partition Based Spatial-Merge), a new algorithm for performing spatial join operation. This algorithm is especially effective when neither of the inputs to the join have an index on the joining attribute. Such a situation could arise if both inputs to the join are intermediate results in a complex query, or in a parallel environment where the inputs must be dynamically redistributed. The PBSM algorithm partitions the inputs into manageable chunks, and joins them using a computational geometry based plane-sweeping technique. This paper also presents a performance study comparing the the traditional indexed nested loops join algorithm, a spatial join algorithm based on joining spatial indices, and the PBSM algorithm. These comparisons are based on complete implementations of these algorithms in Paradise, a database system for handling GIS applications. Using real data sets, the performance study examines the behavior of these spatial join algorithms in a variety of situations, including the cases when both, one, or none of the inputs to the join have an suitable index. The study also examines the effect of clustering the join inputs on the performance of these join algorithms. The performance comparisons demonstrates the feasibility, and applicability of the PBSM join algorithm.

Patent
22 May 1996
TL;DR: In this article, a system for analyzing a data file containing a plurality of data records with each data record containing plurality of parameters is provided, which includes an input (40) for receiving the data file and a data processor (32) having at least one of several data processing functions.
Abstract: A system (10) for analyzing a data file containing a plurality of data records with each data record containing a plurality of parameters is provided. The system (10) includes an input (40) for receiving the data file and a data processor (32) having at least one of several data processing functions. These data processing functions include, for example, a segmentation function (34) for segmenting the data records into a plurality of segments based on the parameters. The data processing functions also include a clustering function (36) for clustering the data records into a plurality of clusters containing data records having similar parameters. A prediction function (38) for predicting expected future results from the parameters in the data records may also be provided with the data processor (32).

Journal ArticleDOI
TL;DR: A Self Organizing Map (SOM) neural network clustering methodology is used and it is demonstrated that it is superior to the hierarchical clustering methods.

01 Jan 1996
TL;DR: This paper describes the incorporation of seven stand-alone clustering programs into S-PLUS, where they can now be used in a much more flexible way.
Abstract: This paper describes the incorporation of seven stand-alone clustering programs into S-PLUS, where they can now be used in a much more flexible way. The original Fortran programs carried out new cluster analysis algorithms introduced in the book of Kaufman and Rousseeuw (1990). These clustering methods were designed to be robust and to accept dissimilarity data as well as objects-by-variables data. Moreover, they each provide a graphical display and a quality index reflecting the strength of the clustering. The powerful graphics of S-PLUS made it possible to improve these graphical representations considerably. The integration of the clustering algorithms was performed according to the object-oriented principle supported by S-PLUS. The new functions have a uniform interface, and are compatible with existing S-PLUS functions. We will describe the basic idea and the use of each clustering method, together with its graphical features. Each function is briefly illustrated with an example.

Proceedings ArticleDOI
01 Mar 1996
TL;DR: Experience with HyPursuit suggests that abstraction functions based on hypertext clustering can be used to construct meaningful and scalable cluster hierarchies, and is encouraged by preliminary results on clustering based on both document contents and hyperlink structures.
Abstract: HyPursuit is a new hierarchical network search engine that clusters hypertext documents to structure a given information space for browsing and search act ivities. Our content-link clustering algorithm is based on the semantic information embedded in hyperlink structures and document contents. HyPursuit admits multiple, coexisting cluster hierarchies based on different principles for grouping documents, such as the Library of Congress catalog scheme and automatically created hypertext clusters. HyPursuit’s abstraction functions summarize cluster contents to support scalable query processing. The abstraction functions satisfy system resource limitations with controlled information 10SS. The result of query processing operations on a cluster summary approximates the result of performing the operations on the entire information space. We constructed a prototype system comprising 100 leaf WorldWide Web sites and a hierarchy of 42 servers that route queries to the leaf sites. Experience with our system suggests that abstraction functions based on hypertext clustering can be used to construct meaningful and scalable cluster hierarchies. We are also encouraged by preliminary results on clustering based on both document contents and hyperlink structures.

Journal ArticleDOI
TL;DR: Genetic Algorithms have been used in an attempt to optimize a specified objective function related to a clustering problem and it is shown that the proposed method may improve the final output of K-Means where an improvement is possible.

Journal ArticleDOI
TL;DR: The Kohonen network, an unsupervised learning algorithm in artificial neural networks, performs self-organizing mapping and reduces dimensions of a complex data set and showed a possibility of producing easily comprehensible low-dimensional maps under the total configuration of community groups in a target ecosystem.

Book ChapterDOI
18 Sep 1996
TL;DR: This paper describes some two dimensional plane drawing algorithms for clustered graphs and shows how to extend these algorithms to three dimensional multilevel drawings, and considers two conventions: straight-line convex drawings and orthogonal rectangular drawings.
Abstract: Clustered graphs are graphs with recursive clustering structures over the vertices. This type of structure appears in many systems. Examples include CASE tools, management information systems, VLSI design tools, and reverse engineering systems. Existing layout algorithms represent the clustering structure as recursively nested regions in the plane. However, as the structure becomes more and more complex, two dimensional plane representations tend to be insufficient. In this paper, firstly, we describe some two dimensional plane drawing algorithms for clustered graphs; then we show how to extend two dimensional plane drawings to three dimensional multilevel drawings. We consider two conventions: straight-line convex drawings and orthogonal rectangular drawings; and we show some examples.

Journal ArticleDOI
01 Jan 1996
TL;DR: Experiments show that the HEC network leads to a significant improvement in the clustering results over the K-means algorithm with Euclidean distance, and indicates that hyperellipsoidal shaped clusters are often encountered in practice.
Abstract: We propose a self-organizing network for hyperellipsoidal clustering (HEC). It consists of two layers. The first employs a number of principal component analysis subnetworks to estimate the hyperellipsoidal shapes of currently formed clusters. The second performs competitive learning using the cluster shape information from the first. The network performs partitional clustering using the proposed regularized Mahalanobis distance, which was designed to deal with the problems in estimating the Mahalanobis distance when the number of patterns in a cluster is less than or not considerably larger than the dimensionality of the feature space during clustering. This distance also achieves a tradeoff between hyperspherical and hyperellipsoidal cluster shapes so as to prevent the HEC network from producing unusually large or small clusters. The significance level of the Kolmogorov-Smirnov test on the distribution of the Mahalanobis distances of patterns in a cluster to the cluster center under the Gaussian cluster assumption is used as a compactness measure. The HEC network has been tested on a number of artificial data sets and real data sets, We also apply the HEC network to texture segmentation problems. Experiments show that the HEC network leads to a significant improvement in the clustering results over the K-means algorithm with Euclidean distance. Our results on real data sets also indicate that hyperellipsoidal shaped clusters are often encountered in practice.

Journal ArticleDOI
TL;DR: This paper presents GALOIS, a system that automates and applies the theory of concept lattices, and describes a prototype user interface for browsing through the concept lattice of a document-term relation, possibly enriched with a thesaurus of terms.
Abstract: The theory of concept (or Galois) lattices provides a simple and formal approach to conceptual clustering. In this paper we present GALOIS, a system that automates and applies this theory. The algorithm utilized by GALOIS to build a concept lattice is incremental and efficient, each update being done in time at most quadratic in the number of objects in the lattice. Also, the algorithm may incorporate background information into the lattice, and through clustering, extend the scope of the theory. The application we present is concerned with information retrieval via browsing, for which we argue that concept lattices may represent major support structures. We describe a prototype user interface for browsing through the concept lattice of a document-term relation, possibly enriched with a thesaurus of terms. An experimental evaluation of the system performed on a medium-sized bibliographic database shows good retrieval performance and a significant improvement after the introduction of background knowledge.

Proceedings ArticleDOI
18 Jun 1996
TL;DR: A Gabor feature representation for textured images is proposed, and its performance in pattern retrieval is evaluated on a large texture image database, and these features compare favorably with other existing texture representations.
Abstract: This paper addresses two important issues related to texture pattern retrieval: feature extraction and similarity search. A Gabor feature representation for textured images is proposed, and its performance in pattern retrieval is evaluated on a large texture image database. These features compare favorably with other existing texture representations. A simple hybrid neural network algorithm is used to learn the similarity by simple clustering in the texture feature space. With learning similarity the performance of similar pattern retrieval improves significantly. An important aspect of this work is its application to real image data. Texture feature extraction with similarity learning is used to search through large aerial photographs. Feature clustering enables efficient search of the database as our experimental results indicate.

Journal ArticleDOI
TL;DR: A Fuzzy C-Means-based clustering method guided by an auxiliary (conditional) variable is introduced that reveals a structure within a family of patterns by considering their vicinity in a feature space along with the similarity of the values assumed by a certain conditional variable.

Proceedings ArticleDOI
Jerome R. Bellegarda1, J.W. Butzberger1, Yen-Lu Chow1, Noah Coccaro1, Devang Naik1 
07 May 1996
TL;DR: A new approach is proposed for the clustering of words in a given vocabulary based on a paradigm first formulated in the context of information retrieval, called latent semantic analysis, which leads to a parsimonious vector representation of each word in a suitable vector space.
Abstract: A new approach is proposed for the clustering of words in a given vocabulary. The method is based on a paradigm first formulated in the context of information retrieval, called latent semantic analysis. This paradigm leads to a parsimonious vector representation of each word in a suitable vector space, where familiar clustering techniques can be applied. The distance measure selected in this space arises naturally from the problem formulation. Preliminary experiments indicate that, the clusters produced are intuitively satisfactory. Because these clusters are semantic in nature, this approach may prove useful as a complement to conventional class-based statistical language modeling techniques.

Journal ArticleDOI
TL;DR: The examples show that the semi-supervised approach provides MRI segmentations that are superior to ordinary fuzzy c-means and to the crisp k-nearest neighbor rule and further, that the new method ameliorates (P1)-(P3).