scispace - formally typeset
Search or ask a question

Showing papers on "Cluster analysis published in 1999"


Book
25 Oct 1999
TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

20,196 citations


Journal ArticleDOI
TL;DR: An overview of pattern clustering methods from a statistical pattern recognition perspective is presented, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners.
Abstract: Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exploratory data analysis. However, clustering is a difficult problem combinatorially, and differences in assumptions and contexts in different communities has made the transfer of useful generic concepts and methodologies slow to occur. This paper presents an overview of pattern clustering methods from a statistical pattern recognition perspective, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners. We present a taxonomy of clustering techniques, and identify cross-cutting themes and recent advances. We also describe some important applications of clustering algorithms such as image segmentation, object recognition, and information retrieval.

14,054 citations


Journal ArticleDOI
01 Jun 1999
TL;DR: A new algorithm is introduced for the purpose of cluster analysis which does not produce a clustering of a data set explicitly; but instead creates an augmented ordering of the database representing its density-based clustering structure.
Abstract: Cluster analysis is a primary method for database mining. It is either used as a stand-alone tool to get insight into the distribution of a data set, e.g. to focus further analysis and data processing, or as a preprocessing step for other algorithms operating on the detected clusters. Almost all of the well-known clustering algorithms require input parameters which are hard to determine but have a significant influence on the clustering result. Furthermore, for many real-data sets there does not even exist a global parameter setting for which the result of the clustering algorithm describes the intrinsic clustering structure accurately. We introduce a new algorithm for the purpose of cluster analysis which does not produce a clustering of a data set explicitly; but instead creates an augmented ordering of the database representing its density-based clustering structure. This cluster-ordering contains information which is equivalent to the density-based clusterings corresponding to a broad range of parameter settings. It is a versatile basis for both automatic and interactive cluster analysis. We show how to automatically and efficiently extract not only 'traditional' clustering information (e.g. representative points, arbitrary shaped clusters), but also the intrinsic clustering structure. For medium sized data sets, the cluster-ordering can be represented graphically and for very large data sets, we introduce an appropriate visualization technique. Both are suitable for interactive exploration of the intrinsic clustering structure offering additional insights into the distribution and correlation of the data.

4,020 citations


Journal ArticleDOI
TL;DR: A systematic set of statistical algorithms are applied, based on whole-genome mRNA data, partitional clustering and motif discovery, to identify transcriptional regulatory sub-networks in yeast—without any a priori knowledge of their structure or any assumptions about their dynamics.
Abstract: Technologies to measure whole-genome mRNA abundances1,2,3 and methods to organize and display such data4,5,6,7,8,9,10 are emerging as valuable tools for systems-level exploration of transcriptional regulatory networks. For instance, it has been shown that mRNA data from 118 genes, measured at several time points in the developing hindbrain of mice, can be hierarchically clustered into various patterns (or 'waves') whose members tend to participate in common processes5. We have previously shown that hierarchical clustering can group together genes whose cis-regulatory elements are bound by the same proteins in vivo6. Hierarchical clustering has also been used to organize genes into hierarchical dendograms on the basis of their expression across multiple growth conditions7. The application of Fourier analysis to synchronized yeast mRNA expression data has identified cell-cycle periodic genes, many of which have expected cis-regulatory elements8. Here we apply a systematic set of statistical algorithms, based on whole-genome mRNA data, partitional clustering and motif discovery, to identify transcriptional regulatory sub-networks in yeast—without any a priori knowledge of their structure or any assumptions about their dynamics. This approach uncovered new regulons (sets of co-regulated genes) and their putative cis-regulatory elements. We used statistical characterization of known regulons and motifs to derive criteria by which we infer the biological significance of newly discovered regulons and motifs. Our approach holds promise for the rapid elucidation of genetic network architecture in sequenced organisms in which little biology is known.

2,580 citations


Journal ArticleDOI
TL;DR: Chameleon's key feature is that it accounts for both interconnectivity and closeness in identifying the most similar pair of clusters, which is important for dealing with highly variable clusters.
Abstract: Clustering is a discovery process in data mining. It groups a set of data in a way that maximizes the similarity within clusters and minimizes the similarity between two different clusters. Many advanced algorithms have difficulty dealing with highly variable clusters that do not follow a preconceived model. By basing its selections on both interconnectivity and closeness, the Chameleon algorithm yields accurate results for these highly variable clusters. Existing algorithms use a static model of the clusters and do not use information about the nature of individual clusters as they are merged. Furthermore, one set of schemes (the CURE algorithm and related schemes) ignores the information about the aggregate interconnectivity of items in two clusters. Another set of schemes (the Rock algorithm, group averaging method, and related schemes) ignores information about the closeness of two clusters as defined by the similarity of the closest items across two clusters. By considering either interconnectivity or closeness only, these algorithms can select and merge the wrong pair of clusters. Chameleon's key feature is that it accounts for both interconnectivity and closeness in identifying the most similar pair of clusters. Chameleon finds the clusters in the data set by using a two-phase algorithm. During the first phase, Chameleon uses a graph partitioning algorithm to cluster the data items into several relatively small subclusters. During the second phase, it uses an algorithm to find the genuine clusters by repeatedly combining these subclusters.

2,111 citations


Journal ArticleDOI
TL;DR: PCA is formulated within a maximum likelihood framework, based on a specific form of gaussian latent variable model, which leads to a well-defined mixture model for probabilistic principal component analyzers, whose parameters can be determined using an expectation-maximization algorithm.
Abstract: Principal component analysis (PCA) is one of the most popular techniques for processing, compressing, and visualizing data, although its effectiveness is limited by its global linearity. While nonlinear variants of PCA have been proposed, an alternative paradigm is to capture data complexity by a combination of local linear PCA projections. However, conventional PCA does not correspond to a probability density, and so there is no unique way to combine PCA models. Therefore, previous attempts to formulate mixture models for PCA have been ad hoc to some extent. In this article, PCA is formulated within a maximum likelihood framework, based on a specific form of gaussian latent variable model. This leads to a well-defined mixture model for probabilistic principal component analyzers, whose parameters can be determined using an expectationmaximization algorithm. We discuss the advantages of this model in the context of clustering, density modeling, and local dimensionality reduction, and we demonstrate its application to image compression and handwritten digit recognition.

1,927 citations


Book
10 Sep 1999
TL;DR: In this paper, the authors propose a statistical pattern recognition method for pattern recognition using neural networks and nonlinear discriminant analysis (NDA) based on classification trees and feature selection and extraction.
Abstract: Introduction to statistical pattern recognition * Estimation * Density estimation * Linear discriminant analysis * Nonlinear discriminant analysis - neural networks * Nonlinear discriminant analysis - statistical methods * Classification trees * Feature selection and extraction * Clustering * Additional topics * Measures of dissimilarity * Parameter estimation * Linear algebra * Data * Probability theory

1,813 citations


Journal ArticleDOI
01 Jun 1999
TL;DR: A novel hybrid genetic algorithm that finds a globally optimal partition of a given data into a specified number of clusters using a classical gradient descent algorithm used in clustering, viz.
Abstract: In this paper, we propose a novel hybrid genetic algorithm (GA) that finds a globally optimal partition of a given data into a specified number of clusters. GA's used earlier in clustering employ either an expensive crossover operator to generate valid child chromosomes from parent chromosomes or a costly fitness function or both. To circumvent these expensive operations, we hybridize GA with a classical gradient descent algorithm used in clustering, viz. K-means algorithm. Hence, the name genetic K-means algorithm (GKA). We define K-means operator, one-step of K-means algorithm, and use it in GKA as a search operator instead of crossover. We also define a biased mutation operator specific to clustering called distance-based-mutation. Using finite Markov chain theory, we prove that the GKA converges to the global optimum. It is observed in the simulations that GKA converges to the best known optimum corresponding to the given data in concurrence with the convergence result. It is also observed that GKA searches faster than some of the other evolutionary algorithms used for clustering.

1,326 citations


Proceedings ArticleDOI
23 Mar 1999
TL;DR: This work develops a robust hierarchical clustering algorithm, ROCK, that employs links and not distances when merging clusters, and shows that ROCK not only generates better quality clusters than traditional algorithms, but also exhibits good scalability properties.
Abstract: We study clustering algorithms for data with Boolean and categorical attributes. We show that traditional clustering algorithms that use distances between points for clustering are not appropriate for Boolean and categorical attributes. Instead, we propose a novel concept of links to measure the similarity/proximity between a pair of data points. We develop a robust hierarchical clustering algorithm, ROCK, that employs links and not distances when merging clusters. Our methods naturally extend to non-metric similarity measures that are relevant in situations where a domain expert/similarity table is the only source of knowledge. In addition to presenting detailed complexity results for ROCK, we also conduct an experimental study with real-life as well as synthetic data sets. Our study shows that ROCK not only generates better quality clusters than traditional algorithms, but also exhibits good scalability properties.

1,322 citations


Journal ArticleDOI
TL;DR: This paper defines an appropriate stochastic error model on the input, and proves that under the conditions of the model, the algorithm recovers the cluster structure with high probability, and presents a practical heuristic based on the same algorithmic ideas.
Abstract: Recent advances in biotechnology allow researchers to measure expression levels for thousands of genes simultaneously, across different conditions and over time. Analysis of data produced by such experiments offers potential insight into gene function and regulatory mechanisms. A key step in the analysis of gene expression data is the detection of groups of genes that manifest similar expression patterns. The corresponding algorithmic problem is to cluster multicondition gene expression patterns. In this paper we describe a novel clustering algorithm that was developed for analysis of gene expression data. We define an appropriate stochastic error model on the input, and prove that under the conditions of the model, the algorithm recovers the cluster structure with high probability. The running time of the algorithm on an n-gene dataset is O[n2[log(n)]c]. We also present a practical heuristic based on the same algorithmic ideas. The heuristic was implemented and its performance is demonstrated on simulated data and on real gene expression data, with very promising results.

1,241 citations


Journal ArticleDOI
TL;DR: A similarity measure that reduces the number of false positives, a new clustering algorithm designed specifically for grouping gene expression patterns, and an interactive graphical cluster analysis tool that allows user feedback and validation are described.
Abstract: Analysis procedures are needed to extract useful information from the large amount of gene expression data that is becoming available. This work describes a set of analytical tools and their application to yeast cell cycle data. The components of our approach are (1) a similarity measure that reduces the number of false positives, (2) a new clustering algorithm designed specifically for grouping gene expression patterns, and (3) an interactive graphical cluster analysis tool that allows user feedback and validation. We use the clusters generated by our algorithm to summarize genome-wide expression and to initiate supervised clustering of genes into biologically meaningful groups.

Journal ArticleDOI
01 Jun 1999
TL;DR: An algorithmic framework for solving the projected clustering problem, in which the subsets of dimensions selected are specific to the clusters themselves, is developed and tested.
Abstract: The clustering problem is well known in the database literature for its numerous applications in problems such as customer segmentation, classification and trend analysis. Unfortunately, all known algorithms tend to break down in high dimensional spaces because of the inherent sparsity of the points. In such high dimensional spaces not all dimensions may be relevant to a given cluster. One way of handling this is to pick the closely correlated dimensions and find clusters in the corresponding subspace. Traditional feature selection algorithms attempt to achieve this. The weakness of this approach is that in typical high dimensional data mining applications different sets of points may cluster better for different subsets of dimensions. The number of dimensions in each such cluster-specific subspace may also vary. Hence, it may be impossible to find a single small subset of dimensions for all the clusters. We therefore discuss a generalization of the clustering problem, referred to as the projected clustering problem, in which the subsets of dimensions selected are specific to the clusters themselves. We develop an algorithmic framework for solving the projected clustering problem, and test its performance on synthetic data.

Proceedings ArticleDOI
23 Jun 1999
TL;DR: A Distributed Clustered Algorithm (DCA) and a Distributed Mobility-Adaptive Clustering (DMAC) algorithm are presented that partition the nodes of a fully mobile network: (ad hoc network) into clusters, giving the network a hierarchical organization.
Abstract: A Distributed Clustering Algorithm (DCA) and a Distributed Mobility-Adaptive Clustering (DMAC) algorithm are presented that partition the nodes of a fully mobile network: (ad hoc network) into clusters, this giving the network a hierarchical organization. Nodes are grouped by following a new weight-based criterion that allows the choice of the nodes that coordinate the clustering process based on node mobility-rebated parameters. The DCA is suitable for clustering "quasistatic" ad hoc networks. It is easy to implement and its time complexity is proven to be bounded by a network parameter that depends on the topology of the network rather than on its size, i.e., the invariant number of the network nodes. The DMAC algorithm adapts to the changes in the network topology due to the mobility of the nodes, and it is thus suitable for any mobile environment. Both algorithms are executed at each node with the sole knowledge of the identity of the one hop neighbors, and induce on the network the same clustering structure.

Proceedings ArticleDOI
01 Dec 1999
TL;DR: A technique that computes comprehensive pair-wise mutual information for all genes in such a data set and shows how this technique was used on a public data set of 79 RNA expression measurements of 2,467 genes to construct 22 clusters, or Relevance Networks.
Abstract: Increasing numbers of methodologies are available to find functional genomic clusters in RNA expression data. We describe a technique that computes comprehensive pair-wise mutual information for all genes in such a data set. An association with a high mutual information means that one gene is non-randomly associated with another; we hypothesize this means the two are related biologically. By picking a threshold mutual information and using only associations at or above the threshold, we show how this technique was used on a public data set of 79 RNA expression measurements of 2,467 genes to construct 22 clusters, or Relevance Networks. The biological significance of each Relevance Network is explained.

Proceedings ArticleDOI
01 Aug 1999
TL;DR: An unsupervised, near-linear time text clustering system that offers a number of algorithm choices for each phase, and a refinement to center adjustment, “vector average damping,” that further improves cluster quality.
Abstract: Clustering is a powerful technique for large-scale topic discovery from text. It involves two phases: first, feature extraction maps each document or record to a point in high-dimensional space, then clustering algorithms automatically group the points into a hierarchy of clusters. We describe an unsupervised, near-linear time text clustering system that offers a number of algorithm choices for each phase. We introduce a methodology for measuring the quality of a cluster hierarchy in terms of FMeasure, and present the results of experiments comparing different algorithms. The evaluation considers some feature selection parameters (tfidfand feature vector length) but focuses on the clustering algorithms, namely techniques from Scatter/Gather (buckshot, fractionation, and split/join) and kmeans. Our experiments suggest that continuous center adjustment contributes more to cluster quality than seed selection does. It follows that using a simpler seed selection algorithm gives a better time/quality tradeoff. We describe a refinement to center adjustment, “vector average damping,” that further improves cluster quality. We also compare the near-linear time algorithms to a group average greedy agglomerative clustering algorithm to demonstrate the time/quality tradeoff quantitatively.

Book
09 Jul 1999
TL;DR: This paper presents a meta-modelling framework that automates the very labor-intensive and therefore time-heavy and therefore expensive process of rule generation and estimation in the context of cluster dynamics.
Abstract: Introduction. Basic Concepts. Classical Fuzzy Clustering Algorithms. Linear and Ellipsoidal Prototypes Shell Prototypes. Polygonal Object Boundaries. Cluster Estimation Models. Cluster Validity. Rule Generation with Clustering. Appendix. Bibliography.

Journal ArticleDOI
TL;DR: There is increasing agreement that clustering helps small enterprises to overcome growth constraints and compete in distant markets but there is also recognition that this is not an automatic outcome as mentioned in this paper, and recent research on industrial clusters has made a major contribution to this shift in the debate.

Journal ArticleDOI
17 May 1999
TL;DR: This paper introduces Grouper, an interface to the results of the HuskySearch meta-search engine, which dynamically groups the search results into clusters labeled by phrases extracted from the snippets, and reports on the first empirical comparison of user Web search behavior on a standard ranked-list presentation versus a clustered presentation.
Abstract: Users of Web search engines are often forced to sift through the long ordered list of document `snippets' returned by the engines. The IR community has explored document clustering as an alternative method of organizing retrieval results, but clustering has yet to be deployed on most major search engines. The NorthernLight search engine organizes its output into `custom folders' based on pre-computed document labels, but does not reveal how the folders are generated or how well they correspond to users' interests. In this paper, we introduce Grouper, an interface to the results of the HuskySearch meta-search engine, which dynamically groups the search results into clusters labeled by phrases extracted from the snippets. In addition, we report on the first empirical comparison of user Web search behavior on a standard ranked-list presentation versus a clustered presentation. By analyzing HuskySearch logs, we are able to demonstrate substantial differences in the number of documents followed, and in the amount of time and effort expended by users accessing search results through these two interfaces.

Journal ArticleDOI
TL;DR: The results suggest that the Kaufman initialization method induces to the K-Means algorithm a more desirable behaviour with respect to the convergence speed than the random initialization method.

Journal ArticleDOI
TL;DR: Experiments showed that Close is very efficient for mining dense and/or correlated data such as census style data, and performs reasonably well for market basket style data.

Journal ArticleDOI
TL;DR: A novel framework for dynamically organizing mobile nodes in wireless ad hoc networks into clusters in which the probability of path availability can be bounded is presented, which supports an adaptive hybrid routing architecture that can be more responsive and effective when mobility rates are low and more efficient when Mobility rates are high.
Abstract: This paper presents a novel framework for dynamically organizing mobile nodes in wireless ad hoc networks into clusters in which the probability of path availability can be bounded. The purpose of the (/spl alpha/, t) cluster is to help minimize the far-reaching effects of topological changes while balancing the need to support more optimal routing. A mobility model for ad hoc networks is developed and is used to derive expressions for the probability of path availability as a function of time. It is shown how this model provides the basis for dynamically grouping nodes into clusters using an efficient distributed clustering algorithm. Since the criteria for cluster organization depends directly upon path availability, the structure of the cluster topology is adaptive with respect to node mobility. Consequently, this framework supports an adaptive hybrid routing architecture that can be more responsive and effective when mobility rates are low and more efficient when mobility rates are high.

Proceedings ArticleDOI
01 Aug 1999
TL;DR: This work considers a database with numerical attributes, in which each transaction is viewed as a multi-dimensional vector, and identifies new meaningful criteria of high density and correlation of dimensions for goodness of clustering in subspaces.
Abstract: Mining numerical data is a relatively difficult problem in data mining. Clustering is one of the techniques. We consider a database with numerical attributes, in which each transaction is viewed as a multi-dimensional vector. By studying the clusters formed by these vectors, we can discover certain behaviors hidden in the data. Traditional clustering algorithms find clusters in the full space of the data sets. This results in high dimensional clusters, which are poorly comprehensible to human. One important task in this setting is the ability to discover clusters embedded in the subspaces of a high-dimensional data set. This problem is known as subspace clustering. We follow the basic assumptions of previous work CLIQUE. It is found that the number of subspaces with clustering is very large, and a criterion called the coverage is proposed in CLIQUE for the pruning. In addition to coverage, we identify new useful criteria for this problem and propose an entropybased algorithm called ENCLUS to handle the criteria. Our major contributions are: (1) identify new meaningful criteria of high density and correlation of dimensions for goodness of clustering in subspaces, (2) introduce the use of entropy and provide evidence to support its use, (3) make use of two closure properties based on entropy to prune away uninteresting subspaces efficiently, (4) propose a mechanism to mine non-minimally correlated subspaces which are of interest because of strong clustering, (5) experiments are carried out to show the effectiveness of the proposed method.

Proceedings ArticleDOI
01 Aug 1999
TL;DR: This paper introduces a novel formalization of a cluster for categorical attributes by generalizing a definition of a clusters for numerical attributes and describes a very fast summarizationbased algorithm called CACTUS that discovers exactly such clusters in the data.
Abstract: Clustering is an important data mining problem. Most of the earlier work on clustering focussed on numeric attributes which have a natural ordering on their attribute values. Recently, clustering data with categorical attributes, whose attribute values do not have a natural ordering, has received some attention. However, previous algorithms do not give a formal description of the clusters they discover and some of them assume that the user post-processes the output of the algorithm to identify the final clusters. In this paper, we introduce a novel formalization of a cluster for categorical attributes by generalizing a definition of a cluster for numerical attributes. We then describe a very fast summarizationbased algorithm called CACTUS that discovers exactly such clusters in the data. CACTUS has two important characteristics. First, the algorithm requires only two scans of the dataset, and hence is very fast and scalable. Our experiments on a variety of datasets show that CACTUS outperforms previous work by a factor of 3 to 10. Second, CACTUS can find clusters in subsets of all attributes and can thus perform a subspace clustering of the data. This feature is important if clusters do not span all attributes, a likely scenario if the number of attributes is very large. In a thorough experimental evaluation, we study the performance of CACTUS on real and synthetic datasets.

Journal ArticleDOI
TL;DR: This paper addresses three major issues associated with conventional partitional clustering, namely, sensitivity to initialization, difficulty in determining the number of clusters, and sensitivity to noise and outliers with the proposed robust competitive agglomeration (RCA).
Abstract: This paper addresses three major issues associated with conventional partitional clustering, namely, sensitivity to initialization, difficulty in determining the number of clusters, and sensitivity to noise and outliers. The proposed robust competitive agglomeration (RCA) algorithm starts with a large number of clusters to reduce the sensitivity to initialization, and determines the actual number of clusters by a process of competitive agglomeration. Noise immunity is achieved by incorporating concepts from robust statistics into the algorithm. RCA assigns two different sets of weights for each data point: the first set of constrained weights represents degrees of sharing, and is used to create a competitive environment and to generate a fuzzy partition of the data set. The second set corresponds to robust weights, and is used to obtain robust estimates of the cluster prototypes. By choosing an appropriate distance measure in the objective function, RCA can be used to find an unknown number of clusters of various shapes in noisy data sets, as well as to fit an unknown number of parametric models simultaneously. Several examples, such as clustering/mixture decomposition, line/plane fitting, segmentation of range images, and estimation of motion parameters of multiple objects, are shown.

Journal ArticleDOI
TL;DR: MCLUST is a software package for cluster analysis written in Fortran and interfaced to the S PLUS commercial software package and includes functions that combine hierarchical clustering EM and the Bayesian Information Criterion BIC in a comprehensive clustering strategy.
Abstract: MCLUST is a software package for cluster analysis written in Fortran and interfaced to the S PLUS commercial software package It implements parameterized Gaussian hierarchical clustering algorithms and the EM algorithm for parameterized Gaussian mixture models with the possible addition of a Poisson noise term MCLUST also includes functions that combine hierarchical clustering EM and the Bayesian Information Criterion BIC in a comprehensive clustering strategy Methods of this type have shown promise in a number of practical applications including character recognition tissue segmenta tion mine eld and seismic fault detection identi cation of textile aws from images and classi cation of astronomical data A web page with related links can be found at

Journal ArticleDOI
TL;DR: Two different clustering algorithms are presented and used to identify regions of similar activations in an fMRI experiment involving a visual stimulus and a novel metric is employed that measures the similarity between the activation stimulus and the fMRI signal.

Proceedings Article
31 Jul 1999
TL;DR: This paper presents a statistical approach to collaborative filtering and investigates the use of latent class models for predicting individual choices and preferences based on observed preference behavior and presents EM algorithms for different variants of the aspect model.
Abstract: This paper presents a statistical approach to collaborative filtering and investigates the use of latent class models for predicting individual choices and preferences based on observed preference behavior. Two models are discussed and compared: the aspect model, a probabilistic latent space model which models individual preferences as a convex combination of preference factors, and the two-sided clustering model, which simultaneously partitions persons and objects into clusters. We present EM algorithms for different variants of the aspect model and derive an approximate EM algorithm based on a variational principle for the two-sided clustering model. The benefits of the different models are experimentally investigated on a large movie data set.

Proceedings ArticleDOI
30 Aug 1999
TL;DR: A clustering tool called Bunch is developed that creates a system decomposition automatically by treating clustering as an optimization problem and a feature that enables the integration of designer knowledge about the system structure into an otherwise fully automatic clustering process is described.
Abstract: Software systems are typically modified in order to extend or change their functionality, improve their performance, port them to different platforms, and so on. For developers, it is crucial to understand the structure of a system before attempting to modify it. The structure of a system, however, may not be apparent to new developers, because the design documentation is non-existent or, worse, inconsistent with the implementation. This problem could be alleviated if developers were somehow able to produce high-level system decomposition descriptions from the low-level structures present in the source code. We have developed a clustering tool called Bunch that creates a system decomposition automatically by treating clustering as an optimization problem. The paper describes the extensions made to Bunch in response to feedback we received from users. The most important extension, in terms of the quality of results and execution efficiency, is a feature that enables the integration of designer knowledge about the system structure into an otherwise fully automatic clustering process. We use a case study to show how our new features simplified the task of extracting the subsystem structure of a medium size program, while exposing an interesting design flaw in the process.

Journal ArticleDOI
TL;DR: A new approach is developed, which allows the use of the k-means paradigm to efficiently cluster large categorical data sets and a fuzzy k-modes algorithm is presented and the effectiveness of the algorithm is demonstrated with experimental results.
Abstract: This correspondence describes extensions to the fuzzy k-means algorithm for clustering categorical data. By using a simple matching dissimilarity measure for categorical objects and modes instead of means for clusters, a new approach is developed, which allows the use of the k-means paradigm to efficiently cluster large categorical data sets. A fuzzy k-modes algorithm is presented and the effectiveness of the algorithm is demonstrated with experimental results.

Journal ArticleDOI
01 Dec 1999
TL;DR: An equivalence between the concepts of fuzzy clustering and soft competitive learning in clustering algorithms is proposed as a unifying framework in the comparison of clustering systems.
Abstract: For pt.I see ibid., p.775-85. In part I an equivalence between the concepts of fuzzy clustering and soft competitive learning in clustering algorithms is proposed on the basis of the existing literature. Moreover, a set of functional attributes is selected for use as dictionary entries in the comparison of clustering algorithms. In this paper, five clustering algorithms taken from the literature are reviewed, assessed and compared on the basis of the selected properties of interest. These clustering models are (1) self-organizing map (SOM); (2) fuzzy learning vector quantization (FLVQ); (3) fuzzy adaptive resonance theory (fuzzy ART); (4) growing neural gas (GNG); (5) fully self-organizing simplified adaptive resonance theory (FOSART). Although our theoretical comparison is fairly simple, it yields observations that may appear parodoxical. First, only FLVQ, fuzzy ART, and FOSART exploit concepts derived from fuzzy set theory (e.g., relative and/or absolute fuzzy membership functions). Secondly, only SOM, FLVQ, GNG, and FOSART employ soft competitive learning mechanisms, which are affected by asymptotic misbehaviors in the case of FLVQ, i.e., only SOM, GNG, and FOSART are considered effective fuzzy clustering algorithms.