scispace - formally typeset
Search or ask a question

Showing papers on "Cluster analysis published in 2003"


Journal ArticleDOI
TL;DR: A novel graph theoretic clustering algorithm, "Molecular Complex Detection" (MCODE), that detects densely connected regions in large protein-protein interaction networks that may represent molecular complexes is described.
Abstract: Recent advances in proteomics technologies such as two-hybrid, phage display and mass spectrometry have enabled us to create a detailed map of biomolecular interaction networks. Initial mapping efforts have already produced a wealth of data. As the size of the interaction set increases, databases and computational methods will be required to store, visualize and analyze the information in order to effectively aid in knowledge discovery. This paper describes a novel graph theoretic clustering algorithm, "Molecular Complex Detection" (MCODE), that detects densely connected regions in large protein-protein interaction networks that may represent molecular complexes. The method is based on vertex weighting by local neighborhood density and outward traversal from a locally dense seed protein to isolate the dense regions according to given parameters. The algorithm has the advantage over other graph clustering methods of having a directed mode that allows fine-tuning of clusters of interest without considering the rest of the network and allows examination of cluster interconnectivity, which is relevant for protein networks. Protein interaction and complex information from the yeast Saccharomyces cerevisiae was used for evaluation. Dense regions of protein interaction networks can be found, based solely on connectivity data, many of which correspond to known protein complexes. The algorithm is not affected by a known high rate of false positives in data from high-throughput interaction techniques. The program is available from ftp://ftp.mshri.on.ca/pub/BIND/Tools/MCODE .

4,599 citations


Journal ArticleDOI
TL;DR: This paper introduces the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitionings and proposes three effective and efficient techniques for obtaining high-quality combiners (consensus functions).
Abstract: This paper introduces the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitionings. We first identify several application scenarios for the resultant 'knowledge reuse' framework that we call cluster ensembles. The cluster ensemble problem is then formalized as a combinatorial optimization problem in terms of shared mutual information. In addition to a direct maximization approach, we propose three effective and efficient techniques for obtaining high-quality combiners (consensus functions). The first combiner induces a similarity measure from the partitionings and then reclusters the objects. The second combiner is based on hypergraph partitioning. The third one collapses groups of clusters into meta-clusters which then compete for each object to determine the combined clustering. Due to the low computational costs of our techniques, it is quite feasible to use a supra-consensus function that evaluates all three approaches against the objective function and picks the best solution for a given situation. We evaluate the effectiveness of cluster ensembles in three qualitatively different application scenarios: (i) where the original clusters were formed based on non-identical sets of features, (ii) where the original clustering algorithms worked on non-identical sets of objects, and (iii) where a common data-set is used and the main purpose of combining multiple clusterings is to improve the quality and robustness of the solution. Promising results are obtained in all three situations for synthetic as well as real data-sets.

4,375 citations


Journal ArticleDOI
TL;DR: The global k-means algorithm is presented which is an incremental approach to clustering that dynamically adds one cluster center at a time through a deterministic global search procedure consisting of N executions of the k-Means algorithm from suitable initial positions.

2,544 citations


Journal ArticleDOI
TL;DR: In this article, the authors show that many real networks in nature and society share two generic properties: they are scale-free and they display a high degree of clustering, implying that small groups of nodes organize in a hierarchical manner into increasingly large groups, while maintaining a scale free topology.
Abstract: Many real networks in nature and society share two generic properties: they are scale-free and they display a high degree of clustering We show that these two features are the consequence of a hierarchical organization, implying that small groups of nodes organize in a hierarchical manner into increasingly large groups, while maintaining a scale-free topology In hierarchical networks, the degree of clustering characterizing the different groups follows a strict scaling law, which can be used to identify the presence of a hierarchical organization in real networks We find that several real networks, such as the Worldwideweb, actor network, the Internet at the domain level, and the semantic web obey this scaling law, indicating that hierarchy is a fundamental characteristic of many complex systems

1,981 citations


Proceedings ArticleDOI
09 Jul 2003
TL;DR: This paper proposes a distributed, randomized clustering algorithm to organize the sensors in a wireless sensor network into clusters, and extends this algorithm to generate a hierarchy of clusterheads and observes that the energy savings increase with the number of levels in the hierarchy.
Abstract: A wireless network consisting of a large number of small sensors with low-power transceivers can be an effective tool for gathering data in a variety of environments. The data collected by each sensor is communicated through the network to a single processing center that uses all reported data to determine characteristics of the environment or detect an event. The communication or message passing process must be designed to conserve the limited energy resources of the sensors. Clustering sensors into groups, so that sensors communicate information only to clusterheads and then the clusterheads communicate the aggregated information to the processing center, may save energy. In this paper, we propose a distributed, randomized clustering algorithm to organize the sensors in a wireless sensor network into clusters. We then extend this algorithm to generate a hierarchy of clusterheads and observe that the energy savings increase with the number of levels in the hierarchy. Results in stochastic geometry are used to derive solutions for the values of parameters of our algorithm that minimize the total energy spent in the network when all sensors report data through the clusterheads to the processing center.

1,935 citations


Proceedings ArticleDOI
13 Jun 2003
TL;DR: A new symbolic representation of time series is introduced that is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measuresdefined on the original series.
Abstract: The parallel explosions of interest in streaming data, and data mining of time series have had surprisingly little intersection. This is in spite of the fact that time series data are typically streaming data. The main reason for this apparent paradox is the fact that the vast majority of work on streaming data explicitly assumes that the data is discrete, whereas the vast majority of time series data is real valued.Many researchers have also considered transforming real valued time series into symbolic representations, nothing that such representations would potentially allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities, in addition to allowing formerly "batch-only" problems to be tackled by the streaming community. While many symbolic representations of time series have been introduced over the past decades, they all suffer from three fatal flaws. Firstly, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Secondly, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series. Finally, most of these symbolic approaches require one to have access to all the data, before creating the symbolic representation. This last feature explicitly thwarts efforts to use the representations with streaming algorithms.In this work we introduce a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. Finally, our representation allows the real valued data to be converted in a streaming fashion, with only an infinitesimal time and space overhead.We will demonstrate the utility of our representation on the classic data mining tasks of clustering, classification, query by content and anomaly detection.

1,922 citations


Proceedings ArticleDOI
28 Jul 2003
TL;DR: This paper proposes a novel document clustering method based on the non-negative factorization of the term-document matrix of the given document corpus that surpasses the latent semantic indexing and the spectral clustering methods not only in the easy and reliable derivation of document clustered results, but also in document clusters accuracies.
Abstract: In this paper, we propose a novel document clustering method based on the non-negative factorization of the term-document matrix of the given document corpus. In the latent semantic space derived by the non-negative matrix factorization (NMF), each axis captures the base topic of a particular document cluster, and each document is represented as an additive combination of the base topics. The cluster membership of each document can be easily determined by finding the base topic (the axis) with which the document has the largest projection value. Our experimental evaluations show that the proposed document clustering method surpasses the latent semantic indexing and the spectral clustering methods not only in the easy and reliable derivation of document clustering results, but also in document clustering accuracies.

1,903 citations


Book ChapterDOI
09 Sep 2003
TL;DR: A fundamentally different philosophy for data stream clustering is discussed which is guided by application-centered requirements and uses the concepts of a pyramidal time frame in conjunction with a microclustering approach.
Abstract: The clustering problem is a difficult problem for the data stream domain. This is because the large volumes of data arriving in a stream renders most traditional algorithms too inefficient. In recent years, a few one-pass clustering algorithms have been developed for the data stream problem. Although such methods address the scalability issues of the clustering problem, they are generally blind to the evolution of the data and do not address the following issues: (1) The quality of the clusters is poor when the data evolves considerably over time. (2) A data stream clustering algorithm requires much greater functionality in discovering and exploring clusters over different portions of the stream. The widely used practice of viewing data stream clustering algorithms as a class of one-pass clustering algorithms is not very useful from an application point of view. For example, a simple one-pass clustering algorithm over an entire data stream of a few years is dominated by the outdated history of the stream. The exploration of the stream over different time windows can provide the users with a much deeper understanding of the evolving behavior of the clusters. At the same time, it is not possible to simultaneously perform dynamic clustering over all possible time horizons for a data stream of even moderately large volume. This paper discusses a fundamentally different philosophy for data stream clustering which is guided by application-centered requirements. The idea is divide the clustering process into an online component which periodically stores detailed summary statistics and an offine component which uses only this summary statistics. The offine component is utilized by the analyst who can use a wide variety of inputs (such as time horizon or number of clusters) in order to provide a quick understanding of the broad clusters in the data stream. The problems of efficient choice, storage, and use of this statistical data for a fast data stream turns out to be quite tricky. For this purpose, we use the concepts of a pyramidal time frame in conjunction with a microclustering approach. Our performance experiments over a number of real and synthetic data sets illustrate the effectiveness, efficiency, and insights provided by our approach.

1,836 citations


Journal ArticleDOI
TL;DR: A new methodology of class discovery and clustering validation tailored to the task of analyzing gene expression data is presented, and in conjunction with resampling techniques, it provides for a method to represent the consensus across multiple runs of a clustering algorithm and to assess the stability of the discovered clusters.
Abstract: In this paper we present a new methodology of class discovery and clustering validation tailored to the task of analyzing gene expression data. The method can best be thought of as an analysis approach, to guide and assist in the use of any of a wide range of available clustering algorithms. We call the new methodology consensus clustering, and in conjunction with resampling techniques, it provides for a method to represent the consensus across multiple runs of a clustering algorithm and to assess the stability of the discovered clusters. The method can also be used to represent the consensus over multiple runs of a clustering algorithm with random restart (such as K-means, model-based Bayesian clustering, SOM, etc.), so as to account for its sensitivity to the initial conditions. Finally, it provides for a visualization tool to inspect cluster number, membership, and boundaries. We present the results of our experiments on both simulated data and real gene expression data aimed at evaluating the effectiveness of the methodology in discovering biologically meaningful clusters.

1,831 citations


Journal ArticleDOI
TL;DR: TGICL is a pipeline for analysis of large Expressed Sequence Tags and mRNA databases in which the sequences are first clustered based on pairwise sequence similarity, and then assembled by individual clusters to produce longer, more complete consensus sequences.
Abstract: TGICL is a pipeline for analysis of large Expressed Sequence Tags (EST) and mRNA databases in which the sequences are first clustered based on pairwise sequence similarity, and then assembled by individual clusters (optionally with quality values) to produce longer, more complete consensus sequences. The system can run on multi-CPU architectures including SMP and PVM.

1,703 citations


Book
01 Jan 2003
TL;DR: This chapter discusses data mining techniques, products, and terminology used in the mining industry, as well as some of the techniques themselves, which have changed in the past decade.
Abstract: I. INTRODUCTION. 1. Introduction. 2. Related Concepts. 3. Data Mining Techniques. II. CORE TOPICS. 4. Classification. 5. Clustering. 6. Association Rules. III. ADVANCED TOPICS. 7. Web Mining. 8. Spatial Mining. 9. Temporal Mining. IV. APPENDIX. 10. Data Mining Products.

Journal ArticleDOI
TL;DR: There are a multitude of applications where novelty detection is extremely important including signal processing, computer vision, pattern recognition, data mining, and robotics.

Proceedings ArticleDOI
28 Jul 2003
TL;DR: The approach shows the usefulness of using formal information retrieval models for the task of image annotation and retrieval by assuming that regions in an image can be described using a small vocabulary of blobs.
Abstract: Libraries have traditionally used manual image annotation for indexing and then later retrieving their image collections. However, manual image annotation is an expensive and labor intensive procedure and hence there has been great interest in coming up with automatic ways to retrieve images based on content. Here, we propose an automatic approach to annotating and retrieving images based on a training set of images. We assume that regions in an image can be described using a small vocabulary of blobs. Blobs are generated from image features using clustering. Given a training set of images with annotations, we show that probabilistic models allow us to predict the probability of generating a word given the blobs in an image. This may be used to automatically annotate and retrieve images given a word as a query. We show that relevance models allow us to derive these probabilities in a natural way. Experiments show that the annotation performance of this cross-media relevance model is almost six times as good (in terms of mean precision) than a model based on word-blob co-occurrence model and twice as good as a state of the art model derived from machine translation. Our approach shows the usefulness of using formal information retrieval models for the task of image annotation and retrieval.

Journal ArticleDOI
TL;DR: The most exhaustive set of time series experiments ever attempted, re-implementing the contribution of more than two dozen papers, and testing them on 50 real world, highly diverse datasets support the claim that there is a need for a set oftime series benchmarks and more careful empirical evaluation in the data mining community.
Abstract: In the last decade there has been an explosion of interest in mining time series data. Literally hundreds of papers have introduced new algorithms to index, classify, cluster and segment time series. In this work we make the following claim. Much of this work has very little utility because the contribution made (speed in the case of indexing, accuracy in the case of classification and clustering, model accuracy in the case of segmentation) offer an amount of “improvement” that would have been completely dwarfed by the variance that would have been observed by testing on many real world datasets, or the variance that would have been observed by changing minor (unstated) implementation details. To illustrate our point, we have undertaken the most exhaustive set of time series experiments ever attempted, re-implementing the contribution of more than two dozen papers, and testing them on 50 real world, highly diverse datasets. Our empirical results strongly support our assertion, and suggest the need for a set of time series benchmarks and more careful empirical evaluation in the data mining community.

Journal ArticleDOI
TL;DR: This article provided an overview of applications of cluster-sample methods, both to cluster samples and to panel data sets, and showed how accounting for multi-level clustering can have dramatic effects on t statistics.
Abstract: Inference methods that recognize the clustering of individual observations have been available for more than 25 years. Brent Moulton (1990) caught the attention of economists when he demonstrated the serious biases that can result in estimating the effects of aggregate explanatory variables on individual-specific response variables. The source of the downward bias in the usual ordinary least-squares (OLS) standard errors is the presence of an unobserved, state-level effect in the error term. More recently, John Pepper (2002) showed how accounting for multi-level clustering can have dramatic effects on t statistics. While adjusting for clustering is much more common than it was 10 years ago, inference methods robust to cluster correlation are not used routinely across all relevant settings. In this paper, I provide an overview of applications of cluster-sample methods, both to cluster samples and to panel data sets.

Proceedings ArticleDOI
24 Aug 2003
TL;DR: This work presents an innovative co-clustering algorithm that monotonically increases the preserved mutual information by intertwining both the row and column clusterings at all stages and demonstrates that the algorithm works well in practice, especially in the presence of sparsity and high-dimensionality.
Abstract: Two-dimensional contingency or co-occurrence tables arise frequently in important applications such as text, web-log and market-basket data analysis. A basic problem in contingency table analysis is co-clustering: simultaneous clustering of the rows and columns. A novel theoretical formulation views the contingency table as an empirical joint probability distribution of two discrete random variables and poses the co-clustering problem as an optimization problem in information theory---the optimal co-clustering maximizes the mutual information between the clustered random variables subject to constraints on the number of row and column clusters. We present an innovative co-clustering algorithm that monotonically increases the preserved mutual information by intertwining both the row and column clusterings at all stages. Using the practical example of simultaneous word-document clustering, we demonstrate that our algorithm works well in practice, especially in the presence of sparsity and high-dimensionality.

Journal ArticleDOI
Ming Li, Xin Chen1, Xin Li, Bin Ma, Paul M. B. Vitányi1 
15 Sep 2003
TL;DR: Evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors is reported.
Abstract: We present a new method for clustering based on compression. The method does not use subject-specific features or background knowledge, and works as follows: First, we determine a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal. However, the optimality comes at the price of using the noncomputable notion of Kolmogorov complexity. We propose axioms to capture the real-world setting, and show that the NCD approximates optimality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (ternary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics, we presented new evidence for major questions in Mammalian evolution, based on whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis.

Journal ArticleDOI
TL;DR: This review focuses on application of statistical tools and techniques in analysis of genetic diversity at the intraspecific level in crop plants.
Abstract: Knowledge about germplasm diversity and genetic relationships among breeding materials could be an invaluable aid in crop improvement strategies. A number of methods are currently available for analysis of genetic diversity in germplasm accessions, breeding lines, and populations. These methods have relied on pedigree data, morphological data, agronomic performance data, biochemical data, and more recently molecular (DNA-based) data. For reasonably accurate and unbiased estimates of genetic diversity, adequate attention has to be devoted to (i) sampling strategies; (ii) utilization of various data sets on the basis of the understanding of their strengths and constraints; (iii) choice of genetic distance measure(s), clustering procedures, and other multivariate methods in analyses of data; and (iv) objective determination of genetic relationships. Judicious combination and utilization of statistical tools and techniques, such as bootstrapping, is vital for addressing complex issues related to data analysis and interpretation of results from different types of data sets, particularly through clustering procedures. This review focuses on application of statistical tools and techniques in analysis of genetic diversity at the intraspecific level in crop plants.

Proceedings Article
09 Dec 2003
TL;DR: A unified framework for extending Local Linear Embedding, Isomap, Laplacian Eigenmaps, Multi-Dimensional Scaling as well as for Spectral Clustering is provided.
Abstract: Several unsupervised learning algorithms based on an eigendecomposition provide either an embedding or a clustering only for given training points, with no straightforward extension for out-of-sample examples short of recomputing eigenvectors. This paper provides a unified framework for extending Local Linear Embedding (LLE), Isomap, Laplacian Eigenmaps, Multi-Dimensional Scaling (for dimensionality reduction) as well as for Spectral Clustering. This framework is based on seeing these algorithms as learning eigenfunctions of a data-dependent kernel. Numerical experiments show that the generalizations performed have a level of error comparable to the variability of the embedding algorithms due to the choice of training data.

Proceedings ArticleDOI
Yu1, Shi
13 Oct 2003
TL;DR: This work proposes a principled account on multiclass spectral clustering by solving a relaxed continuous optimization problem by eigen-decomposition and clarifying the role of eigenvectors as a generator of all optimal solutions through orthonormal transforms.
Abstract: We propose a principled account on multiclass spectral clustering Given a discrete clustering formulation, we first solve a relaxed continuous optimization problem by eigen-decomposition We clarify the role of eigenvectors as a generator of all optimal solutions through orthonormal transforms We then solve an optimal discretization problem, which seeks a discrete solution closest to the continuous optima The discretization is efficiently computed in an iterative fashion using singular value decomposition and nonmaximum suppression The resulting discrete solutions are nearly global-optimal Our method is robust to random initialization and converges faster than other clustering methods Experiments on real image segmentation are reported

Proceedings ArticleDOI
23 Mar 2003
TL;DR: The Joint Clustering technique reduces computational cost by more than an order of magnitude, compared to the current state of the art techniques, allowing non-centralized implementation on mobile clients.
Abstract: We present a WLAN location determination technique, the Joint Clustering technique, that uses: (1) signal strength probability distributions to address the noisy wireless channel, and (2) clustering of locations to reduce the computational cost of searching the radio map. The Joint Clustering technique reduces computational cost by more than an order of magnitude, compared to the current state of the art techniques, allowing non-centralized implementation on mobile clients. Results from 802.11-equipped iPAQ implementations show that the new technique gives user location to within 7 feet with over 90% accuracy.

Journal ArticleDOI
TL;DR: This work describes a streaming algorithm that effectively clusters large data streams and provides empirical evidence of the algorithm's performance on synthetic and real data streams.
Abstract: The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm's performance on synthetic and real data streams.

Proceedings Article
09 Dec 2003
TL;DR: An improved algorithm for learning k while clustering based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution, which works well, and better than a recent method based on the BIC penalty for model complexity.
Abstract: When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the standard statistical significance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does not penalize strongly enough the model's complexity.

Proceedings ArticleDOI
08 Dec 2003
TL;DR: This paper proposes two new approaches to using PSO to cluster data, one which basically usesPSO to refine the clusters formed by K-means, and the other which uses PSO in a different way to seed the initial swarm.
Abstract: This paper proposes two new approaches to using PSO to cluster data. It is shown how PSO can be used to find the centroids of a user specified number of clusters. The algorithm is then extended to use K-means clustering to seed the initial swarm. This second algorithm basically uses PSO to refine the clusters formed by K-means. The new PSO algorithms are evaluated on six data sets, and compared to the performance of K-means clustering. Results show that both PSO clustering techniques have much potential.

Posted Content
TL;DR: The normalized compression distance (NCD) as discussed by the authors is a similarity metric that approximates universality based on the normalized information distance (NIC) metric, which was proposed by the authors of this paper.
Abstract: We present a new method for clustering based on compression. The method doesn't use subject-specific features or background knowledge, and works as follows: First, we determine a universal similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is universal in that it is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal but uses the non-computable notion of Kolmogorov complexity. We propose precise notions of similarity metric, normal compressor, and show that the NCD based on a normal compressor is a similarity metric that approximates universality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (binary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics we presented new evidence for major questions in Mammalian evolution, based on whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis.

Proceedings Article
01 Jan 2003
TL;DR: A novel clustering technique that addresses problems with varying densities and high dimensionality, while the use of core points handles problems with shape and size, and a number of optimizations that allow the algorithm to handle large data sets are discussed.
Abstract: Finding clusters in data, especially high dimensional data, is challenging when the clusters are of widely diering shapes, sizes, and densities, and when the data contains noise and outliers. We present a novel clustering technique that addresses these issues. Our algorithm rst nds the nearest neighbors of each data point and then redenes the similarity between pairs of points in terms of how many nearest neighbors the two points share. Using this denition of similarity, our algorithm identies core points and then builds clusters around the core points. The use of a shared nearest neighbor denition of similarity alleviates problems with varying densities and high dimensionality, while the use of core points handles problems with shape and size. While our algorithm can nd the \dense" clusters that other clustering algorithms nd, it also nds clusters that these approaches overlook, i.e., clusters of low or medium density which represent relatively uniform regions \surrounded" by non-uniform or higher density areas. We experimentally show that our algorithm performs better than traditional methods (e.g., K-means, DBSCAN, CURE) on a variety of data sets: KDD Cup '99 network intrusion data, NASA Earth science time series data, two-dimensional point sets, and documents. The run-time complexity of our technique is O(n 2 ) if the similarity matrix has to be constructed. However, we discuss a number of optimizations that allow the algorithm to handle large data sets ecien tly.

Proceedings ArticleDOI
24 Aug 2003
TL;DR: This work presents a method for k-means clustering when different sites contain different attributes for a common set of entities, where each site learns the cluster of each entity, but learns nothing about the attributes at other sites.
Abstract: Privacy and security concerns can prevent sharing of data, derailing data mining projects. Distributed knowledge discovery, if done correctly, can alleviate this problem. The key is to obtain valid results, while providing guarantees on the (non)disclosure of data. We present a method for k-means clustering when different sites contain different attributes for a common set of entities. Each site learns the cluster of each entity, but learns nothing about the attributes at other sites.

Proceedings Article
21 Aug 2003
TL;DR: Empirical results show that the proposed approach achieves better and more robust clustering performance compared to not only single runs of random projection/clustering but also clustering with PCA, a traditional data reduction method for high dimensional data.
Abstract: We investigate how random projection can best be used for clustering high dimensional data. Random projection has been shown to have promising theoretical properties. In practice, however, we find that it results in highly unstable clustering performance. Our solution is to use random projection in a cluster ensemble approach. Empirical results show that the proposed approach achieves better and more robust clustering performance compared to not only single runs of random projection/clustering but also clustering with PCA, a traditional data reduction method for high dimensional data. To gain insights into the performance improvement obtained by our ensemble method, we analyze and identify the influence of the quality and the diversity of the individual clustering solutions on the final ensemble performance.

Proceedings ArticleDOI
09 Jul 2003
TL;DR: This paper considers the problem of power control when nodes are nonhomogeneously dispersed in space, and provides three solutions for joint clustering and power control, and establishes that all three protocols ensure that packets ultimately reach their intended destinations.
Abstract: In this paper, we consider the problem of power control when nodes are nonhomogeneously dispersed in space. In such situations, one seeks to employ per packet power control depending on the source and destination of the packet. This gives rise to a joint problem which involves not only power control but also clustering. We provide three solutions for joint clustering and power control. The first protocol, CLUSTERPOW, aims to increase the network capacity by increasing spatial reuse. We provide a simple and modular architecture to implement CLUSTERPOW at the network layer. The second, Tunnelled CLUSTERPOW, allows a finer optimization by using encapsulation, but we do not know of an efficient way to implement it. The last, MINPOW, whose basic idea is not new, provides an optimal routing solution with respect to the total power consumed in communication. Our contribution includes a clean implementation of MINPOW at the network layer without any physical layer support. We establish that all three protocols ensure that packets ultimately reach their intended destinations. We provide a software architectural framework for our implementation as a network layer protocol. The architecture works with any routing protocol, and can also be used to implement other power control schemes. Details of the implementation in Linux are provided.

Proceedings ArticleDOI
18 Jun 2003
TL;DR: Two appearance-based methods for clustering a set of images of 3D (three-dimensional) objects into disjoint subsets corresponding to individual objects, based on the concept of illumination cones and another affinity measure based on image gradient comparisons are introduced.
Abstract: We introduce two appearance-based methods for clustering a set of images of 3D (three-dimensional) objects, acquired under varying illumination conditions, into disjoint subsets corresponding to individual objects. The first algorithm is based on the concept of illumination cones. According to the theory, the clustering problem is equivalent to finding convex polyhedral cones in the high-dimensional image space. To efficiently determine the conic structures hidden in the image data, we introduce the concept of conic affinity, which measures the likelihood of a pair of images belonging to the same underlying polyhedral cone. For the second method, we introduce another affinity measure based on image gradient comparisons. The algorithm operates directly on the image gradients by comparing the magnitudes and orientations of the image gradient at each pixel. Both methods have clear geometric motivations, and they operate directly on the images without the need for feature extraction or computation of pixel statistics. We demonstrate experimentally that both algorithms are surprisingly effective in clustering images acquired under varying illumination conditions with two large, well-known image data sets.