scispace - formally typeset
Search or ask a question

Showing papers on "Cluster analysis published in 2002"


Journal ArticleDOI
TL;DR: This work develops and analyzes low-energy adaptive clustering hierarchy (LEACH), a protocol architecture for microsensor networks that combines the ideas of energy-efficient cluster-based routing and media access together with application-specific data aggregation to achieve good performance in terms of system lifetime, latency, and application-perceived quality.
Abstract: Networking together hundreds or thousands of cheap microsensor nodes allows users to accurately monitor a remote environment by intelligently combining the data from the individual nodes. These networks require robust wireless communication protocols that are energy efficient and provide low latency. We develop and analyze low-energy adaptive clustering hierarchy (LEACH), a protocol architecture for microsensor networks that combines the ideas of energy-efficient cluster-based routing and media access together with application-specific data aggregation to achieve good performance in terms of system lifetime, latency, and application-perceived quality. LEACH includes a new, distributed cluster formation technique that enables self-organization of large numbers of nodes, algorithms for adapting clusters and rotating cluster head positions to evenly distribute the energy load among all the nodes, and techniques to enable distributed signal processing to save communication resources. Our results show that LEACH can improve system lifetime by an order of magnitude compared with general-purpose multihop approaches.

10,296 citations


Journal ArticleDOI
TL;DR: This work presents a simple and efficient implementation of Lloyd's k-means clustering algorithm, which it calls the filtering algorithm, and establishes the practical efficiency of the algorithm's running time.
Abstract: In k-means clustering, we are given a set of n data points in d-dimensional space R/sup d/ and an integer k and the problem is to determine a set of k points in Rd, called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for k-means clustering is Lloyd's (1982) algorithm. We present a simple and efficient implementation of Lloyd's k-means clustering algorithm, which we call the filtering algorithm. This algorithm is easy to implement, requiring a kd-tree as the only major data structure. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a data-sensitive analysis of the algorithm's running time, which shows that the algorithm runs faster as the separation between clusters increases. Second, we present a number of empirical studies both on synthetically generated data and on real data sets from applications in color quantization, data compression, and image segmentation.

5,288 citations


Journal ArticleDOI
TL;DR: This work reviews a general methodology for model-based clustering that provides a principled statistical approach to important practical questions that arise in cluster analysis, such as how many clusters are there, which clustering method should be used, and how should outliers be handled.
Abstract: Cluster analysis is the automated search for groups of related observations in a dataset. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures, and most clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as how many clusters are there, which clustering method should be used, and how should outliers be handled. We review a general methodology for model-based clustering that provides a principled statistical approach to these issues. We also show that this can be useful for other problems in multivariate analysis, such as discriminant analysis and multivariate density estimation. We give examples from medical diagnosis, minefield detection, cluster recovery from noisy data, and spatial density estimation. Finally, we mention limitations of the methodology and discuss recent development...

4,123 citations


Proceedings Article
01 Jan 2002
TL;DR: This paper presents an algorithm that, given examples of similar (and, if desired, dissimilar) pairs of points in �”n, learns a distance metric over ℝn that respects these relationships.
Abstract: Many algorithms rely critically on being given a good metric over their inputs. For instance, data can often be clustered in many "plausible" ways, and if a clustering algorithm such as K-means initially fails to find one that is meaningful to a user, the only recourse may be for the user to manually tweak the metric until sufficiently good clusters are found. For these and other applications requiring good metrics, it is desirable that we provide a more systematic way for users to indicate what they consider "similar." For instance, we may ask them to provide examples. In this paper, we present an algorithm that, given examples of similar (and, if desired, dissimilar) pairs of points in ℝn, learns a distance metric over ℝn that respects these relationships. Our method is based on posing metric learning as a convex optimization problem, which allows us to give efficient, local-optima-free algorithms. We also demonstrate empirically that the learned metrics can be used to significantly improve clustering performance.

3,176 citations


Journal ArticleDOI
TL;DR: The novelty of the approach is that it does not use a model selection criterion to choose one among a set of preestimated candidate models; instead, it seamlessly integrate estimation and model selection in a single algorithm.
Abstract: This paper proposes an unsupervised algorithm for learning a finite mixture model from multivariate data. The adjective "unsupervised" is justified by two properties of the algorithm: 1) it is capable of selecting the number of components and 2) unlike the standard expectation-maximization (EM) algorithm, it does not require careful initialization. The proposed method also avoids another drawback of EM for mixture fitting: the possibility of convergence toward a singular estimate at the boundary of the parameter space. The novelty of our approach is that we do not use a model selection criterion to choose one among a set of preestimated candidate models; instead, we seamlessly integrate estimation and model selection in a single algorithm. Our technique can be applied to any type of parametric mixture model for which it is possible to write an EM algorithm; in this paper, we illustrate it with experiments involving Gaussian mixtures. These experiments testify for the good performance of our approach.

2,182 citations


Journal ArticleDOI
TL;DR: Genesis integrates various tools for microarray data analysis such as filters, normalization and visualization tools, distance measures as well as common clustering algorithms including hierarchical clustering, self-organizing maps, k-means, principal component analysis, and support vector machines.
Abstract: Summary: A versatile, platform independent and easy to use Java suite for large-scale gene expression analysis was developed. Genesis integrates various tools for microarray data analysis such as filters, normalization and visualization tools, distance measures as well as common clustering algorithms including hierarchical clustering, selforganizing maps, k-means, principal component analysis, and support vector machines. The results of the clustering are transparent across all implemented methods and enable the analysis of the outcome of different algorithms and parameters. Additionally, mapping of gene expression data onto chromosomal sequences was implemented to enhance promoter analysis and investigation of transcriptional control mechanisms. Availability: http://genome.tugraz.at Contact: zlatko.trajanoski@tugraz.at

1,768 citations


Journal ArticleDOI
TL;DR: In this article, the authors present the results of a large library of cosmological N-body simulations, using power-law initial spectra, showing that, when transformed under the self-similarity scaling, the scale-free spectra define a nonlinear locus that is clearly shallower than would be required under stable clustering.
Abstract: We present the results of a large library of cosmological N-body simulations, using power-law initial spectra. The nonlinear evolution of the matter power spectra is compared with the predictions of existing analytic scaling formulae based on the work of Hamilton et al. The scaling approach has assumed that highly nonlinear structures obey `stable clustering' and are frozen in proper coordinates. Our results show that, when transformed under the self-similarity scaling, the scale-free spectra define a nonlinear locus that is clearly shallower than would be required under stable clustering. Furthermore, the small-scale nonlinear power increases as both the power-spectrum index n and the density parameter Omega decrease, and this evolution is not well accounted for by the previous scaling formulae. This breakdown of stable clustering can be understood as resulting from the modification of dark-matter haloes by continuing mergers. These effects are naturally included in the analytic `halo model' for nonlinear structure; using this approach we are able to fit both our scale-free results and also our previous CDM data. This approach is more accurate than the commonly-used Peacock--Dodds formula and should be applicable to more general power spectra. Code to evaluate nonlinear power spectra using this method is available from this http URL Following publication, we will make the power-law simulation data available through the Virgo website this http URL

1,693 citations


Journal ArticleDOI
TL;DR: An improved version of Michael Eisen's well-known Cluster program for Windows, Mac OS X and Linux/Unix is created, and a Python and a Perl interface to the C Clustering Library is generated, thereby combining the flexibility of a scripting language with the speed of C.
Abstract: SUMMARY We have implemented k-means clustering, hierarchical clustering and self-organizing maps in a single multipurpose open-source library of C routines, callable from other C and C++ programs. Using this library, we have created an improved version of Michael Eisen's well-known Cluster program for Windows, Mac OS X and Linux/Unix. In addition, we generated a Python and a Perl interface to the C Clustering Library, thereby combining the flexibility of a scripting language with the speed of C. AVAILABILITY The C Clustering Library and the corresponding Python C extension module Pycluster were released under the Python License, while the Perl module Algorithm::Cluster was released under the Artistic License. The GUI code Cluster 3.0 for Windows, Macintosh and Linux/Unix, as well as the corresponding command-line program, were released under the same license as the original Cluster code. The complete source code is available at http://bonsai.ims.u-tokyo.ac.jp/mdehoon/software/cluster. Alternatively, Algorithm::Cluster can be downloaded from CPAN, while Pycluster is also available as part of the Biopython distribution.

1,493 citations


Journal ArticleDOI
TL;DR: In this article, the authors identify two shortcomings of existing research on the clustering phenomenon and argue for the need to establish a specific theory of the cluster where learning occupies centre stage.
Abstract: A number of possible advantages of industry agglomeration—or spatial clustering—have been identified in the research literature, notably those related to shared costs for infrastructure, the build-up of a skilled labour force, transaction efficiency, and knowledge spillovers leading to firm learning and innovation. We identify two shortcomings of existing research on the clustering phenomenon. First, the abundance of theoretical concepts and explanations stands in sharp contrast with the general lack of work aimed at validating these mechanisms empirically and the contradictory evidence found in recent empirical work in the field. Second, there is still a lack of a unified theoretical framework for analyzing spatial clustering. In an attempt to remedy the latter shortcoming, this paper investigates the nature of the cluster from a knowledge-creation or learning perspective. We argue for the need to establish a specific theory of the cluster where learning occupies centre stage. The basic requirements for ...

1,454 citations


Journal ArticleDOI
TL;DR: An on-demand distributed clustering algorithm for multi-hop packet radio networks that takes into consideration the ideal degree, transmission power, mobility, and battery power of mobile nodes, and is aimed to reduce the computation and communication costs.
Abstract: In this paper, we propose an on-demand distributed clustering algorithm for multi-hop packet radio networks. These types of networks, also known as i>ad hoc networks, are dynamic in nature due to the mobility of nodes. The association and dissociation of nodes to and from i>clusters perturb the stability of the network topology, and hence a reconfiguration of the system is often unavoidable. However, it is vital to keep the topology stable as long as possible. The i>clusterheads, form a i>dominant set in the network, determine the topology and its stability. The proposed weight-based distributed clustering algorithm takes into consideration the ideal degree, transmission power, mobility, and battery power of mobile nodes. The time required to identify the clusterheads depends on the diameter of the underlying graph. We try to keep the number of nodes in a cluster around a pre-defined threshold to facilitate the optimal operation of the medium access control (MAC) protocol. The non-periodic procedure for clusterhead election is invoked on-demand, and is aimed to reduce the computation and communication costs. The clusterheads, operating in “dual" power mode, connects the clusters which help in routing messages from a node to any other node. We observe a trade-off between the uniformity of the load handled by the clusterheads and the connectivity of the network. Simulation experiments are conducted to evaluate the performance of our algorithm in terms of the number of clusterheads, i>reaffiliation frequency, and dominant set updates. Results show that our algorithm performs better than existing ones and is also tunable to different kinds of network conditions.

1,419 citations


Journal ArticleDOI
TL;DR: In this paper, a Gaussian kernel based clustering method using support vector machines (SVM) is proposed to find the minimal enclosing sphere, which can separate into several components, each enclosing a separate cluster of points.
Abstract: We present a novel clustering method using the approach of support vector machines. Data points are mapped by means of a Gaussian kernel to a high dimensional feature space, where we search for the minimal enclosing sphere. This sphere, when mapped back to data space, can separate into several components, each enclosing a separate cluster of points. We present a simple algorithm for identifying these clusters. The width of the Gaussian kernel controls the scale at which the data is probed while the soft margin constant helps coping with outliers and overlapping clusters. The structure of a dataset is explored by varying the two parameters, maintaining a minimal number of support vectors to assure smooth cluster boundaries. We demonstrate the performance of our algorithm on several datasets.

Journal ArticleDOI
TL;DR: This article evaluates the performance of three clustering algorithms, hard K-Means, single linkage, and a simulated annealing (SA) based technique, in conjunction with four cluster validity indices, namely Davies-Bouldin index, Dunn's index, Calinski-Harabasz index, andA recently developed index I.
Abstract: In this article, we evaluate the performance of three clustering algorithms, hard K-Means, single linkage, and a simulated annealing (SA) based technique, in conjunction with four cluster validity indices, namely Davies-Bouldin index, Dunn's index, Calinski-Harabasz index, and a recently developed index I. Based on a relation between the index I and the Dunn's index, a lower bound of the value of the former is theoretically estimated in order to get unique hard K-partition when the data set has distinct substructures. The effectiveness of the different validity indices and clustering methods in automatically evolving the appropriate number of clusters is demonstrated experimentally for both artificial and real-life data sets with the number of clusters varying from two to ten. Once the appropriate number of clusters is determined, the SA-based clustering technique is used for proper partitioning of the data into the said number of clusters.

Journal ArticleDOI
TL;DR: A new clustering method is proposed, called CLARANS, whose aim is to identify spatial structures that may be present in the data, and two spatial data mining algorithms that aim to discover relationships between spatial and nonspatial attributes are developed.
Abstract: Spatial data mining is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases. To this end, this paper has three main contributions. First, it proposes a new clustering method called CLARANS, whose aim is to identify spatial structures that may be present in the data. Experimental results indicate that, when compared with existing clustering methods, CLARANS is very efficient and effective. Second, the paper investigates how CLARANS can handle not only point objects, but also polygon objects efficiently. One of the methods considered, called the IR-approximation, is very efficient in clustering convex and nonconvex polygon objects. Third, building on top of CLARANS, the paper develops two spatial data mining algorithms that aim to discover relationships between spatial and nonspatial attributes. Both algorithms can discover knowledge that is difficult to find with existing spatial data mining algorithms.

Proceedings ArticleDOI
10 Dec 2002
TL;DR: A communication protocol named LEACH (low-energy adaptive clustering hierarchy) is modified and its stochastic cluster-head selection algorithm is extended by a deterministic component to reduce the power consumption of wireless microsensor networks.
Abstract: This paper focuses on reducing the power consumption of wireless microsensor networks. Therefore, a communication protocol named LEACH (low-energy adaptive clustering hierarchy) is modified. We extend LEACH's stochastic cluster-head selection algorithm by a deterministic component. Depending on the network configuration an increase of network lifetime by about 30% can be accomplished. Furthermore, we present a new approach to define lifetime of microsensor networks using three new metrics FND (First Node Dies), HNA (Half of the Nodes Alive), and LND (Last Node Dies).

Journal ArticleDOI
TL;DR: The standard scale-free network model is extended to include a "triad formation step" and the clustering coefficient is shown to be tunable simply by changing a control parameter---the average number of triad formation trials per time step.
Abstract: We extend the standard scale-free network model to include a "triad formation step." We analyze the geometric properties of networks generated by this algorithm both analytically and by numerical calculations, and find that our model possesses the same characteristics as the standard scale-free networks such as the power-law degree distribution and the small average geodesic length, but with the high clustering at the same time. In our model, the clustering coefficient is also shown to be tunable simply by changing a control parameter---the average number of triad formation trials per time step.

Book
01 Jan 2002
TL;DR: This chapter discusses Parameter Optimisation Algorithms, Density Modelling and Clustering, Single-Layer Networks, and Radial Basis Functions.
Abstract: Introduction.- Parameter Optimisation Algorithms.- Density Modelling and Clustering.- Single-Layer Networks.- The Multi-Layer Perceptron.- Radial Basis Functions.- Visualisation and Latent Variable Models.- Sampling.- Bayesian Techniques.- Gaussian Processes.- Linear Algebra and Matrices.- Algorithm Error Analysis.- Function Index.- Subject Index.

Journal ArticleDOI
TL;DR: This paper compares the performance of Ant-Miner with CN2, a well-known data mining algorithm for classification, in six public domain data sets and provides evidence that Ant- Miner is competitive with CN1 with respect to predictive accuracy and the rule lists discovered are considerably simpler than those discovered by CN2.
Abstract: The paper proposes an algorithm for data mining called Ant-Miner (ant-colony-based data miner). The goal of Ant-Miner is to extract classification rules from data. The algorithm is inspired by both research on the behavior of real ant colonies and some data mining concepts as well as principles. We compare the performance of Ant-Miner with CN2, a well-known data mining algorithm for classification, in six public domain data sets. The results provide evidence that: 1) Ant-Miner is competitive with CN2 with respect to predictive accuracy, and 2) the rule lists discovered by Ant-Miner are considerably simpler (smaller) than those discovered by CN2.

Journal ArticleDOI
TL;DR: In this paper, the performance of the many available graphical and statistical methodologies used to classify water samples including: Collins bar diagram, pie diagram, Stiff pattern diagram, Schoeller plot, Piper diagram, Q-mode hierarchical cluster analysis, K-means clustering, principal components analysis, and fuzzy k-mean clustering are compared.
Abstract: A robust classification scheme for partitioning water chemistry samples into homogeneous groups is an important tool for the characterization of hydrologic systems. In this paper we test the performance of the many available graphical and statistical methodologies used to classify water samples including: Collins bar diagram, pie diagram, Stiff pattern diagram, Schoeller plot, Piper diagram, Q-mode hierarchical cluster analysis, K-means clustering, principal components analysis, and fuzzy k-means clustering. All the methods are discussed and compared as to their ability to cluster, ease of use, and ease of interpretation. In addition, several issues related to data preparation, database editing, data-gap filling, data screening, and data quality assurance are discussed and a database construction methodology is presented.

Journal ArticleDOI
TL;DR: It is shown that the eigenvectors of a kernel matrix which defines the implicit mapping provides a means to estimate the number of clusters inherent within the data and a computationally simple iterative procedure is presented for the subsequent feature space partitioning of the data.
Abstract: The article presents a method for both the unsupervised partitioning of a sample of data and the estimation of the possible number of inherent clusters which generate the data. This work exploits the notion that performing a nonlinear data transformation into some high dimensional feature space increases the probability of the linear separability of the patterns within the transformed space and therefore simplifies the associated data structure. It is shown that the eigenvectors of a kernel matrix which defines the implicit mapping provides a means to estimate the number of clusters inherent within the data and a computationally simple iterative procedure is presented for the subsequent feature space partitioning of the data.

Journal ArticleDOI
TL;DR: The approach assigns genes to context-dependent and potentially overlapping 'transcription modules', thus overcoming the main limitations of traditional clustering methods, and uses the method to elucidate regulatory properties of cellular pathways and to characterize cis-regulatory elements.
Abstract: Standard clustering methods can classify genes successfully when applied to relatively small data sets, but have limited use in the analysis of large-scale expression data, mainly owing to their assignment of a gene to a single cluster. Here we propose an alternative method for the global analysis of genome-wide expression data. Our approach assigns genes to context-dependent and potentially overlapping ‘transcription modules’, thus overcoming the main limitations of traditional clustering methods. We use our method to elucidate regulatory properties of cellular pathways and to characterize cis-regulatory elements. By applying our algorithm systematically to all of the available expression data on Saccharomyces cerevisiae, we identify a comprehensive set of overlapping transcriptional modules. Our results provide functional predictions for numerous genes, identify relations between modules and present a global view on the transcriptional network. article

Journal ArticleDOI
TL;DR: A new prediction-based resampling method, Clest, is developed, to estimate the number of clusters in a dataset, and was generally found to be more accurate and robust than the six existing methods considered in the study.
Abstract: Microarray technology is increasingly being applied in biological and medical research to address a wide range of problems, such as the classification of tumors. An important statistical problem associated with tumor classification is the identification of new tumor classes using gene-expression profiles. Two essential aspects of this clustering problem are: to estimate the number of clusters, if any, in a dataset; and to allocate tumor samples to these clusters, and assess the confidence of cluster assignments for individual samples. Here we address the first of these problems. We have developed a new prediction-based resampling method, Clest, to estimate the number of clusters in a dataset. The performance of the new and existing methods were compared using simulated data and gene-expression data from four recently published cancer microarray studies. Clest was generally found to be more accurate and robust than the six existing methods considered in the study. Focusing on prediction accuracy in conjunction with resampling produces accurate and robust estimates of the number of clusters.

01 Jan 2002
TL;DR: The authors compare these two approaches using data simulated from a setting where true group membership is known to indicate that LC substantially outperforms the K-means technique.
Abstract: Recent developments in latent class (LC) analysis and associated software to include continuous variables offer a model-based alternative to more traditional clustering approaches such as K-means. In this paper, the authors compare these two approaches using data simulated from a setting where true group membership is known. The authors choose a setting favourable to K-means by simulating data according to the assumptions made in both discriminant analysis (DISC) and K-means clustering. Since the information on true group membership is used in DISC but not in clustering approaches in general, the authors use the results obtained from DISC as a gold standard in determining an upper bound on the best possible outcome that might be expected from a clustering technique. The results indicate that LC substantially outperforms the K-means technique. A truly surprising result is that the LC performance is so good that it is virtually indistinguishable from the performance of DISC.

Journal ArticleDOI
TL;DR: It is found that the connectivity structure of the Internet presents statistical distributions settled in a well-defined stationary state and the large-scale properties are characterized by a scale-free topology consistent with previous observations.
Abstract: We study the large-scale topological and dynamical properties of real Internet maps at the autonomous system level, collected in a 3-yr time interval. We find that the connectivity structure of the Internet presents statistical distributions settled in a well-defined stationary state. The large-scale properties are characterized by a scale-free topology consistent with previous observations. Correlation functions and clustering coefficients exhibit a remarkable structure due to the underlying hierarchical organization of the Internet. The study of the Internet time evolution shows a growth dynamics with aging features typical of recently proposed growing network models. We compare the properties of growing network models with the present real Internet data analysis.

Journal ArticleDOI
TL;DR: The use of cluster analysis as an exploratory data analysis tool requires a powerful program system to test different data preparation, processing and clustering methods, including the ability to present the results in a number of easy to grasp graphics.

01 Jan 2002
TL;DR: This paper undertake the first extensive review and empirical comparison of all proposed techniques for mining time series data and introduces a novel algorithm that is empirically show to be superior to all others in the literature.
Abstract: In recent years, there has been an explosion of interest in mining time series databases. As with most computer science problems, representation of the data is the key to efficient and effective solutions. One of the most commonly used representations is piecewise linear approximation. This representation has been used by various researchers to support clustering, classification, indexing and association rule mining of time series data. A variety of algorithms have been proposed to obtain this representation, with several algorithms having been independently rediscovered several times. In this paper, we undertake the first extensive review and empirical comparison of all proposed techniques. We show that all these algorithms have fatal flaws from a data mining perspective. We introduce a novel algorithm that we empirically show to be superior to all others in the literature.

Proceedings ArticleDOI
04 Nov 2002
TL;DR: It is suggested that partitional clustering algorithms are well-suited for clustering large document datasets due to not only their relatively low computational requirements, but also comparable or even better clustering performance.
Abstract: Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering solutions provide a view of the data at different levels of granularity, making them ideal for people to visualize and interactively explore large document collections.In this paper we evaluate different partitional and agglomerative approaches for hierarchical clustering. Our experimental evaluation showed that partitional algorithms always lead to better clustering solutions than agglomerative algorithms, which suggests that partitional clustering algorithms are well-suited for clustering large document datasets due to not only their relatively low computational requirements, but also comparable or even better clustering performance. We present a new class of clustering algorithms called constrained agglomerative algorithms that combine the features of both partitional and agglomerative algorithms. Our experimental results showed that they consistently lead to better hierarchical solutions than agglomerative or partitional algorithms alone.

Proceedings ArticleDOI
12 May 2002
TL;DR: The main features of the adaptation of an immune network model include: automatic determination of the population size, combination of local with global search, defined convergence criterion, and capability of locating and maintaining stable local optima solutions.
Abstract: This paper presents the adaptation of an immune network model, originally proposed to perform information compression and data clustering, to solve multimodal function optimization problems. The algorithm is described theoretically and empirically compared with similar approaches from the literature. The main features of the algorithm include: automatic determination of the population size, combination of local with global search (exploitation plus exploration of the fitness landscape), defined convergence criterion, and capability of locating and maintaining stable local optima solutions.

Proceedings ArticleDOI
05 Jun 2002
TL;DR: This work considers the question of whether there exists a simple and practical approximation algorithm for k-means clustering, and presents a local improvement heuristic based on swapping centers in and out that yields a (9+ε)-approximation algorithm.
Abstract: In k-means clustering we are given a set of n data points in d-dimensional space ℜd and an integer k, and the problem is to determine a set of k points in ℜd, called centers, to minimize the mean squared distance from each data point to its nearest center. No exact polynomial-time algorithms are known for this problem. Although asymptotically efficient approximation algorithms exist, these algorithms are not practical due to the extremely high constant factors involved. There are many heuristics that are used in practice, but we know of no bounds on their performance.We consider the question of whether there exists a simple and practical approximation algorithm for k-means clustering. We present a local improvement heuristic based on swapping centers in and out. We prove that this yields a (9+e)-approximation algorithm. We show that the approximation factor is almost tight, by giving an example for which the algorithm achieves an approximation factor of (9-e). To establish the practical value of the heuristic, we present an empirical study that shows that, when combined with Lloyd's algorithm, this heuristic performs quite well in practice.

Journal ArticleDOI
TL;DR: The DDMCMC paradigm provides a unifying framework in which the role of many existing segmentation algorithms are revealed as either realizing Markov chain dynamics or computing importance proposal probabilities and generalizes these segmentation methods in a principled way.
Abstract: This paper presents a computational paradigm called Data-Driven Markov Chain Monte Carlo (DDMCMC) for image segmentation in the Bayesian statistical framework. The paper contributes to image segmentation in four aspects. First, it designs efficient and well-balanced Markov Chain dynamics to explore the complex solution space and, thus, achieves a nearly global optimal solution independent of initial segmentations. Second, it presents a mathematical principle and a K-adventurers algorithm for computing multiple distinct solutions from the Markov chain sequence and, thus, it incorporates intrinsic ambiguities in image segmentation. Third, it utilizes data-driven (bottom-up) techniques, such as clustering and edge detection, to compute importance proposal probabilities, which drive the Markov chain dynamics and achieve tremendous speedup in comparison to the traditional jump-diffusion methods. Fourth, the DDMCMC paradigm provides a unifying framework in which the role of many existing segmentation algorithms, such as, edge detection, clustering, region growing, split-merge, snake/balloon, and region competition, are revealed as either realizing Markov chain dynamics or computing importance proposal probabilities. Thus, the DDMCMC paradigm combines and generalizes these segmentation methods in a principled way. The DDMCMC paradigm adopts seven parametric and nonparametric image models for intensity and color at various regions. We test the DDMCMC paradigm extensively on both color and gray-level images and some results are reported in this paper.

Proceedings ArticleDOI
07 Aug 2002
TL;DR: This work describes a streaming algorithm that effectively clusters large data streams and provides empirical evidence of the algorithm's performance on synthetic and real data streams.
Abstract: Streaming data analysis has recently attracted attention in numerous applications including telephone records, Web documents and click streams. For such analysis, single-pass algorithms that consume a small amount of memory are critical. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm's performance on synthetic and real data streams.