scispace - formally typeset
Search or ask a question

Showing papers on "CURE data clustering algorithm published in 2001"


Proceedings Article
03 Jan 2001
TL;DR: A simple spectral clustering algorithm that can be implemented using a few lines of Matlab is presented, and tools from matrix perturbation theory are used to analyze the algorithm, and give conditions under which it can be expected to do well.
Abstract: Despite many empirical successes of spectral clustering methods— algorithms that cluster points using eigenvectors of matrices derived from the data—there are several unresolved issues. First. there are a wide variety of algorithms that use the eigenvectors in slightly different ways. Second, many of these algorithms have no proof that they will actually compute a reasonable clustering. In this paper, we present a simple spectral clustering algorithm that can be implemented using a few lines of Matlab. Using tools from matrix perturbation theory, we analyze the algorithm, and give conditions under which it can be expected to do well. We also show surprisingly good experimental results on a number of challenging clustering problems.

9,043 citations


Journal ArticleDOI
02 Dec 2001
TL;DR: The fundamental concepts of clustering are introduced while it surveys the widely known clustering algorithms in a comparative way and the issues that are under-addressed by the recent algorithms are illustrated.
Abstract: Cluster analysis aims at identifying groups of similar objects and, therefore helps to discover distribution of patterns and interesting correlations in large data sets. It has been subject of wide research since it arises in many application domains in engineering, business and social sciences. Especially, in the last years the availability of huge transactional and experimental data sets and the arising requirements for data mining created needs for clustering algorithms that scale and can be applied in diverse domains. This paper introduces the fundamental concepts of clustering while it surveys the widely known clustering algorithms in a comparative way. Moreover, it addresses an important issue of clustering process regarding the quality assessment of the clustering results. This is also related to the inherent features of the data set under concern. A review of clustering validity measures and approaches available in the literature is presented. Furthermore, the paper illustrates the issues that are under-addressed by the recent algorithms and gives the trends in clustering process.

2,643 citations


Proceedings Article
28 Jun 2001
TL;DR: This paper demonstrates how the popular k-means clustering algorithm can be protably modied to make use of information about the problem domain that is available in addition to the data instances themselves.
Abstract: Clustering is traditionally viewed as an unsupervised method for data analysis. However, in some cases information about the problem domain is available in addition to the data instances themselves. In this paper, we demonstrate how the popular k-means clustering algorithm can be protably modied to make use of this information. In experiments with articial constraints on six data sets, we observe improvements in clustering accuracy. We also apply this method to the real-world problem of automatically detecting road lanes from GPS data and observe dramatic increases in performance.

2,641 citations


Journal ArticleDOI
TL;DR: The empirical study showed that clustering with the PCs instead of the original variables does not necessarily improve, and often degrades, cluster quality, and would not recommend PCA before clustering except in special circumstances.
Abstract: Motivation: There is a great need to develop analytical methodology to analyze and to exploit the information contained in gene expression data. Because of the large number of genes and the complexity of biological networks, clustering is a useful exploratory technique for analysis of gene expression data. Other classical techniques, such as principal component analysis (PCA), have also been applied to analyze gene expression data. Using different data analysis techniques and different clustering algorithms to analyze the same data set can lead to very different conclusions. Our goal is to study the effectiveness of principal components (PCs) in capturing cluster structure. Specifically, using both real and synthetic gene expression data sets, we compared the quality of clusters obtained from the original data to the quality of clusters obtained after projecting onto subsets of the principal component axes. Results: Our empirical study showed that clustering with the PCs instead of the original variables does not necessarily improve, and often degrades, cluster quality. In particular, the first few PCs (which contain most of the variation in the data) do not necessarily capture most of the cluster structure. We also showed that clustering with PCs has different impact on different algorithms and different similarity metrics. Overall, we would not recommend PCA before clustering except in special circumstances.

1,134 citations


Proceedings ArticleDOI
29 Nov 2001
TL;DR: This paper proposes a new algorithm for graph partitioning with an objective function that follows the min-max clustering principle, and demonstrates that a linearized search order based on linkage differential is better than that based on the Fiedler vector, providing another effective partitioning method.
Abstract: An important application of graph partitioning is data clustering using a graph model - the pairwise similarities between all data objects form a weighted graph adjacency matrix that contains all necessary information for clustering. In this paper, we propose a new algorithm for graph partitioning with an objective function that follows the min-max clustering principle. The relaxed version of the optimization of the min-max cut objective function leads to the Fiedler vector in spectral graph partitioning. Theoretical analyses of min-max cut indicate that it leads to balanced partitions, and lower bounds are derived. The min-max cut algorithm is tested on newsgroup data sets and is found to out-perform other current popular partitioning/clustering methods. The linkage-based refinements to the algorithm further improve the quality of clustering substantially. We also demonstrate that a linearized search order based on linkage differential is better than that based on the Fiedler vector, providing another effective partitioning method.

903 citations


Journal ArticleDOI
TL;DR: The model-based approach has superior performance on synthetic data sets, consistently selecting the correct model and the number of clusters, and the validity of the Gaussian mixture assumption on different transformations of real data is explored.
Abstract: Motivation: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particular, model-based clustering assumes that the data is generated by a finite mixture of underlying probability distributions such as multivariate normal distributions. The issues of selecting a ‘good’ clustering method and determining the ‘correct’ number of clusters are reduced to model selection problems in the probability framework. Gaussian mixture models have been shown to be a powerful tool for clustering in many applications. Results: We benchmarked the performance of modelbased clustering on several synthetic and real gene expression data sets for which external evaluation criteria were available. The model-based approach has superior performance on our synthetic data sets, consistently selecting the correct model and the number of clusters. On real expression data, the model-based approach produced clusters of quality comparable to a leading heuristic clustering algorithm, but with the key advantage of suggesting the number of clusters and an appropriate model. We also explored the validity of the Gaussian mixture assumption on different transformations of real data. We also assessed the degree to which these real gene expression data sets fit multivariate Gaussian distributions both before and after subjecting them to commonly used data transformations. Suitably chosen transformations seem to result in reasonable fits. Availability: MCLUST is available at http://www.stat. washington.edu/fraley/mclust. The software for the diagonal model is under development.

890 citations


Journal ArticleDOI
TL;DR: It is demonstrated that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality.

778 citations


Journal ArticleDOI
TL;DR: This work provides a systematic framework for assessing the results of clustering algorithms for gene expression data sets by applying a clustering algorithm to the data from all but one experimental condition.
Abstract: Motivation: Many clustering algorithms have been proposed for the analysis of gene expression data, but little guidance is available to help choose among them. We provide a systematic framework for assessing the results of clustering algorithms. Clustering algorithms attempt to partition the genes into groups exhibiting similar patterns of variation in expression level. Our methodology is to apply a clustering algorithm to the data from all but one experimental condition. The remaining condition is used to assess the predictive power of the resulting clusters—meaningful clusters should exhibit less variation in the remaining condition than clusters formed by chance. Results: We successfully applied our methodology to compare six clustering algorithms on four gene expression data sets. We found our quantitative measures of cluster quality to be positively correlated with external standards of cluster quality. Availability: The software is under development.

707 citations


Proceedings ArticleDOI
Tom Chiu, Dongping Fang1, John Chen1, Yao Wang1, Christopher Jeris1 
26 Aug 2001
TL;DR: A distance measure is proposed that enables clustering data with both continuous and categorical attributes and is derived from a probabilistic model that the distance between two clusters is equivalent to the decrease in log-likelihood function as a result of merging.
Abstract: Clustering is a widely used technique in data mining applications to discover patterns in the underlying data. Most traditional clustering algorithms are limited to handling datasets that contain either continuous or categorical attributes. However, datasets with mixed types of attributes are common in real life data mining problems. In this paper, we propose a distance measure that enables clustering data with both continuous and categorical attributes. This distance measure is derived from a probabilistic model that the distance between two clusters is equivalent to the decrease in log-likelihood function as a result of merging. Calculation of this measure is memory efficient as it depends only on the merging cluster pair and not on all the other clusters. Zhang et al [8] proposed a clustering method named BIRCH that is especially suitable for very large datasets. We develop a clustering algorithm using our distance measure based on the framework of BIRCH. Similar to BIRCH, our algorithm first performs a pre-clustering step by scanning the entire dataset and storing the dense regions of data records in terms of summary statistics. A hierarchical clustering algorithm is then applied to cluster the dense regions. Apart from the ability of handling mixed type of attributes, our algorithm differs from BIRCH in that we add a procedure that enables the algorithm to automatically determine the appropriate number of clusters and a new strategy of assigning cluster membership to noisy data. For data with mixed type of attributes, our experimental results confirm that the algorithm not only generates better quality clusters than the traditional k-means algorithms, but also exhibits good scalability properties and is able to identify the underlying number of clusters in the data correctly. The algorithm is implemented in the commercial data mining tool Clementine 6.0 which supports the PMML standard of data mining model deployment.

686 citations


Proceedings ArticleDOI
01 Jan 2001
TL;DR: This paper presents a clustering scheme to create a hierarchical control structure for multi-hop wireless networks and presents an efficient distributed implementation of the clustering algorithm for a set of wireless nodes to create the set of desired clusters.
Abstract: In this paper we present a clustering scheme to create a hierarchical control structure for multi-hop wireless networks. A cluster is defined as a subset of vertices, whose induced graph is connected. In addition, a cluster is required to obey certain constraints that are useful for management and scalability of the hierarchy. All these constraints cannot be met simultaneously for general graphs, but we show how such a clustering can be obtained for wireless network topologies. Finally, we present an efficient distributed implementation of our clustering algorithm for a set of wireless nodes to create the set of desired clusters.

616 citations


Proceedings ArticleDOI
01 Dec 2001
TL;DR: The method can be used with any clustering algorithm and provides a means of rationally defining an optimum number of clusters, and can also detect the lack of structure in data.
Abstract: We present a method for visually and quantitatively assessing the presence of structure in clustered data. The method exploits measurements of the stability of clustering solutions obtained by perturbing the data set. Stability is characterized by the distribution of pairwise similarities between clusterings obtained from sub samples of the data. High pairwise similarities indicate a stable clustering pattern. The method can be used with any clustering algorithm; it provides a means of rationally defining an optimum number of clusters, and can also detect the lack of structure in data. We show results on artificial and microarray data using a hierarchical clustering algorithm.

Journal ArticleDOI
TL;DR: This work presents the first practical algorithm for the optimal linear leaf ordering of trees that are generated by hierarchical clustering, and shows how optimal leaf ordering can reveal biological structure that is not observed with an existing heuristic ordering method.
Abstract: We present the first practical algorithm for the optimal linear leaf ordering of trees that are generated by hierarchical clustering. Hierarchical clustering has been extensively used to analyze gene expression data, and we show how optimal leaf ordering can reveal biological structure that is not observed with an existing heuristic ordering method. For a tree with n leaves, there are 2(n-1) linear orderings consistent with the structure of the tree. Our optimal leaf ordering algorithm runs in time O(n(4)), and we present further improvements that make the running time of our algorithm practical.

Proceedings ArticleDOI
29 Nov 2001
TL;DR: A clustering validity procedure, which evaluates the results of clustering algorithms on data sets and defines a validity index, S Dbw, based on well-defined clustering criteria enabling the selection of optimal input parameter values for a clustering algorithm that result in the best partitioning of a data set.
Abstract: Clustering is a mostly unsupervised procedure and the majority of clustering algorithms depend on certain assumptions in order to define the subgroups present in a data set. As a consequence, in most applications the resulting clustering scheme requires some sort of evaluation regarding its validity. In this paper we present a clustering validity procedure, which evaluates the results of clustering algorithms on data sets. We define a validity index, S Dbw, based on well-defined clustering criteria enabling the selection of optimal input parameter values for a clustering algorithm that result in the best partitioning of a data set. We evaluate the reliability of our index both theoretically and experimentally, considering three representative clustering algorithms run on synthetic and real data sets. We also carried out an evaluation study to compare S Dbw performance with other known validity indices. Our approach performed favorably in all cases, even those in which other indices failed to indicate the correct partitions in a data set.

Journal ArticleDOI
TL;DR: In this article, a modified version of the K-means algorithm is proposed to cluster data, which adopts a novel nonmetric distance measure based on the idea of "point symmetry", which can be applied in data clustering and human face detection.
Abstract: We propose a modified version of the K-means algorithm to cluster data. The proposed algorithm adopts a novel nonmetric distance measure based on the idea of "point symmetry". This kind of "point symmetry distance" can be applied in data clustering and human face detection. Several data sets are used to illustrate its effectiveness.

Journal ArticleDOI
TL;DR: A two-phase clustering algorithm for outliers detection is proposed, which first modify the traditional k-means algorithm in Phase 1 by using a heuristic “if one new input pattern is far enough away from all clusters' centers, then assign it as a new cluster center”.

Journal ArticleDOI
01 Jan 2001
TL;DR: This paper performs an experimental comparison between three batch algorithms for model-based clustering on high-dimensional discrete-variable datasets, and finds that the Expectation–Maximization (EM) algorithm significantly outperforms the other methods.
Abstract: We examine methods for clustering in high dimensions. In the first part of the paper, we perform an experimental comparison between three batch clustering algorithms: the Expectation-Maximization (EM) algorithm, a "winner take all" version of the EM algorithm reminiscent of the K-means algorithm, and model-based hierarchical agglomerative clustering. We learn naive-Bayes models with a hidden root node, using high-dimensional discrete-variable data sets (both real and synthetic). We find that the EM algorithm significantly outperforms the other methods, and proceed to investigate the effect of various initialization schemes on the final solution produced by the EM algorithm. The initializations that we consider are (1) parameters sampled from an uninformative prior, (2) random perturbations of the marginal distribution of the data, and (3) the output of hierarchical agglomerative clustering. Although the methods are substantially different, they lead to learned models that are strikingly similar in quality.

Book ChapterDOI
02 Jul 2001
TL;DR: This paper addresses the problem of finding consistent clusters in data partitions, proposing the analysis of the most common associations performed in a majority voting scheme, and evaluating the proposed methodology in the context of k-means clustering, a new clustering algorithm being presented.
Abstract: Given an arbitrary data set, to which no particular parametrical, statistical or geometrical structure can be assumed, different clustering algorithms will in general produce different data partitions. In fact, several partitions can also be obtained by using a single clustering algorithm due to dependencies on initialization or the selection of the value of some design parameter. This paper addresses the problem of finding consistent clusters in data partitions, proposing the analysis of the most common associations performed in a majority voting scheme. Combination of clustering results are performed by transforming data partitions into a co-association sample matrix, which maps coherent associations. This matrix is then used to extract the underlying consistent clusters. The proposed methodology is evaluated in the context of k-means clustering, a new clustering algorithm - voting-k-means, being presented. Examples, using both simulated and real data, show how this majority voting combination scheme simultaneously handles the problems of selecting the number of clusters, and dependency on initialization. Furthermore, resulting clusters are not constrained to be hyperspherically shaped.

Journal ArticleDOI
TL;DR: In this article, a density-based unsupervised clustering approach for detecting natural patterns in data (further denoted as NP) is presented, and its performance is illustrated for data sets with different types of clusters.

Journal ArticleDOI
TL;DR: CLIFF, an algorithm for clustering biological samples using gene expression microarray data that outperforms standard clustering approaches that do not consider the feature selection issue, and produces a result that is very close to the original expert labeling of the sample set.
Abstract: We present CLIFF, an algorithm for clustering biological samples using gene expression microarray data. This clustering problem is difficult for several reasons, in particular the sparsity of the data, the high dimensionality of the feature (gene) space, and the fact that many features are irrelevant or redundant. Our algorithm iterates between two computational processes, feature filtering and clustering. Given a reference partition that approximates the correct clustering of the samples, our feature filtering procedure ranks the features according to their intrinsic discriminability, relevance to the reference partition, and irredundancy to other relevant features, and uses this ranking to select the features to be used in the following round of clustering. Our clustering algorithm, which is based on the concept of a normalized cut, clusters the samples into a new reference partition on the basis of the selected features. On a well-studied problem involving 72 leukemia samples and 7130 genes, we demonstrate that CLIFF outperforms standard clustering approaches that do not consider the feature selection issue, and produces a result that is very close to the original expert labeling of the sample set.

Journal ArticleDOI
TL;DR: A genetic algorithm is proposed for clustering the data with compact spherical clusters that can be used in two ways, the user-controlled and an automatic clustering, where a heuristic strategy is applied to find a good clustering.

Proceedings ArticleDOI
02 Apr 2001
TL;DR: It is shown that by pushing the task of handling obstacles into COD-CLARANS instead of abstracting it at the distance function level, more optimization can be done in the form of a pruning function E'.
Abstract: Clustering in spatial data mining is to group similar objects based on their distance, connectivity, or their relative density in space. In the real world there exist many physical obstacles such as rivers, lakes and highways, and their presence may affect the result of clustering substantially. We study the problem of clustering in the presence of obstacles and define it as a COD (Clustering with Obstructed Distance) problem. As a solution to this problem, we propose a scalable clustering algorithm, called COD-CLARANS. We discuss various forms of pre-processed information that could enhance the efficiency of COD-CLARANS. In the strictest sense, the COD problem can be treated as a change in distance function and thus could be handled by current clustering algorithms by changing the distance function. However, we show that by pushing the task of handling obstacles into COD-CLARANS instead of abstracting it at the distance function level, more optimization can be done in the form of a pruning function E'. We conduct various performance studies to show that COD-CLARANS is both efficient and effective.

Book ChapterDOI
04 Jan 2001
TL;DR: In this article, a scalable constrained clustering algorithm is developed which starts by finding an initial solution that satisfies user-specified constraints and then refines the solution by performing confined object movements under constraints.
Abstract: Constrained clustering--finding clusters that satisfy user-specified constraints--is highly desirable in many applications. In this paper, we introduce the constrained clustering problem and show that traditional clustering algorithms (e.g., k-means) cannot handle it. A scalable constraint-clustering algorithm is developed in this study which starts by finding an initial solution that satisfies user-specified constraints and then refines the solution by performing confined object movements under constraints. Our algorithm consists of two phases: pivot movement and deadlock resolution. For both phases, we show that finding the optimal solution is NP-hard. We then propose several heuristics and show how our algorithm can scale up for large data sets using the heuristic of micro-cluster sharing. By experiments, we show the effectiveness and efficiency of the heuristics.

Proceedings ArticleDOI
26 Aug 2001
TL;DR: An efficient algorithm that identifies error-tolerant frequent clusters of items in transactional data that consistently uncovers structure in large sparse databases that other traditional clustering algorithms fail to find is presented.
Abstract: We present a generalization of frequent itemsets allowing for the notion of errors in the itemset definition. We motivate the problem and present an efficient algorithm that identifies error-tolerant frequent clusters of items in transactional data (customer-purchase data, web browsing data, text, etc.). The algorithm exploits sparseness of the underlying data to find large groups of items that are correlated over database records (rows). The notion of transaction coverage allows us to extend the algorithm and view it as a fast clustering algorithm for discovering segments of similar transactions in binary sparse data. We evaluate the new algorithm on three real-world applications: clustering high-dimensional data, query selectivity estimation and collaborative filtering. Results show that the algorithm consistently uncovers structure in large sparse databases that other traditional clustering algorithms fail to find.

Proceedings ArticleDOI
27 May 2001
TL;DR: This work proposes a hybrid GA based on clustering, which considerably reduces the evaluation number without any loss of performance, and divides the whole population into several clusters, and evaluates only one representative for each cluster.
Abstract: To solve a general problem with genetic algorithms, it is desirable to maintain the population size as large as possible. In some cases, however, the cost to evaluate each individual is relatively high, and it is difficult to maintain a large population. To solve this problem, we propose a hybrid GA based on clustering, which considerably reduces the evaluation number without any loss of performance. The algorithm divides the whole population into several clusters, and evaluates only one representative for each cluster. The fitness values of other individuals are estimated from the representative fitness values indirectly, which can maintain a large population with less number of evaluations. Several benchmark tests have been conducted and the results show that the proposed GA is very efficient.

Proceedings ArticleDOI
25 Jul 2001
TL;DR: A new fuzzy clustering algorithm for categorical multivariate data where only cooccurrence relations among individuals and categories are given and the criterion to obtain clusters is not available is proposed.
Abstract: This paper proposes a new fuzzy clustering algorithm for categorical multivariate data. The conventional fuzzy clustering algorithms form fuzzy clusters so as to minimize the total distance from cluster centers to data points. However, they cannot be applied to the case where only cooccurrence relations among individuals and categories are given and the criterion to obtain clusters is not available. The proposed method enables us to handle that kind of data set by maximizing the degree of aggregation among clusters. The clustering results by the proposed method show similarity to those of correspondence analysis or Hayashi's (1952) quantification method Type III. Numerical examples show the usefulness of our method.

Journal ArticleDOI
TL;DR: A new framework for microarray gene-expression data clustering is described, which has developed a number of rigorous and efficient clustering algorithms, including two with guaranteed global optimality, implemented as a computer software EXCAVATOR.
Abstract: This paper describes a new framework for microarray gene-expression data clustering. The foundation of this framework is a minimum spanning tree (MST) representation of a set of multidimensional gene expression data. A key property of this representation is that each cluster of the expression data corresponds to one subtree of the MST, which rigorously converts a multidimensional clustering problem to a tree partitioning problem. We have demonstrated that though the inter-data relationship is greatly simplified in the MST representation, no essential information is lost for the purpose of clustering. Two key advantages in representing a set of multi-dimensional data as an MST are: (1) the simple structure of a tree facilitates efficient implementations of rigorous clustering algorithms, which otherwise are highly computationally challenging; and (2) as an MSTbased clustering does not depend on detailed geometric shape of a cluster, it can overcome many of the problems faced by classical clustering algorithms. Based on the MST representation, we have developed a number of rigorous and efficient clustering algorithms, including two with guaranteed global optimality. We have implemented these algorithms as a computer software EXCAVATOR. To demonstrate its effectiveness, we have tested it on two data sets, i.e., expression data from yeast Saccharomyces cerevisiae, and Arabidopsis expression data in response to chitin elicitation.

Proceedings ArticleDOI
09 Jan 2001
TL;DR: This paper motivates and introduces a new model of clustering that is in the spirit of the “PAC (probably approximately correct)” learning model, and gives examples of efficient PAC-clustering algorithms.
Abstract: Clustering is of central importance in a number of disciplines including Machine Learning, Statistics, and Data Mining. This paper has two foci: (1) It describes how existing algorithms for clustering can benefit from simple sampling techniques arising from work in statistics [Pol84]. (2) It motivates and introduces a new model of clustering that is in the spirit of the “PAC (probably approximately correct)” learning model, and gives examples of efficient PAC-clustering algorithms.

Proceedings Article
11 Sep 2001
TL;DR: C 2 P is presented, a new clustering algorithm for large spatial databases, which exploits spatial access methods for the determination of closest pairs and attains the advantages of hierarchical clustering and graphtheoretic algorithms providing both efficiency and quality of clustering result.
Abstract: In this paper we present C 2 P, a new clustering algorithm for large spatial databases, which exploits spatial access methods for the determination of closest pairs. Several extensions are presented for scalable clustering in large databases that contain clusters of various shapes and outliers. Due to its characteristics, the proposed algorithm attains the advantages of hierarchical clustering and graphtheoretic algorithms providing both efficiency and quality of clustering result. The superiority of C 2 P is verified both with analytical and experimental results.

Proceedings Article
01 Jan 2001
TL;DR: In this paper, the authors present a linear time algorithm for computing a 2-approximation to the k-centre clustering of a set of n points in R/sup d/. This is a slight improvement over the algorithm of T. Feder and D. Greene (1988), that runs in /spl Theta/(n log k) time (which is optimal in the comparison model).
Abstract: Given a set of moving points in R/sup d/, we show that one can cluster them in advance, using a small number of clusters, so that at any point in time this static clustering is competitive with the optimal k-centre clustering of the point-set at this point in time. The advantage of this approach is that it avoids the usage of kinetic data structures and as such it does not need to update the clustering as time passes. To implement this static clustering efficiently, we describe a simple technique for speeding up clustering algorithms, and apply it to achieve faster clustering algorithms for several problems. In particular, we present a linear time algorithm for computing a 2-approximation to the k-centre clustering of a set of n points in R/sup d/. This is a slight improvement over the algorithm of T. Feder and D. Greene (1988), that runs in /spl Theta/(n log k) time (which is optimal in the comparison model).

Proceedings ArticleDOI
06 Jul 2001
TL;DR: This work presents a primal-dual based constant factor approximation algorithm that achieves a logarithmic approximation which also applies when the distance function is asymmetric and an incremental clustering algorithm that maintains a solution whose cost is at most a constant factors times that of optimal with a constant factor blowup in the number of clusters.
Abstract: We study the problem of clustering points in a metric space so as to minimize the sum of cluster diameters. Significantly improving on previous results, we present a primal-dual based constant factor approximation algorithm for this problem. We present a simple greedy algorithm that achieves a logarithmic approximation which also applies when the distance function is asymmetric. The previous best known result obtained a logarithmic approximation with a constant factor blowup in the number of clusters. We also obtain an incremental clustering algorithm that maintains a solution whose cost is at most a constant factor times that of optimal with a constant factor blowup in the number of clusters.