scispace - formally typeset
Open AccessJournal ArticleDOI

Linear-time approximation schemes for clustering problems in any dimensions

TLDR
This work presents a general approach for designing approximation algorithms for a fundamental class of geometric clustering problems in arbitrary dimensions and leads to simple randomized algorithms for the k-means, median and discrete problems.
Abstract
We present a general approach for designing approximation algorithms for a fundamental class of geometric clustering problems in arbitrary dimensions. More specifically, our approach leads to simple randomized algorithms for the k-means, k-median and discrete k-means problems that yield (1+e) approximations with probability ≥ 1/2 and running times of O(2(k/e)O(1)dn). These are the first algorithms for these problems whose running times are linear in the size of the input (nd for n points in d dimensions) assuming k and e are fixed. Our method is general enough to be applicable to clustering problems satisfying certain simple properties and is likely to have further applications.

read more

Content maybe subject to copyright    Report

Linear-Time Approximation Schemes for Clustering Problems in
any Dimensions
Amit Kumar
Dept of Comp Sc & Engg
Indian Institute of
Technology
New Delhi-110016, India
amitk@cse.iitd.ernet.in
Yogish Sabharwal
IBM Research - India
Plot 4, Block C,
Vasant Kunj Inst. Area
New Delhi-110070, India
ysabharwal@in.ibm.com
Sandeep Sen
Dept of Comp Sc & Engg
Indian Institute of
Technology
New Delhi-110016, India
ssen@cse.iitd.ernet.in
September 23, 2009
Abstract
We present a gener al approach for designing approximation algorithms for a fundamental
class of geometric clustering problems in arbitrary dimensions. More specifically, our approach
leads to simple randomized algorithms for the k-means, k-median and discrete k-means pro blems
that yield (1 + ε) approximations with probability 1/2 and running times o f O(2
(k/ε)
O(1)
dn).
These are the first algorithms for these problems whose running times are linear in the s iz e of
the input (nd for n points in d dimensions) assuming k and ε are fixed. Our method is general
enough to be applicable to clustering pr oblems satisfying certa in simple properties and is likely
to have further applications.
1 Introduction
The problem of clustering a group of data items into similar groups is one of the most widely studied
problems in computer science. Clustering has ap plications in a variety of areas, for example, data
mining, information retrieval, image processing, and web search ([5, 9, 22, 11]). Given the wide
range of applications, many different definitions of clustering exist in the literature ([10, 4]). Most of
these definitions begin by defi ning a notion of distance (similarity) between two data items and then
try to form clusters so that data items with small distance between them get clustered together.
Often, clustering prob lems arise in a geometric setting, i.e., the data items are points in a high
dimensional Euclidean space. In such settings, it is natural to defin e the distance between two
points as the Euclidean distance between them. Two of the most popu lar defi nitions of clustering
are the k-means clustering problem and the k-median clustering problem. Given a set of points
P , the k-means clustering problems seeks to find a set K of k centers, such that
P
pP
d(p, K)
2
is
minimized, whereas the k-median clustering problems seeks to find a set K of k centers, such that
P
pP
d(p, K) is minimized. Note that the points in K can be arbitrary points in the Euclidean
space. Here d(p, K) refers to the distance between p and the closest center in K. We can think
of this as each point in P gets assigned to the closest center. The points that get assigned to the
same center form a cluster. These problems are NP-hard for even k = 2 (when dimension is not
Preliminary versions of the results have appeared earlier in IEEE Symposium on Foundations of Computer
Science, 2004[17] and International Colloquium on Automata, Languages and Programming, 2005[18].
1

2
fixed) [7 ]. Interestingly, the center in the optimal solution to the 1-mean problem is the same as
the center of mass of the points. However, in the case of th e 1-median problem, also kn own as the
Fermat-Weber problem, no such closed form is known. We show that despite the lack of such a
closed form, we can obtain an approximation to the optimal 1-median in O(1) time (indepen dent of
the nu mber of points). Th ere are many useful variations to these clustering problems, for example,
in the discrete versions of these problems, the centers that we seek s hould belong to the input set
of points.
1.1 Related work
A lot of research h as been devoted to solving these problems exactly (see [14] and th e references
therein). Even the best known algorithms for the k-median and the k-means problem take at
least Ω(n
d
) time. Recently, more attention has been devoted to finding (1 + ε)-approximation
algorithm for these problems, where ε can be an arbitrarily sm all constant. This has resulted in
algorithms with substantially improved running times. Further, if we look at the applications of
these problems, they often involve mapping subjective features to points in the Euclidean space.
Since there is an error inherent in this mapping, finding a (1 + ε)-approximate solution is within
acceptable limits f or the actual applications.
The fastest exact algorithm for the k-means clustering problem was proposed by In aba et al.
[14]. They obs erved that the number of Voronoi partitions of k points in
d
is O(n
kd
) and so the
optimal k-means clustering could be determined exactly in time O(n
kd+1
). They also proposed a
randomized (1 + ε)-approximation algorithm for the 2-means clustering problem with running time
O(n/ε
d
). Matouˇsek [19] proposed a deterministic (1 + ε)-approximation algorithm for the k-means
problem with running time O(
2k
2
d
log
k
n).
By generalizing the technique of Arora [1], Arora et al. [2] presented a O(n
O(1)+1
) time
(1 + ε)-approximation algorithm f or the k-median problem where points lie in the plane. This was
significantly impr oved by Kolliopoulos et al. [16] who proposed an algorithm w ith a running time
of O(nlognlogk) for the discrete version of the problem, where the m ed ians mus t belong to the
input set and = exp[O ((1 + log1))
d1
].
Recently, Badoiu et al. [3] proposed a (1 + ε)-approximation algorithm for k-median clustering
with a running time of O(2
(k/ε)
O(1)
d
O(1)
nlog
O(k)
n). Their algorithm can be extended to k-means
with some modifications. de la Vega et al. [8] proposed a (1 + ε)-approximation algorithm for
the k-means problem which works well for points in high dimensions. The running time of this
algorithm is O(g(k, ε)nlog
k
n) where g(k, ε) = exp[(k
3
8
)(ln(k)lnk].
Recently, Har-Peled and Mazumdar [13] proposed (1 + ε)-approximation algorithms for the
k-median, discrete k-median and the k-means clustering in low dimensions. They obtained a
runnin g time of O(n + k
O(1)
log
O(1)
n) for th e k-median problem, O(n + k
O(1)
log
O(1)
n) for the
discrete k-median problem and O(n + k
k+2
ε
(2d+1)k
log
k+1
nlog
k
1
ε
) for the k-means problem. For
approximating the 1-median of a set of points, Indyk [15] proposed an algorithm that nds a (1+ε)
approximate 1-median in time O(n/ε
2
) with constant probability.
Table 1.1 su mmarizes the recent results for the problems, in the context of (1+ε)-approximation
algorithms. Some of these algorithms are ran domized with the expected ruining time holding good
for any input.
1.2 Our results and techniques
The general algorithm we present solves a large class of clustering prob lems satisfying a set of
conditions (cf. S ection 3). We show that the k-means problem, k-median problem and the discrete

3
Problem Result Reference
1-median O(n/ε
2
) Indyk [15]
k-median O(n
O(1)+1
) for d = 2 Arora [1]
O(nlognlogk) (discrete only) Kolliopoulos et al. [16]
O(2
(k/ε)
O(1)
d
O(1)
n log
k
n)
Badoiu et al. [3]
O(n + k
O(1)
log
O(1)
n) (discrete also) Har-Peled et al. [13]
where = exp[O((1 + log1))
d1
]
k-means O(
2k
2
d
log
k
n) Matouˇsek [19]
O(g(k, ε)nlog
k
n) de la Vega et al. [8]
g(k, ε) = exp[(k
3
8
)(ln(k)lnk]
O(n + k
k+2
ε
(2d+1)k
log
k+1
nlog
k
1
ε
) (discrete also)
Har-Peled et al. [13]
Figure 1: Summary of previous results on k-means and k-median clustering.
k-means problem, all satisfy the required conditions and therefore belong to this class of clustering
problems. One important condition that the clustering problems must satisfy is the existence of
an algorithm to generate a candidate set of points such that at least one of these points is a close
approximation to the optimal center for k = 1 (one cluster). Further, the running time of this
algorithm as well as the size of this candidate set should be independent of n. Based on such a
subroutine, we sh ow how to approximate all the centers in the optimal solution in an iterative
manner.
The runn ing times of O(2
(k/ε)
O(1)
nd) of our algorithms are better than the previously known
algorithms for th ese problems, specially when d is very large. In fact, these are the first algorithms
for the k-means, k-median and the discrete k-means clustering problems that have running time
linear in the size of the input for fixed k and ε. The algorithms in this paper h ave the additional
advantage of simplicity as the only technique involved is random sampling. Our method is based
on using random sampling to identify a small set of candidate centers. In contrast, an alternate
strategy [13] involves identifying significantly small sets of points, called coresets, such that solving
the clustering problem on the coresets yields a solution for th e original set of points. In a subsequent
work [6], fu rther improvements were obtained using a clever comb ination of the two techniques (cf.
Section 8).
The main drawback of our algorithm is that the running time has expon ential d ependence on
k
ε
O(1)
. We wou ld however like to note that Guruswamy and Indyk [12] showed that it is NP-hard
to obtain a PTAS for the k-median problem for arbitrary k and d Ω(log n). Since we avoid
exponential dependence on d, this implies that the exponential dependence on k is inherent. An
algorithm that avoids exponential dependen ce on k, like Arora et al.[2] has doubly exponential
dependence on d which is arguably wors e in most situations.
We also present a randomized (1 + ε)-approximation algorithm for the 1-median problem which
runs in time O(2
1
O(1)
d), assuming that the points are stored in a suitable d ata structure such as
an array, where a point can be randomly sampled in constant time. All our algorithms yield the
desired result with constant probability (which can be made as close to 1 as we wish by a constant
number of repetitions).
The remaining paper is organized as follows. In Section 2, we define clustering problems. In
Section 3 we present a simplified algorithm for the 2-means clustering problem. In Section 4,
we describe a general approach for solving clustering problems efficiently. In subsequent sections
we give applications of the general method by showing that this class of problems includes the

4
k-means, k-median and discrete k-means prob lems. In Section 5.3, we also describe an efficient
approximation algorithm for the 1-median problem. In Section 7, we extend our algorithms for
efficiently handling weighted point sets. We conclude by stating some open problems and some
interesting developments subsequent to the publication of earlier versions of this work in Section 8.
2 Clustering Problems
In this section, we give a general definition of clustering problems.
We shall d efi ne a clustering problem by two parameters an integer k and a real-valued cost
function f(Q, x), where Q is a set of points, and x is a point in an Euclidean sp ace. We shall den ote
this clustering problem as C(f, k). The input to C(f, k) is a set of points in an Euclidean space.
Given an in stance P of n points, C(f, k) s eeks to partition them into k sets, which we shall
denote as clusters. Let these clusters be C
1
, . . . , C
k
. A solution also finds k points, which we call
centers, c
1
, . . . , c
k
. We shall say that c
i
is the center of cluster C
i
(or the points in C
i
are assigned
to c
i
). The objective of the problem is to m inimize the quantity
P
k
i=1
f(C
i
, c
i
).
This is a fairly general definition. Let us see some important special cases.
k-median : f
1
(Q, x) =
P
qQ
d(q, x).
k-means : f
2
(Q, x) =
P
qQ
d(q, x)
2
.
We can also encompass the discrete versions of these problems, i.e., cases where the centers
have to be one of the points in P . In such problems, we can make f (Q, x) unb ounded if x / Q.
As stated earlier, we shall assum e that we are given a constant ε > 0, and we are interested in
finding (1 + ε)-approximation algorithms for these clustering problems.
We n ow give some defin itions. Let us x a clustering problem C(f, k). Although we should
parametrized all our definitions by f, we avoid this because the clustering problem will be clear
from the context.
Definition 2.1. Given a point set P , let OPT
k
(P ) be the cost of the optimal solution to the clustering
problem C(f, k) on input P .
Definition 2.2. Given a set of points P and a set of k points C, let OPT
k
(P, C) be the cost of the
optimal solution to C(f, k) on P when the set of centers is C.
3 Algorithm for 2-means Clustering
In this section we describe the algorithm for 2-means clustering. The 2-means clustering algo-
rithm contains many of the ideas inherent in the more general algorithm. This makes it easier to
understand the more general algorithm described in the next section.
Consider an instance of the 2-means problem where we are given a set P of n points in
d
. We
seek to find C(f
2
, 2) where f
2
corresponds to the k-means cost function as defined in Section 2.
We first look at some properties of the 1-means problem.
Definition 3.1. For a set of points P , define the centroid, c(P ), of P as the point
P
pP
p
|P |
.
Claim 3.1. For any point x
d
,
f
2
(P, x) = f
2
(P, c(P )) + |P | ·d(c(P ), x)
2
. (1)

5
Proof.
f
2
(P, x) =
X
pP
||p x||
2
=
X
pP
||p c(P ) + c(P ) x||
2
=
X
pP
||p c(P )||
2
+
X
pP
||c(P) x||
2
= f
2
(P, c(P )) + |P | ·d(c(P ), x)
2
where the second last equality follows from the fact that
P
pP
||p c(P)|| = 0.
From this we can make the following observation.
Fact 3.2. Any optimal solution to the 1-means problem with respect to an input point set P chooses
c(P ) as the center.
We can deduce an important p roperty of any optimal solution to the 2-means clustering problem.
Suppose we are given an optimal solution to the 2-means clustering problem with respect to the
input P . Let C = {c
1
, c
2
} be the set of centers constructed by this solution. C produces a
partitioning of the point set P into 2 clusters, namely, P
1
, P
2
. P
i
is the set of points for which the
closest point in C is c
i
. In other words, the clusters correspond to the points in the Voronoi regions
in
d
with respect to C. Now, Fact 3.2 imp lies that c
i
mu st be the centroid of P
i
for i = 1, 2.
Inaba et. al. [14] showed th at the centroid of a small random sample of points in P can be a
go od approximation to c(P ).
Lemma 3.3. [14] Let T be a set obtained by independently sampling m points uniformly at random
from a point set P . Then, for any δ > 0,
f
2
(S, c(T )) <
1 +
1
δm
OPT
1
(P )
holds with probability at least 1 δ.
Therefore, if we ch oose m as
2
ε
, then with probability at least 1/2, we get a (1+ε)-approximation
to OPT
1
(P ) by taking the center as the centroid of T . Thus, a constant size sample can quickly
yield a good approximation to the optimal 1-means solution.
Suppose P
is a subset of P and we want to get a good approximation to the optimal 1-mean
for the point set P
. Following Lemma 3.3, we would like to sample from P
. But the problem
is that P
is not explicitly given to us. The following lemma states that if the size of P
is close
to th at of P , then we can sample a slightly larger set of points from P and hopefully this sample
would contain enough random samples from P
. Let us define things more formally first. Let P be
a set of points and P
be a subset of P such that |P
| θ|P |, w here θ is a constant between 0 and
1. Suppose we take a sample S of size
4
θε
from P . Now we consider all possible subsets of size
2
ε
of S. For each of these subsets S
, we compute its centroid c(S
), and consider this as a potential
center for the 1-means problem instance on P
. In other words, we consider f
2
(P
, c(S
)) for all
such subsets S
. T he following lemma shows that one of these subsets must give a close enough
approximation to the optimal 1-means s olution for P
.
Lemma 3.4. The following event happens with constant probability
min
S
:S
S,|S
|=
2
ε
f
2
(P
, c(S
)) (1 + ε)OPT
1
(P
)

Citations
More filters
Journal ArticleDOI

Journal of the ACM

Dan Suciu, +1 more
- 01 Jan 2006 - 
Posted Content

Turning Big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering

TL;DR: In this paper, the authors developed and analyzed a method to reduce the size of a very large set of data points in a high dimensional Euclidean space R d to a small set of weighted points such that the result of a predetermined data analysis task on the reduced set is approximately the same as that for the original point set.
Posted Content

A Unified Framework for Approximating and Clustering Data

TL;DR: In this article, a unified framework for constructing coresets and approximate clustering for general sets of functions is presented. But it is not a coreset-based clustering framework.
Posted Content

Clustering with Spectral Norm and the k-means Algorithm

TL;DR: In this paper, a simple clustering algorithm for data points generated by a mixture of $k$ probability distributions without assuming any generative (probabilistic) model is presented.
Proceedings ArticleDOI

Clustering with Spectral Norm and the k-Means Algorithm

TL;DR: This paper shows that a simple clustering algorithm works without assuming any generative (probabilistic) model, and proves some new results for generative models - e.g., it can cluster all but a small fraction of points only assuming a bound on the variance.
References
More filters
Journal ArticleDOI

Indexing by Latent Semantic Analysis

TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Journal ArticleDOI

Color indexing

TL;DR: In this paper, color histograms of multicolored objects provide a robust, efficient cue for indexing into a large database of models, and they can differentiate among a large number of objects.
Journal ArticleDOI

Syntactic clustering of the Web

TL;DR: An efficient way to determine the syntactic similarity of files is developed and applied to every document on the World Wide Web, and a clustering of all the documents that are syntactically similar is built.
Journal ArticleDOI

Efficient and effective querying by image content

TL;DR: A set of novel features and similarity measures allowing query by image content, together with the QBIC system, and a new theorem that makes efficient filtering possible by bounding the non-Euclidean, full cross-term quadratic distance expression with a simple Euclidean distance.
Frequently Asked Questions (3)
Q1. What are the contributions mentioned in the paper "Linear-time approximation schemes for clustering problems in any dimensions∗" ?

The authors present a general approach for designing approximation algorithms for a fundamental class of geometric clustering problems in arbitrary dimensions. More specifically, their approach leads to simple randomized algorithms for the k-means, k-median and discrete k-means problems that yield ( 1 + ε ) approximations with probability ≥ 1/2 and running times of O ( 2 ( k/ε ) O ( 1 ) dn ). Their method is general enough to be applicable to clustering problems satisfying certain simple properties and is likely to have further applications. 

An interesting open problem is whether there exist coresets for the k-median or k-means clustering problems of size independent of n and having only polynomial dependence in d.Another interesting open problem is to find a PTAS for the k-means clustering problem, even for fixed dimensions. 

It is important to note here that given a Random Sampling Procedure for an unweighted clustering problem, the corresponding Weighted Random Sampling Procedure for the weighted version of the problem can be simply obtained by performing weighted sampling as described above.