What is the problem of finding a PTAS for the k-median?

An interesting open problem is whether there exist coresets for the k-median or k-means clustering problems of size independent of n and having only polynomial dependence in d.Another interesting open problem is to find a PTAS for the k-means clustering problem, even for fixed dimensions.

What is the weighted random sampling procedure for a clustering problem?

It is important to note here that given a Random Sampling Procedure for an unweighted clustering problem, the corresponding Weighted Random Sampling Procedure for the weighted version of the problem can be simply obtained by performing weighted sampling as described above.

(Open Access) Linear-time approximation schemes for clustering problems in any dimensions (2010) | Amit Kumar

Q: What are the contributions mentioned in the paper "Linear-time approximation schemes for clustering problems in any dimensions∗" ?

The authors present a general approach for designing approximation algorithms for a fundamental class of geometric clustering problems in arbitrary dimensions. More specifically, their approach leads to simple randomized algorithms for the k-means, k-median and discrete k-means problems that yield ( 1 + ε ) approximations with probability ≥ 1/2 and running times of O ( 2 ( k/ε ) O ( 1 ) dn ). Their method is general enough to be applicable to clustering problems satisfying certain simple properties and is likely to have further applications.

Linear-Time Approximation Schemes for Clustering Problems in

any Dimensions

∗

Amit Kumar

Dept of Comp Sc & Engg

Indian Institute of

Technology

New Delhi-110016, India

amitk@cse.iitd.ernet.in

Yogish Sabharwal

IBM Research - India

Plot 4, Block C,

Vasant Kunj Inst. Area

New Delhi-110070, India

ysabharwal@in.ibm.com

Sandeep Sen

Dept of Comp Sc & Engg

Indian Institute of

Technology

New Delhi-110016, India

ssen@cse.iitd.ernet.in

September 23, 2009

Abstract

We present a gener al approach for designing approximation algorithms for a fundamental

class of geometric clustering problems in arbitrary dimensions. More speciﬁcally, our approach

leads to simple randomized algorithms for the k-means, k-median and discrete k-means pro blems

that yield (1 + ε) approximations with probability ≥ 1/2 and running times o f O(2

(k/ε)

O(1)

dn).

These are the ﬁrst algorithms for these problems whose running times are linear in the s iz e of

the input (nd for n points in d dimensions) assuming k and ε are ﬁxed. Our method is general

enough to be applicable to clustering pr oblems satisfying certa in simple properties and is likely

to have further applications.

1 Introduction

The problem of clustering a group of data items into similar groups is one of the most widely studied

problems in computer science. Clustering has ap plications in a variety of areas, for example, data

mining, information retrieval, image processing, and web search ([5, 9, 22, 11]). Given the wide

range of applications, many diﬀerent deﬁnitions of clustering exist in the literature ([10, 4]). Most of

these deﬁnitions begin by deﬁ ning a notion of distance (similarity) between two data items and then

try to form clusters so that data items with small distance between them get clustered together.

Often, clustering prob lems arise in a geometric setting, i.e., the data items are points in a high

dimensional Euclidean space. In such settings, it is natural to deﬁn e the distance between two

points as the Euclidean distance between them. Two of the most popu lar deﬁ nitions of clustering

are the k-means clustering problem and the k-median clustering problem. Given a set of points

P , the k-means clustering problems seeks to ﬁnd a set K of k centers, such that

p∈P

d(p, K)

minimized, whereas the k-median clustering problems seeks to ﬁnd a set K of k centers, such that

p∈P

d(p, K) is minimized. Note that the points in K can be arbitrary points in the Euclidean

space. Here d(p, K) refers to the distance between p and the closest center in K. We can think

of this as each point in P gets assigned to the closest center. The points that get assigned to the

same center form a cluster. These problems are NP-hard for even k = 2 (when dimension is not

∗

Preliminary versions of the results have appeared earlier in IEEE Symposium on Foundations of Computer

Science, 2004[17] and International Colloquium on Automata, Languages and Programming, 2005[18].

ﬁxed) [7 ]. Interestingly, the center in the optimal solution to the 1-mean problem is the same as

the center of mass of the points. However, in the case of th e 1-median problem, also kn own as the

Fermat-Weber problem, no such closed form is known. We show that despite the lack of such a

closed form, we can obtain an approximation to the optimal 1-median in O(1) time (indepen dent of

the nu mber of points). Th ere are many useful variations to these clustering problems, for example,

in the discrete versions of these problems, the centers that we seek s hould belong to the input set

of points.

1.1 Related work

A lot of research h as been devoted to solving these problems exactly (see [14] and th e references

therein). Even the best known algorithms for the k-median and the k-means problem take at

least Ω(n

) time. Recently, more attention has been devoted to ﬁnding (1 + ε)-approximation

algorithm for these problems, where ε can be an arbitrarily sm all constant. This has resulted in

algorithms with substantially improved running times. Further, if we look at the applications of

these problems, they often involve mapping subjective features to points in the Euclidean space.

Since there is an error inherent in this mapping, ﬁnding a (1 + ε)-approximate solution is within

acceptable limits f or the actual applications.

The fastest exact algorithm for the k-means clustering problem was proposed by In aba et al.

[14]. They obs erved that the number of Voronoi partitions of k points in ℜ

is O(n

) and so the

optimal k-means clustering could be determined exactly in time O(n

kd+1

). They also proposed a

randomized (1 + ε)-approximation algorithm for the 2-means clustering problem with running time

O(n/ε

). Matouˇsek [19] proposed a deterministic (1 + ε)-approximation algorithm for the k-means

problem with running time O(nε

−2k

log

n).

By generalizing the technique of Arora [1], Arora et al. [2] presented a O(n

O(1/ε)+1

) time

(1 + ε)-approximation algorithm f or the k-median problem where points lie in the plane. This was

signiﬁcantly impr oved by Kolliopoulos et al. [16] who proposed an algorithm w ith a running time

of O(nlognlogk) for the discrete version of the problem, where the m ed ians mus t belong to the

input set and  = exp[O ((1 + log1/ε)/ε)

d−1

Recently, Badoiu et al. [3] proposed a (1 + ε)-approximation algorithm for k-median clustering

with a running time of O(2

(k/ε)

O(1)

nlog

O(k)

n). Their algorithm can be extended to k-means

with some modiﬁcations. de la Vega et al. [8] proposed a (1 + ε)-approximation algorithm for

the k-means problem which works well for points in high dimensions. The running time of this

algorithm is O(g(k, ε)nlog

n) where g(k, ε) = exp[(k

/ε

)(ln(k/ε)lnk].

Recently, Har-Peled and Mazumdar [13] proposed (1 + ε)-approximation algorithms for the

k-median, discrete k-median and the k-means clustering in low dimensions. They obtained a

runnin g time of O(n + k

O(1)

log

O(1)

n) for th e k-median problem, O(n + k

O(1)

log

O(1)

n) for the

discrete k-median problem and O(n + k

k+2

−(2d+1)k

log

k+1

nlog

) for the k-means problem. For

approximating the 1-median of a set of points, Indyk [15] proposed an algorithm that ﬁnds a (1+ε)

approximate 1-median in time O(n/ε

) with constant probability.

Table 1.1 su mmarizes the recent results for the problems, in the context of (1+ε)-approximation

algorithms. Some of these algorithms are ran domized with the expected ruining time holding good

for any input.

1.2 Our results and techniques

The general algorithm we present solves a large class of clustering prob lems satisfying a set of

conditions (cf. S ection 3). We show that the k-means problem, k-median problem and the discrete

Problem Result Reference

1-median O(n/ε

) Indyk [15]

k-median O(n

O(1/ε)+1

) for d = 2 Arora [1]

O(nlognlogk) (discrete only) Kolliopoulos et al. [16]

O(2

(k/ε)

O(1)

n log

Badoiu et al. [3]

O(n + k

O(1)

log

O(1)

n) (discrete also) Har-Peled et al. [13]

where  = exp[O((1 + log1/ε)/ε)

d−1

]

k-means O(nε

−2k

log

n) Matouˇsek [19]

O(g(k, ε)nlog

n) de la Vega et al. [8]

g(k, ε) = exp[(k

/ε

)(ln(k/ε)lnk]

O(n + k

k+2

−(2d+1)k

log

k+1

nlog

) (discrete also)

Har-Peled et al. [13]

Figure 1: Summary of previous results on k-means and k-median clustering.

k-means problem, all satisfy the required conditions and therefore belong to this class of clustering

problems. One important condition that the clustering problems must satisfy is the existence of

an algorithm to generate a candidate set of points such that at least one of these points is a close

approximation to the optimal center for k = 1 (one cluster). Further, the running time of this

algorithm as well as the size of this candidate set should be independent of n. Based on such a

subroutine, we sh ow how to approximate all the centers in the optimal solution in an iterative

manner.

The runn ing times of O(2

(k/ε)

O(1)

nd) of our algorithms are better than the previously known

algorithms for th ese problems, specially when d is very large. In fact, these are the ﬁrst algorithms

for the k-means, k-median and the discrete k-means clustering problems that have running time

linear in the size of the input for ﬁxed k and ε. The algorithms in this paper h ave the additional

advantage of simplicity as the only technique involved is random sampling. Our method is based

on using random sampling to identify a small set of candidate centers. In contrast, an alternate

strategy [13] involves identifying signiﬁcantly small sets of points, called coresets, such that solving

the clustering problem on the coresets yields a solution for th e original set of points. In a subsequent

work [6], fu rther improvements were obtained using a clever comb ination of the two techniques (cf.

Section 8).

The main drawback of our algorithm is that the running time has expon ential d ependence on





O(1)

. We wou ld however like to note that Guruswamy and Indyk [12] showed that it is NP-hard

to obtain a PTAS for the k-median problem for arbitrary k and d ≥ Ω(log n). Since we avoid

exponential dependence on d, this implies that the exponential dependence on k is inherent. An

algorithm that avoids exponential dependen ce on k, like Arora et al.[2] has doubly exponential

dependence on d which is arguably wors e in most situations.

We also present a randomized (1 + ε)-approximation algorithm for the 1-median problem which

runs in time O(2

1/ε

O(1)

d), assuming that the points are stored in a suitable d ata structure such as

an array, where a point can be randomly sampled in constant time. All our algorithms yield the

desired result with constant probability (which can be made as close to 1 as we wish by a constant

number of repetitions).

The remaining paper is organized as follows. In Section 2, we deﬁne clustering problems. In

Section 3 we present a simpliﬁed algorithm for the 2-means clustering problem. In Section 4,

we describe a general approach for solving clustering problems eﬃciently. In subsequent sections

we give applications of the general method by showing that this class of problems includes the

k-means, k-median and discrete k-means prob lems. In Section 5.3, we also describe an eﬃcient

approximation algorithm for the 1-median problem. In Section 7, we extend our algorithms for

eﬃciently handling weighted point sets. We conclude by stating some open problems and some

interesting developments subsequent to the publication of earlier versions of this work in Section 8.

2 Clustering Problems

In this section, we give a general deﬁnition of clustering problems.

We shall d eﬁ ne a clustering problem by two parameters – an integer k and a real-valued cost

function f(Q, x), where Q is a set of points, and x is a point in an Euclidean sp ace. We shall den ote

this clustering problem as C(f, k). The input to C(f, k) is a set of points in an Euclidean space.

Given an in stance P of n points, C(f, k) s eeks to partition them into k sets, which we shall

denote as clusters. Let these clusters be C

, . . . , C

. A solution also ﬁnds k points, which we call

centers, c

, . . . , c

. We shall say that c

is the center of cluster C

(or the points in C

are assigned

to c

). The objective of the problem is to m inimize the quantity

i=1

f(C

, c

This is a fairly general deﬁnition. Let us see some important special cases.

• k-median : f

(Q, x) =

q∈Q

d(q, x).

• k-means : f

(Q, x) =

q∈Q

d(q, x)

We can also encompass the discrete versions of these problems, i.e., cases where the centers

have to be one of the points in P . In such problems, we can make f (Q, x) unb ounded if x /∈ Q.

As stated earlier, we shall assum e that we are given a constant ε > 0, and we are interested in

ﬁnding (1 + ε)-approximation algorithms for these clustering problems.

We n ow give some deﬁn itions. Let us ﬁ x a clustering problem C(f, k). Although we should

parametrized all our deﬁnitions by f, we avoid this because the clustering problem will be clear

from the context.

Deﬁnition 2.1. Given a point set P , let OPT

(P ) be the cost of the optimal solution to the clustering

problem C(f, k) on input P .

Deﬁnition 2.2. Given a set of points P and a set of k points C, let OPT

(P, C) be the cost of the

optimal solution to C(f, k) on P when the set of centers is C.

3 Algorithm for 2-means Clustering

In this section we describe the algorithm for 2-means clustering. The 2-means clustering algo-

rithm contains many of the ideas inherent in the more general algorithm. This makes it easier to

understand the more general algorithm described in the next section.

Consider an instance of the 2-means problem where we are given a set P of n points in ℜ

. We

seek to ﬁnd C(f

, 2) where f

corresponds to the k-means cost function as deﬁned in Section 2.

We ﬁrst look at some properties of the 1-means problem.

Deﬁnition 3.1. For a set of points P , deﬁne the centroid, c(P ), of P as the point

p∈P

|P |

Claim 3.1. For any point x ∈ ℜ

(P, x) = f

(P, c(P )) + |P | ·d(c(P ), x)

. (1)

Proof.

(P, x) =

p∈P

||p − x||

p∈P

||p − c(P ) + c(P ) − x||

p∈P

||p − c(P )||

p∈P

||c(P) − x||

= f

(P, c(P )) + |P | ·d(c(P ), x)

where the second last equality follows from the fact that

p∈P

||p −c(P)|| = 0.

From this we can make the following observation.

Fact 3.2. Any optimal solution to the 1-means problem with respect to an input point set P chooses

c(P ) as the center.

We can deduce an important p roperty of any optimal solution to the 2-means clustering problem.

Suppose we are given an optimal solution to the 2-means clustering problem with respect to the

input P . Let C = {c

, c

} be the set of centers constructed by this solution. C produces a

partitioning of the point set P into 2 clusters, namely, P

, P

. P

is the set of points for which the

closest point in C is c

. In other words, the clusters correspond to the points in the Voronoi regions

in ℜ

with respect to C. Now, Fact 3.2 imp lies that c

mu st be the centroid of P

for i = 1, 2.

Inaba et. al. [14] showed th at the centroid of a small random sample of points in P can be a

go od approximation to c(P ).

Lemma 3.3. [14] Let T be a set obtained by independently sampling m points uniformly at random

from a point set P . Then, for any δ > 0,

(S, c(T )) <



1 +

δm



OPT

(P )

holds with probability at least 1 − δ.

Therefore, if we ch oose m as

, then with probability at least 1/2, we get a (1+ε)-approximation

to OPT

(P ) by taking the center as the centroid of T . Thus, a constant size sample can quickly

yield a good approximation to the optimal 1-means solution.

Suppose P

′

is a subset of P and we want to get a good approximation to the optimal 1-mean

for the point set P

′

. Following Lemma 3.3, we would like to sample from P

′

. But the problem

is that P

′

is not explicitly given to us. The following lemma states that if the size of P

′

is close

to th at of P , then we can sample a slightly larger set of points from P and hopefully this sample

would contain enough random samples from P

′

. Let us deﬁne things more formally ﬁrst. Let P be

a set of points and P

′

be a subset of P such that |P

′

| ≥ θ|P |, w here θ is a constant between 0 and

1. Suppose we take a sample S of size

θε

from P . Now we consider all possible subsets of size

of S. For each of these subsets S

′

, we compute its centroid c(S

′

), and consider this as a potential

center for the 1-means problem instance on P

′

. In other words, we consider f

′

, c(S

′

)) for all

such subsets S

′

. T he following lemma shows that one of these subsets must give a close enough

approximation to the optimal 1-means s olution for P

′

Lemma 3.4. The following event happens with constant probability

min

′

⊂S,|S

′

, c(S

′

)) ≤ (1 + ε)OPT

′

)

Linear-time approximation schemes for clustering problems in any dimensions

Figures

Citations

Journal of the ACM

Turning Big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering

A Unified Framework for Approximating and Clustering Data

Clustering with Spectral Norm and the k-means Algorithm

Clustering with Spectral Norm and the k-Means Algorithm

References

Pattern Classification

Indexing by Latent Semantic Analysis

Color indexing

Syntactic clustering of the Web

Efficient and effective querying by image content

Related Papers (5)

On coresets for k-means and k-median clustering

Approximate clustering via core-sets

Least squares quantization in PCM

k-means++: the advantages of careful seeding

Local Search Heuristics for k -Median and Facility Location Problems

Frequently Asked Questions (3)

Q1. What are the contributions mentioned in the paper "Linear-time approximation schemes for clustering problems in any dimensions∗" ?

Q2. What is the problem of finding a PTAS for the k-median?

Q3. What is the weighted random sampling procedure for a clustering problem?