scispace - formally typeset
Open AccessProceedings ArticleDOI

A simple linear time (1 + /spl epsiv/)-approximation algorithm for k-means clustering in any dimensions

Reads0
Chats0
TLDR
This work presents the first linear time (1 + /spl epsiv/)-approximation algorithm for the k-means problem for fixed k and /spl Epsiv/, which runs in O(nd) time.
Abstract
We present the first linear time (1 + /spl epsiv/)-approximation algorithm for the k-means problem for fixed k and /spl epsiv/. Our algorithm runs in O(nd) time, which is linear in the size of the input. Another feature of our algorithm is its simplicity - the only technique involved is random sampling.

read more

Content maybe subject to copyright    Report

A Simple Linear Time (1 + ε)-Approximation Algorithm for k-Means Clustering
in Any Dimensions
Amit Kumar
Dept. of Computer Science
& Engg., IIT Delhi
New Delhi-110016, India
amitk@cse.iitd.ernet.in
Yogish Sabharwal
IBM India Research Lab
Block-I, IIT Delhi,
New Delhi-110016, India
ysabharwal@in.ibm.com
Sandeep Sen
1
Dept. of Computer Science
& Engg., IIT Delhi
New Delhi-110016, India
ssen@cse.iitd.ernet.in
Abstract
We present the first linear time (1+ε)-approximation al-
gorithm for the k-means problem for fixed k and ε . Our al-
gorithm runs in O(nd) time, which is linear in the size of
the input. Another feature of our algorithm is its simplic-
ity the only technique involved is random sampling.
1. Introduction
The problem of clustering a group of data items into sim-
ilar groups is one of the most widely studied problems in
computer science. Clustering has applications in a variety of
areas, for example, data mining, information retrieval, im-
age processing, and web search ([5, 7, 14, 9]). Given the
wide range of applications, many different definitions of
clustering exist in the literature ([8, 4]). Most of these defi-
nitions begin by defining a notion of distance between two
data items and then try to form clusters so that data items
with small distance between them get clustered together.
Often, clustering problems arise in a geometric setting,
i.e., the data items are points in a high dimensional Eu-
clidean space. In such settings, it is natural to define the
distance between two points as the Euclidean distance be-
tween them. One of the most popular definitions of cluster-
ing is the k-means clustering problem. Given a set of points
P , the k-means clustering problems seeks to find a set K of
k centers, such that
X
pP
d(p, K)
2
is minimized. Note that the points in K can be arbitrary
points in the Euclidean space. Here d(p, K) refers to the dis-
tance between p and the closest center in K. We can think
1 Author’s present address: Dept of Computer Science and Engneering,
IIT Kharagpur 721302.
of this as each point in P gets assigned to the closest cen-
ter in K. The points that get assigned to the same center
form a cluster. The k-means problem is NP-hard even for
k = 2. Another popular definition of clustering is the k-
median problem. This is defined in the same manner as the
k-means problem except for the fact that the objective func-
tion is
P
pP
d(p, K). Observe that the distance measure
used in the definition of the k-means problem is not a met-
ric. This might lead one to believe that solving the k-means
problem is more difficult than the k -median problem. How-
ever, in this paper, we give strong evidence that this may not
be the case.
A lot of research has been devoted to solving the k-
means problem exactly (see [11] and the references therein).
Even the best known algorithms for this problem take at
least Ω(n
d
) time. Recently, some work has been devoted
to finding (1 + ε)-approximation algorithms for the k-
means problem, where ε can be an arbitrarily small con-
stant. This has led to algorithms with much improved run-
ning time. Further, if we look at the applications of the k-
means problem, they often involve mapping subjective fea-
tures to points in the Euclidean space. Since there is an error
inherent in this mapping, nding a (1 + ε)-approximate so-
lution does not lead to a deterioration in the solution for the
actual application.
In this paper, we give the first truly linear time (1 + ε)-
approximation algorithm for the k-means problem. Treat-
ing k and ε as constants, our algorithm runs in O(nd) time,
which is linear in the size of the input. Another feature of
our algorithm is its simplicity the only technique involved
is random sampling.
1.1. Related work
The fastest exact algorithm for the k-means clustering
problem was proposed by Inaba et al. [11]. They observed
that the number of Voronoi partitions of k points in <
d
is
O(n
kd
) and so the optimal k-means clustering could be de-

termined exactly in time O(n
kd+1
). They also proposed
a randomized (1 + ε)-approximation algorithm for the 2-
means clustering problem with running time O(n/ε
d
).
Matousek [13] proposed a deterministic (1 + ε)-
approximation algorithm for the k-means problem
with running time O(
2k
2
d
log
k
n). Badoiu et al.
[3] proposed a (1 + ε)-approximation algorithm for
the k-median clustering problem with running time
O(2
(k/ε)
O(1)
d
O(1)
nlog
O(k)
n). Their algorithm can be ex-
tended to get a (1 + ε)-approximation algorithm for the
k-means clustering problem with a similar running time. de
la Vega et al. [6] proposed a (1 + ε)-approximation algo-
rithm for the k-means problem which works well for points
in high dimensional points in high dimensions. The run-
ning time of this algorithm is O(g(k, ε)nlog
k
n) where
g(k, ε) = exp[(k
3
8
)(ln(k/ ε)lnk]. Recently, Har-Peled
et al. [10] proposed a (1 + ε)-approximation algo-
rithm for the k-means clustering whose running time
is O(n + k
k+2
ε
(2d+1)k
log
k+1
nlog
k
1
ε
). Their algo-
rithm is also fairly complicated and relies on several
results in computational geometry that depend exponen-
tially on the number of dimensions. So this is more suitable
for low dimensions only.
There exist other definitions of clustering, for example,
k-median clustering where the objective is to minimize the
sum of the distances to the nearest center and k-center clus-
tering, where the objective is to minimize the maximum dis-
tance (see [1, 2, 3, 10, 12] and references therein).
1.2. Our contributions
We present a linear time (1 + ε)-approximation algo-
rithm for the k-means problem. Treating k and ε as con-
stants, the running time of our algorithm is better in com-
parison to the previously known algorithms for this prob-
lem. However, the algorithm due to Har-Peled and Mazum-
dar [10] deserves careful comparison. Note that their algo-
rithm, though linear in n, is not linear in the input size of
the problem, which is dn (for n points in d dimensions).
Therefore, their algorithm is better only for low dimen-
sions; for d = Ω(log n), our algorithm is much faster. Even
use of Johnson-Lindenstraus lemma will not make the run-
ning time comparable as it has its own overheads. Many re-
cent algorithms rely on techniques like exponential grid or
scaling that have high overheads. For instance, normaliz-
ing with respect to minimum distance between points may
incur an extra (n) cost per point depending on the com-
putational model. In [3], the authors have used rounding
techniques based on approximations of the optimal k-center
value without specifying the cost incurred in the process.
The techniques employed in our algorithm have no such
hidden overheads.
The 2-means clustering problem has also gener-
ated enough research interest in the past. Our algo-
rithm yields a (1 + ε)-approximation algorithm for the
2-means clustering problem with constant probabil-
ity in time O(2
(1)
O(1)
dn). This is the first dimension
independent (in the exponent) algorithm for this prob-
lem that runs in linear time.
The basic idea of our algorithm is very simple. We be-
gin with the observation of Inaba et. al. [11] that given a set
of points, their centroid can be very well approximated by
sampling a constant number of points and finding the cen-
troid of this sample. So if we knew the clusters formed by
the optimal solution, we can get good approximations to the
actual centers. Of course, we do not know this fact. How-
ever, if we sample O (k) points, we know that we will get
a constant number of points from the largest cluster. Thus,
by trying all subsets of constant size from this sample, we
can essentially sample points from the largest cluster. In this
way, we can estimate the centers of large clusters. How-
ever, in order to sample from the smaller clusters, we need
to prune points from the larger clusters. This pruning has
to balance two facts we would not like to remove points
from the smaller clusters and yet we want to removeenough
points from the larger clusters.
Our algorithm appears very similar in spirit to that of
Badiou et al. [3]. In fact both these algorithms begin with
the same premise of random sampling. However, in order
to sample from the smaller clusters, their algorithm has to
guess the sizes of the smaller clusters and the distances be-
tween clusters. This causes an O(log
k
n) multiplicative fac-
tor in the running time of their algorithm. We completely
avoid this extra factor by a much more careful pruning al-
gorithm. Moreover this makes our algorithm considerably
simpler.
2. Preliminaries
Let P be a set of n points in the Euclidean space <
d
.
Given a set of k points K, which we also denote as centers,
define the kmeans cost of P with respect to K, ∆(P, K),
as
∆(P, K) =
X
pP
d(p, K)
2
,
where d(p, K) denotes the distance between p and the
closest point to p in K. The k means problem seeks to
find a set
1
K of size k such that ∆(P, K) is minimized.
Let
k
(P ) denote the cost of the optimal solution to the
kmeans problem with respect to P .
1 In this paper we have addressed the unconstrained problem, where this
set can consist of any k points in <
d
.

If K happens to be a singleton set {y}, then we shall de-
note ∆(P, K) by ∆(P, y). Similar comments apply when
P is a singleton set.
Definition 2.1. We say that the point set P is (k, ε)-
irreducible if
k1
(P ) (1 + 32ε)∆
k
(P ). Otherwise we
say that the point set is (k, ε)-reducible.
Reducibility basically captures the fact that if instead of
finding the optimal k-means solution, we find the optimal
(k 1)-means solution, we will still be close to the former
solution. We now look at some properties of the 1-means
problem.
2.1. Properties of the 1-means problem
Definition 2.2. For a set of points P , define the centroid,
c(P ), of P as the point
P
pP
p
|P |
.
For any point x <
d
, it is easy to check that
∆(P, x) = ∆(P, c(P )) + |P | · ∆(c(P ), x). (1)
From this we can make the following observation.
Fact 2.1. Any optimal solution to the 1-means problem with
respect to an input point set P chooses c(P) as the center.
We can also deduce an important property of any opti-
mal solution to the k-means problem. Suppose we are given
an optimal solution to the k-means problem with respect
to the input P . Let K = {x
1
, . . . , x
k
} be the set of cen-
ters constructed by this solution. K produces a partitioning
of the point set P into K clusters, namely, P
1
, . . . , P
K
. P
i
is the set of points for which the closest point in K is x
i
.
In other words, the clusters correspond to the points in the
Voronoi regions in <
d
with respect to K. Now, Fact 2.1 im-
plies that x
i
must be the centroid of P
i
for all i.
Since we will be interested in fast algorithms for comput-
ing good approximations to the k-means problem, we first
consider the case k = 1. Inaba et. al. [11] showed that the
centroid of a small random sample of points in P can be a
good approximation to c(P ).
Lemma 2.2. [11] Let T be a set of m points obtained by
independently sampling m points uniformly at random from
a point set P . Then, for any δ > 0,
∆(S, c(T )) <
1 +
1
δm
1
(P )
holds with probability at least 1 δ.
Therefore, if we choose m as
2
ε
, then with probability at
least 1/2, we get a (1 + ε)-approximation to
1
(P ) by tak-
ing the center as the centroid of T . Thus, a constant size
sample can quickly yield a good approximation to the opti-
mal 1-means solution.
Suppose P
0
is a subset of P and we want to get a good
approximation to the optimal 1-means for the point set P
0
.
Following lemma 2.2, we would like to sample from P
0
. But
the problem is that P
0
is not explicitly given to us. The fol-
lowing lemma states that if the size of P
0
is close to that of
P , then we can sample a slightly larger set of points from
P and hopefully this sample would contain enough random
samples from P
0
. Let us define things more formally first.
Let P be a set of points and P
0
be a subset of P such that
|P
0
| β|P |, where β is a constant between 0 and 1. Sup-
pose we take a sample S of size
4
βε
from P . Now we con-
sider all possible subsets of size
2
ε
of S. For each of these
subsets S
0
, we compute its centroid c(S
0
), and consider this
as a potential center for the 1-means problem instance on
P
0
. In other words, we consider ∆(P
0
, c(S
0
)) for all such
subsets S
0
. The following lemma shows that one of these
subsets must give a close enough approximation to the op-
timal 1-means solution for P
0
.
Lemma 2.3. (Superset Sampling Lemma) The following
event happens with constant probability
min
S
0
:S
0
S,|S
0
|=
2
ε
∆(P
0
, c(S
0
)) (1 + ε)∆
1
(P
0
)
Proof. With constant probability, S contains at least
2
ε
points from P
0
. The rest follows from Lemma 2.2.
We use the standard notation B(p, r) to denote the open
ball of radius r around a point p.
We assume the input parameter ε for the approximation
factor satisfies 0 < ε1.
3. A linear time algorithm for 2-means clus-
tering
Before considering the k-means problem, we consider
the 2-means problem. This contains many of the ideas in-
herent in the more general algorithm. So it will make it eas-
ier to understand the more general algorithm.
Theorem 3.1. Given a point set P of size n in <
d
, there ex-
ists an algorithm which produces a (1 + ε )-approximation
to the optimal 2-means solution on the point set P with
constant probability. Further, this algorithm runs in time
O(2
(1)
O(1)
dn).
Proof. Let α = ε/ 64. We can assume that P is (2, α)-
irreducible. Indeed suppose P is (2, α)-reducible. Then
1
(P ) (1 + ε/2)∆
2
(P ). We can get a solution to
the 1-means problem for P by computing the centroid of
P in O(nd) time. The cost of this solution is at most
(1 + ε/2)∆
2
(P ). Thus we have shown the theorem if P
is (2, α)-reducible.
Consider an optimal 2-means solution for P . Let c
1
and
c
2
be the two centers in this solution. Let P
1
be the points

which are closer to c
1
than c
2
and P
2
be the points closer
to c
2
than c
1
. So c
1
is the centroid of P
1
and c
2
that of P
2
.
Without loss of generality, assume that |P
1
| |P
2
|.
Since |P
1
| |P |/2, Lemma 2.3 implies that if we sam-
ple a set S of size O
1
ε
from P and look at the set of
centroids of all subsets of S of size
2
ε
, then at least one of
these centroids, call it c
0
1
has the property that ∆(P
1
, c
0
1
)
(1 + α)∆(P
1
, c
1
). Since our algorithm is going to cycle
through all such subsets of S, we can assume that we have
found such a point c
0
1
.
Let the distance between c
1
and c
2
be t, i.e., d(c
1
, c
2
) =
t.
Lemma 3.2. d(c
1
, c
0
1
) t/4.
Proof. Suppose d(c
1
, c
0
1
) > t/4. Equation (1) implies that
∆(P
1
, c
0
1
) ∆(P
1
, c
1
) = |P
1
|∆(c
1
, c
0
1
)
t
2
|P
1
|
16
.
But we also know that left hand side is at most α∆(P
1
, c
1
).
Thus we get t
2
|P
1
| 16α∆(P
1
, c
1
).
Applying Equation (1) once again, we see that
∆(P
1
, c
2
) = ∆(P
1
, c
1
) + t
2
|P
1
| (1 + 16α)∆(P
1
, c
1
).
Therefore, ∆(P, c
2
) (1 + 16 α)∆(P
1
, c
1
) +
∆(P
2
, c
2
) (1 + 16α)∆
2
(P ). This contradicts the fact
that P is (2, α)-irreducible.
Now consider the ball B(c
0
1
, t/4). The previous lemma
implies that this ball is contained in the ball B(c
1
, t/2) of
radius t/2 centered at c
1
. So B(c
0
1
, t/4) is contained in P
1
.
Since we are looking for the point c
2
, we can delete the
points in this ball and hope that the resulting point set has a
good fraction of points from P
2
.
This is what we prove next. Let P
0
1
denote the point set
P
1
B(c
0
1
, t/4). Let P
0
denote P
0
1
P
2
. As we noted above
P
2
is a subset of P
0
.
Claim 3.3. |P
2
| α|P
0
1
|
Proof. Suppose not, i.e., |P
2
| α|P
0
1
|. Notice that
∆(P
1
, c
0
1
) ∆(P
0
1
, c
0
1
)
t
2
|P
0
1
|
16
.
Since ∆(P
1
, c
0
1
) (1 + α)∆(P
1
, c
1
), it follows that
t
2
|P
0
1
| 16(1 + α)∆(P
1
, c
1
) (2)
So,
∆(P, c
1
) = ∆(P
1
, c
1
) + ∆(P
2
, c
1
)
= ∆(P
1
, c
1
) + ∆(P
2
, c
2
) + t
2
|P
2
|
∆(P
1
, c
1
) + ∆(P
2
, c
2
)
+16α(1 + α)∆(P
1
, c
1
)
(1 + 32α)∆(P
1
, c
1
) + ∆(P
2
, c
2
)
(1 + 32α)∆
2
(P ),
where the second equation follows from (1), while third in-
equality follows from (2) and the fact |P
2
| α|P
0
1
|. But this
contradicts the fact that P is (2, α)-irreducible. This proves
the claim.
The above claim combined with Lemma 2.2 implies
that if we sample O
1
α
2
points from P
0
, and consider
the centroids of all subsets of size
2
α
in this sample, then
with constant probability we shall get a point c
0
2
for which
∆(P
2
, c
0
2
) (1 + α)∆(P
2
, c
2
). Thus, we get the centers c
0
1
and c
0
2
which satisfy the requirements of our lemma.
The only problem is that we do not know the value of
the parameter t. We will somehow need to guess this value
and yet maintain the fact that our algorithm takes only lin-
ear amount of time.
We can assume that we have found c
0
1
(this does not re-
quire any assumption on t). Now we need to sample from
P
0
(recall that P
0
is the set of points obtained by remov-
ing the points in P distant at most t/4 from c
0
1
). Suppose
we know the parameter i such that
n
2
i
|P
0
|
n
2
i1
.
Consider the points of P in descending order of distance
from c
0
1
. Let Q
0
i
be the rst
n
2
i1
points in this sequence. No-
tice that P
0
is a subset of Q
0
i
and |P
0
| |Q
0
i
|/2. Also we can
find Q
0
i
in linear time (because we can locate the point at po-
sition
n
2
i1
in linear time). Since |P
2
| α|P
0
|, we see that
|P
2
| α|Q
0
i
|/2. Thus, Lemma 2.2 implies that it is enough
to sample O
1
α
2
points from Q
0
i
to locate c
0
2
(with con-
stant probability of course).
But the problem with this scheme is that we do not know
the value i. One option is try all possible values of i, which
will imply a running time of O(n log n) (treating the terms
involving α and d as constant). Also note that we cannot use
approximate range searching because preprocessing takes
O(nlogn) time.
We somehow need to combine the sampling and the idea
of guessing the value of i. Our algorithm proceeds as fol-
lows. It tries values of i in the order 0, 1, 2, . . .. In iteration
i, we find the set of points Q
0
i
. Note that Q
0
i+1
is a subset of
Q
0
i
. In fact Q
0
i+1
is the half of Q
0
i
which is farther from c
0
1
.
So in iteration (i+1), we can begin from the set of points Q
0
i
(instead of P
0
). We can find the candidate point c
0
2
by sam-
pling from Q
0
i+1
. Thus we can find Q
0
i+1
in time linear in
|Q
0
i+1
| only.
Further in iteration i, we also maintain the sum ∆(P
Q
0
i
, c
0
1
). Since ∆(P Q
0
i+1
, c
0
1
) = ∆(P Q
0
i
, c
0
1
)+∆(Q
0
i
Q
0
i+1
, c
0
1
), we can compute ∆(P Q
0
i+1
, c
0
1
) in iteration
i + 1 in time linear in Q
0
i+1
. This is needed because when
we find a candidate c
0
2
in iteration i + 1, we need to com-
pute the 2-means solution when all points in P Q
0
i
are as-
signed to c
0
1
and the points in Q
0
i
are assigned to the nearer
of c
0
1
and c
0
2
. We can do this in time linear in |Q
0
i+1
| if we
maintain the quantities ∆(P Q
0
i
, c
0
1
) for all i.
Thus, we see that iteration i takes time linear in |Q
0
i
|.
Since |Q
0
i
|s decrease by a factor of 2, the overall running

time for a given value of c
0
1
is O(2
(1)
O(1)
dn). Since the
number of possible candidates for c
0
1
is O(2
(1)
O(1)
), the
running time is as stated.
Claim 3.4. The cost, , reported by the algorithm satisfies
2
(P )(1 + α)∆
2
(P ).
Proof.
2
(P ) is obvious as we are associating each
point with one of the 2 centers being reported and accu-
mulating the corresponding cost. Now, consider the case
when we have the candidate center set where each center
is a (1 + α)-approximate centroid of it’s respective cluster.
As we are associating each point to the approximate cen-
troid of the corresponding cluster or a center closer than
it, it follows that (1 + α)∆
2
(P ). If we report the min-
imum cost clustering, C, then since the actual cost of the
clustering (due to the corresponding Voronoi partitioning)
can only be better than the cost that we report (because we
associate some points with approximate centroids of cor-
responding cluster rather than the closest center), we have
∆(C)(1 + α)∆
2
(P ).
This proves the theorem.
4. A linear time algorithm for k-means clus-
tering
We now present the general k-means algorithm. We first
present a brief outline of the algorithm.
4.1. Outline
Our algorithm begins on the same lines as the 2-means
algorithm. Again, we can assume that the solution is ir-
reducible, i.e., removing one of the centers does not cre-
ate a solution which has cost within a small factor of the
optimal solution. Consider an optimal solution which has
centers c
1
, . . . , c
k
and which correspondingly partitions the
point set P into clusters P
1
, . . . , P
k
. Assume that |P
1
|
· · · |P
k
|. Our goal again will be to find approximations
c
0
1
, . . . , c
0
k
to c
1
, . . . , c
k
respectively.
Suppose we have found centers c
0
1
, . . . , c
0
i
. Sup-
pose t is the distance between the closest pair of centroids
{c
1
, . . . , c
i
} and {c
i+1
, . . . , c
k
}. As in the case of k = 2,
we can show that the points at distant at most t /4 from
{c
0
1
, . . . , c
0
i
} get assigned to c
1
, . . . , c
i
by the optimal so-
lution. So, we can delete these points. Now we can show
that among the remaining points, the size of P
i+1
is sig-
nificant. Therefore, we can use random sampling to ob-
tain a center c
0
i+1
which is a pretty good estimate of c
i+1
.
Of course we do not know the value of t, and so a naive im-
plementation of this idea gives an O(n(log n)
k
) time
algorithm.
Algorithm k-means(P, k, ε)
Inputs : Point set P ,
Number of clusters k,
Approximation ratio ε.
Output : k-means clustering of P .
1. For i = 1 to k do
Obtain the clustering
Irred-k-means(P, i, i, φ, ε/64k, 0).
2. Return the clustering which has minimum cost.
Figure 1. The k-means Algorithm
So far the algorithm looks very similar to the k = 2
case. But now we want to modify it to a linear time algo-
rithm. This is where the algorithm gets more involved. As
mentioned above, we can not guess the parameter t. So we
try to guess the size of the point set obtained by removing
the balls of radius t/4 around {c
1
, . . . , c
i
}. So we work with
the remaining point set with the hope that the time taken for
this remaining point set will also be small and so the over-
all time will be linear. Although similar in spirit to the k = 2
case, we still need to prove some more details in this case.
Now, we describe the actual k-means algorithm.
4.2. The algorithm
The algorithm is described in Figures 1 and 2. Figure 1
is the main algorithm. The inputs are the point set P , k and
an approximation factor ε. Let α denote ε/64k. The algo-
rithm k-means(P , k, ε) tries to find the highest i such that
P is (i, α)-irreducible. Essentially we are saying that it is
enough to find i centers only. Since we do not know this
value of i, the algorithm tries all possible values of i.
We now describe the algorithm Irred-k-
means(Q, m, k, C, α, Sum). We have found a set C
of k m centers already. The points in P Q have been as-
signed to C. We need to assign the remaining points in Q.
The case m = 0 is clear. In step 2, we try to find a new cen-
ter by the random sampling method. This will work
provided a good fraction of the points in Q do not get as-
signed to C. If this is not the case then in step 3, we
assign half of the points in Q to C and call the algo-
rithm recursively with this reduced point set. For the base
case, when |C| = 0, as P
1
is the largest cluster, we re-
quire to sample only O(k) points. This is tackled in Step
2. Step 3 is not performed in this case, as there are no cen-
ters.

Citations
More filters
Journal ArticleDOI

SLIC Superpixels Compared to State-of-the-Art Superpixel Methods

TL;DR: A new superpixel algorithm is introduced, simple linear iterative clustering (SLIC), which adapts a k-means clustering approach to efficiently generate superpixels and is faster and more memory efficient, improves segmentation performance, and is straightforward to extend to supervoxel generation.
Proceedings ArticleDOI

k-means++: the advantages of careful seeding

TL;DR: By augmenting k-means with a very simple, randomized seeding technique, this work obtains an algorithm that is Θ(logk)-competitive with the optimal clustering.
Journal ArticleDOI

Scalable k-means++

TL;DR: In this article, the authors show how to reduce the number of passes needed to obtain, in parallel, a good initialization of k-means++ in both sequential and parallel settings.
Proceedings ArticleDOI

Fast approximate spectral clustering

TL;DR: This work develops a general framework for fast approximate spectral clustering in which a distortion-minimizing local transformation is first applied to the data, and develops two concrete instances of this framework, one based on local k-means clustering (KASP) and onebased on random projection trees (RASP).
Book ChapterDOI

The Planar k-Means Problem is NP-Hard

TL;DR: It is shown that this well-known problem is NP-hard even for instances in the plane, answering an open question posed by Dasgupta [6].
References
More filters
Journal ArticleDOI

Indexing by Latent Semantic Analysis

TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Journal ArticleDOI

Color indexing

TL;DR: In this paper, color histograms of multicolored objects provide a robust, efficient cue for indexing into a large database of models, and they can differentiate among a large number of objects.
Journal ArticleDOI

Syntactic clustering of the Web

TL;DR: An efficient way to determine the syntactic similarity of files is developed and applied to every document on the World Wide Web, and a clustering of all the documents that are syntactically similar is built.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What have the authors contributed in "A simple linear time (1 + ε)-approximation algorithm for k-means clustering in any dimensions" ?

The authors present the first linear time ( 1+ε ) -approximation algorithm for the k-means problem for fixed k and ε. 

An interesting direction for further research is to extend their methods for other clustering problems. 

The problem of clustering a group of data items into similar groups is one of the most widely studied problems in computer science. 

Clustering has applications in a variety of areas, for example, data mining, information retrieval, image processing, and web search ([5, 7, 14, 9]). 

Using the notion of balanced clusters in conjunction with Lemma 2.2, by eliminating at most (1+µ)γ|P | outliers, the authors can approximate the cost of the optimal k-means clustering with at most γ|P | outliers. 

it is an open problem to get a polynomial time (1 + ε)approximation algorithm for the k-means clustering problem when n, k and d are not constants. 

some work has been devoted to finding (1 + ε)-approximation algorithms for the kmeans problem, where ε can be an arbitrarily small constant. 

This is needed because when the authors find a candidate c′2 in iteration i + 1, the authors need to compute the 2-means solution when all points in P −Q′i are assigned to c′1 and the points in Q ′ i are assigned to the nearer of c′1 and c ′ 2. 

if the authors choose m as 2ε , then with probability at least 1/2, the authors get a (1 + ε)-approximation to ∆1(P ) by taking the center as the centroid of T . 

Most of these definitions begin by defining a notion of distance between two data items and then try to form clusters so that data items with small distance between them get clustered together. 

Algorithm Irred-k-means(Q, m, k, C, α, Sum) Inputs Q: Remaining point setm: number of cluster centers yet to be found k: total number of clusters C: set of k − m cluster centers found so far α: approximation factor Sum: the cost of assigning pointsin P − Q to the centers in C Output The clustering of the points in Q in k clusters. 

Given a set of k points K , which the authors also denote as centers, define the k−means cost of P with respect to K , ∆(P, K), as∆(P, K) = ∑p∈Pd(p, K)2,where d(p, K) denotes the distance between p and the closest point to p in K .