scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Linear-time approximation schemes for clustering problems in any dimensions

08 Feb 2010-Journal of the ACM (ACM)-Vol. 57, Iss: 2, pp 5
TL;DR: This work presents a general approach for designing approximation algorithms for a fundamental class of geometric clustering problems in arbitrary dimensions and leads to simple randomized algorithms for the k-means, median and discrete problems.
Abstract: We present a general approach for designing approximation algorithms for a fundamental class of geometric clustering problems in arbitrary dimensions. More specifically, our approach leads to simple randomized algorithms for the k-means, k-median and discrete k-means problems that yield (1+e) approximations with probability ≥ 1/2 and running times of O(2(k/e)O(1)dn). These are the first algorithms for these problems whose running times are linear in the size of the input (nd for n points in d dimensions) assuming k and e are fixed. Our method is general enough to be applicable to clustering problems satisfying certain simple properties and is likely to have further applications.

Content maybe subject to copyright    Report

Linear-Time Approximation Schemes for Clustering Problems in
any Dimensions
Amit Kumar
Dept of Comp Sc & Engg
Indian Institute of
Technology
New Delhi-110016, India
amitk@cse.iitd.ernet.in
Yogish Sabharwal
IBM Research - India
Plot 4, Block C,
Vasant Kunj Inst. Area
New Delhi-110070, India
ysabharwal@in.ibm.com
Sandeep Sen
Dept of Comp Sc & Engg
Indian Institute of
Technology
New Delhi-110016, India
ssen@cse.iitd.ernet.in
September 23, 2009
Abstract
We present a gener al approach for designing approximation algorithms for a fundamental
class of geometric clustering problems in arbitrary dimensions. More specifically, our approach
leads to simple randomized algorithms for the k-means, k-median and discrete k-means pro blems
that yield (1 + ε) approximations with probability 1/2 and running times o f O(2
(k/ε)
O(1)
dn).
These are the first algorithms for these problems whose running times are linear in the s iz e of
the input (nd for n points in d dimensions) assuming k and ε are fixed. Our method is general
enough to be applicable to clustering pr oblems satisfying certa in simple properties and is likely
to have further applications.
1 Introduction
The problem of clustering a group of data items into similar groups is one of the most widely studied
problems in computer science. Clustering has ap plications in a variety of areas, for example, data
mining, information retrieval, image processing, and web search ([5, 9, 22, 11]). Given the wide
range of applications, many different definitions of clustering exist in the literature ([10, 4]). Most of
these definitions begin by defi ning a notion of distance (similarity) between two data items and then
try to form clusters so that data items with small distance between them get clustered together.
Often, clustering prob lems arise in a geometric setting, i.e., the data items are points in a high
dimensional Euclidean space. In such settings, it is natural to defin e the distance between two
points as the Euclidean distance between them. Two of the most popu lar defi nitions of clustering
are the k-means clustering problem and the k-median clustering problem. Given a set of points
P , the k-means clustering problems seeks to find a set K of k centers, such that
P
pP
d(p, K)
2
is
minimized, whereas the k-median clustering problems seeks to find a set K of k centers, such that
P
pP
d(p, K) is minimized. Note that the points in K can be arbitrary points in the Euclidean
space. Here d(p, K) refers to the distance between p and the closest center in K. We can think
of this as each point in P gets assigned to the closest center. The points that get assigned to the
same center form a cluster. These problems are NP-hard for even k = 2 (when dimension is not
Preliminary versions of the results have appeared earlier in IEEE Symposium on Foundations of Computer
Science, 2004[17] and International Colloquium on Automata, Languages and Programming, 2005[18].
1

2
fixed) [7 ]. Interestingly, the center in the optimal solution to the 1-mean problem is the same as
the center of mass of the points. However, in the case of th e 1-median problem, also kn own as the
Fermat-Weber problem, no such closed form is known. We show that despite the lack of such a
closed form, we can obtain an approximation to the optimal 1-median in O(1) time (indepen dent of
the nu mber of points). Th ere are many useful variations to these clustering problems, for example,
in the discrete versions of these problems, the centers that we seek s hould belong to the input set
of points.
1.1 Related work
A lot of research h as been devoted to solving these problems exactly (see [14] and th e references
therein). Even the best known algorithms for the k-median and the k-means problem take at
least Ω(n
d
) time. Recently, more attention has been devoted to finding (1 + ε)-approximation
algorithm for these problems, where ε can be an arbitrarily sm all constant. This has resulted in
algorithms with substantially improved running times. Further, if we look at the applications of
these problems, they often involve mapping subjective features to points in the Euclidean space.
Since there is an error inherent in this mapping, finding a (1 + ε)-approximate solution is within
acceptable limits f or the actual applications.
The fastest exact algorithm for the k-means clustering problem was proposed by In aba et al.
[14]. They obs erved that the number of Voronoi partitions of k points in
d
is O(n
kd
) and so the
optimal k-means clustering could be determined exactly in time O(n
kd+1
). They also proposed a
randomized (1 + ε)-approximation algorithm for the 2-means clustering problem with running time
O(n/ε
d
). Matouˇsek [19] proposed a deterministic (1 + ε)-approximation algorithm for the k-means
problem with running time O(
2k
2
d
log
k
n).
By generalizing the technique of Arora [1], Arora et al. [2] presented a O(n
O(1)+1
) time
(1 + ε)-approximation algorithm f or the k-median problem where points lie in the plane. This was
significantly impr oved by Kolliopoulos et al. [16] who proposed an algorithm w ith a running time
of O(nlognlogk) for the discrete version of the problem, where the m ed ians mus t belong to the
input set and = exp[O ((1 + log1))
d1
].
Recently, Badoiu et al. [3] proposed a (1 + ε)-approximation algorithm for k-median clustering
with a running time of O(2
(k/ε)
O(1)
d
O(1)
nlog
O(k)
n). Their algorithm can be extended to k-means
with some modifications. de la Vega et al. [8] proposed a (1 + ε)-approximation algorithm for
the k-means problem which works well for points in high dimensions. The running time of this
algorithm is O(g(k, ε)nlog
k
n) where g(k, ε) = exp[(k
3
8
)(ln(k)lnk].
Recently, Har-Peled and Mazumdar [13] proposed (1 + ε)-approximation algorithms for the
k-median, discrete k-median and the k-means clustering in low dimensions. They obtained a
runnin g time of O(n + k
O(1)
log
O(1)
n) for th e k-median problem, O(n + k
O(1)
log
O(1)
n) for the
discrete k-median problem and O(n + k
k+2
ε
(2d+1)k
log
k+1
nlog
k
1
ε
) for the k-means problem. For
approximating the 1-median of a set of points, Indyk [15] proposed an algorithm that nds a (1+ε)
approximate 1-median in time O(n/ε
2
) with constant probability.
Table 1.1 su mmarizes the recent results for the problems, in the context of (1+ε)-approximation
algorithms. Some of these algorithms are ran domized with the expected ruining time holding good
for any input.
1.2 Our results and techniques
The general algorithm we present solves a large class of clustering prob lems satisfying a set of
conditions (cf. S ection 3). We show that the k-means problem, k-median problem and the discrete

3
Problem Result Reference
1-median O(n/ε
2
) Indyk [15]
k-median O(n
O(1)+1
) for d = 2 Arora [1]
O(nlognlogk) (discrete only) Kolliopoulos et al. [16]
O(2
(k/ε)
O(1)
d
O(1)
n log
k
n)
Badoiu et al. [3]
O(n + k
O(1)
log
O(1)
n) (discrete also) Har-Peled et al. [13]
where = exp[O((1 + log1))
d1
]
k-means O(
2k
2
d
log
k
n) Matouˇsek [19]
O(g(k, ε)nlog
k
n) de la Vega et al. [8]
g(k, ε) = exp[(k
3
8
)(ln(k)lnk]
O(n + k
k+2
ε
(2d+1)k
log
k+1
nlog
k
1
ε
) (discrete also)
Har-Peled et al. [13]
Figure 1: Summary of previous results on k-means and k-median clustering.
k-means problem, all satisfy the required conditions and therefore belong to this class of clustering
problems. One important condition that the clustering problems must satisfy is the existence of
an algorithm to generate a candidate set of points such that at least one of these points is a close
approximation to the optimal center for k = 1 (one cluster). Further, the running time of this
algorithm as well as the size of this candidate set should be independent of n. Based on such a
subroutine, we sh ow how to approximate all the centers in the optimal solution in an iterative
manner.
The runn ing times of O(2
(k/ε)
O(1)
nd) of our algorithms are better than the previously known
algorithms for th ese problems, specially when d is very large. In fact, these are the first algorithms
for the k-means, k-median and the discrete k-means clustering problems that have running time
linear in the size of the input for fixed k and ε. The algorithms in this paper h ave the additional
advantage of simplicity as the only technique involved is random sampling. Our method is based
on using random sampling to identify a small set of candidate centers. In contrast, an alternate
strategy [13] involves identifying significantly small sets of points, called coresets, such that solving
the clustering problem on the coresets yields a solution for th e original set of points. In a subsequent
work [6], fu rther improvements were obtained using a clever comb ination of the two techniques (cf.
Section 8).
The main drawback of our algorithm is that the running time has expon ential d ependence on
k
ε
O(1)
. We wou ld however like to note that Guruswamy and Indyk [12] showed that it is NP-hard
to obtain a PTAS for the k-median problem for arbitrary k and d Ω(log n). Since we avoid
exponential dependence on d, this implies that the exponential dependence on k is inherent. An
algorithm that avoids exponential dependen ce on k, like Arora et al.[2] has doubly exponential
dependence on d which is arguably wors e in most situations.
We also present a randomized (1 + ε)-approximation algorithm for the 1-median problem which
runs in time O(2
1
O(1)
d), assuming that the points are stored in a suitable d ata structure such as
an array, where a point can be randomly sampled in constant time. All our algorithms yield the
desired result with constant probability (which can be made as close to 1 as we wish by a constant
number of repetitions).
The remaining paper is organized as follows. In Section 2, we define clustering problems. In
Section 3 we present a simplified algorithm for the 2-means clustering problem. In Section 4,
we describe a general approach for solving clustering problems efficiently. In subsequent sections
we give applications of the general method by showing that this class of problems includes the

4
k-means, k-median and discrete k-means prob lems. In Section 5.3, we also describe an efficient
approximation algorithm for the 1-median problem. In Section 7, we extend our algorithms for
efficiently handling weighted point sets. We conclude by stating some open problems and some
interesting developments subsequent to the publication of earlier versions of this work in Section 8.
2 Clustering Problems
In this section, we give a general definition of clustering problems.
We shall d efi ne a clustering problem by two parameters an integer k and a real-valued cost
function f(Q, x), where Q is a set of points, and x is a point in an Euclidean sp ace. We shall den ote
this clustering problem as C(f, k). The input to C(f, k) is a set of points in an Euclidean space.
Given an in stance P of n points, C(f, k) s eeks to partition them into k sets, which we shall
denote as clusters. Let these clusters be C
1
, . . . , C
k
. A solution also finds k points, which we call
centers, c
1
, . . . , c
k
. We shall say that c
i
is the center of cluster C
i
(or the points in C
i
are assigned
to c
i
). The objective of the problem is to m inimize the quantity
P
k
i=1
f(C
i
, c
i
).
This is a fairly general definition. Let us see some important special cases.
k-median : f
1
(Q, x) =
P
qQ
d(q, x).
k-means : f
2
(Q, x) =
P
qQ
d(q, x)
2
.
We can also encompass the discrete versions of these problems, i.e., cases where the centers
have to be one of the points in P . In such problems, we can make f (Q, x) unb ounded if x / Q.
As stated earlier, we shall assum e that we are given a constant ε > 0, and we are interested in
finding (1 + ε)-approximation algorithms for these clustering problems.
We n ow give some defin itions. Let us x a clustering problem C(f, k). Although we should
parametrized all our definitions by f, we avoid this because the clustering problem will be clear
from the context.
Definition 2.1. Given a point set P , let OPT
k
(P ) be the cost of the optimal solution to the clustering
problem C(f, k) on input P .
Definition 2.2. Given a set of points P and a set of k points C, let OPT
k
(P, C) be the cost of the
optimal solution to C(f, k) on P when the set of centers is C.
3 Algorithm for 2-means Clustering
In this section we describe the algorithm for 2-means clustering. The 2-means clustering algo-
rithm contains many of the ideas inherent in the more general algorithm. This makes it easier to
understand the more general algorithm described in the next section.
Consider an instance of the 2-means problem where we are given a set P of n points in
d
. We
seek to find C(f
2
, 2) where f
2
corresponds to the k-means cost function as defined in Section 2.
We first look at some properties of the 1-means problem.
Definition 3.1. For a set of points P , define the centroid, c(P ), of P as the point
P
pP
p
|P |
.
Claim 3.1. For any point x
d
,
f
2
(P, x) = f
2
(P, c(P )) + |P | ·d(c(P ), x)
2
. (1)

5
Proof.
f
2
(P, x) =
X
pP
||p x||
2
=
X
pP
||p c(P ) + c(P ) x||
2
=
X
pP
||p c(P )||
2
+
X
pP
||c(P) x||
2
= f
2
(P, c(P )) + |P | ·d(c(P ), x)
2
where the second last equality follows from the fact that
P
pP
||p c(P)|| = 0.
From this we can make the following observation.
Fact 3.2. Any optimal solution to the 1-means problem with respect to an input point set P chooses
c(P ) as the center.
We can deduce an important p roperty of any optimal solution to the 2-means clustering problem.
Suppose we are given an optimal solution to the 2-means clustering problem with respect to the
input P . Let C = {c
1
, c
2
} be the set of centers constructed by this solution. C produces a
partitioning of the point set P into 2 clusters, namely, P
1
, P
2
. P
i
is the set of points for which the
closest point in C is c
i
. In other words, the clusters correspond to the points in the Voronoi regions
in
d
with respect to C. Now, Fact 3.2 imp lies that c
i
mu st be the centroid of P
i
for i = 1, 2.
Inaba et. al. [14] showed th at the centroid of a small random sample of points in P can be a
go od approximation to c(P ).
Lemma 3.3. [14] Let T be a set obtained by independently sampling m points uniformly at random
from a point set P . Then, for any δ > 0,
f
2
(S, c(T )) <
1 +
1
δm
OPT
1
(P )
holds with probability at least 1 δ.
Therefore, if we ch oose m as
2
ε
, then with probability at least 1/2, we get a (1+ε)-approximation
to OPT
1
(P ) by taking the center as the centroid of T . Thus, a constant size sample can quickly
yield a good approximation to the optimal 1-means solution.
Suppose P
is a subset of P and we want to get a good approximation to the optimal 1-mean
for the point set P
. Following Lemma 3.3, we would like to sample from P
. But the problem
is that P
is not explicitly given to us. The following lemma states that if the size of P
is close
to th at of P , then we can sample a slightly larger set of points from P and hopefully this sample
would contain enough random samples from P
. Let us define things more formally first. Let P be
a set of points and P
be a subset of P such that |P
| θ|P |, w here θ is a constant between 0 and
1. Suppose we take a sample S of size
4
θε
from P . Now we consider all possible subsets of size
2
ε
of S. For each of these subsets S
, we compute its centroid c(S
), and consider this as a potential
center for the 1-means problem instance on P
. In other words, we consider f
2
(P
, c(S
)) for all
such subsets S
. T he following lemma shows that one of these subsets must give a close enough
approximation to the optimal 1-means s olution for P
.
Lemma 3.4. The following event happens with constant probability
min
S
:S
S,|S
|=
2
ε
f
2
(P
, c(S
)) (1 + ε)OPT
1
(P
)

Citations
More filters
Journal ArticleDOI

[...]

766 citations

Posted Content

[...]

TL;DR: In this paper, the authors developed and analyzed a method to reduce the size of a very large set of data points in a high dimensional Euclidean space R d to a small set of weighted points such that the result of a predetermined data analysis task on the reduced set is approximately the same as that for the original point set.
Abstract: We develop and analyze a method to reduce the size of a very large set of data points in a high dimensional Euclidean space R d to a small set of weighted points such that the result of a predetermined data analysis task on the reduced set is approximately the same as that for the original point set. For example, computing the first k principal components of the reduced set will return approximately the first k principal components of the original set or computing the centers of a k-means clustering on the reduced set will return an approximation for the original set. Such a reduced set is also known as a coreset. The main new feature of our construction is that the cardinality of the reduced set is independent of the dimension d of the input space and that the sets are mergable. The latter property means that the union of two reduced sets is a reduced set for the union of the two original sets (this property has recently also been called composability, see Indyk et. al., PODS 2014). It allows us to turn our methods into streaming or distributed algorithms using standard approaches. For problems such as k-means and subspace approximation the coreset sizes are also independent of the number of input points. Our method is based on projecting the points on a low dimensional subspace and reducing the cardinality of the points inside this subspace using known methods. The proposed approach works for a wide range of data analysis techniques including k-means clustering, principal component analysis and subspace clustering. The main conceptual contribution is a new coreset definition that allows to charge costs that appear for every solution to an additive constant.

312 citations

Posted Content

[...]

TL;DR: In this article, a unified framework for constructing coresets and approximate clustering for general sets of functions is presented. But it is not a coreset-based clustering framework.
Abstract: Given a set $F$ of $n$ positive functions over a ground set $X$, we consider the problem of computing $x^*$ that minimizes the expression $\sum_{f\in F}f(x)$, over $x\in X$. A typical application is \emph{shape fitting}, where we wish to approximate a set $P$ of $n$ elements (say, points) by a shape $x$ from a (possibly infinite) family $X$ of shapes. Here, each point $p\in P$ corresponds to a function $f$ such that $f(x)$ is the distance from $p$ to $x$, and we seek a shape $x$ that minimizes the sum of distances from each point in $P$. In the $k$-clustering variant, each $x\in X$ is a tuple of $k$ shapes, and $f(x)$ is the distance from $p$ to its closest shape in $x$. Our main result is a unified framework for constructing {\em coresets} and {\em approximate clustering} for such general sets of functions. To achieve our results, we forge a link between the classic and well defined notion of $\varepsilon$-approximations from the theory of PAC Learning and VC dimension, to the relatively new (and not so consistent) paradigm of coresets, which are some kind of "compressed representation" of the input set $F$. Using traditional techniques, a coreset usually implies an LTAS (linear time approximation scheme) for the corresponding optimization problem, which can be computed in parallel, via one pass over the data, and using only polylogarithmic space (i.e, in the streaming model). We show how to generalize the results of our framework for squared distances (as in $k$-mean), distances to the $q$th power, and deterministic constructions.

249 citations

Posted Content

[...]

TL;DR: In this paper, a simple clustering algorithm for data points generated by a mixture of $k$ probability distributions without assuming any generative (probabilistic) model is presented.
Abstract: There has been much progress on efficient algorithms for clustering data points generated by a mixture of $k$ probability distributions under the assumption that the means of the distributions are well-separated, i.e., the distance between the means of any two distributions is at least $\Omega(k)$ standard deviations. These results generally make heavy use of the generative model and particular properties of the distributions. In this paper, we show that a simple clustering algorithm works without assuming any generative (probabilistic) model. Our only assumption is what we call a "proximity condition": the projection of any data point onto the line joining its cluster center to any other cluster center is $\Omega(k)$ standard deviations closer to its own center than the other center. Here the notion of standard deviations is based on the spectral norm of the matrix whose rows represent the difference between a point and the mean of the cluster to which it belongs. We show that in the generative models studied, our proximity condition is satisfied and so we are able to derive most known results for generative models as corollaries of our main result. We also prove some new results for generative models - e.g., we can cluster all but a small fraction of points only assuming a bound on the variance. Our algorithm relies on the well known $k$-means algorithm, and along the way, we prove a result of independent interest -- that the $k$-means algorithm converges to the "true centers" even in the presence of spurious points provided the initial (estimated) centers are close enough to the corresponding actual centers and all but a small fraction of the points satisfy the proximity condition. Finally, we present a new technique for boosting the ratio of inter-center separation to standard deviation.

146 citations

Proceedings ArticleDOI

[...]

23 Oct 2010
TL;DR: This paper shows that a simple clustering algorithm works without assuming any generative (probabilistic) model, and proves some new results for generative models - e.g., it can cluster all but a small fraction of points only assuming a bound on the variance.
Abstract: There has been much progress on efficient algorithms for clustering data points generated by a mixture of k probability distributions under the assumption that the means of the distributions are well-separated, i.e., the distance between the means of any two distributions is at least Omega(k) standard deviations. These results generally make heavy use of the generative model and particular properties of the distributions. In this paper, we show that a simple clustering algorithm works without assuming any generative (probabilistic) model. Our only assumption is what we call a "proximity condition'': the projection of any data point onto the line joining its cluster center to any other cluster center is Omega(k) standard deviations closer to its own center than the other center. Here the notion of standard deviations is based on the spectral norm of the matrix whose rows represent the difference between a point and the mean of the cluster to which it belongs. We show that in the generative models studied, our proximity condition is satisfied and so we are able to derive most known results for generative models as corollaries of our main result. We also prove some new results for generative models - e.g., we can cluster all but a small fraction of points only assuming a bound on the variance. Our algorithm relies on the well known k-means algorithm, and along the way, we prove a result of independent interest – that the k-means algorithm converges to the "true centers'' even in the presence of spurious points provided the initial (estimated) centers are close enough to the corresponding actual centers and all but a small fraction of the points satisfy the proximity condition. Finally, we present a new technique for boosting the ratio of inter-center separation to standard deviation. This allows us to prove results for learning certain mixture of distributions under weaker separation conditions.

110 citations

References
More filters
Book

[...]

01 Jan 1973

20,533 citations


"Linear-time approximation schemes f..." refers background in this paper

  • [...]

Journal ArticleDOI

[...]

TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Abstract: A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.

12,005 citations


"Linear-time approximation schemes f..." refers background or methods in this paper

  • [...]

  • [...]

Journal ArticleDOI

[...]

TL;DR: In this paper, color histograms of multicolored objects provide a robust, efficient cue for indexing into a large database of models, and they can differentiate among a large number of objects.
Abstract: Computer vision is moving into a new era in which the aim is to develop visual skills for robots that allow them to interact with a dynamic, unconstrained environment. To achieve this aim, new kinds of vision algorithms need to be developed which run in real time and subserve the robot's goals. Two fundamental goals are determining the identity of an object with a known location, and determining the location of a known object. Color can be successfully used for both tasks. This dissertation demonstrates that color histograms of multicolored objects provide a robust, efficient cue for indexing into a large database of models. It shows that color histograms are stable object representations in the presence of occlusion and over change in view, and that they can differentiate among a large number of objects. For solving the identification problem, it introduces a technique called Histogram Intersection, which matches model and image histograms and a fast incremental version of Histogram Intersection which allows real-time indexing into a large database of stored models. It demonstrates techniques for dealing with crowded scenes and with models with similar color signatures. For solving the location problem it introduces an algorithm called Histogram Backprojection which performs this task efficiently in crowded scenes.

5,485 citations

Journal ArticleDOI

[...]

01 Sep 1997
TL;DR: An efficient way to determine the syntactic similarity of files is developed and applied to every document on the World Wide Web, and a clustering of all the documents that are syntactically similar is built.
Abstract: We have developed an efficient way to determine the syntactic similarity of files and have applied it to every document on the World Wide Web. Using this mechanism, we built a clustering of all the documents that are syntactically similar. Possible applications include a "Lost and Found" service, filtering the results of Web searches, updating widely distributed web-pages, and identifying violations of intellectual property rights.

1,506 citations


"Linear-time approximation schemes f..." refers background or methods in this paper

  • [...]

  • [...]

Journal ArticleDOI

[...]

01 Jul 1994
TL;DR: A set of novel features and similarity measures allowing query by image content, together with the QBIC system, and a new theorem that makes efficient filtering possible by bounding the non-Euclidean, full cross-term quadratic distance expression with a simple Euclidean distance.
Abstract: In the QBIC (Query By Image Content) project we are studying methods to query large on-line image databases using the images' content as the basis of the queries. Examples of the content we use include color, texture, shape, position, and dominant edges of image objects and regions. Potential applications include medical (“Give me other images that contain a tumor with a texture like this one”), photo-journalism (“Give me images that have blue at the top and red at the bottom”), and many others in art, fashion, cataloging, retailing, and industry. We describe a set of novel features and similarity measures allowing query by image content, together with the QBIC system we implemented. We demonstrate the effectiveness of our system with normalized precision and recall experiments on test databases containing over 1000 images and 1000 objects populated from commercially available photo clip art images, and of images of airplane silhouettes. We also present new methods for efficient processing of QBIC queries that consist of filtering and indexing steps. We specifically address two problems: (a) non Euclidean distance measures; and (b) the high dimensionality of feature vectors. For the first problem, we introduce a new theorem that makes efficient filtering possible by bounding the non-Euclidean, full cross-term quadratic distance expression with a simple Euclidean distance. For the second, we illustrate how orthogonal transforms, such as Karhunen Loeve, can help reduce the dimensionality of the search space. Our methods are general and allow some “false hits” but no false dismissals. The resulting QBIC system offers effective retrieval using image content, and for large image databases significant speedup over straightforward indexing alternatives. The system is implemented in X/Motif and C running on an RS/6000.

1,279 citations


"Linear-time approximation schemes f..." refers background or methods in this paper

  • [...]

  • [...]

Frequently Asked Questions (3)
Q1. What are the contributions mentioned in the paper "Linear-time approximation schemes for clustering problems in any dimensions∗" ?

The authors present a general approach for designing approximation algorithms for a fundamental class of geometric clustering problems in arbitrary dimensions. More specifically, their approach leads to simple randomized algorithms for the k-means, k-median and discrete k-means problems that yield ( 1 + ε ) approximations with probability ≥ 1/2 and running times of O ( 2 ( k/ε ) O ( 1 ) dn ). Their method is general enough to be applicable to clustering problems satisfying certain simple properties and is likely to have further applications. 

An interesting open problem is whether there exist coresets for the k-median or k-means clustering problems of size independent of n and having only polynomial dependence in d.Another interesting open problem is to find a PTAS for the k-means clustering problem, even for fixed dimensions. 

It is important to note here that given a Random Sampling Procedure for an unweighted clustering problem, the corresponding Weighted Random Sampling Procedure for the weighted version of the problem can be simply obtained by performing weighted sampling as described above.