In projective clustering we are given a set of n points in $R^d$ and wish to cluster them to a set $S$ of $k$ linear subspaces in $R^d$ according to some given distance function. An $\eps$-coreset for this problem is a weighted (scaled) subset of the input points such that for every such possible $S$ the sum of these distances is approximated up to a factor of $(1+\eps)$. We suggest to reduce the size of existing coresets by suggesting the first $O(\log(m))$ approximation for the case of $m$ lines clustering in $O(ndm)$ time, compared to the existing $\exp(m)$ solution. We then project the points on these lines and prove that for a sufficiently large $m$ we obtain a coreset for projective clustering. Our algorithm also generalize to handle outliers. Experimental results and open code are also provided.

Faster Projective Clustering Approximation of Big Data.

In this work, we study the $k$-means cost function. The (Euclidean) $k$-means problem can be described as follows: given a dataset $X \subseteq \mathbb{R}^d$ and a positive integer $k$, find a set of $k$ centers $C \subseteq \mathbb{R}^d$ such that $\Phi(C, X) \stackrel{def}{=} \sum_{x \in X} \min_{c \in C} ||x - c||^2$ is minimized. Let $\Delta_k(X) \stackrel{def}{=} \min_{C \subseteq \mathbb{R}^d} \Phi(C, X)$ denote the cost of the optimal $k$-means solution. It is simple to observe that for any dataset $X$, $\Delta_k(X)$ decreases as $k$ increases. We try to understand this behaviour more precisely. For any dataset $X \subseteq \mathbb{R}^d$, integer $k \geq 1$, and a small precision parameter $\varepsilon > 0$, let $\mathcal{L}_{X}^{k, \varepsilon}$ denote the smallest integer such that $\Delta_{\mathcal{L}_{X}^{k, \varepsilon}}(X) \leq \varepsilon \cdot \Delta_{k}(X)$. We show upper and lower bounds on this quantity. Our techniques generalize for the metric $k$-median problem in arbitrary metrics and we give bounds in terms of the doubling dimension of the metric. Finally, we observe that for any dataset $X$, we can compute a set $S$ of size $O \left(\mathcal{L}_{X}^{k, \frac{\varepsilon}{c}} \right)$ such that $\Delta_{S}(X) \leq \varepsilon \cdot \Delta_k(X)$ using the $D^2$-sampling algorithm which is also known as the $k$-means++ seeding procedure. In the previous statement, $c$ is some fixed constant. We also discuss some applications of our bounds.

Faster Projective Clustering Approximation of Big Data.

References

On the k-Means/Median Cost Function.

Related Papers (5)

Epsilon-Coresets for Clustering (with Outliers) in Doubling Metrics

Tight Sensitivity Bounds For Smaller Coresets

Small space representations for metric min-sum k-clustering and their applications

Coresets for clustering in Euclidean spaces: importance sampling is nearly optimal

Minimum Coresets for Maxima Representation of Multidimensional Data