Deterministic Coresets for k-Means of Big Sparse Data

doi:10.3390/A13040092

Open AccessJournal ArticleDOI

Deterministic Coresets for k-Means of Big Sparse Data

Artem Barger, +1 more

- 14 Apr 2020 -

Algorithms

- Vol. 13, Iss: 4, pp 92

TLDR

The first such coreset of size independent of d is suggested, which is also the first deterministic coreset construction whose resulting size is not exponential in d.

Abstract:

Let P be a set of n points in R d , k ≥ 1 be an integer and e ∈ ( 0 , 1 ) be a constant. An e-coreset is a subset C ⊆ P with appropriate non-negative weights (scalars), that approximates any given set Q ⊆ R d of k centers. That is, the sum of squared distances over every point in P to its closest point in Q is the same, up to a factor of 1 ± e to the weighted sum of C to the same k centers. If the coreset is small, we can solve problems such as k-means clustering or its variants (e.g., discrete k-means, where the centers are restricted to be in P, or other restricted zones) on the small coreset to get faster provable approximations. Moreover, it is known that such coreset support streaming, dynamic and distributed data using the classic merge-reduce trees. The fact that the coreset is a subset implies that it preserves the sparsity of the data. However, existing such coresets are randomized and their size has at least linear dependency on the dimension d. We suggest the first such coreset of size independent of d. This is also the first deterministic coreset construction whose resulting size is not exponential in d. Extensive experimental results and benchmarks are provided on public datasets, including the first coreset of the English Wikipedia using Amazon’s cloud.

Deterministic Coresets for k-Means of Big Sparse Data

Citations

Survey on Technique and User Profiling in Unsupervised Machine Learning Method

Online Coresets for Clustering with Bregman Divergences.

Visible-NIR spectral characterization and grade inversion modelling study of the Derni copper deposit

Overview of accurate coresets

On Coresets for Fair Regression and Individually Fair Clustering

References

Generalizing the hough transform to detect arbitrary shapes

Extensions of Lipschitz mappings into Hilbert space

Extensions of Lipschitz mappings into a Hilbert space

On coresets for k-means and k-median clustering

The Planar k-Means Problem is NP-Hard

Related Papers (5)

Coresets for Ordered Weighted Clustering

Clustering high dimensional dynamic data streams

Coresets for clustering in Euclidean spaces: importance sampling is nearly optimal

Epsilon-Coresets for Clustering (with Outliers) in Doubling Metrics

Clustering problems on sliding windows