scispace - formally typeset
Open AccessJournal ArticleDOI

Deterministic Coresets for k-Means of Big Sparse Data

Artem Barger, +1 more
- 14 Apr 2020 - 
- Vol. 13, Iss: 4, pp 92
TLDR
The first such coreset of size independent of d is suggested, which is also the first deterministic coreset construction whose resulting size is not exponential in d.
Abstract
Let P be a set of n points in R d , k ≥ 1 be an integer and e ∈ ( 0 , 1 ) be a constant. An e-coreset is a subset C ⊆ P with appropriate non-negative weights (scalars), that approximates any given set Q ⊆ R d of k centers. That is, the sum of squared distances over every point in P to its closest point in Q is the same, up to a factor of 1 ± e to the weighted sum of C to the same k centers. If the coreset is small, we can solve problems such as k-means clustering or its variants (e.g., discrete k-means, where the centers are restricted to be in P, or other restricted zones) on the small coreset to get faster provable approximations. Moreover, it is known that such coreset support streaming, dynamic and distributed data using the classic merge-reduce trees. The fact that the coreset is a subset implies that it preserves the sparsity of the data. However, existing such coresets are randomized and their size has at least linear dependency on the dimension d. We suggest the first such coreset of size independent of d. This is also the first deterministic coreset construction whose resulting size is not exponential in d. Extensive experimental results and benchmarks are provided on public datasets, including the first coreset of the English Wikipedia using Amazon’s cloud.

read more

Citations
More filters
Journal ArticleDOI

Survey on Technique and User Profiling in Unsupervised Machine Learning Method

TL;DR: This research is to provide a framework that outlines the Unsupervised Machine Learning methods for User-Profiling (UP) based on essential data attributes and Academics and professionals may use the framework to figure out which UML techniques are best for creating strong profiles or data-driven user segmentation.
Posted Content

Online Coresets for Clustering with Bregman Divergences.

TL;DR: Algorithms that create coresets in an online setting for clustering problems according to a wide subset of Bregman divergences and a non-parametric coresets are presented, which are larger by a factor of $O(\log n)$ ($n$ is number of points) and have similar additive guarantee.
Journal ArticleDOI

Visible-NIR spectral characterization and grade inversion modelling study of the Derni copper deposit

TL;DR: Several Derni copper grade inversion models based on two machine learning algorithms, the backpropagation neural network and radial basis function (RBF) neural network, were developed and the coefficient of determination (R2), mean absolute error (MAE) and root mean square error (RMSE) were used to evaluate the accuracy of each model.
Proceedings Article

On Coresets for Fair Regression and Individually Fair Clustering

TL;DR: This paper defines coresets for Fair Regression with Statistical Parity (SP) constraints and for Individually Fair Clustering and shows that to obtain such coresets, it is sufficient to sample points based on the probabilities dependent on combination of sensitivity score and a carefully chosen term according to the fairness constraints.
References
More filters
Journal ArticleDOI

Generalizing the hough transform to detect arbitrary shapes

TL;DR: It is shown how the boundaries of an arbitrary non-analytic shape can be used to construct a mapping between image space and Hough transform space, which makes the generalized Houghtransform a kind of universal transform which can beused to find arbitrarily complex shapes.
Proceedings ArticleDOI

On coresets for k-means and k-median clustering

TL;DR: This paper shows the existence of small coresets for the problems of computing k-median/means clustering for points in low dimension, and improves the fastest known algorithms for (1+ε)-approximate k-means and k- median.
Book ChapterDOI

The Planar k-Means Problem is NP-Hard

TL;DR: It is shown that this well-known problem is NP-hard even for instances in the plane, answering an open question posed by Dasgupta [6].
Related Papers (5)