Deterministic Coresets for k-Means of Big Sparse Data
Artem Barger,Dan Feldman +1 more
TLDR
The first such coreset of size independent of d is suggested, which is also the first deterministic coreset construction whose resulting size is not exponential in d.Abstract:
Let P be a set of n points in R d , k ≥ 1 be an integer and e ∈ ( 0 , 1 ) be a constant. An e-coreset is a subset C ⊆ P with appropriate non-negative weights (scalars), that approximates any given set Q ⊆ R d of k centers. That is, the sum of squared distances over every point in P to its closest point in Q is the same, up to a factor of 1 ± e to the weighted sum of C to the same k centers. If the coreset is small, we can solve problems such as k-means clustering or its variants (e.g., discrete k-means, where the centers are restricted to be in P, or other restricted zones) on the small coreset to get faster provable approximations. Moreover, it is known that such coreset support streaming, dynamic and distributed data using the classic merge-reduce trees. The fact that the coreset is a subset implies that it preserves the sparsity of the data. However, existing such coresets are randomized and their size has at least linear dependency on the dimension d. We suggest the first such coreset of size independent of d. This is also the first deterministic coreset construction whose resulting size is not exponential in d. Extensive experimental results and benchmarks are provided on public datasets, including the first coreset of the English Wikipedia using Amazon’s cloud.read more
Citations
More filters
Journal ArticleDOI
Survey on Technique and User Profiling in Unsupervised Machine Learning Method
TL;DR: This research is to provide a framework that outlines the Unsupervised Machine Learning methods for User-Profiling (UP) based on essential data attributes and Academics and professionals may use the framework to figure out which UML techniques are best for creating strong profiles or data-driven user segmentation.
Posted Content
Online Coresets for Clustering with Bregman Divergences.
TL;DR: Algorithms that create coresets in an online setting for clustering problems according to a wide subset of Bregman divergences and a non-parametric coresets are presented, which are larger by a factor of $O(\log n)$ ($n$ is number of points) and have similar additive guarantee.
Journal ArticleDOI
Visible-NIR spectral characterization and grade inversion modelling study of the Derni copper deposit
TL;DR: Several Derni copper grade inversion models based on two machine learning algorithms, the backpropagation neural network and radial basis function (RBF) neural network, were developed and the coefficient of determination (R2), mean absolute error (MAE) and root mean square error (RMSE) were used to evaluate the accuracy of each model.
Proceedings Article
On Coresets for Fair Regression and Individually Fair Clustering
TL;DR: This paper defines coresets for Fair Regression with Statistical Parity (SP) constraints and for Individually Fair Clustering and shows that to obtain such coresets, it is sufficient to sample points based on the probabilities dependent on combination of sensitivity score and a carefully chosen term according to the fairness constraints.
References
More filters
Journal ArticleDOI
Generalizing the hough transform to detect arbitrary shapes
TL;DR: It is shown how the boundaries of an arbitrary non-analytic shape can be used to construct a mapping between image space and Hough transform space, which makes the generalized Houghtransform a kind of universal transform which can beused to find arbitrarily complex shapes.
Proceedings ArticleDOI
On coresets for k-means and k-median clustering
Sariel Har-Peled,Soham Mazumdar +1 more
TL;DR: This paper shows the existence of small coresets for the problems of computing k-median/means clustering for points in low dimension, and improves the fastest known algorithms for (1+ε)-approximate k-means and k- median.
Book ChapterDOI
The Planar k-Means Problem is NP-Hard
TL;DR: It is shown that this well-known problem is NP-hard even for instances in the plane, answering an open question posed by Dasgupta [6].