scispace - formally typeset
Open AccessProceedings ArticleDOI

Tight Sensitivity Bounds For Smaller Coresets

Reads0
Chats0
TLDR
In this paper, the authors proposed an e-coreset to the dimensionality reduction problem for a (possibly very large) matrix A ∈ Rn x d is a small scaled subset of its n rows that approximates their sum of squared distances to every affine k-dimensional subspace of Rd, up to a factor of 1±e.
Abstract
An e-coreset to the dimensionality reduction problem for a (possibly very large) matrix A ∈ Rn x d is a small scaled subset of its n rows that approximates their sum of squared distances to every affine k-dimensional subspace of Rd, up to a factor of 1±e. Such a coreset is useful for boosting the running time of computing a low-rank approximation (k-SVD/k-PCA) while using small memory. Coresets are also useful for handling streaming, dynamic and distributed data in parallel. With high probability, non-uniform sampling based on the so called leverage score or sensitivity of each row in A yields a coreset. The size of the (sampled) coreset is then near-linear in the total sum of these sensitivity bounds. We provide algorithms that compute provably tight bounds for the sensitivity of each input row. It is based on two ingredients: (i) iterative algorithm that computes the exact sensitivity of each row up to arbitrary small precision for (non-affine) k-subspaces, and (ii) a general reduction for computing a coreset for affine subspaces, given a coreset for (non-affine) subspaces in Rd. Experimental results on real-world datasets, including the English Wikipedia documents-term matrix, show that our bounds provide significantly smaller and data-dependent coresets also in practice. Full open source code is also provided.

read more

Citations
More filters
Posted Content

Fast and Accurate Least-Mean-Squares Solvers

TL;DR: In this paper, the authors proposed a faster algorithm that computes a weighted subset of sparsified input points in O(nd+d^4\log{n}) time, using O(log n)$ calls to Caratheodory's construction on small but "smart" subsets.
Posted Content

Sets Clustering

TL;DR: The first PTAS ($1+\varepsilon$ approximation) for the sets-$k$-means problem that takes time near linear in $n is obtained, the first result even for sets-mean on the plane.
Posted Content

Compressed Deep Networks: Goodbye SVD, Hello Robust Low-Rank Approximation.

TL;DR: This paper suggests to replace the $k-rank $\ell_2$ approximation by $ell_p, for $p\in [1,2]$, and provides practical and provable approximation algorithms to compute it for any $ p\geq1$, based on modern techniques in computational geometry.
Journal ArticleDOI

Fast and Accurate Least-Mean-Squares Solvers for High Dimensional Data

TL;DR: In this paper , the authors propose an algorithm that computes a subset of vectors with positive weights whose weighted sum is the same as the sum of all the vectors with the same positive weights.
References
More filters
Journal ArticleDOI

Regression Shrinkage and Selection via the Lasso

TL;DR: A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.
Journal ArticleDOI

Regularization and variable selection via the elastic net

TL;DR: It is shown that the elastic net often outperforms the lasso, while enjoying a similar sparsity of representation, and an algorithm called LARS‐EN is proposed for computing elastic net regularization paths efficiently, much like algorithm LARS does for the lamba.
Journal ArticleDOI

X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling

TL;DR: In this article, a system of deviations from the means of n variables with standard deviations σ 1, σ 2 σ σ n and with correlations r12, r13, r23, r n −1,n.
Proceedings Article

A public domain dataset for human activity recognition using smartphones

TL;DR: An Activity Recognition database is described, built from the recordings of 30 subjects doing Activities of Daily Living while carrying a waist-mounted smartphone with embedded inertial sensors, which is released to public domain on a well-known on-line repository.
Proceedings ArticleDOI

K-means clustering via principal component analysis

TL;DR: It is proved that principal components are the continuous solutions to the discrete cluster membership indicators for K-means clustering, which indicates that unsupervised dimension reduction is closely related to unsuper supervised learning.
Related Papers (5)