scispace - formally typeset
Open AccessProceedings ArticleDOI

Turning big data into tiny data: constant-size coresets for k-means, PCA and projective clustering

Reads0
Chats0
TLDR
The authors' coresets with the merge-and-reduce approach obtain embarrassingly parallel streaming algorithms for problems such as k-means, PCA and projective clustering, and a simple recursive coreset construction that produces coresets of size.
Abstract
@d can be approximated up to (1 + e)-factor, for an arbitrary small e > 0, using the O(k/e2)-rank approximation of A and a constant. This implies, for example, that the optimal k-means clustering of the rows of A is (1 + e)-approximated by an optimal k-means clustering of their projection on the O(k/e2) first right singular vectors (principle components) of A.A (j, k)-coreset for projective clustering is a small set of points that yields a (1 + e)-approximation to the sum of squared distances from the n rows of A to any set of k affine subspaces, each of dimension at most j. Our embedding yields (0, k)-coresets of size O(k) for handling k-means queries, (j, 1)-coresets of size O(j) for PCA queries, and (j, k)-coresets of size (log n)O(jk) for any j, k ≥ 1 and constant e e (0, 1/2). Previous coresets usually have a size which is linearly or even exponentially dependent of d, which makes them useless when d ~ n.Using our coresets with the merge-and-reduce approach, we obtain embarrassingly parallel streaming algorithms for problems such as k-means, PCA and projective clustering. These algorithms use update time per point and memory that is polynomial in log n and only linear in d.For cost functions other than squared Euclidean distances we suggest a simple recursive coreset construction that produces coresets of size

read more

Citations
More filters
Journal ArticleDOI

Visual Place Recognition: A Survey

TL;DR: A survey of the visual place recognition research landscape is presented, introducing the concepts behind place recognition, how a “place” is defined in a robotics context, and the major components of a place recognition system.
Journal ArticleDOI

Mining big data: current status, and forecast to the future

TL;DR: This issue introduces four articles, written by influential scientists in the field, covering the most interesting and state-of-the-art topics on Big Data mining, and presents a broad overview of the topic, its current status, controversy, and a forecast to the future.
Journal ArticleDOI

Machine Learning in Wireless Sensor Networks: Algorithms, Strategies, and Applications

TL;DR: An extensive literature review over the period 2002-2013 of machine learning methods that were used to address common issues in WSNs is presented and a comparative guide is provided to aid WSN designers in developing suitable machine learning solutions for their specific application challenges.
Journal ArticleDOI

Big data analytics: a survey

TL;DR: The question that arises now is, how to develop a high performance platform to efficiently analyze big data and how to design an appropriate mining algorithm to find the useful things from big data.
Book

Sketching as a Tool for Numerical Linear Algebra

TL;DR: A survey of linear sketching algorithms for numeric allinear algebra can be found in this paper, where the authors consider least squares as well as robust regression problems, low rank approximation, and graph sparsification.
References
More filters
Journal ArticleDOI

Latent dirichlet allocation

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Book

Kernel Methods for Pattern Analysis

TL;DR: This book provides an easy introduction for students and researchers to the growing field of kernel-based pattern analysis, demonstrating with examples how to handcraft an algorithm or a kernel for a new specific application, and covering all the necessary conceptual and mathematical tools to do so.
Book ChapterDOI

On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities

TL;DR: This chapter reproduces the English translation by B. Seckler of the paper by Vapnik and Chervonenkis in which they gave proofs for the innovative results they had obtained in a draft form in July 1966 and announced in 1968 in their note in Soviet Mathematics Doklady.
Book

Hadoop: The Definitive Guide

Tom White
TL;DR: This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoops clusters.
Proceedings ArticleDOI

Clustering with Bregman Divergences

TL;DR: This paper proposes and analyzes parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences, and shows that there is a bijection between regular exponential families and a largeclass of BRegman diverGences, that is called regular Breg man divergence.