scispace - formally typeset
Search or ask a question
Author

Dan Feldman

Bio: Dan Feldman is an academic researcher from University of Haifa. The author has contributed to research in topics: Coreset & Approximation algorithm. The author has an hindex of 26, co-authored 120 publications receiving 3168 citations. Previous affiliations of Dan Feldman include Vassar College & University of Southern California.


Papers
More filters
Proceedings ArticleDOI
06 Jan 2013
TL;DR: The authors' coresets with the merge-and-reduce approach obtain embarrassingly parallel streaming algorithms for problems such as k-means, PCA and projective clustering, and a simple recursive coreset construction that produces coresets of size.
Abstract: @d can be approximated up to (1 + e)-factor, for an arbitrary small e > 0, using the O(k/e2)-rank approximation of A and a constant. This implies, for example, that the optimal k-means clustering of the rows of A is (1 + e)-approximated by an optimal k-means clustering of their projection on the O(k/e2) first right singular vectors (principle components) of A.A (j, k)-coreset for projective clustering is a small set of points that yields a (1 + e)-approximation to the sum of squared distances from the n rows of A to any set of k affine subspaces, each of dimension at most j. Our embedding yields (0, k)-coresets of size O(k) for handling k-means queries, (j, 1)-coresets of size O(j) for PCA queries, and (j, k)-coresets of size (log n)O(jk) for any j, k ≥ 1 and constant e e (0, 1/2). Previous coresets usually have a size which is linearly or even exponentially dependent of d, which makes them useless when d ~ n.Using our coresets with the merge-and-reduce approach, we obtain embarrassingly parallel streaming algorithms for problems such as k-means, PCA and projective clustering. These algorithms use update time per point and memory that is polynomial in log n and only linear in d.For cost functions other than squared Euclidean distances we suggest a simple recursive coreset construction that produces coresets of size

353 citations

Proceedings ArticleDOI
06 Jun 2011
TL;DR: A unified framework for constructing coresets and approximate clustering for general sets of functions, and shows how to generalize the results of the framework for squared distances, distances to the qth power, and deterministic constructions.
Abstract: Given a set F of n positive functions over a ground set X, we consider the problem of computing x* that minimizes the expression ∑f ∈ Ff(x), over x ∈ X. A typical application is shape fitting, where we wish to approximate a set P of n elements (say, points) by a shape x from a (possibly infinite) family X of shapes. Here, each point p ∈ P corresponds to a function f such that f(x) is the distance from p to x, and we seek a shape x that minimizes the sum of distances from each point in P. In the k-clustering variant, each x\in X is a tuple of k shapes, and f(x) is the distance from p to its closest shape in x.Our main result is a unified framework for constructing coresets and approximate clustering for such general sets of functions. To achieve our results, we forge a link between the classic and well defined notion of e-approximations from the theory of PAC Learning and VC dimension, to the relatively new (and not so consistent) paradigm of coresets, which are some kind of "compressed representation" of the input set F. Using traditional techniques, a coreset usually implies an LTAS (linear time approximation scheme) for the corresponding optimization problem, which can be computed in parallel, via one pass over the data, and using only polylogarithmic space (i.e, in the streaming model). For several function families F for which coresets are known not to exist, or the corresponding (approximate) optimization problems are hard, our framework yields bicriteria approximations, or coresets that are large, but contained in a low-dimensional space.We demonstrate our unified framework by applying it on projective clustering problems. We obtain new coreset constructions and significantly smaller coresets, over the ones that appeared in the literature during the past years, for problems such as: k-Median [Har-Peled and Mazumdar,STOC'04], [Chen, SODA'06], [Langberg and Schulman, SODA'10]; k-Line median [Feldman, Fiat and Sharir, FOCS'06], [Deshpande and Varadarajan, STOC'07]; Projective clustering [Deshpande et al., SODA'06] [Deshpande and Varadarajan, STOC'07]; Linear lp regression [Clarkson, Woodruff, STOC'09 ]; Low-rank approximation [Sarlos, FOCS'06]; Subspace approximation [Shyamalkumar and Varadarajan, SODA'07], [Feldman, Monemizadeh, Sohler and Woodruff, SODA'10], [Deshpande, Tulsiani, and Vishnoi, SODA'11].The running times of the corresponding optimization problems are also significantly improved. We show how to generalize the results of our framework for squared distances (as in k-mean), distances to the qth power, and deterministic constructions.

350 citations

Posted Content
TL;DR: In this paper, the authors developed and analyzed a method to reduce the size of a very large set of data points in a high dimensional Euclidean space R d to a small set of weighted points such that the result of a predetermined data analysis task on the reduced set is approximately the same as that for the original point set.
Abstract: We develop and analyze a method to reduce the size of a very large set of data points in a high dimensional Euclidean space R d to a small set of weighted points such that the result of a predetermined data analysis task on the reduced set is approximately the same as that for the original point set. For example, computing the first k principal components of the reduced set will return approximately the first k principal components of the original set or computing the centers of a k-means clustering on the reduced set will return an approximation for the original set. Such a reduced set is also known as a coreset. The main new feature of our construction is that the cardinality of the reduced set is independent of the dimension d of the input space and that the sets are mergable. The latter property means that the union of two reduced sets is a reduced set for the union of the two original sets (this property has recently also been called composability, see Indyk et. al., PODS 2014). It allows us to turn our methods into streaming or distributed algorithms using standard approaches. For problems such as k-means and subspace approximation the coreset sizes are also independent of the number of input points. Our method is based on projecting the points on a low dimensional subspace and reducing the cardinality of the points inside this subspace using known methods. The proposed approach works for a wide range of data analysis techniques including k-means clustering, principal component analysis and subspace clustering. The main conceptual contribution is a new coreset definition that allows to charge costs that appear for every solution to an additive constant.

318 citations

Posted Content
TL;DR: In this article, a unified framework for constructing coresets and approximate clustering for general sets of functions is presented. But it is not a coreset-based clustering framework.
Abstract: Given a set $F$ of $n$ positive functions over a ground set $X$, we consider the problem of computing $x^*$ that minimizes the expression $\sum_{f\in F}f(x)$, over $x\in X$. A typical application is \emph{shape fitting}, where we wish to approximate a set $P$ of $n$ elements (say, points) by a shape $x$ from a (possibly infinite) family $X$ of shapes. Here, each point $p\in P$ corresponds to a function $f$ such that $f(x)$ is the distance from $p$ to $x$, and we seek a shape $x$ that minimizes the sum of distances from each point in $P$. In the $k$-clustering variant, each $x\in X$ is a tuple of $k$ shapes, and $f(x)$ is the distance from $p$ to its closest shape in $x$. Our main result is a unified framework for constructing {\em coresets} and {\em approximate clustering} for such general sets of functions. To achieve our results, we forge a link between the classic and well defined notion of $\varepsilon$-approximations from the theory of PAC Learning and VC dimension, to the relatively new (and not so consistent) paradigm of coresets, which are some kind of "compressed representation" of the input set $F$. Using traditional techniques, a coreset usually implies an LTAS (linear time approximation scheme) for the corresponding optimization problem, which can be computed in parallel, via one pass over the data, and using only polylogarithmic space (i.e, in the streaming model). We show how to generalize the results of our framework for squared distances (as in $k$-mean), distances to the $q$th power, and deterministic constructions.

249 citations

Proceedings ArticleDOI
06 Jun 2007
TL;DR: Every unweighted point set P has a weak coreset of size Poly(k,1/ε) for the k-means clustering problem, i.e. its size is independent of the cardinality |P| of the point set and the dimension d of the Euclidean space Rd.
Abstract: Given a point set P ⊆ Rd the k-means clustering problem is to find a set C=(c1,...,ck) of k points and a partition of P into k clusters C1,...,Ck such that the sum of squared errors ∑i=1k ∑p ∈ Ci |p -ci |22 is minimized. For given centers this cost function is minimized byassigning points to the nearest center.The k-means cost function is probably the most widely used cost function in the area of clustering.In this paper we show that every unweighted point set P has a weak (e, k)-coreset of size Poly(k,1/e) for the k-means clustering problem, i.e. its size is independent of the cardinality |P| of the point set and the dimension d of the Euclidean space Rd. A weak coreset is a weighted set S ⊆ P together with a set T such that T contains a (1+e)-approximation for the optimal cluster centers from P and for every set of kcenters from T the cost of the centers for S is a (1±e)-approximation of the cost for P.We apply our weak coreset to obtain a PTAS for the k-means clustering problem with running time O(nkd + d · Poly(k/e) + 2O(k/e)).

214 citations


Cited by
More filters
Christopher M. Bishop1
01 Jan 2006
TL;DR: Probability distributions of linear models for regression and classification are given in this article, along with a discussion of combining models and combining models in the context of machine learning and classification.
Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.

10,141 citations

Journal ArticleDOI
01 Apr 1988-Nature
TL;DR: In this paper, a sedimentological core and petrographic characterisation of samples from eleven boreholes from the Lower Carboniferous of Bowland Basin (Northwest England) is presented.
Abstract: Deposits of clastic carbonate-dominated (calciclastic) sedimentary slope systems in the rock record have been identified mostly as linearly-consistent carbonate apron deposits, even though most ancient clastic carbonate slope deposits fit the submarine fan systems better. Calciclastic submarine fans are consequently rarely described and are poorly understood. Subsequently, very little is known especially in mud-dominated calciclastic submarine fan systems. Presented in this study are a sedimentological core and petrographic characterisation of samples from eleven boreholes from the Lower Carboniferous of Bowland Basin (Northwest England) that reveals a >250 m thick calciturbidite complex deposited in a calciclastic submarine fan setting. Seven facies are recognised from core and thin section characterisation and are grouped into three carbonate turbidite sequences. They include: 1) Calciturbidites, comprising mostly of highto low-density, wavy-laminated bioclast-rich facies; 2) low-density densite mudstones which are characterised by planar laminated and unlaminated muddominated facies; and 3) Calcidebrites which are muddy or hyper-concentrated debrisflow deposits occurring as poorly-sorted, chaotic, mud-supported floatstones. These

9,929 citations

Reference EntryDOI
15 Oct 2004

2,118 citations

Journal ArticleDOI
TL;DR: A survey of the visual place recognition research landscape is presented, introducing the concepts behind place recognition, how a “place” is defined in a robotics context, and the major components of a place recognition system.
Abstract: Visual place recognition is a challenging problem due to the vast range of ways in which the appearance of real-world places can vary. In recent years, improvements in visual sensing capabilities, an ever-increasing focus on long-term mobile robot autonomy, and the ability to draw on state-of-the-art research in other disciplines—particularly recognition in computer vision and animal navigation in neuroscience—have all contributed to significant advances in visual place recognition systems. This paper presents a survey of the visual place recognition research landscape. We start by introducing the concepts behind place recognition—the role of place recognition in the animal kingdom, how a “place” is defined in a robotics context, and the major components of a place recognition system. Long-term robot operations have revealed that changing appearance can be a significant factor in visual place recognition failure; therefore, we discuss how place recognition solutions can implicitly or explicitly account for appearance change within the environment. Finally, we close with a discussion on the future of visual place recognition, in particular with respect to the rapid advances being made in the related fields of deep learning, semantic scene understanding, and video description.

933 citations

Book ChapterDOI
20 Dec 2013

780 citations