scispace - formally typeset
Search or ask a question
Author

Yuval Rabani

Bio: Yuval Rabani is an academic researcher from Hebrew University of Jerusalem. The author has contributed to research in topics: Approximation algorithm & Competitive analysis. The author has an hindex of 44, co-authored 139 publications receiving 6781 citations. Previous affiliations of Yuval Rabani include University of Toronto & Tel Aviv University.


Papers
More filters
Proceedings ArticleDOI
23 May 1998
TL;DR: Significantly improving and extending recent results of Kleinberg, data structures whose size is polynomial in the size of the database and search algorithms that run in time nearly linear or nearly quadratic in the dimension are constructed.
Abstract: We address the problem ofdesigning data structures that allow efficient search f or approximate nearest neighbors. More specifically, given a database consisting ofa set ofvectors in some high dimensional Euclidean space, we want to construct a space-efficient data structure that would allow us to search, given a query vector, for the closest or nearly closest vector in the database. We also address this problem when distances are measured by the L1 norm and in the Hamming cube. Significantly improving and extending recent results ofKleinberg, we construct data structures whose size is polynomial in the size ofthe database and search algorithms that run in time nearly linear or nearly quadratic in the dimension. (Depending on the case, the extra factors are polylogarithmic in the size ofthe database.)

400 citations

Journal ArticleDOI
TL;DR: This work investigates variants of Lloyd's heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and proposes and justifies a clusterability criterion for data sets.
Abstract: We investigate variants of Lloyd's heuristic for clustering high-dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data sets. We present variants of Lloyd's heuristic that quickly lead to provably near-optimal clustering solutions when applied to well-clusterable instances. This is the first performance guarantee for a variant of Lloyd's heuristic. The provision of a guarantee on output quality does not come at the expense of speed: some of our algorithms are candidates for being faster in practice than currently used variants of Lloyd's method. In addition, our other algorithms are faster on well-clusterable instances than recently proposed approximation algorithms, while maintaining similar guarantees on clustering quality. Our main algorithmic contribution is a novel probabilistic seeding process for the starting configuration of a Lloyd-type iteration.

398 citations

Journal ArticleDOI
TL;DR: It is shown that the minimum cut ratio is within a factor of O(log k) of the maximum concurrent flow for k-commodity flow instances with arbitrary capacities and demands, and thus of the optimal min-cut ratio.
Abstract: It is shown that the minimum cut ratio is within a factor of O(log k) of the maximum concurrent flow for k-commodity flow instances with arbitrary capacities and demands. This improves upon the previously best-known bound of O(log2 k) and is existentially tight, up to a constant factor. An algorithm for finding a cut with ratio within a factor of O(log k) of the maximum concurrent flow, and thus of the optimal min-cut ratio, is presented.

303 citations

Journal ArticleDOI
TL;DR: Significantly improving and extending recent results of Kleinberg, data structures whose size is polynomial in the size of the database and search algorithms that run in time nearly linear or nearly quadratic in the dimension are constructed.
Abstract: We address the problem of designing data structures that allow efficient search for approximate nearest neighbors. More specifically, given a database consisting of a set of vectors in some high dimensional Euclidean space, we want to construct a space-efficient data structure that would allow us to search, given a query vector, for the closest or nearly closest vector in the database. We also address this problem when distances are measured by the L1 norm and in the Hamming cube. Significantly improving and extending recent results of Kleinberg, we construct data structures whose size is polynomial in the size of the database and search algorithms that run in time nearly linear or nearly quadratic in the dimension. (Depending on the case, the extra factors are polylogarithmic in the size of the database.)

300 citations

Proceedings ArticleDOI
21 Oct 2006
TL;DR: This work investigates variants of Lloyd's heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and proposes and justifies a clusterability criterion for data sets.
Abstract: We investigate variants of Lloyd's heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data sets. We present variants of Lloyd's heuristic that quickly lead to provably near-optimal clustering solutions when applied to well-clusterable instances. This is the first performance guarantee for a variant of Lloyd's heuristic. The provision of a guarantee on output quality does not come at the expense of speed: some of our algorithms are candidates for being faster in practice than currently used variants of Lloyd's method. In addition, our other algorithms are faster on well-clusterable instances than recently proposed approximation algorithms, while maintaining similar guarantees on clustering quality. Our main algorithmic contribution is a novel probabilistic seeding process for the starting configuration of a Lloyd-type iteration.

248 citations


Cited by
More filters
Proceedings ArticleDOI
07 Jan 2007
TL;DR: By augmenting k-means with a very simple, randomized seeding technique, this work obtains an algorithm that is Θ(logk)-competitive with the optimal clustering.
Abstract: The k-means method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting k-means with a very simple, randomized seeding technique, we obtain an algorithm that is Θ(logk)-competitive with the optimal clustering. Preliminary experiments show that our augmentation improves both the speed and the accuracy of k-means, often quite dramatically.

7,539 citations

Proceedings ArticleDOI
23 May 1998
TL;DR: In this paper, the authors present two algorithms for the approximate nearest neighbor problem in high-dimensional spaces, for data sets of size n living in R d, which require space that is only polynomial in n and d.
Abstract: We present two algorithms for the approximate nearest neighbor problem in high-dimensional spaces. For data sets of size n living in R d , the algorithms require space that is only polynomial in n and d, while achieving query times that are sub-linear in n and polynomial in d. We also show applications to other high-dimensional geometric problems, such as the approximate minimum spanning tree. The article is based on the material from the authors' STOC'98 and FOCS'01 papers. It unifies, generalizes and simplifies the results from those papers.

4,478 citations

Journal ArticleDOI
TL;DR: This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for performing low-rank matrix approximation, and presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions.
Abstract: Low-rank matrix approximations, such as the truncated singular value decomposition and the rank-revealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for performing low-rank matrix approximation. These techniques exploit modern computational architectures more fully than classical methods and open the possibility of dealing with truly massive data sets. This paper presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions. These methods use random sampling to identify a subspace that captures most of the action of a matrix. The input matrix is then compressed—either explicitly or implicitly—to this subspace, and the reduced matrix is manipulated deterministically to obtain the desired low-rank factorization. In many cases, this approach beats its classical competitors in terms of accuracy, robustness, and/or speed. These claims are supported by extensive numerical experiments and a detailed error analysis. The specific benefits of randomized techniques depend on the computational environment. Consider the model problem of finding the $k$ dominant components of the singular value decomposition of an $m \times n$ matrix. (i) For a dense input matrix, randomized algorithms require $\bigO(mn \log(k))$ floating-point operations (flops) in contrast to $ \bigO(mnk)$ for classical algorithms. (ii) For a sparse input matrix, the flop count matches classical Krylov subspace methods, but the randomized approach is more robust and can easily be reorganized to exploit multiprocessor architectures. (iii) For a matrix that is too large to fit in fast memory, the randomized techniques require only a constant number of passes over the data, as opposed to $\bigO(k)$ passes for classical algorithms. In fact, it is sometimes possible to perform matrix approximation with a single pass over the data.

3,248 citations

Proceedings ArticleDOI
08 Jun 2004
TL;DR: A novel Locality-Sensitive Hashing scheme for the Approximate Nearest Neighbor Problem under lp norm, based on p-stable distributions that improves the running time of the earlier algorithm and yields the first known provably efficient approximate NN algorithm for the case p<1.
Abstract: We present a novel Locality-Sensitive Hashing scheme for the Approximate Nearest Neighbor Problem under lp norm, based on p-stable distributions.Our scheme improves the running time of the earlier algorithm for the case of the lp norm. It also yields the first known provably efficient approximate NN algorithm for the case p

3,109 citations

Journal ArticleDOI
TL;DR: In this paper, it was shown that given an integer k ≥ 1, (1 + ϵ)-approximation to the k nearest neighbors of q can be computed in additional O(kd log n) time.
Abstract: Consider a set of S of n data points in real d-dimensional space, Rd, where distances are measured using any Minkowski metric. In nearest neighbor searching, we preprocess S into a data structure, so that given any query point q∈ Rd, is the closest point of S to q can be reported quickly. Given any positive real ϵ, data point p is a (1 +ϵ)-approximate nearest neighbor of q if its distance from q is within a factor of (1 + ϵ) of the distance to the true nearest neighbor. We show that it is possible to preprocess a set of n points in Rd in O(dn log n) time and O(dn) space, so that given a query point q ∈ Rd, and ϵ > 0, a (1 + ϵ)-approximate nearest neighbor of q can be computed in O(cd, ϵ log n) time, where cd,ϵ≤d ⌈1 + 6d/ϵ⌉d is a factor depending only on dimension and ϵ. In general, we show that given an integer k ≥ 1, (1 + ϵ)-approximations to the k nearest neighbors of q can be computed in additional O(kd log n) time.

2,813 citations