scispace - formally typeset
Search or ask a question

Showing papers on "Locality-sensitive hashing published in 1999"


Proceedings Article
07 Sep 1999
TL;DR: Experimental results indicate that the novel scheme for approximate similarity search based on hashing scales well even for a relatively large number of dimensions, and provides experimental evidence that the method gives improvement in running time over other methods for searching in highdimensional spaces based on hierarchical tree decomposition.
Abstract: The nearestor near-neighbor query problems arise in a large variety of database applications, usually in the context of similarity searching. Of late, there has been increasing interest in building search/index structures for performing similarity search over high-dimensional data, e.g., image databases, document collections, time-series databases, and genome databases. Unfortunately, all known techniques for solving this problem fall prey to the \curse of dimensionality." That is, the data structures scale poorly with data dimensionality; in fact, if the number of dimensions exceeds 10 to 20, searching in k-d trees and related structures involves the inspection of a large fraction of the database, thereby doing no better than brute-force linear search. It has been suggested that since the selection of features and the choice of a distance metric in typical applications is rather heuristic, determining an approximate nearest neighbor should su ce for most practical purposes. In this paper, we examine a novel scheme for approximate similarity search based on hashing. The basic idea is to hash the points Supported by NAVY N00014-96-1-1221 grant and NSF Grant IIS-9811904. Supported by Stanford Graduate Fellowship and NSF NYI Award CCR-9357849. Supported by ARO MURI Grant DAAH04-96-1-0007, NSF Grant IIS-9811904, and NSF Young Investigator Award CCR9357849, with matching funds from IBM, Mitsubishi, Schlumberger Foundation, Shell Foundation, and Xerox Corporation. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 25th VLDB Conference, Edinburgh, Scotland, 1999. from the database so as to ensure that the probability of collision is much higher for objects that are close to each other than for those that are far apart. We provide experimental evidence that our method gives signi cant improvement in running time over other methods for searching in highdimensional spaces based on hierarchical tree decomposition. Experimental results also indicate that our scheme scales well even for a relatively large number of dimensions (more than 50).

3,705 citations


Journal Article
TL;DR: In this article, the authors introduce a new class of Boolean functions called rotation symmetric, whose evaluation is especially efficient and they call them rotation-symmetric Boolean functions (RSFLs).
Abstract: Efficient hashing is a centerpiece of modern cryptography. The progress in computing technology enables us to use 64-bit machines with the promise of 128-bit machines in the near future. To exploit fully the technology for fast hashing, we need to be able to design cryptographically strong Boolean functions in many variables which can be evaluated faster using partial evaluations from the previous rounds. We introduce a new class of Boolean functions whose evaluation is especially efficient and we call them rotation symmetric. Basic cryptographic properties of rotation-symmetric functions are investigated in a broader context of symmetric functions. An algorithm for the design of rotation-symmetric functions is given and two classes of functions are examined. These classes are important from a practical point of view as their forms are short. We show that shortening of rotation-symmetric functions paradoxically leads to more expensive evaluation process.

122 citations


Proceedings ArticleDOI
01 May 1999
TL;DR: This work investigates the exact nearest neighbors search problem and the related problem of exact partial match within the asymmetric communication model first used by Miltersen to study data structure problems and derives non-trivial asymptotic lower bounds for the exact problem that stand in contrast to known algorithms for approximate nearest neighbor search.
Abstract: In spite of extensive and continuing research, for various geometric search problems (such as nearest neighbor search), the best algorithms known have performance that degrades exponentially in the dimension. This phenomenon is sometimes called the curse of dimensionality. Recent results [37, 38, 40] show that in some sense it is possible to avoid the curse of dimensionality for the approximate nearest neighbor search problem. But must the exact nearest neighbor search problem suffer this curse? We provide some evidence in support of the curse. Specifically we investigate the exact nearest neighbor search problem and the related problem of exact partial match within the asymmetric communication model first used by Miltersen [43] to study data structure problems. We derive non-trivial asymptotic lower bounds for the exact problem that stand in contrast to known algorithms for approximate nearest neighbor search.

118 citations


Book ChapterDOI
02 May 1999
TL;DR: This paper compares the parameters sizes and software performance of several recent constructions for universal hash functions: bucket hashing, polynomial hashing, Toeplitz hashing, division hashing, evaluation hashing, and MMH hashing to find constructions that offer a comparable security level.
Abstract: This paper compares the parameters sizes and software performance of several recent constructions for universal hash functions: bucket hashing, polynomial hashing, Toeplitz hashing, division hashing, evaluation hashing, and MMH hashing. An objective comparison between these widely varying approaches is achieved by defining constructions that offer a comparable security level. It is also demonstrated how the security of these constructions compares favorably to existing MAC algorithms, the security of which is less understood.

87 citations



01 Jan 1999
TL;DR: This paper describes how the distance distribution of the query object can be used to determine a suitable stopping condition with probabilistic guarantees on the quality of the result, and analyzes performance of both sequential and index-based PAC-NN algorithms.
Abstract: In this paper we introduce a new paradigm for similarity search, called PAC-NN (probably approximately correct nearest neighbor) queries, aiming to break the “dimensionality curse” which inhibits current approaches to be applied in high-dimensional spaces. PAC-NN queries return, with probability at least 1− δ, a (1+ )-approximate NN – an object whose distance from the query q is less than (1 + ) times the distance between q and its NN. We describe how the distance distribution of the query object can be used to determine a suitable stopping condition with probabilistic guarantees on the quality of the result, and then analyze performance of both sequential and index-based PAC-NN algorithms. This shows that PAC-NN queries can be efficiently processed even on very high-dimensional spaces and that control can be exerted in order to tradeoff between the accuracy of the result and the cost.

4 citations


Journal ArticleDOI
TL;DR: This work proposes a new scheme for dynamic hashing in which the growth of a file occurs at a rate of n+k/n per full expansion, where n is the number of pages of the file and k is a given integer constant which is smaller than n, as compared to a rates of two in linear hashing.
Abstract: The goal of dynamic hashing is to design a function and a file structure that allow the address space allocated to the file to be increased and reduced without reorganizing the whole file. We propose a new scheme for dynamic hashing in which the growth of a file occurs at a rate of n+k/n per full expansion, where n is the number of pages of the file and k is a given integer constant which is smaller than n, as compared to a rate of two in linear hashing. Like linear hashing, the proposed scheme (called linear spiral hashing) requires no index; however, the proposed scheme may or may not add one more physical page, instead of always adding one more page in linear hashing, when a split occurs. Therefore, linear spiral hashing can maintain a more stable performance through the file expansions and have much better storage utilization than linear hashing. From our performance analysis, linear spiral hashing can achieve nearly 97 percent storage utilization as compared to 78 percent storage utilization by using linear hashing, which is also verified by a simulation study.

3 citations