scispace - formally typeset
Search or ask a question

Showing papers on "Locality-sensitive hashing published in 1998"


01 Jan 1998
TL;DR: ANN is a library of C++ objects and procedures that supports approximate nearest neighbor searching, and is written as a testbed for a class of nearest neighbour searching algorithms, particularly those based on orthogonal decompositions of space.
Abstract: ANN is a library of C++ objects and procedures that supports approximate nearest neighbor searching. In nearest neighbor searching, we are given a set of data points S in real d-dimensional space, R d , and are to build a data structure such that, given any query point q 2 R d , the nearest data point to q can be found eeciently. In general, we are given k 1, and are asked to return the k-nearest neighbors to q in S. In approximate nearest neighbor searching, an error bound 0 is also given. The search algorithm returns k distinct points of S, such that the ratio between the distance to the ith point reported and the true ith nearest neighbor is at most 1 +. Among the features of ANN are the following. It supports k-nearest neighbor searching, by specifying k with the query. It supports both exact and approximate nearest neighbor searching, by specifying an approximation factor 0 with the query. It supports all Minkowski distance metrics, including the L 1 (Manhattan), L 2 (Eu-clidean), and L 1 (Max) metrics. There are no exponential factors in space, implying that the data structure is practical even for very large data sets in high dimensional spaces, irrespective of. ANN is written as a testbed for a class of nearest neighbor searching algorithms, particularly those based on orthogonal decompositions of space. These include k-d trees 3, 4], balanced box-decomposition trees 2] and other related spatial data structures (see Samet 5]). The library supports a number of diierent methods for building search structures. It also supports two methods for searching these structures: standard tree-ordered search 1] and priority search 2]. In priority search, the cells of the data structure are visited in increasing order of distance from the query point. In addition to the library there are two programs provided for testing and evaluating the performance of various search methods. The rst, called ann test, provides a primitive script language that allows the user to generate data sets and query sets, either by reading from a le or randomly through the use of a number of built-in point distributions. Any of a

438 citations


Proceedings ArticleDOI
23 Feb 1998
TL;DR: This work precomputes the result of any nearest neighbor search which corresponds to a computation of the voronoi cell of each data point, which is based on a precomputation of the solution space and demonstrates the high efficiency for uniformly distributed as well as real data.
Abstract: Similarity search in multimedia databases requires an efficient support of nearest neighbor search on a large set of high dimensional points as a basic operation for query processing. As recent theoretical results show, state of the art approaches to nearest neighbor search are not efficient in higher dimensions. In our new approach, we therefore precompute the result of any nearest neighbor search which corresponds to a computation of the voronoi cell of each data point. In a second step, we store the voronoi cells in an index structure efficient for high dimensional data spaces. As a result, nearest neighbor search corresponds to a simple point query on the index structure. Although our technique is based on a precomputation of the solution space, it is dynamic, i.e. it supports insertions of new data points. An extensive experimental evaluation of our technique demonstrates the high efficiency for uniformly distributed as well as real data. We obtained a significant reduction of the search time compared to nearest neighbor search in the X tree (up to a factor of 4).

170 citations


Patent
15 Oct 1998
TL;DR: In this paper, a method and apparatus for using a hashing function to store data in a cache memory where the hashing function used is changed periodically is described, and the data at the index generated by the current hashing function does not match the incoming data, previous hashing functions are used to repeat the search.
Abstract: A method and apparatus for using a hashing function to store data in a cache memory. Briefly, a method and apparatus is provided for using a hashing function to store data in a cache memory where the hashing function used is changed periodically. In one embodiment, the cache memory stores the data, an indicator of the hashing function used and the index value generated by the hashing function used. To retrieve data from the cache memory, the current hashing function is used to generate an index for the incoming data. The data at the index is checked to determine whether the stored data matches the incoming data. If the data at the index generated by the current hashing function does not match the incoming data, previous hashing functions are used to repeat the search.

53 citations


Journal ArticleDOI
TL;DR: A non-expansive hashing scheme wherein any set of size from a large universe may be stored in a memory of size (any, and ), and where retrieval takes operations.
Abstract: hashing scheme, similar inputs are stored in memory locations which are close. We develop a non-expansive hashing scheme wherein any set of size from a large universe may be stored in a memory of size (any , and ), and where retrieval takes operations. We explain how to use non-expansive hashing schemes for efficient storage and retrieval of noisy data. A dynamic version of this hashing scheme is presented as well.

9 citations



09 Jun 1998
TL;DR: It is found that the indexing technique used in geometric hashing is much more eecient in the 3-D case than it is in the 2-DCase, and the use of aspect models and more complex transformation functions in the2-D approach causes incorrect, degenerate solutions that do not occur in the3-D cases.
Abstract: In this paper we compare our 3-D geometric hashing approach to object recognition to a 2-D geometric hashing approach developed by Gavrila and Groen. We apply both methods to same recognition task using real images. We found that the indexing technique used in geometric hashing is much more eecient in the 3-D case than it is in the 2-D case. Furthermore, the use of aspect models and more complex transformation functions in the 2-D approach causes incorrect, degenerate solutions that do not occur in the 3-D case.

2 citations


Journal ArticleDOI
Isidore Rigoutsos1, Alex Delis
TL;DR: A two-stage methodology that uses the knowledge of the hashing function to reorganize the group assignments so that the resulting groups have similar expected cardinalities, and is generally applicable and independent of the used hashing function.
Abstract: Increasingly larger data sets are being stored in networked architectures. Many of the available data structures are not easily amenable to parallel realizations. Hashing schemes show promise in that respect for the simple reason that the underlying data structure can be decomposed and spread among the set of cooperating nodes with minimal communication and maintenance requirements. In all cases, storage utilization and load balancing are issues that need to be addressed. One can identify two basic approaches to tackle the problem. One way is to address it as part of the design of the data structure that is used to store and retrieve the data. The other is to maintain the data structure intact but address the problem separately. The method that we present here falls in the latter category and is applicable whenever a hash table is the preferred data structure. Intrinsically attached to the used hash table is a hashing function that allows one to partition a possibly unbounded set of data items into a finite set of groups; the hashing function provides the partitioning by assigning each data item to one of the groups. In general, the hashing function cannot guarantee that the various groups will have the same cardinality on average, for all possible data item distributions. In this paper, we propose a two-stage methodology that uses the knowledge of the hashing function to reorganize the group assignments so that the resulting groups have similar expected cardinalities. The method is generally applicable and independent of the used hashing function. We show the power of the methodology using both synthetic and real-world databases. The derived quasi-uniform storage occupancy and associated load-balancing gains are significant.

2 citations