scispace - formally typeset
Search or ask a question

Showing papers on "Locality-sensitive hashing published in 2008"


Proceedings Article
08 Dec 2008
TL;DR: The problem of finding a best code for a given dataset is closely related to the problem of graph partitioning and can be shown to be NP hard and a spectral method is obtained whose solutions are simply a subset of thresholded eigenvectors of the graph Laplacian.
Abstract: Semantic hashing[1] seeks compact binary codes of data-points so that the Hamming distance between codewords correlates with semantic similarity. In this paper, we show that the problem of finding a best code for a given dataset is closely related to the problem of graph partitioning and can be shown to be NP hard. By relaxing the original problem, we obtain a spectral method whose solutions are simply a subset of thresholded eigenvectors of the graph Laplacian. By utilizing recent results on convergence of graph Laplacian eigenvectors to the Laplace-Beltrami eigenfunctions of manifolds, we show how to efficiently calculate the code of a novel data-point. Taken together, both learning the code and applying it to a novel point are extremely simple. Our experiments show that our codes outperform the state-of-the art.

2,641 citations


Journal ArticleDOI
TL;DR: An algorithm for the c-approximate nearest neighbor problem in a d-dimensional Euclidean space, achieving query time of O(dn 1c2/+o(1)) and space O(DN + n1+1c2 + o(1) + 1/c2), which almost matches the lower bound for hashing-based algorithm recently obtained.
Abstract: In this article, we give an overview of efficient algorithms for the approximate and exact nearest neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. The problem is of significant interest in a wide variety of areas.

1,759 citations


Journal ArticleDOI
TL;DR: This lecture note describes a technique known as locality-sensitive hashing (LSH) that allows one to quickly find similar entries in large databases using a novel and interesting class of algorithms that are known as randomized algorithms.
Abstract: This lecture note describes a technique known as locality-sensitive hashing (LSH) that allows one to quickly find similar entries in large databases. This approach belongs to a novel and interesting class of algorithms that are known as randomized algorithms. A randomized algorithm does not guarantee an exact answer but instead provides a high probability guarantee that it will return the correct answer or one close to it. By investing additional computational effort, the probability can be pushed as high as desired.

326 citations


01 Jan 2008
TL;DR: This lecture note describes a technique known as locality-sensitive hashing (LSH) that allows one to quickly find similar entries in large databases using a novel and interesting class of algorithms known as randomized algorithms.
Abstract: 1053-5888/08/$20.00©2008IEEE IEEE SIGNAL PROCESSING MAGAZINE [128] MARCH 2008 T he Internet has brought us a wealth of data, all now available at our fingertips. We can easily carry in our pockets thousands of songs, hundreds of thousands of images, and hundreds of hours of video. But even with the rapid growth of computer performance, we don’t have the processing power to search this amount of data by brute force. This lecture note describes a technique known as locality-sensitive hashing (LSH) that allows one to quickly find similar entries in large databases. This approach belongs to a novel and interesting class of algorithms that are known as randomized algorithms. A randomized algorithm does not guarantee an exact answer but instead provides a high probability guarantee that it will return the correct answer or one close to it. By investing additional computational effort, the probability can be pushed as high as desired.

207 citations


Proceedings ArticleDOI
20 Jul 2008
TL;DR: SpotSigs is a new algorithm for extracting and matching signatures for near duplicate detection in large Web crawls that provides an exact and efficient, self-tuning matching algorithm that exploits a novel combination of collection partitioning and inverted index pruning for high-dimensional similarity search.
Abstract: Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching signatures for near duplicate detection in large Web crawls. Our spot signatures are designed to favor natural-language portions of Web pages over advertisements and navigational bars.The contributions of SpotSigs are twofold: 1) by combining stopword antecedents with short chains of adjacent content terms, we create robust document signatures with a natural ability to filter out noisy components of Web pages that would otherwise distract pure n-gram-based approaches such as Shingling; 2) we provide an exact and efficient, self-tuning matching algorithm that exploits a novel combination of collection partitioning and inverted index pruning for high-dimensional similarity search. Experiments confirm an increase in combined precision and recall of more than 24 percent over state-of-the-art approaches such as Shingling or I-Match and up to a factor of 3 faster execution times than Locality Sensitive Hashing (LSH), over a demonstrative "Gold Set" of manually assessed near-duplicate news articles as well as the TREC WT10g Web collection.

182 citations


Proceedings ArticleDOI
Wei Dong1, Zhe Wang1, William Josephson1, Moses Charikar1, Kai Li1 
26 Oct 2008
TL;DR: A statistical performance model of Multi-probe LSH, a state-of-the-art variance of LSH is presented, which can accurately predict the average search quality and latency given a small sample dataset and an adaptive LSH search algorithm is devised to determine the probing parameter dynamically for each query.
Abstract: Although Locality-Sensitive Hashing (LSH) is a promising approach to similarity search in high-dimensional spaces, it has not been considered practical partly because its search quality is sensitive to several parameters that are quite data dependent. Previous research on LSH, though obtained interesting asymptotic results, provides little guidance on how these parameters should be chosen, and tuning parameters for a given dataset remains a tedious process.To address this problem, we present a statistical performance model of Multi-probe LSH, a state-of-the-art variance of LSH. Our model can accurately predict the average search quality and latency given a small sample dataset. Apart from automatic parameter tuning with the performance model, we also use the model to devise an adaptive LSH search algorithm to determine the probing parameter dynamically for each query. The adaptive probing method addresses the problem that even though the average performance is tuned for optimal, the variance of the performance is extremely high. We experimented with three different datasets including audio, images and 3D shapes to evaluate our methods. The results show the accuracy of the proposed model: the recall errors predicted are within 5% from the real values for most cases; the adaptive search method reduces the standard deviation of recall by about 50% over the existing method.

164 citations


Patent
25 Feb 2008
TL;DR: In this paper, a set of cluster centroids (e.g., feature vectors of their corresponding clusters) are retrieved from a memory based on the feature vector of the document using a locality sensitive hashing function.
Abstract: Documents from a data stream are clustered by first generating a feature vector for each document. A set of cluster centroids (e.g., feature vectors of their corresponding clusters) are retrieved from a memory based on the feature vector of the document using a locality sensitive hashing function. The centroids may be retrieved by retrieving a set of cluster identifiers from a cluster table, the cluster identifiers each indicative of a respective cluster centroid, and retrieving the cluster centroids corresponding to the retrieved cluster identifiers from a memory. Documents may then be clustered into one or more of the candidate clusters using distance measures from the feature vector of the document to the cluster centroids.

156 citations


Proceedings ArticleDOI
26 Oct 2008
TL;DR: This work defines a more reliable a posteriori model taking account some prior about the queries and the searched objects that outperforms other multi-probe LSH while offering a better quality control and comparisons to the basic LSH technique show that this method allows consistent improvements both in space and time efficiency.
Abstract: Efficient high-dimensional similarity search structures are essential for building scalable content-based search systems on feature-rich multimedia data. In the last decade, Locality Sensitive Hashing (LSH) has been proposed as indexing technique for approximate similarity search. Among the most recent variations of LSH, multi-probe LSH techniques have been proved to overcome the overlinear space cost drawback of common LSH. Multi-probe LSH is built on the well-known LSH technique, but it intelligently probes multiple buckets that are likely to contain query results in a hash table. Our method is inspired by previous work on probabilistic similarity search structures and improves upon recent theoretical work on multi-probe and query adaptive LSH. Whereas these methods are based on likelihood criteria that a given bucket contains query results, we define a more reliable a posteriori model taking account some prior about the queries and the searched objects. This prior knowledge allows a better quality control of the search and a more accurate selection of the most probable buckets. We implemented a nearest neighbors search based on this paradigm and performed experiments on different real visual features datasets. We show that our a posteriori scheme outperforms other multi-probe LSH while offering a better quality control. Comparisons to the basic LSH technique show that our method allows consistent improvements both in space and time efficiency.

145 citations


Journal ArticleDOI
TL;DR: An automatic method for measuring content-based music similarity, enhancing the current generation of music search engines and recommended systems and compatible with locality-sensitive hashing-allowing implementation with retrieval times several orders of magnitude faster than those using exhaustive distance computations.
Abstract: We propose an automatic method for measuring content-based music similarity, enhancing the current generation of music search engines and recommended systems. Many previous approaches to track similarity require brute-force, pair-wise processing between all audio features in a database and therefore are not practical for large collections. However, in an Internet-connected world, where users have access to millions of musical tracks, efficiency is crucial. Our approach uses features extracted from unlabeled audio data and near-neigbor retrieval using a distance threshold, determined by analysis, to solve a range of retrieval tasks. The tasks require temporal features-analogous to the technique of shingling used for text retrieval. To measure similarity, we count pairs of audio shingles, between a query and target track, that are below a distance threshold. The distribution of between-shingle distances is different for each database; therefore, we present an analysis of the distribution of minimum distances between shingles and a method for estimating a distance threshold for optimal retrieval performance. The method is compatible with locality-sensitive hashing (LSH)-allowing implementation with retrieval times several orders of magnitude faster than those using exhaustive distance computations. We evaluate the performance of our proposed method on three contrasting music similarity tasks: retrieval of mis-attributed recordings (fingerprint), retrieval of the same work performed by different artists (cover songs), and retrieval of edited and sampled versions of a query track by remix artists (remixes). Our method achieves near-perfect performance in the first two tasks and 75% precision at 70% recall in the third task. Each task was performed on a test database comprising 4.5 million audio shingles.

118 citations


Proceedings ArticleDOI
12 May 2008
TL;DR: A query by humming method based on locality sensitive hashing (LSH) to retrieve audio signals, which applies an automatic melody transcription method to construct the melody database directly from music recordings and report the corresponding retrieval results.
Abstract: This paper proposes a query by humming method based on locality sensitive hashing (LSH). The method constructs an index of melodic fragments by extracting pitch vectors from a database of melodies. In retrieval, the method automatically transcribes a sung query into notes and then extracts pitch vectors similarly to the index construction. For each query pitch vector, the method searches for similar melodic fragments in the database to obtain a list of candidate melodies. This is performed efficiently by using LSH. The candidate melodies are ranked by their distance to the entire query and returned to the user. In our experiments, the method achieved mean reciprocal rank of 0.885 for 2797 queries when searching from a database of 6030 MIDI melodies. To retrieve audio signals, we apply an automatic melody transcription method to construct the melody database directly from music recordings and report the corresponding retrieval results.

105 citations


Proceedings ArticleDOI
07 Apr 2008
TL;DR: A novel formulation is presented, that uses statistical observations from sample data to analyze retrieval accuracy and efficiency for the proposed indexing method, and significantly outperforms VP-trees, which are a well-known method for distance-based indexing.
Abstract: A method is proposed for indexing spaces with arbitrary distance measures, so as to achieve efficient approximate nearest neighbor retrieval. Hashing methods, such as locality sensitive hashing (LSH), have been successfully applied for similarity indexing in vector spaces and string spaces under the Hamming distance. The key novelty of the hashing technique proposed here is that it can be applied to spaces with arbitrary distance measures, including non-metric distance measures. First, we describe a domain-independent method for constructing a family of binary hash functions. Then, we use these functions to construct multiple multibit hash tables. We show that the LSH formalism is not applicable for analyzing the behavior of these tables as index structures. We present a novel formulation, that uses statistical observations from sample data to analyze retrieval accuracy and efficiency for the proposed indexing method. Experiments on several real-world data sets demonstrate that our method produces good trade-offs between accuracy and efficiency, and significantly outperforms VP-trees, which are a well-known method for distance-based indexing.

Journal ArticleDOI
TL;DR: This work study and compare several HTML style similarity measures based on both textual and extra-textual features in HTML source code and proposes a flexible algorithm to cluster a large collection of documents according to these measures.
Abstract: Automatically generated content is ubiquitous in the web: dynamic sites built using the three-tier paradigm are good examples (e.g., commercial sites, blogs and other sites edited using web authoring software), as well as less legitimate spamdexing attempts (e.g., link farms, faked directories).Those pages built using the same generating method (template or script) share a common “look and feel” that is not easily detected by common text classification methods, but is more related to stylometry.In this work we study and compare several HTML style similarity measures based on both textual and extra-textual features in HTML source code. We also propose a flexible algorithm to cluster a large collection of documents according to these measures. Since the proposed algorithm is based on locality sensitive hashing (LSH), we first review this technique.We then describe how to use the HTML style similarity clusters to pinpoint dubious pages and enhance the quality of spam classifiers. We present an evaluation of our algorithm on the WEBSPAM-UK2006 dataset.

Proceedings ArticleDOI
24 Feb 2008
TL;DR: This paper presents a general pattern-based behavior synthesis framework which can efficiently extract similar structures in programs and applies it to FPGA resource optimization with the observation that multiplexors are particularly expensive on FPGAs.
Abstract: Pattern-based synthesis has drawn wide interest from researchers who tried to utilize the regularity in applications for design optimizations. In this paper we present a general pattern-based behavior synthesis framework which can efficiently extract similar structures in programs. Our approach is very scalable in benefit of advanced pruning techniques that include locality sensitive hashing and characteristic vectors. The similarity of structures is captured by a mismatch-tolerant metric: graph edit distance. The edit distance between two graphs is the minimum number of vertex/edge insertion, deletion, substitution operations to transform one graph into the other. Graph edit distance can naturally handle various program variations such as bit-width variations, structure variations and port variations. In addition, we apply our pattern-based synthesis system to FPGA resource optimization with the observation that multiplexors are particularly expensive on FPGA platforms. Considering knowledge of discovered patterns, the resource binding step can intelligently generate the data-path to reduce interconnect costs. Experiments show our approach can, on average, reduce the total area by about 20% with 7% latency overhead on the Xilinx Virtex-4 FPGAs, compared to the traditional behavior synthesis flow

Proceedings ArticleDOI
24 Aug 2008
TL;DR: This paper presents metric learning algorithms that scale linearly with dimensionality, permitting efficient optimization, storage, and evaluation of the learned metric, and shows that their learned metric can achieve excellent quality with respect to various criteria.
Abstract: The success of popular algorithms such as k-means clustering or nearest neighbor searches depend on the assumption that the underlying distance functions reflect domain-specific notions of similarity for the problem at hand. The distance metric learning problem seeks to optimize a distance function subject to constraints that arise from fully-supervised or semisupervised information. Several recent algorithms have been proposed to learn such distance functions in low dimensional settings. One major shortcoming of these methods is their failure to scale to high dimensional problems that are becoming increasingly ubiquitous in modern data mining applications. In this paper, we present metric learning algorithms that scale linearly with dimensionality, permitting efficient optimization, storage, and evaluation of the learned metric. This is achieved through our main technical contribution which provides a framework based on the log-determinant matrix divergence which enables efficient optimization of structured, low-parameter Mahalanobis distances. Experimentally, we evaluate our methods across a variety of high dimensional domains, including text, statistical software analysis, and collaborative filtering, showing that our methods scale to data sets with tens of thousands or more features. We show that our learned metric can achieve excellent quality with respect to various criteria. For example, in the context of metric learning for nearest neighbor classification, we show that our methods achieve 24% higher accuracy over the baseline distance. Additionally, our methods yield very good precision while providing recall measures up to 20% higher than other baseline methods such as latent semantic analysis.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: This paper proposes a new fingerprint indexing and retrieval scheme using scale invariant feature transformation (SIFT), which has been widely used in generic image retrieval and shows the effectiveness of the proposed scheme.
Abstract: Most of current fingerprint indexing schemes utilize features based on global textures and minutiae structures. To extend the existing technology of feature extraction, this paper proposes a new fingerprint indexing and retrieval scheme using scale invariant feature transformation (SIFT), which has been widely used in generic image retrieval. With slight loss in effectiveness, we reduce the number of features generated from one fingerprint for efficiency. To cope with the uncertainty of acquisition (e.g. partialness, distortion), we use a composite set of features to form multiple impressions for the fingerprint representation. In the index construction phase, the use of locality-sensitive hashing (LSH) allows us to perform similarity queries by only examining a small fraction of the database. Experiments on database FVC2000 and FVC2002 show the effectiveness of our proposed scheme.

Proceedings ArticleDOI
12 May 2008
TL;DR: Improvements to Locality-Sensitive Hashing are made by performing an on-line selection of the most appropriate hash functions from a pool of functions to greatly reduce the search complexity for a given level of accuracy.
Abstract: It is well known that high-dimensional nearest-neighbor retrieval is very expensive. Many signal processing methods suffer from this computing cost. Dramatic performance gains can be obtained by using approximate search, such as the popular Locality-Sensitive Hashing. This paper improves LSH by performing an on-line selection of the most appropriate hash functions from a pool of functions. An additional improvement originates from the use of E& lattices for geometric hashing instead of one-dimensional random projections. A performance study based on state-of-the-art high-dimensional descriptors computed on real images shows that our improvements to LSH greatly reduce the search complexity for a given level of accuracy.

Proceedings ArticleDOI
24 Aug 2008
TL;DR: This work introduces several approximations based on the properties of concomitant order statistics and discrete transforms that perform almost as well, with significantly reduced computational cost.
Abstract: Locality Sensitive Hash functions are invaluable tools for approximate near neighbor problems in high dimensional spaces. In this work, we are focused on LSH schemes where the similarity metric is the cosine measure. The contribution of this work is a new class of locality sensitive hash functions for the cosine similarity measure based on the theory of concomitants, which arises in order statistics. Consider n i.i.d sample pairs, {(X1; Y1); (X2; Y2); : : : ;(Xn; Yn)} obtained from a bivariate distribution f(X, Y). Concomitant theory captures the relation between the order statistics of X and Y in the form of a rank distribution given by Prob(Rank(Yi)=j-Rank(Xi)=k). We exploit properties of the rank distribution towards developing a locality sensitive hash family that has excellent collision rate properties for the cosine measure.The computational cost of the basic algorithm is high for high hash lengths. We introduce several approximations based on the properties of concomitant order statistics and discrete transforms that perform almost as well, with significantly reduced computational cost. We demonstrate the practical applicability of our algorithms by using it for finding similar images in an image repository.

Journal ArticleDOI
TL;DR: This work presents the Balanced Exploration and Exploitation Model Search (BEEM) algorithm that works very well especially for these difficult scenes and achieves significant speedups compared to the state of the art algorithms.
Abstract: The estimation of the epipolar geometry is especially difficult when the putative correspondences include a low percentage of inlier correspondences and/or a large subset of the inliers is consistent with a degenerate configuration of the epipolar geometry that is totally incorrect. This work presents the balanced exploration and exploitation model (BEEM) search algorithm, which works very well especially for these difficult scenes. The algorithm handles these two problems in a unified manner. It includes the following main features: 1) balanced use of three search techniques: global random exploration, local exploration near the current best solution, and local exploitation to improve the quality of the model, 2) exploitation of available prior information to accelerate the search process, 3) use of the best found model to guide the search process, escape from degenerate models, and define an efficient stopping criterion, 4) presentation of a simple and efficient method to estimate the epipolar geometry from two scale-invariant feature transform (SIFT) correspondences, and 5) use of the locality-sensitive hashing (LSH) approximate nearest neighbor algorithm for fast putative correspondence generation. The resulting algorithm when tested on real images with or without degenerate configurations gives quality estimations and achieves significant speedups compared to the state-of-the-art algorithms.

Journal ArticleDOI
Shumeet Baluja1, Michele Covell1
TL;DR: A method to learn a similarity function from only weakly labeled positive examples is described, used as the basis of a hash function to severely constrain the number of points considered for each lookup in a large corpus of high-dimensional data points.
Abstract: The problem of efficiently finding similar items in a large corpus of high-dimensional data points arises in many real-world tasks, such as music, image, and video retrieval. Beyond the scaling difficulties that arise with lookups in large data sets, the complexity in these domains is exacerbated by an imprecise definition of similarity. In this paper, we describe a method to learn a similarity function from only weakly labeled positive examples. Once learned, this similarity function is used as the basis of a hash function to severely constrain the number of points considered for each lookup. Tested on a large real-world audio dataset, only a tiny fraction of the points (~0.27%) are ever considered for each lookup. To increase efficiency, no comparisons in the original high-dimensional space of points are required. The performance far surpasses, in terms of both efficiency and accuracy, a state-of-the-art Locality-Sensitive-Hashing-based (LSH) technique for the same problem and data set.

Book ChapterDOI
01 Jan 2008
TL;DR: This work quantitatively analyze the performance of exact and approximate nearest-neighbors algorithms on increasingly high-dimensional problems in the context of sampling-based motion planning and studies the impact of the dimension, number of samples, distance metrics, and sampling schemes on the efficiency and accuracy.
Abstract: We quantitatively analyze the performance of exact and approximate nearest-neighbors algorithms on increasingly high-dimensional problems in the context of sampling-based motion planning. We study the impact of the dimension, number of samples, distance metrics, and sampling schemes on the efficiency and accuracy of nearest-neighbors algorithms. Efficiency measures computation time and accuracy indicates similarity between exact and approximate nearest neighbors.

Proceedings ArticleDOI
26 Oct 2008
TL;DR: A randomized algorithm to embed a set of features into a single high-dimensional vector to simplify the feature-set matching problem and can achieve accuracy comparable to the state-of-the-art feature- set matching methods, while requiring significantly less space and time.
Abstract: As the commonly used representation of a feature-rich data object has evolved from a single feature vector to a set of feature vectors, a key challenge in building a content-based search engine for feature-rich data is to match feature-sets efficiently. Although substantial progress has been made during the past few years, existing approaches are still inefficient and inflexible for building a search engine for massive datasets. This paper presents a randomized algorithm to embed a set of features into a single high-dimensional vector to simplify the feature-set matching problem. The main idea is to project feature vectors into an auxiliary space using locality sensitive hashing and to represent a set of features as a histogram in the auxiliary space. A histogram is simply a high dimensional vector, and efficient similarity measures like L1 and L2 distances can be employed to approximate feature-set distance measures.We evaluated the proposed approach under three different task settings, i.e. content-based image search, image object recognition and near-duplicate video clip detection. The experimental results show that the proposed approach is indeed effective and flexible. It can achieve accuracy comparable to the feature-set matching methods, while requiring significantly less space and time. For object recognition with Caltech 101 dataset, our method runs 25 times faster to achieve the same precision as Pyramid Matching Kernel, the state-of-the-art feature-set matching method.

Patent
25 Feb 2008
TL;DR: In this paper, a set of cluster centroids (e.g., feature vectors of their corresponding clusters) are retrieved from a memory based on the feature vector of the document using a locality sensitive hashing function.
Abstract: Documents from a data stream are clustered by first generating a feature vector for each document A set of cluster centroids (eg, feature vectors of their corresponding clusters) are retrieved from a memory based on the feature vector of the document using a locality sensitive hashing function The centroids may be retrieved by retrieving a set of cluster identifiers from a cluster table, the cluster identifiers each indicative of a respective cluster centroid, and retrieving the cluster centroids corresponding to the retrieved cluster identifiers from a memory Documents may then be clustered into one or more of the candidate clusters using distance measures from the feature vector of the document to the cluster centroids

Proceedings Article
01 Jan 2008
TL;DR: A ecient mapping scheme based on p-stable Locality Sensitive Hashing to assign hash buckets to peers in a Chord-style overlay network and considers load balancing by harnessing estimates of the resulting data mapping, which follows a normal distribution.
Abstract: We consider K-Nearest Neighbor search for high dimensional data in large-scale structured Peer-to-Peer networks. We present an ecient mapping scheme based on p-stable Locality Sensitive Hashing to assign hash buckets to peers in a Chord-style overlay network. To minimize network trac, we process queries in an incremental top-K fashion leveraging on a locality preserving mapping to the peer space. Furthermore, we consider load balancing by harnessing estimates of the resulting data mapping, which follows a normal distribution. We report on a comprehensive performance evaluation using high dimensional real-world data, demonstrating the suitability of our approach.

Proceedings ArticleDOI
30 Oct 2008
TL;DR: A new method for robust content-based video copy detection based on local spatio-temporal features as shown by experimental validation brings additional robustness and discriminativity to the task of video footage reuse detection in news broadcasts.
Abstract: n this paper, we present a new method for robust content-based video copy detection based on local spatio-temporal features. As we show by experimental validation, the use of local spatio-temporal features instead of purely spatial ones brings additional robustness and discriminativity. Efficient operation is ensured by using the new spatio-temporal features proposed in [20]. To cope with the high-dimensionality of the resulting descriptors, these features are incorporated in a disk-based index and query system based on p-stable locality sensitive hashing. The system is applied to the task of video footage reuse detection in news broadcasts. Results are reported on 88 hours of news broadcast data from the TRECVID2006 dataset.

Proceedings ArticleDOI
09 Sep 2008
TL;DR: This paper presents simple and space-efficient Bounded-LSH to map non-uniform data space into load-balanced hash buckets that contain approximate number of objects that require less number of hash tables while maintaining a high probability of returning the closest objects to requests.
Abstract: Similarity search has been widely studied in peer-to-peer environments. In this paper, we propose the Bounded Locality Sensitive Hashing (Bounded LSH) method for similarity search in P2P file systems. Compared to the basic Locality Sensitive Hashing (LSH), Bounded LSH makes improvement on the space saving and quick query response in the similarity search, especially for high-dimensional data objects that exhibit non-uniform distribution property. We present simple and space-efficient Bounded-LSH to map non-uniform data space into load-balanced hash buckets that contain approximate number of objects. Load-balanced hash buckets in Bounded-LSH, in turn, require less number of hash tables while maintaining a high probability of returning the closest objects to requests. Our experiments based on synthetic and real-world datasets showed the feasibility, query and space efficiency of our proposed method.

Proceedings Article
01 Jan 2008
TL;DR: The utility of reverse nearest neighbor search is demonstrated by showing how it can help improve the classification accuracy and propose exact and approximate algorithms that do not require pre-computation of nearest neighbor distances, and can potentially prune off most of the search space.
Abstract: Reverse nearest neighbor queries are useful in identifying objects that are of significant influence or importance. Existing methods either rely on pre-computation of nearest neighbor distances, do not scale well with high dimensionality, or do not produce exact solutions. In this work we motivate and investigate the problem of reverse nearest neighbor search on high dimensional, multimedia data. We propose exact and approximate algorithms that do not require pre-computation of nearest neighbor distances, and can potentially prune off most of the search space. We demonstrate the utility of reverse nearest neighbor search by showing how it can help improve the classification accuracy.

Proceedings ArticleDOI
12 May 2008
TL;DR: In this paper, the authors introduce the idea of permutation-grouping to intelligently design the hash functions that are used to index the LSH tables, which helps to overcome the inefficiencies introduced by hashing real-world data that is noisy, structured, and most importantly is not independently and identically distributed.
Abstract: The combination of MinHash-based signatures and locality- sensitive hashing (LSH) schemes has been effectively used for finding approximate matches in very large audio and image retrieval systems. In this study, we introduce the idea of permutation-grouping to intelligently design the hash functions that are used to index the LSH tables. This helps to overcome the inefficiencies introduced by hashing real-world data that is noisy, structured, and most importantly is not independently and identically distributed. Through extensive tests, we find that permutation-grouping dramatically increases the efficiency of the overall retrieval system by lowering the number of low-probability candidates that must be examined by 30-50%.

01 Jan 2008
TL;DR: This work proposes an approximate computation technique for inter-object distances for binary data sets based on the locality sensitive hashing, which scales up with the number of objects and is much faster than the “brute-force” computation of these distances.
Abstract: We propose an approximate computation technique for inter-object distances for binary data sets. Our approach is based on the locality sensitive hashing, scales up with the number of objects and is much faster than the “brute-force” computation of these distances.

Proceedings ArticleDOI
09 Jun 2008
TL;DR: This study looks at data objects that are represented using leaf-labeled trees denoting a set of elements at the leaves organized in a hierarchy, and compute sketches of such trees by propagating min-hash computations up the tree using locality-sensitive hashing.
Abstract: In this study we propose sketching algorithms for computing similarities between hierarchical data. Specifically, we look at data objects that are represented using leaf-labeled trees denoting a set of elements at the leaves organized in a hierarchy. Such representations are richer alternatives to a set. For example, a document can be represented as a hierarchy of sets wherein chapters, sections, and paragraphs represent different levels in the hierarchy. Such a representation is richer than viewing the document simply as a set of words. We measure distance between trees using the best possible super-imposition that minimizes the number of mismatched leaf labels. Our distance measure is equivalent to an Earth Mover's Distance measure since the leaf-labeled trees of height one can be viewed as sets and can be recursively extended to trees of larger height by viewing them as set of sets. We compute sketches of arbitrary weighted trees and analyze them in the context of locality-sensitive hashing (LSH) where the probability of two sketches matching is high when two trees are similar and low when the two trees are far under the given distance measure. Specifically, we compute sketches of such trees by propagating min-hash computations up the tree. Furthermore, we show that propagating one min-hash results in poor sketch properties while propagating two min-hashes results in good sketches.

Proceedings ArticleDOI
19 May 2008
TL;DR: A scalable localization algorithm is proposed for incremental databases of high dimensional features and the Monte Carlo localization (MCL) algorithm is extended by employing the exact Euclidean locality sensitive hashing (LSH).
Abstract: In recent years, high-dimensional descriptive features have been widely used for feature-based robot localization. However, the space/time costs of building/retrieving the map database tend to be significant due to the high dimensionality. In addition, most of existing databases are working well only on batch problems, difficult to be built incrementally by a mapper robot. In this paper, a scalable localization algorithm is proposed for incremental databases of high dimensional features. The Monte Carlo localization (MCL) algorithm is extended by employing the exact Euclidean locality sensitive hashing (LSH). The robustness and efficiency of the proposed algorithms have been demonstrated using the radish dataset.