scispace - formally typeset
Search or ask a question

Showing papers on "Locality-sensitive hashing published in 2001"


Journal ArticleDOI
TL;DR: This work develops a family of algorithms for solving association-rule mining, employing a combination of random sampling and hashing techniques, and provides analysis of the algorithms developed and experiments on real and synthetic data to obtain a comparative performance analysis.
Abstract: Association-rule mining has heretofore relied on the condition of high support to do its work efficiently. In particular, the well-known a priori algorithm is only effective when the only rules of interest are relationships that occur very frequently. However, there are a number of applications, such as data mining, identification of similar Web documents, clustering, and collaborative filtering, where the rules of interest have comparatively few instances in the data. In these cases, we must look for highly correlated items, or possibly even causal relationships between infrequent items. We develop a family of algorithms for solving this problem, employing a combination of random sampling and hashing techniques. We provide analysis of the algorithms developed and conduct experiments on real and synthetic data to obtain a comparative performance analysis.

370 citations


01 Jan 2001
TL;DR: This paper proposes a scheme that exploits a structured search algorithm that allows searching databases containing over 100,000 songs, and shows that the proposed scheme is robust against severe compression, but bit errors do occur.
Abstract: Nowadays most audio content identification systems are based on watermarking technology. In this paper we present a different technology, referred to as robust audio hashing. By extracting robust features and translating them into a bit string, we get an object called a robust hash. Content can then be identified by comparing hash values of a received audio clip with the hash values of previously stored original audio clips. A distinguishing feature of the proposed hash scheme is its ability to extract a bit string for every so many milliseconds. More precisely, for every windowed time interval a hash value of 32 bits is computed by thresholding energy differences of several frequency bands. A sequence of 256 hash values, corresponding to approximately 3 seconds of audio, can uniquely identify a song. Experimental results show that the proposed scheme is robust against severe compression, but bit errors do occur. This implies that searching and matching is a non-trivial task for large databases. For instance, a brute force search approach is already prohibitive for databases containing hash values of more than 100 songs. Therefore we propose a scheme that exploits a structured search algorithm that allows searching databases containing over 100,000 songs.

317 citations


Book ChapterDOI
05 Sep 2001
TL;DR: This work develops a recommendation system, termed Yoda, that is designed to support large-scale Web-based applications requiring highly accurate recommendations in real-time and introduces a hybrid approach that combines collaborative filtering (CF) and content-based querying to achieve higher accuracy.
Abstract: Recommendation systems are applied to personalize and customize the Web environment. We have developed a recommendation system, termed Yoda, that is designed to support large-scale Web-based applications requiring highly accurate recommendations in real-time. With Yoda, we introduce a hybrid approach that combines collaborative filtering (CF) and content-based querying to achieve higher accuracy. Yoda is structured as a tunable model that is trained off-line and employed for real-time recommendation on-line. The on-line process benefits from an optimized aggregation function with low complexity that allows realtime weighted aggregation of the soft classification of active users to predefined recommendation sets. Leveraging on localized distribution of the recommendable items, the same aggregation function is further optimized for the off-line process to reduce the time complexity of constructing the pre-defined recommendation sets of the model. To make the off-line process scalable furthermore, we also propose a filtering mechanism, FLSH, that extends the Locality Sensitive Hashing technique by incorporating a novel distance measure that satisfies specific requirements of our application. Our end-to-end experiments show while Yoda's complexity is low and remains constant as the number of users and/or items grow, its accuracy surpasses that of the basic nearest-neighbor method by a wide margin (in most cases more than 100%).

137 citations


Patent
24 Apr 2001
TL;DR: In this article, the authors describe an implementation of a technology for recognizing the perceptual similarity of the content of digital goods, which produces hash values for digital goods that are proximally near each other, when the digital goods contain similar content.
Abstract: An implementation of a technology is described herein for recognizing the perceptual similarity of the content of digital goods. At least one implementation, described herein, introduces a new hashing technique. More particularly, this hashing technique produces hash values for digital goods that are proximally near each other, when the digital goods contain perceptually similar content. In other words, if the content of digital goods are perceptually similar, then their hash values are, likewise, similar. The hash values are proximally near each other. This is unlike conventional hashing techniques where the hash values of goods with perceptually similar content are far apart with high probability in some distance sense (e.g., Hamming). This abstract itself is not intended to limit the scope of this patent. The scope of the present invention is pointed out in the appending claims.

82 citations


Proceedings ArticleDOI
Cheng Yang1
21 Oct 2001
TL;DR: The algorithm tries to capture the intuitive notion of similarity perceived by humans: two pieces are similar if they are fully or partially based on the same score, even if they were performed by different people or at different speed.
Abstract: We present a prototype method of indexing raw-audio music files in a way that facilitates content-based similarity retrieval. The algorithm tries to capture the intuitive notion of similarity perceived by humans: two pieces are similar if they are fully or partially based on the same score, even if they are performed by different people or at different speed. Local peaks in signal power are identified in each audio file, and a spectral vector is extracted near each peak. Nearby peaks are selectively grouped together to form "characteristic sequences" which are used as the basis for indexing. A hashing scheme known as "locality-sensitive hashing" is employed to index the high-dimensional vectors. Retrieval results are ranked based on the number of final matches filtered by some linearity criteria.

75 citations


Proceedings ArticleDOI
01 Oct 2001
TL;DR: A novel algorithm for adaptive nearest neigbor computations for high dimensional feature vectors and when the number of items in the databse is large, which exploits the correlations between two consecutive nearest neighbor searches when the underlying similarity metric is changing.
Abstract: Relevance feedback is often used in refining similarity retrievals in image and video databases. Typically this involves modification to the similarity metrics based on the user feedback and recomputing a set of nearest neighbors using the modified similarity values. Such nearest neighbor computations are expensive given that typical image features, such as color and texture, are represented in high dimensional spaces. Search complexity is a ciritcal issue while dealing with large databases and this issue has not received much attention in relevance feedback research. Most of the current methods report results on very small data sets, of the order of few thousand items, where a sequential (and hence exhaustive search) is practical. The main contribution of this paper is a novel algorithm for adaptive nearest neigbor computations for high dimensional feature vectors and when the number of items in the databse is large. The proposed method exploits the correlations between two consecutive nearest neighbor searches when the underlying similarity metric is changing, and filters out a significant number of candidates ina two stage search and retrieval process, thus reducing the number of I/O accesses to the database. Detailed experimental results are provided using a set of about 700,000 images. Comparision to the existing method shows an order of magnitude overall imporovement.

69 citations


Proceedings ArticleDOI
30 Sep 2001
TL;DR: A novel access structure for similarity search in metric databases, called Similarity Hashing (SH), is proposed, a multi-level hash structure, consisting of search-separable bucket sets on each level, suitable for distributed and parallel implementations.
Abstract: A novel access structure for similarity search in metric databases, called Similarity Hashing (SH), is proposed. It is a multi-level hash structure, consisting of search-separable bucket sets on each level. The structure supports easy insertion and bounded search costs, because at most one bucket needs to be accessed at each level for range queries up to a pre-defined value of search radius. At the same time, the pivot-based strategy significantly reduces the number of distance computations. Contrary to tree organizations, the SH structure is suitable for distributed and parallel implementations.

28 citations



Proceedings ArticleDOI
18 Apr 2001
TL;DR: This work proposes an index structure, the ANN-tree (approximate nearest neighbor tree), which is demonstrably more efficient than existing structures like the R*-tree and is a preferable index structure for both exact and approximate nearest neighbor searches.
Abstract: We explore the problem of approximate nearest neighbor searches. We propose an index structure, the ANN-tree (approximate nearest neighbor tree) to solve this problem. The ANN-tree supports high accuracy nearest neighbor search. The actual nearest neighbor of a query point can usually be found in the first leaf page accessed. The accuracy increases to near 100% if a second page is accessed. This is not achievable via traditional indexes. Even if an exact nearest neighbor query is desired, the ANN-tree is demonstrably more efficient than existing structures like the R*-tree. This makes the ANN-tree a preferable index structure for both exact and approximate nearest neighbor searches. We present the index in detail and provide experimental results on both real and synthetic data sets.

16 citations


Proceedings ArticleDOI
05 Oct 2001
TL;DR: A new approach for processing nearest neighbor search with the Euclidean metric, which searches over only a small subset of the original space and effectively approximates clusters by encapsulating them into geometrically regular shapes and also computes better upper and lower bounds of the distances from the query point to the clusters.
Abstract: The nearest neighbor search is an important operation widely-used in multimedia databases. In higher dimensions, most of previous methods for nearest neighbor search become inefficient and require to compute nearest neighbor distances to a large fraction of points in the space. In this paper, we present a new approach for processing nearest neighbor search with the Euclidean metric, which searches over only a small subset of the original space. This approach effectively approximates clusters by encapsulating them into geometrically regular shapes and also computes better upper and lower bounds of the distances from the query point to the clusters. For showing the effectiveness of the proposed approach, we perform extensive experiments. The results reveal that the proposed approach significantly outperforms the X-tree as well as the sequential scan.

7 citations


Proceedings Article
01 Jan 2001
TL;DR: The proposed algorithm, which makes use of the triangle inequality property, is considered from a function minimization perspective, and simulation results are provided that suggest better performance than that obtained with spatial partition techniques such as Elias and k-d tree, for moderate size point sets.
Abstract: This paper describes a solution to the nearest neighbor problem. The proposed algorithm, which makes use of the triangle inequality property, is considered from a function minimization perspective. The distance function is regularized through the computation of distance to a reference point; an initial starting point is rapidly found, and used in an iterative refinement using search over a sorted list. The algorithm is described, and simulation results are provided that suggest better performance than that obtained with spatial partition techniques such as Elias and k-d tree, for moderate size point sets.