scispace - formally typeset
Search or ask a question
Topic

Locality-sensitive hashing

About: Locality-sensitive hashing is a research topic. Over the lifetime, 1894 publications have been published within this topic receiving 69362 citations.


Papers
More filters
Proceedings ArticleDOI
14 May 1975
TL;DR: A study of the performance measures obtained during tests of "Distribution-dependent" hashing functions indicates that in certain cases, distribution-dependent methods perform better than the division method.
Abstract: In this paper procedures are studied for storing, accessing, updating, and reorganizing data in large files whose organization is direct, an organization used when a fast response time is required. "Distribution-dependent" hashing functions and the division method are compared as methods of indirect addressing."Distribution-dependent" hashing functions are characterized. These hashing functions generate addresses from a set of keys by using knowledge of the distribution of that key set within the key space or range of keys. A study of the performance measures obtained during tests of these functions on several key sets indicates that in certain cases, distribution-dependent methods perform better than the division method. This result is extended by a demonstration that distribution-dependent hashing functions can accommodate a change in the distribution of keys without being redefined. A number of insertions to and deletions from the key set can be made before a distribution-dependent hashing function gives poorer performance than the division method under identical circumstances.If many additions are made to a set of keys, it becomes necessary to reorganize, in a larger storage area, the direct file of records identified by that key set. Although processor time must be sacrificed in order to redefine a distribution-dependent hashing function, the division method requires substantially greater access time in a reorganizational situation.

9 citations

Proceedings ArticleDOI
13 Jul 2014
TL;DR: Experiments carried out on an archive of aerial images show that the presented hashing methods are one hundred times faster than those that exploit an exact nearest neighbor search while keeping a high retrieval accuracy.
Abstract: This paper presents hashing based approximate nearest neighbor search algorithms that allow fast and accurate image retrieval in huge remote sensing data archives. Hashing methods aim at mapping high-dimensional image feature vectors into short binary codes based on hashing functions. Then, the image retrieval is accomplished according to Hamming distances of image hash codes. In particular, in this paper two hashing methods are adopted for RS image retrieval problems. The former aims at defining hash functions in the kernel space by using only unlabeled images. The latter leverages on the semantic similarity given in terms of annotated images to define much distinctive hash functions in the kernel space. The effectiveness of both methods is analyzed in terms of RS image retrieval accuracy as well as retrieval time. Experiments carried out on an archive of aerial images show that the presented hashing methods are one hundred times faster than those that exploit an exact nearest neighbor search while keeping a high retrieval accuracy.

9 citations

Journal ArticleDOI
Deng Cai1
TL;DR: In this article, a simple but effective novel hash index search approach was proposed and made a thorough comparison of eleven popular hashing algorithms, and the random-projection-based LSH ranked the first, which is in contradiction to the claims in all the other 10 hashing articles.
Abstract: Approximate Nearest Neighbor Search (ANNS) is a fundamental problem in many areas of machine learning and data mining. During the past decade, numerous hashing algorithms are proposed to solve this problem. Every proposed algorithm claims to outperform Locality Sensitive Hashing (LSH), which is the most popular hashing method. However, the evaluation of these hashing article was not thorough enough, and the claim should be re-examined. If implemented correctly, almost all the hashing methods will have their performance improved as the code length increases. However, many existing hashing article only report the performance with the code length shorter than 128. In this article, we carefully revisit the problem of search-with-a-hash-index and analyze the pros and cons of two popular hash index search procedures. Then we proposed a simple but effective novel hash index search approach and made a thorough comparison of eleven popular hashing algorithms. Surprisingly, the random-projection-based Locality Sensitive Hashing ranked the first, which is in contradiction to the claims in all the other 10 hashing article. Despite the extreme simplicity of random-projection-based LSH, our results show that the capability of this algorithm has been far underestimated. For the sake of reproducibility, all the codes used in the article are released on GitHub, which can be used as a testing platform for a fair comparison between various hashing algorithms.

9 citations

Proceedings ArticleDOI
04 Apr 2016
TL;DR: This paper focuses on reducing the computation time of a state-of-the-art duplicate detection algorithm, and constructs uniform vector representations for the products and applies the Multi-component Similarity Method (MSM).
Abstract: The amount of online shops is growing daily and many Web shops focus on the same product types, like consumer electronics. Since Web shops use different product representations, it is hard to compare products among different Web shops. Duplicate detection methods aim to solve this problem by identifying the same products in differentWeb shops. In this paper, we focus on reducing the computation time of a state-of-the-art duplicate detection algorithm. First, we construct uniform vector representations for the products. We use these vectors as input for a Locality Sensitive Hashing (LSH) algorithm, which pre-selects potential duplicates. Finally, duplicate products are found by applying the Multi-component Similarity Method (MSM). Compared to original MSM, the number of needed computations can be reduced by 95% with only a minor decrease by 9% in the F1-measure.

9 citations

Posted Content
TL;DR: Li et al. as mentioned in this paper proposed a secure binary embedding scheme generated from a novel probabilistic transformation over locality sensitive hashing family, which has a sublinear query time and the ability to handle honest-but-curious parties.
Abstract: In Near-Neighbor Search (NNS), a new client queries a database (held by a server) for the most similar data (near-neighbors) given a certain similarity metric. The Privacy-Preserving variant (PP-NNS) requires that neither server nor the client shall learn information about the other party's data except what can be inferred from the outcome of NNS. The overwhelming growth in the size of current datasets and the lack of a truly secure server in the online world render the existing solutions impractical; either due to their high computational requirements or non-realistic assumptions which potentially compromise privacy. PP-NNS having query time {\it sub-linear} in the size of the database has been suggested as an open research direction by Li et al. (CCSW'15). In this paper, we provide the first such algorithm, called Secure Locality Sensitive Indexing (SLSI) which has a sub-linear query time and the ability to handle honest-but-curious parties. At the heart of our proposal lies a secure binary embedding scheme generated from a novel probabilistic transformation over locality sensitive hashing family. We provide information theoretic bound for the privacy guarantees and support our theoretical claims using substantial empirical evidence on real-world datasets.

9 citations


Network Information
Related Topics (5)
Deep learning
79.8K papers, 2.1M citations
84% related
Feature extraction
111.8K papers, 2.1M citations
83% related
Convolutional neural network
74.7K papers, 2M citations
83% related
Feature (computer vision)
128.2K papers, 1.7M citations
82% related
Support vector machine
73.6K papers, 1.7M citations
82% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202343
2022108
202188
2020110
2019104
2018139