scispace - formally typeset
Search or ask a question
Topic

Locality-sensitive hashing

About: Locality-sensitive hashing is a research topic. Over the lifetime, 1894 publications have been published within this topic receiving 69362 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: In this article , the authors classified deep hashing methods into pairwise methods, ranking-based methods, pointwise methods as well as quantization according to how measuring the similarities of the learned hash codes.
Abstract: Nearest neighbor search aims at obtaining the samples in the database with the smallest distances from them to the queries, which is a basic task in a range of fields, including computer vision and data mining. Hashing is one of the most widely used methods for its computational and storage efficiency. With the development of deep learning, deep hashing methods show more advantages than traditional methods. In this survey, we detailedly investigate current deep hashing algorithms including deep supervised hashing and deep unsupervised hashing. Specifically, we categorize deep supervised hashing methods into pairwise methods, ranking-based methods, pointwise methods as well as quantization according to how measuring the similarities of the learned hash codes. Moreover, deep unsupervised hashing is categorized into similarity reconstruction-based methods, pseudo-label-based methods, and prediction-free self-supervised learning-based methods based on their semantic learning manners. We also introduce three related important topics including semi-supervised deep hashing, domain adaption deep hashing, and multi-modal deep hashing. Meanwhile, we present some commonly used public datasets and the scheme to measure the performance of deep hashing algorithms. Finally, we discuss some potential research directions in conclusion.

9 citations

Proceedings Article
01 Dec 2015
TL;DR: In this article, the authors propose data dependent dispatching that takes advantage of the structure of similar data points to improve the performance of distributed machine learning, and demonstrate that their technique strongly scales with the available computing power.
Abstract: In distributed machine learning, data is dispatched to multiple machines for processing. Motivated by the fact that similar data points often belong to the same or similar classes, and more generally, classification rules of high accuracy tend to be "locally simple but globally complex" (Vapnik & Bottou 1993), we propose data dependent dispatching that takes advantage of such structure. We present an in-depth analysis of this model, providing new algorithms with provable worst-case guarantees, analysis proving existing scalable heuristics perform well in natural non worst-case conditions, and techniques for extending a dispatching rule from a small sample to the entire distribution. We overcome novel technical challenges to satisfy important conditions for accurate distributed learning, including fault tolerance and balancedness. We empirically compare our approach with baselines based on random partitioning, balanced partition trees, and locality sensitive hashing, showing that we achieve significantly higher accuracy on both synthetic and real world image and advertising datasets. We also demonstrate that our technique strongly scales with the available computing power.

9 citations

Journal ArticleDOI
TL;DR: A non-expansive hashing scheme wherein any set of size from a large universe may be stored in a memory of size (any, and ), and where retrieval takes operations.
Abstract: hashing scheme, similar inputs are stored in memory locations which are close. We develop a non-expansive hashing scheme wherein any set of size from a large universe may be stored in a memory of size (any , and ), and where retrieval takes operations. We explain how to use non-expansive hashing schemes for efficient storage and retrieval of noisy data. A dynamic version of this hashing scheme is presented as well.

9 citations

01 Jan 2008
TL;DR: SpotSigs as mentioned in this paper is a new algorithm for extracting and matching signatures for near-duplicate detection in large Web crawls, which is designed to favor natural language portions of web pages over advertisements and navigational bars.
Abstract: Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching signatures for near duplicate detection in large Web crawls Our spot signatures are designed to favor natural-language portions of Web pages over advertisements and navigational barsThe contributions of SpotSigs are twofold: 1) by combining stopword antecedents with short chains of adjacent content terms, we create robust document signatures with a natural ability to filter out noisy components of Web pages that would otherwise distract pure n-gram-based approaches such as Shingling; 2) we provide an exact and efficient, self-tuning matching algorithm that exploits a novel combination of collection partitioning and inverted index pruning for high-dimensional similarity search Experiments confirm an increase in combined precision and recall of more than 24 percent over state-of-the-art approaches such as Shingling or I-Match and up to a factor of 3 faster execution times than Locality Sensitive Hashing (LSH), over a demonstrative "Gold Set" of manually assessed near-duplicate news articles as well as the TREC WT10g Web collection

9 citations

Proceedings ArticleDOI
13 Apr 2015
TL;DR: This paper proposes a set-based summarization method that aggregates the sets of similar nodes in each iteration, thus provides scalability and presents the scalable solutions for lossless summarization of both attributed and non-attributed graphs.
Abstract: Graph summarization is a valuable approach for in-memory processing of a big graph. A summary graph is compact, yet it maintains the overall characteristics of the underlying graph, thus suitable for querying and visualization. To summarize a big graph, the idea is to compress the similar nodes in dense regions of the graph. The existing approaches find these similar nodes either by nodes ordering or pair-wise similarity computations. The former approaches are scalable but cannot simultaneously consider the attributes and neighborhood similarity among the nodes. In contrast, the pair-wise summarization methods can consider both the similarity aspects but are impractical for a big graph. In this paper, we propose a set-based summarization method that aggregates the sets of similar nodes in each iteration, thus provides scalability. To find each set, we approximate the candidate similar nodes without nodes ordering and explicit similarity computations by using Locality Sensitive Hashing, LSH. In conjunction with an information theoretic approach, we present the scalable solutions for lossless summarization of both attributed and non-attributed graphs.

9 citations


Network Information
Related Topics (5)
Deep learning
79.8K papers, 2.1M citations
84% related
Feature extraction
111.8K papers, 2.1M citations
83% related
Convolutional neural network
74.7K papers, 2M citations
83% related
Feature (computer vision)
128.2K papers, 1.7M citations
82% related
Support vector machine
73.6K papers, 1.7M citations
82% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202343
2022108
202188
2020110
2019104
2018139