scispace - formally typeset
Search or ask a question
Topic

Locality-sensitive hashing

About: Locality-sensitive hashing is a research topic. Over the lifetime, 1894 publications have been published within this topic receiving 69362 citations.


Papers
More filters
Proceedings ArticleDOI
13 Oct 2015
TL;DR: A deep self-taught hashing algorithm (DSTH), which generates a set of pseudo labels by analyzing the data itself, and then learns the hash functions for novel data using discriminative deep models and generalizes to support both supervised and unsupervised cases by adaptively incorporating label information.
Abstract: Hashing algorithm has been widely used to speed up image retrieval due to its compact binary code and fast distance calculation The combination with deep learning boosts the performance of hashing by learning accurate representations and complicated hashing functions So far, the most striking success in deep hashing have mostly involved discriminative models, which require labels To apply deep hashing on datasets without labels, we propose a deep self-taught hashing algorithm (DSTH), which generates a set of pseudo labels by analyzing the data itself, and then learns the hash functions for novel data using discriminative deep models Furthermore, we generalize DSTH to support both supervised and unsupervised cases by adaptively incorporating label information We use two different deep learning framework to train the hash functions to deal with out-of-sample problem and reduce the time complexity without loss of accuracy We have conducted extensive experiments to investigate different settings of DSTH, and compared it with state-of-the-art counterparts in six publicly available datasets The experimental results show that DSTH outperforms the others in all datasets

55 citations

Book ChapterDOI
24 Sep 2012
TL;DR: A very simple and effective strategy for sub-linear time near neighbor search is developed, by creating hash tables directly using the bits generated by b-bit minwise hashing.
Abstract: Numerous applications in search, databases, machine learning, and computer vision, can benefit from efficient algorithms for near neighbor search. This paper proposes a simple framework for fast near neighbor search in high-dimensional binary data, which are common in practice (e.g., text). We develop a very simple and effective strategy for sub-linear time near neighbor search, by creating hash tables directly using the bits generated by b-bit minwise hashing. The advantages of our method are demonstrated through thorough comparisons with two strong baselines: spectral hashing and sign (1-bit) random projections.

54 citations

Journal ArticleDOI
TL;DR: The paper at hand aims at providing an assessment methodology and a sample implementation called FRASH: a framework to test algorithms of similarity hashing, and applies it to the well-known similarity hashing approaches ssdeep and sdhash to show their strengths and weaknesses.

54 citations

Proceedings Article
30 Apr 2020
TL;DR: A new framework for building space partitions reducing the problem to balanced graph partitioning followed by supervised classification is developed and the partitions obtained by Neural LSH consistently outperform partitions found by quantization-based and tree-based methods as well as classic, data-oblivious LSH.
Abstract: Space partitions of $\mathbb{R}^d$ underlie a vast and important class of fast nearest neighbor search (NNS) algorithms. Inspired by recent theoretical work on NNS for general metric spaces (Andoni et al. 2018b,c), we develop a new framework for building space partitions reducing the problem to balanced graph partitioning followed by supervised classification. We instantiate this general approach with the KaHIP graph partitioner (Sanders and Schulz 2013) and neural networks, respectively, to obtain a new partitioning procedure called Neural Locality-Sensitive Hashing (Neural LSH). On several standard benchmarks for NNS (Aumuller et al. 2017), our experiments show that the partitions obtained by Neural LSH consistently outperform partitions found by quantization-based and tree-based methods as well as classic, data-oblivious LSH.

53 citations

Journal ArticleDOI
TL;DR: This article shows that it can avoid the one-against-all comparisons of a query spectrum against a very large number of peptides generated from in silico digestion of protein sequences in a database, and can be effectively used for other mass spectra mining applications such as finding clusters of spectra efficiently and accurately.
Abstract: Motivation: Due to the recent advances in technology of mass spectrometry, there has been an exponential increase in the amount of data being generated in the past few years. Database searches have not been able to keep with this data explosion. Thus, speeding up the data searches becomes increasingly important in mass-spectrometry-based applications. Traditional database search methods use one-against-all comparisons of a query spectrum against a very large number of peptides generated from in silico digestion of protein sequences in a database, to filter potential candidates from this database followed by a detailed scoring and ranking of those filtered candidates. Results: In this article, we show that we can avoid the one-against-all comparisons. The basic idea is to design a set of hash functions to pre-process peptides in the database such that for each query spectrum we can use the hash functions to find only a small subset of peptide sequences that are most likely to match the spectrum. The construction of each hash function is based on a random spectrum and the hash value of a peptide is the normalized shared peak counts score (cosine) between the random spectrum and the hypothetical spectrum of the peptide. To implement this idea, we first embed each peptide into a unit vector in a high-dimensional metric space. The random spectrum is represented by a random vector, and we use random vectors to construct a set of hash functions called locality sensitive hashing (LSH) for preprocessing. We demonstrate that our mapping is accurate. We show that our method can filter out >95.65% of the spectra without missing any correct sequences, or gain 111 times speedup by filtering out 99.64% of spectra while missing at most 0.19% (2 out of 1014) of the correct sequences. In addition, we show that our method can be effectively used for other mass spectra mining applications such as finding clusters of spectra efficiently and accurately. Contact: tingchen@usc.edu Supplementary information: Supplementary data are available at Bioinformatics online.

53 citations


Network Information
Related Topics (5)
Deep learning
79.8K papers, 2.1M citations
84% related
Feature extraction
111.8K papers, 2.1M citations
83% related
Convolutional neural network
74.7K papers, 2M citations
83% related
Feature (computer vision)
128.2K papers, 1.7M citations
82% related
Support vector machine
73.6K papers, 1.7M citations
82% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202343
2022108
202188
2020110
2019104
2018139