scispace - formally typeset
Search or ask a question
Topic

Locality-sensitive hashing

About: Locality-sensitive hashing is a research topic. Over the lifetime, 1894 publications have been published within this topic receiving 69362 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: These proposed algorithms have better running time performance than the standard LSH-based applications while preserving the prediction accuracy in reasonable limits and have a large positive impact on aggregate diversity which has recently become an important evaluation measure for recommender algorithms.
Abstract: Neighborhood-based collaborative filtering (CF) methods are widely used in recommender systems because they are easy-to-implement and highly effective. One of the significant challenges of these methods is the ability to scale with the increasing amount of data since finding nearest neighbors requires a search over all of the data. Approximate nearest neighbor (ANN) methods eliminate this exhaustive search by only looking at the data points that are likely to be similar. Locality sensitive hashing (LSH) is a well-known technique for ANN search in high dimensional spaces. It is also effective in solving the scalability problem of neighborhood-based CF. In this study, we provide novel improvements to the current LSH based recommender algorithms and make a systematic evaluation of LSH in neighborhood-based CF. Besides, we make extensive experiments on real-life datasets to investigate various parameters of LSH and their effects on multiple metrics used to evaluate recommender systems. Our proposed algorithms have better running time performance than the standard LSH-based applications while preserving the prediction accuracy in reasonable limits. Also, the proposed algorithms have a large positive impact on aggregate diversity which has recently become an important evaluation measure for recommender algorithms.

14 citations

Journal ArticleDOI
15 Aug 2018-Symmetry
TL;DR: Clustering approach developed-based on Apache Spark framework shows its superiorities in precision and noise robustness in comparison with recent researches and comparison with similar approaches shows superiorities of the proposed method in scalability, high performance, and low computation cost.
Abstract: Unsupervised machine learning and knowledge discovery from large-scale datasets have recently attracted a lot of research interest. The present paper proposes a distributed big data clustering approach-based on adaptive density estimation. The proposed method is developed-based on Apache Spark framework and tested on some of the prevalent datasets. In the first step of this algorithm, the input data is divided into partitions using a Bayesian type of Locality Sensitive Hashing (LSH). Partitioning makes the processing fully parallel and much simpler by avoiding unneeded calculations. Each of the proposed algorithm steps is completely independent of the others and no serial bottleneck exists all over the clustering procedure. Locality preservation also filters out the outliers and enhances the robustness of the proposed approach. Density is defined on the basis of Ordered Weighted Averaging (OWA) distance which makes clusters more homogenous. According to the density of each node, the local density peaks will be detected adaptively. By merging the local peaks, final cluster centers will be obtained and other data points will be a member of the cluster with the nearest center. The proposed method has been implemented and compared with similar recently published researches. Cluster validity indexes achieved from the proposed method shows its superiorities in precision and noise robustness in comparison with recent researches. Comparison with similar approaches also shows superiorities of the proposed method in scalability, high performance, and low computation cost. The proposed method is a general clustering approach and it has been used in gene expression clustering as a sample of its application.

14 citations

Proceedings Article
01 Jan 2016
TL;DR: In this paper, the authors show that solving the sparsified facility location problem will be almost optimal under the standard assumptions of the problem, and then analyze a different method of sparsification that is a better model for methods such as Locality Sensitive Hashing to accelerate the nearest neighbor computations.
Abstract: The facility location problem is widely used for summarizing large datasets and has additional applications in sensor placement, image retrieval, and clustering. One difficulty of this problem is that submodular optimization algorithms require the calculation of pairwise benefits for all items in the dataset. This is infeasible for large problems, so recent work proposed to only calculate nearest neighbor benefits. One limitation is that several strong assumptions were invoked to obtain provable approximation guarantees. In this paper we establish that these extra assumptions are not necessary—solving the sparsified problem will be almost optimal under the standard assumptions of the problem. We then analyze a different method of sparsification that is a better model for methods such as Locality Sensitive Hashing to accelerate the nearest neighbor computations and extend the use of the problem to a broader family of similarities. We validate our approach by demonstrating that it rapidly generates interpretable summaries.

14 citations

Proceedings ArticleDOI
26 Jul 2011
TL;DR: Frequency Based Locality Sensitive Hashing can reduce the extra space cost by several orders of magnitude with less (or similar) time cost while achieving better search quality compared with LSH based on p-stable distributions.
Abstract: Nearest Neighbor (NN) search is of major importance to many applications, such as information retrieval, data mining and so on. However, finding the NN in high dimensional space has been proved to be time-consuming. In recent years, Locality Sensitive Hashing (LSH) has been proposed to solve Approximate Nearest Neighbor (ANN) problem. The main drawback of LSH is that it requires quite a lot of memory to achieve good performance, which makes it not that suit for today's application of massive data. We analyze generic LSH scheme as well as the properties of LSH hash functions based on p-stable distributions and propose a new LSH scheme called Frequency Based Locality Sensitive Hashing (FBLSH). FBLSH just uses one function based on p-stable distributions as hash function of a hash table, and it sets a frequency threshold m, only those points which collide with query point more than m times can be candidate ANNs. FBLSH is easy to implement and through experiments, we show that FBLSH can reduce the extra space cost by several orders of magnitude with less (or similar) time cost while achieving better search quality compared with LSH based on p-stable distributions.

14 citations

Proceedings Article
19 Jun 2011
TL;DR: A novel mechanism called Reservoir Counting is described for application in online Locality Sensitive Hashing, allowing for significant savings in the streaming setting, and maintaining a larger number of signatures, or an increased level of approximation accuracy at a similar memory footprint.
Abstract: We describe a novel mechanism called Reservoir Counting for application in online Locality Sensitive Hashing. This technique allows for significant savings in the streaming setting, allowing for maintaining a larger number of signatures, or an increased level of approximation accuracy at a similar memory footprint.

14 citations


Network Information
Related Topics (5)
Deep learning
79.8K papers, 2.1M citations
84% related
Feature extraction
111.8K papers, 2.1M citations
83% related
Convolutional neural network
74.7K papers, 2M citations
83% related
Feature (computer vision)
128.2K papers, 1.7M citations
82% related
Support vector machine
73.6K papers, 1.7M citations
82% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202343
2022108
202188
2020110
2019104
2018139