scispace - formally typeset
Search or ask a question
Topic

Locality-sensitive hashing

About: Locality-sensitive hashing is a research topic. Over the lifetime, 1894 publications have been published within this topic receiving 69362 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: Experimental results show that the fast niching versions of the multimodal algorithms can exhibit similar or even better performance than their original ones and the execution time of the algorithms is significantly reduced.
Abstract: Niching techniques have recently been incorporated into evolutionary algorithms (EAs) for multisolution optimization in multimodal landscape. However, existing niching techniques inevitably increase the time complexity of basic EAs due to the computation of the distance matrix of individuals. In this paper, we propose a fast niching technique. The technique avoids pairwise distance calculations by introducing the locality sensitive hashing, an efficient algorithm for approximately retrieving nearest neighbors. Individuals are projected to a number of buckets by hash functions. The similar individuals possess a higher probability of being hashed into the same bucket than the dissimilar ones. Then, interactions between individuals are limited to the candidates that fall in the same bucket to achieve local evolution. It is proved that the complexity of the proposed fast niching is linear to the population size. In addition, this mechanism induces stable niching behavior and it inherently keeps a balance between the exploration and exploitation of multiple optima. The theoretical analysis conducted in this paper suggests that the proposed technique is able to provide bounds for the exploration and exploitation probabilities. Experimental results show that the fast niching versions of the multimodal algorithms can exhibit similar or even better performance than their original ones. More importantly, the execution time of the algorithms is significantly reduced.

37 citations

Proceedings ArticleDOI
10 Apr 2018
TL;DR: This paper proposes ACE (Arrays of (locality-sensitive) Count Estimators) algorithm that can be 60x faster than most state-of-the-art unsupervised anomaly detection algorithms and has appealing privacy properties.
Abstract: Anomaly detection is one of the frequent and important subroutines deployed in large-scale data processing applications. Even being a well-studied topic, existing techniques for unsupervised anomaly detection require storing significant amounts of data, which is prohibitive from memory, latency and privacy perspectives, especially for small mobile devices which has ultra-low memory budget and limited computational power. In this paper, we propose ACE (Arrays of (locality-sensitive) Count Estimators) algorithm that can be 60x faster than most state-of-the-art unsupervised anomaly detection algorithms. In addition, ACE has appealing privacy properties. Our experiments show that ACE algorithm has significantly smaller memory footprints (∠ 4MB in our experiments) which can exploit Level 3 cache of any modern processor. At the core of the ACE algorithm, there is a novel statistical estimator which is derived from the sampling view of Locality Sensitive Hashing (LSH). This view is significantly different and efficient than the widely popular view of LSH for near-neighbor search. We show the superiority of ACE algorithm over 11 popular baselines on 3 benchmark datasets, including the KDD-Cup99 data which is the largest available public benchmark comprising of more than half a million entries with ground truth anomaly labels.

37 citations

Book ChapterDOI
20 Apr 2015
TL;DR: This article presents a highly scalable approach to compute the nearest neighbors of objects that instead focuses on preserving neighborhoods well using an ensemble of space-filling curves and demonstrates that, by preserving neighborhoods, the quality of outlier detection based on local density estimates is not only well retained but sometimes even improved.
Abstract: Popular outlier detection methods require the pairwise comparison of objects to compute the nearest neighbors. This inherently quadratic problem is not scalable to large data sets, making multidimensional outlier detection for big data still an open challenge. Existing approximate neighbor search methods are designed to preserve distances as well as possible. In this article, we present a highly scalable approach to compute the nearest neighbors of objects that instead focuses on preserving neighborhoods well using an ensemble of space-filling curves. We show that the method has near-linear complexity, can be distributed to clusters for computation, and preserves neighborhoods—but not distances—better than established methods such as locality sensitive hashing and projection indexed nearest neighbors. Furthermore, we demonstrate that, by preserving neighborhoods, the quality of outlier detection based on local density estimates is not only well retained but sometimes even improved, an effect that can be explained by relating our method to outlier detection ensembles. At the same time, the outlier detection process is accelerated by two orders of magnitude.

37 citations

Journal ArticleDOI
TL;DR: This work considers the problem of retrieving the database points nearest to a given hyperplane query without exhaustively scanning the entire database and proposes two hashing-based solutions that use hashing to retrieve near points in sublinear time.
Abstract: We consider the problem of retrieving the database points nearest to a given hyperplane query without exhaustively scanning the entire database. For this problem, we propose two hashing-based solutions. Our first approach maps the data to 2-bit binary keys that are locality sensitive for the angle between the hyperplane normal and a database point. Our second approach embeds the data into a vector space where the euclidean norm reflects the desired distance between the original points and hyperplane query. Both use hashing to retrieve near points in sublinear time. Our first method's preprocessing stage is more efficient, while the second has stronger accuracy guarantees. We apply both to pool-based active learning: Taking the current hyperplane classifier as a query, our algorithm identifies those points (approximately) satisfying the well-known minimal distance-to-hyperplane selection criterion. We empirically demonstrate our methods' tradeoffs and show that they make it practical to perform active selection with millions of unlabeled points.

37 citations

Journal ArticleDOI
TL;DR: The experimental results show that the proposed stratified sampling based clustering algorithm outperforms the related algorithms in terms of clustering quality and computational efficiency for large-scale data sets.
Abstract: Large-scale data analysis is a challenging and relevant task for present-day research and industry As a promising data analysis tool, clustering is becoming more important in the era of big data In large-scale data clustering, sampling is an efficient and most widely used approximation technique Recently, several sampling-based clustering algorithms have attracted considerable attention in large-scale data analysis owing to their efficiency However, some of these existing algorithms have low clustering accuracy, whereas others have high computational complexity To overcome these deficiencies, a stratified sampling based clustering algorithm for large-scale data is proposed in this paper Its basic steps include: (1) obtaining a number of representative samples from different strata with a stratified sampling scheme, which are formed by locality sensitive hashing technique, (2) partitioning the chosen samples into different clusters using the fuzzy c -means clustering algorithm, (3) assigning the out-of-sample objects into their closest clusters via data labeling technique The performance of the proposed algorithm is compared with the state-of-the-art sampling-based fuzzy c -means clustering algorithms on several large-scale data sets including synthetic and real ones The experimental results show that the proposed algorithm outperforms the related algorithms in terms of clustering quality and computational efficiency for large-scale data sets

37 citations


Network Information
Related Topics (5)
Deep learning
79.8K papers, 2.1M citations
84% related
Feature extraction
111.8K papers, 2.1M citations
83% related
Convolutional neural network
74.7K papers, 2M citations
83% related
Feature (computer vision)
128.2K papers, 1.7M citations
82% related
Support vector machine
73.6K papers, 1.7M citations
82% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202343
2022108
202188
2020110
2019104
2018139