Topic

Locality-sensitive hashing

About: Locality-sensitive hashing is a research topic. Over the lifetime, 1894 publications have been published within this topic receiving 69362 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Toward Fast Niching Evolutionary Algorithms: A Locality Sensitive Hashing-Based Approach

[...]

Yu-Hui Zhang¹, Yue-Jiao Gong¹, Huaxiang Zhang², Tianlong Gu³, Jun Zhang¹ - Show less +1 more•Institutions (3)

South China University of Technology¹, Shandong Normal University², Guilin University of Electronic Technology³

01 Jun 2017-IEEE Transactions on Evolutionary Computation

TL;DR: Experimental results show that the fast niching versions of the multimodal algorithms can exhibit similar or even better performance than their original ones and the execution time of the algorithms is significantly reduced.

...read moreread less

Abstract: Niching techniques have recently been incorporated into evolutionary algorithms (EAs) for multisolution optimization in multimodal landscape. However, existing niching techniques inevitably increase the time complexity of basic EAs due to the computation of the distance matrix of individuals. In this paper, we propose a fast niching technique. The technique avoids pairwise distance calculations by introducing the locality sensitive hashing, an efficient algorithm for approximately retrieving nearest neighbors. Individuals are projected to a number of buckets by hash functions. The similar individuals possess a higher probability of being hashed into the same bucket than the dissimilar ones. Then, interactions between individuals are limited to the candidates that fall in the same bucket to achieve local evolution. It is proved that the complexity of the proposed fast niching is linear to the population size. In addition, this mechanism induces stable niching behavior and it inherently keeps a balance between the exploration and exploitation of multiple optima. The theoretical analysis conducted in this paper suggests that the proposed technique is able to provide bounds for the exploration and exploitation probabilities. Experimental results show that the fast niching versions of the multimodal algorithms can exhibit similar or even better performance than their original ones. More importantly, the execution time of the algorithms is significantly reduced.

...read moreread less

37 citations

Proceedings Article•DOI•

Arrays of (locality-sensitive) Count Estimators (ACE): Anomaly Detection on the Edge

[...]

Chen Luo¹, Anshumali Shrivastava¹•Institutions (1)

Rice University¹

10 Apr 2018

TL;DR: This paper proposes ACE (Arrays of (locality-sensitive) Count Estimators) algorithm that can be 60x faster than most state-of-the-art unsupervised anomaly detection algorithms and has appealing privacy properties.

...read moreread less

Abstract: Anomaly detection is one of the frequent and important subroutines deployed in large-scale data processing applications. Even being a well-studied topic, existing techniques for unsupervised anomaly detection require storing significant amounts of data, which is prohibitive from memory, latency and privacy perspectives, especially for small mobile devices which has ultra-low memory budget and limited computational power. In this paper, we propose ACE (Arrays of (locality-sensitive) Count Estimators) algorithm that can be 60x faster than most state-of-the-art unsupervised anomaly detection algorithms. In addition, ACE has appealing privacy properties. Our experiments show that ACE algorithm has significantly smaller memory footprints (∠ 4MB in our experiments) which can exploit Level 3 cache of any modern processor. At the core of the ACE algorithm, there is a novel statistical estimator which is derived from the sampling view of Locality Sensitive Hashing (LSH). This view is significantly different and efficient than the widely popular view of LSH for near-neighbor search. We show the superiority of ACE algorithm over 11 popular baselines on 3 benchmark datasets, including the KDD-Cup99 data which is the largest available public benchmark comprising of more than half a million entries with ground truth anomaly labels.

...read moreread less

37 citations

Book Chapter•DOI•

Fast and Scalable Outlier Detection with Approximate Nearest Neighbor Ensembles

[...]

Erich Schubert¹, Arthur Zimek¹, Hans-Peter Kriegel¹•Institutions (1)

Ludwig Maximilian University of Munich¹

20 Apr 2015

TL;DR: This article presents a highly scalable approach to compute the nearest neighbors of objects that instead focuses on preserving neighborhoods well using an ensemble of space-filling curves and demonstrates that, by preserving neighborhoods, the quality of outlier detection based on local density estimates is not only well retained but sometimes even improved.

...read moreread less

Abstract: Popular outlier detection methods require the pairwise comparison of objects to compute the nearest neighbors. This inherently quadratic problem is not scalable to large data sets, making multidimensional outlier detection for big data still an open challenge. Existing approximate neighbor search methods are designed to preserve distances as well as possible. In this article, we present a highly scalable approach to compute the nearest neighbors of objects that instead focuses on preserving neighborhoods well using an ensemble of space-filling curves. We show that the method has near-linear complexity, can be distributed to clusters for computation, and preserves neighborhoods—but not distances—better than established methods such as locality sensitive hashing and projection indexed nearest neighbors. Furthermore, we demonstrate that, by preserving neighborhoods, the quality of outlier detection based on local density estimates is not only well retained but sometimes even improved, an effect that can be explained by relating our method to outlier detection ensembles. At the same time, the outlier detection process is accelerated by two orders of magnitude.

...read moreread less

37 citations

Journal Article•DOI•

Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning

[...]

Sudheendra Vijayanarasimhan¹, Prateek Jain², Kristen Grauman¹•Institutions (2)

University of Texas at Austin¹, Microsoft²

01 Feb 2014-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This work considers the problem of retrieving the database points nearest to a given hyperplane query without exhaustively scanning the entire database and proposes two hashing-based solutions that use hashing to retrieve near points in sublinear time.

...read moreread less

Abstract: We consider the problem of retrieving the database points nearest to a given hyperplane query without exhaustively scanning the entire database. For this problem, we propose two hashing-based solutions. Our first approach maps the data to 2-bit binary keys that are locality sensitive for the angle between the hyperplane normal and a database point. Our second approach embeds the data into a vector space where the euclidean norm reflects the desired distance between the original points and hyperplane query. Both use hashing to retrieve near points in sublinear time. Our first method's preprocessing stage is more efficient, while the second has stronger accuracy guarantees. We apply both to pool-based active learning: Taking the current hyperplane classifier as a query, our algorithm identifies those points (approximately) satisfying the well-known minimal distance-to-hyperplane selection criterion. We empirically demonstrate our methods' tradeoffs and show that they make it practical to perform active selection with millions of unlabeled points.

...read moreread less

37 citations

Journal Article•DOI•

A stratified sampling based clustering algorithm for large-scale data

[...]

Xingwang Zhao¹, Xingwang Zhao², Jiye Liang¹, Chuangyin Dang²•Institutions (2)

Shanxi University¹, City University of Hong Kong²

01 Jan 2019-Knowledge Based Systems

TL;DR: The experimental results show that the proposed stratified sampling based clustering algorithm outperforms the related algorithms in terms of clustering quality and computational efficiency for large-scale data sets.

...read moreread less

Abstract: Large-scale data analysis is a challenging and relevant task for present-day research and industry As a promising data analysis tool, clustering is becoming more important in the era of big data In large-scale data clustering, sampling is an efficient and most widely used approximation technique Recently, several sampling-based clustering algorithms have attracted considerable attention in large-scale data analysis owing to their efficiency However, some of these existing algorithms have low clustering accuracy, whereas others have high computational complexity To overcome these deficiencies, a stratified sampling based clustering algorithm for large-scale data is proposed in this paper Its basic steps include: (1) obtaining a number of representative samples from different strata with a stratified sampling scheme, which are formed by locality sensitive hashing technique, (2) partitioning the chosen samples into different clusters using the fuzzy c -means clustering algorithm, (3) assigning the out-of-sample objects into their closest clusters via data labeling technique The performance of the proposed algorithm is compared with the state-of-the-art sampling-based fuzzy c -means clustering algorithms on several large-scale data sets including synthetic and real ones The experimental results show that the proposed algorithm outperforms the related algorithms in terms of clustering quality and computational efficiency for large-scale data sets

...read moreread less

37 citations

Collapse

Network Information

Performance

Metrics

2,048

Papers

77,891

Citations

No. of papers in the topic in previous years
Year	Papers
2023	43
2022	108
2021	88
2020	110
2019	104
2018	139

Locality-sensitive hashing

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics