scispace - formally typeset
Search or ask a question
Topic

Locality-sensitive hashing

About: Locality-sensitive hashing is a research topic. Over the lifetime, 1894 publications have been published within this topic receiving 69362 citations.


Papers
More filters
Proceedings ArticleDOI
Zehua Zhao1, Min Gao1, Fengji Luo2, Yi Zhang1, Qingyu Xiong1 
19 Jul 2020
TL;DR: Empirical experiments prove the effectiveness of LSHWE in cyberbullying detection, particularly on the "deliberately obfuscated words" problem, and it is highly efficient, it can represent tens of thousands of words in a few minutes on a typical single machine.
Abstract: Word embedding methods use low-dimensional vectors to represent words in the corpus. Such low-dimensional vectors can capture lexical semantics and greatly improve the cyberbullying detection performance. However, existing word embedding methods have a major limitation in cyberbullying detection task: they cannot represent well on "deliberately obfuscated words", which are used by users to replace bullying words in order to evade detection. These deliberately obfuscated words are often regarded as "rare words" with a little contextual information and are removed during preprocessing. In this paper, we propose a word embedding method called LSHWE to solve this limitation, which is based on an idea that deliberately obfuscated words have a high context similarity with their corresponding bullying words. LSHWE has two steps: firstly, it generates the nearest neighbor matrix according to the co-occurrence matrix and the nearest neighbor list obtained by Locality Sensitive Hashing (LSH); secondly, it uses an LSH-based autoencoder to learn word representations based on these two matrices. Especially, the reconstructed nearest neighbor matrix generated by the LSH-based autoencoder is used to make the representations of deliberately obfuscated words close to their corresponding bullying words. In order to improve the algorithm efficiency, LSHWE uses LSH to generate the nearest neighbor list and the reconstructed nearest neighbor list. Empirical experiments prove the effectiveness of LSHWE in cyberbullying detection, particularly on the "deliberately obfuscated words" problem. Moreover, LSHWE is highly efficient, it can represent tens of thousands of words in a few minutes on a typical single machine.

7 citations

Proceedings ArticleDOI
15 Jul 2013
TL;DR: This paper proposes an alternative hardware-assisted search algorithm that is estimated to be able to provide ≥0.95 recall on a 1-Trillion feature vector database within 700μs at <; 150W, when the hashing bit error rate (BER) would have been 20% even with 1-bit quantization.
Abstract: Content-based search, such as audio/video fingerprinting, identifies a piece of query content by matching its perceptual features against those from a database of reference content. Such matching is challenging in both scalability and robustness, even with state-of-art methods like Locality Sensitive Hashing (LSH). Previously, Vote Count, a hardware-assisted algorithm, was proposed to provide such scalability and robustness. We have analyzed this algorithm and found that it would however consume very high power, to the point of even making cooling impractical. In this paper, we propose an alternative hardware-assisted search algorithm that is estimated to use very low power while providing scalability and robustness. It is estimated to be able to provide ≥0.95 recall on a 1-Trillion feature vector (~23M hours of video at 12fps signature rate) database within 700μs at <; 150W, when the hashing bit error rate (BER) would have been 20% even with 1-bit quantization. This amounts to over 1000× power and energy savings compared to highly competitive configurations of LSH, while at lower expected system cost and a saving of millions of dollars per year in electricity cost alone.

7 citations

Proceedings ArticleDOI
25 Oct 2010
TL;DR: Experimental results reveal the method can significantly reduce the training time of the best learner searching procedure, and the performance of the method is comparable with the state-of-art methods.
Abstract: AdaBoost has been proved a successful statistical learning method for concept detection with high performance of discrimination and generalization. However, it is computationally expensive to train a concept detector using boosting, especially on large scale datasets. The bottleneck of training phase is to select the best learner among massive learners. Traditional approaches for selecting a weak classifier usually run in O(NT), with N examples and T learners. In this paper, we treat the best learner selection as a Nearest Neighbor Search problem in the function space instead of feature space. With the help of Locality Sensitive Hashing (LSH) algorithm, the best learner searching procedure can be speeded up in the time of O(NL), where L is the number of buckets in LSH. Compared with the T (~500,000), the L (~600) is much smaller in our experiments. In addition, through studying the distribution of weak learners and candidate query points, we present an efficient method to try to partition the weak learner points and the feasible region of query points uniformly as much as possible, which can achieve significant improvement in both recall and precision compared with the random projection in traditional LSH algorithm. Experimental results reveal our method can significantly reduce the training time. And still the performance of our method is comparable with the state-of-art methods.

7 citations

Proceedings ArticleDOI
19 Jul 2010
TL;DR: Experimental results confirm that the proposed hashing method shows robustness against geometrical and topological attacks and provides a unique hash for each model and key.
Abstract: In this paper, a robust 3D mesh hashing method based on a key-dependent 3D surface feature is developed. The main objectives of the proposed hashing method are to show robustness against content-preserved attacks and to enable blind-detection without using any preprocessing techniques for the attacks. To achieve these objectives, the proposed hashing method projects all vertices to the shape coordinates of 3D SSD and curvedness, and then, it segments the shape coordinates into rectangular blocks and computes the block shape intensity using a permutation key and a random key. A hash is generated by binarizing the block shape intensity. Experimental results confirm that the proposed hashing method shows robustness against geometrical and topological attacks and provides a unique hash for each model and key.

7 citations

Journal ArticleDOI
TL;DR: An existing incremental algorithm, Probability-based incremental association rule discovery, modified, which can reduce not only a number of times to scan an original database but also the number of candidate itemsets to generate frequent and expected frequent 2 itemsets has execution time faster than the previous methods.
Abstract: Discovery of association rule is one of the most interesting areas of research in data mining, which extracts together occurrence of itemset. In a dynamic database where the new transaction are inserted into the database, keeping patterns up-to-date and discovering new pattern are challenging problems of great practical importance. This may introduce new association rules and some existing association rules would become invalid. It is important to study efficient algorithms for incremental update of association rules in large databases. In this paper, we modify an existing incremental algorithm, Probability-based incremental association rule discovery. The previous algorithm, probability-based incremental association rule discovery algorithm uses principle of Bernoulli trials to find frequent and expected frequent k-itemsets. The set of frequent and expected frequent k-itemsets are determined from a candidate k-itemsets. Generating and testing the set of candidate is a time-consuming step in the algorithm. To reduce the number of candidates 2-itemset that need to repeatedly scan the database and check a large set of candidate, our paper is utilizing a hash technique for the generation of the candidate 2-itemset, especially for the frequent and expected frequent 2-itemsets, to improve the performance of probability-based algorithm. Thus, the algorithm can reduce not only a number of times to scan an original database but also the number of candidate itemsets to generate frequent and expected frequent 2 itemsets. As a result, the algorithm has execution time faster than the previous methods. This paper also conducts simulation experiments to show the performance of the proposed algorithm. The simulation results show that the proposed algorithm has a good performance.

7 citations


Network Information
Related Topics (5)
Deep learning
79.8K papers, 2.1M citations
84% related
Feature extraction
111.8K papers, 2.1M citations
83% related
Convolutional neural network
74.7K papers, 2M citations
83% related
Feature (computer vision)
128.2K papers, 1.7M citations
82% related
Support vector machine
73.6K papers, 1.7M citations
82% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202343
2022108
202188
2020110
2019104
2018139