scispace - formally typeset
Search or ask a question

Showing papers on "Locality-sensitive hashing published in 2020"


Posted Content
TL;DR: A novel variant of transformer named Adaptive Clustering Transformer named ACT has been proposed to reduce the computation cost for high-resolution input and achieves a good balance between accuracy and computation cost (FLOPs).
Abstract: End-to-end Object Detection with Transformer (DETR)proposes to perform object detection with Transformer and achieve comparable performance with two-stage object detection like Faster-RCNN. However, DETR needs huge computational resources for training and inference due to the high-resolution spatial input. In this paper, a novel variant of transformer named Adaptive Clustering Transformer(ACT) has been proposed to reduce the computation cost for high-resolution input. ACT cluster the query features adaptively using Locality Sensitive Hashing (LSH) and ap-proximate the query-key interaction using the prototype-key interaction. ACT can reduce the quadratic O(N2) complexity inside self-attention into O(NK) where K is the number of prototypes in each layer. ACT can be a drop-in module replacing the original self-attention module without any training. ACT achieves a good balance between accuracy and computation cost (FLOPs). The code is available as supplementary for the ease of experiment replication and verification.

109 citations


Journal ArticleDOI
TL;DR: A multi-dimensional quality ensemble-driven recommendation approach named RecLSH-TOPSIS based on LSH and TOPSIS (Technique for Order Preference by Similarity to Ideal Solution) techniques is proposed, which can make privacy-preserving edge service recommendations with multiple QoS dimensions.

76 citations


Journal ArticleDOI
TL;DR: A multi-stage model for anomaly detection has been proposed by rectifying the problems incurred in traditional DBSCAN by using a kernel-based locality sensitive hashing and a Davies–Bouldin Index based K-medoid approach.

72 citations


Posted Content
TL;DR: This survey detailedly investigates current deep hashing algorithms including deep supervised hashing and deep unsupervised hashing, and categorizes deep supervised hash methods into pairwise methods, ranking-based methods, pointwise methods as well as quantization according to how measuring the similarities of the learned hash codes.
Abstract: Nearest neighbor search is to find the data points in the database such that the distances from them to the query are the smallest, which is a fundamental problem in various domains, such as computer vision, recommendation systems and machine learning. Hashing is one of the most widely used methods for its computational and storage efficiency. With the development of deep learning, deep hashing methods show more advantages than traditional methods. In this paper, we present a comprehensive survey of the deep hashing algorithms. Specifically, we categorize deep supervised hashing methods into pairwise similarity preserving, multiwise similarity preserving, implicit similarity preserving, classification-oriented preserving as well as quantization according to the manners of preserving the similarities. In addition, we also introduce some other topics such as deep unsupervised hashing and multi-modal deep hashing methods. Meanwhile, we also present some commonly used public datasets and the scheme to measure the performance of deep hashing algorithms. Finally, we discussed some potential research directions in conclusion.

59 citations


Proceedings Article
30 Apr 2020
TL;DR: A new framework for building space partitions reducing the problem to balanced graph partitioning followed by supervised classification is developed and the partitions obtained by Neural LSH consistently outperform partitions found by quantization-based and tree-based methods as well as classic, data-oblivious LSH.
Abstract: Space partitions of $\mathbb{R}^d$ underlie a vast and important class of fast nearest neighbor search (NNS) algorithms. Inspired by recent theoretical work on NNS for general metric spaces (Andoni et al. 2018b,c), we develop a new framework for building space partitions reducing the problem to balanced graph partitioning followed by supervised classification. We instantiate this general approach with the KaHIP graph partitioner (Sanders and Schulz 2013) and neural networks, respectively, to obtain a new partitioning procedure called Neural Locality-Sensitive Hashing (Neural LSH). On several standard benchmarks for NNS (Aumuller et al. 2017), our experiments show that the partitions obtained by Neural LSH consistently outperform partitions found by quantization-based and tree-based methods as well as classic, data-oblivious LSH.

53 citations


Journal ArticleDOI
TL;DR: A unique amplified locality‐sensitive hashing (LSH)‐based service recommendation method, that is, SRAmplified‐LSH, is proposed in the article and can guarantee a good balance between accuracy and efficiency of recommendation and user privacy information.

45 citations


Journal ArticleDOI
Hong Zhong1, Li Zhanfei1, Jie Cui1, Yue Sun1, Lu Liu2 
TL;DR: This paper proposes an efficient dynamic multi-keyword fuzzy search scheme for encrypted cloud data to support dynamic file updates and demonstrates that this scheme is more efficient than existing similar schemes.

31 citations


Journal ArticleDOI
01 May 2020
TL;DR: Based on virtual hypersphere partitioning, a novel disk-based indexing and searching scheme VHP is proposed to answer c-ANN queries to achieve up to 2x speedup in running time over the state-of-the-art methods.
Abstract: Locality sensitive hashing (LSH) is a widely practiced c-approximate nearest neighbor(c-ANN) search algorithm in high dimensional spaces The state-of-the-art LSH based algorithm searches an unbounded and irregular space to identify candidates, which jeopardizes the efficiency To address this issue, we introduce the concept of virtual hypersphere partitioning The core idea is to impose a virtual hypersphere, centered at the query, in the original feature space and only examine points inside the hypersphere The search space of a hypersphere is isotropic and bounded, and thus more efficient than the existing one In practice, we use multiple physical hyperspheres with different radii in corresponding projection subspaces to emulate the single virtual hypersphere We also developed a principled method to compute the hypersphere radii for given success probabilityBased on virtual hypersphere partitioning, we propose a novel disk-based indexing and searching scheme VHP to answer c-ANN queries In the indexing phase, VHP stores LSH projections with independent B+-trees To process a query, VHP keeps increasing the radii of physical hyperspheres co-ordinately, which in effect amounts to enlarging the virtual hypersphere, to accommodate more candidates until the success probability is met Rigorous theoretical analysis shows that the proposed algorithm supports c-ANN search for arbitrarily small c ≥ 1 with probability guarantee Extensive experiments on a variety of datasets, including the billion-scale ones, demonstrate that VHP could achieve different tradeoffs between efficiency and accuracy, and achieves up to 2x speedup in running time over the state-of-the-art methods

30 citations


Proceedings ArticleDOI
23 Aug 2020
TL;DR: This work presents Grale, a scalable method developed to address the problem of graph design for graphs with billions of nodes, which operates by fusing together different measures of (potentially weak) similarity to create a graph which exhibits high task-specific homophily between its nodes.
Abstract: How can we find the right graph for semi-supervised learning? In real world applications, the choice of which edges to use for computation is the first step in any graph learning process. Interestingly, there are often many types of similarity available to choose as the edges between nodes, and the choice of edges can drastically affect the performance of downstream semi-supervised learning systems. However, despite the importance of graph design, most of the literature assumes that the graph is static. In this work, we present Grale, a scalable method we have developed to address the problem of graph design for graphs with billions of nodes. Grale operates by fusing together different measures of (potentially weak) similarity to create a graph which exhibits high task-specific homophily between its nodes. Grale is designed for running on large datasets. We have deployed Grale in more than 20 different industrial settings at Google, including datasets which have tens of billions of nodes, and hundreds of trillions of potential edges to score. By employing locality sensitive hashing techniques, we greatly reduce the number of pairs that need to be scored, allowing us to learn a task specific model and build the associated nearest neighbor graph for such datasets in hours, rather than the days or even weeks that might be required otherwise. We illustrate this through a case study where we examine the application of Grale to an abuse classification problem on YouTube with hundreds of million of items. In this application, we find that Grale detects a large number of malicious actors on top of hard-coded rules and content classifiers, increasing the total recall by 89% over those approaches alone.

27 citations


Posted Content
TL;DR: In this article, the authors propose to learn high dimensional and sparse representations that have similar representational capacity as dense embeddings while being more efficient due to sparse matrix multiplication operations which can be much faster than dense multiplication.
Abstract: Deep representation learning has become one of the most widely adopted approaches for visual search, recommendation, and identification. Retrieval of such representations from a large database is however computationally challenging. Approximate methods based on learning compact representations, have been widely explored for this problem, such as locality sensitive hashing, product quantization, and PCA. In this work, in contrast to learning compact representations, we propose to learn high dimensional and sparse representations that have similar representational capacity as dense embeddings while being more efficient due to sparse matrix multiplication operations which can be much faster than dense multiplication. Following the key insight that the number of operations decreases quadratically with the sparsity of embeddings provided the non-zero entries are distributed uniformly across dimensions, we propose a novel approach to learn such distributed sparse embeddings via the use of a carefully constructed regularization function that directly minimizes a continuous relaxation of the number of floating-point operations (FLOPs) incurred during retrieval. Our experiments show that our approach is competitive to the other baselines and yields a similar or better speed-vs-accuracy tradeoff on practical datasets.

25 citations


Proceedings ArticleDOI
20 Apr 2020
TL;DR: A novel and easy-to-implement disk- based method named R2LSH to answer ANN queries in highdimensional spaces and Rigorous theoretical analysis reveals that the proposed algorithm supports c-ANN search for arbitrarily small c ≥ 1 with probability guarantee.
Abstract: Locality sensitive hashing (LSH) is a widely practiced c-approximate nearest neighbor (c-ANN) search algorithm because of its appealing theoretical guarantee and empirical performance. However, available LSH-based solutions do not achieve a good balance between cost and quality because of certain limitations in their index structures.In this paper, we propose a novel and easy-to-implement disk- based method named R2LSH to answer ANN queries in highdimensional spaces. In the indexing phase, R2LSH maps data objects into multiple two-dimensional projected spaces. In each space, a group of B+-trees is constructed to characterize the corresponding data distribution. In the query phase, by setting a query-centric ball in each projected space and using a dynamic counting technique, R2LSH efficiently determines candidates and returns query results with the required quality. Rigorous theoretical analysis reveals that the proposed algorithm supports c-ANN search for arbitrarily small c ≥ 1 with probability guarantee. Extensive experiments on real datasets verify the superiority of R2LSH over state-of-the-art methods.

Proceedings Article
01 Jan 2020
TL;DR: The algorithm, SMYRF, uses Locality Sensitive Hashing (LSH) in a novel way by defining new Asymmetric transformations and an adaptive scheme that produces balanced clusters that can be used interchangeably with dense attention before and after training.
Abstract: We propose a novel type of balanced clustering algorithm to approximate attention. Attention complexity is reduced from $O(N^2)$ to $O(N \log N)$, where $N$ is the sequence length. Our algorithm, SMYRF, uses Locality Sensitive Hashing (LSH) in a novel way by defining new Asymmetric transformations and an adaptive scheme that produces balanced clusters. The biggest advantage of SMYRF is that it can be used as a drop-in replacement for dense attention layers without any retraining. On the contrary, prior fast attention methods impose constraints (e.g. queries and keys share the same vector representations) and require re-training from scratch. We apply our method to pre-trained state-of-the-art Natural Language Processing and Computer Vision models and we report significant memory and speed benefits. Notably, SMYRF-BERT outperforms (slightly) BERT on GLUE, while using $50\%$ less memory. We also show that SMYRF can be used interchangeably with dense attention before and after training. Finally, we use SMYRF to train GANs with attention in high resolutions. Using a single TPU, we were able to scale attention to 128x128=16k and 256x256=65k tokens on BigGAN on CelebA-HQ.

Journal ArticleDOI
01 May 2020
TL;DR: This work proposes indexable distance estimating codes (iDEC), a new solution framework to ANN that extends and improves the locality sensitive hashing framework in a fundamental and systematic way and discovers deep connections between Error-Estimating Codes (EEC), LSH, and iDEC.
Abstract: Approximate Nearest Neighbor (ANN) search is a fundamental algorithmic problem, with numerous applications in many areas of computer science In this work, we propose indexable distance estimating codes (iDEC), a new solution framework to ANN that extends and improves the locality sensitive hashing (LSH) framework in a fundamental and systematic way Empirically, an iDEC-based solution has a low index space complexity of O(n) and can achieve a low average query time complexity of approximately O(log n) We show that our iDEC-based solutions for ANN in Hamming and edit distances outperform the respective state-of-the-art LSH-based solutions for both in-memory and external-memory processing We also show that our iDEC-based in-memory ANN-H solution is more scalable than all existing solutions We also discover deep connections between Error-Estimating Codes (EEC), LSH, and iDEC

Proceedings ArticleDOI
11 Jun 2020
TL;DR: The experimental results demonstrate that LCCS-LSH outperforms state-of-the-art LSH schemes and is proposed as a novel LSH scheme based on the Longest Circular Co-Substring (LCCS) search framework with a theoretical guarantee.
Abstract: Locality-Sensitive Hashing (LSH) is one of the most popular methods for c-Approximate Nearest Neighbor Search (c-ANNS) in high-dimensional spaces. In this paper, we propose a novel LSH scheme based on the Longest Circular Co-Substring (LCCS) search framework (LCCS-LSH) with a theoretical guarantee. We introduce a novel concept of LCCS and a new data structure named Circular Shift Array (CSA) for k-LCCS search. The insight of LCCS search framework is that close data objects will have a longer LCCS than the far-apart ones with high probability. LCCS-LSH is LSH-family-independent, and it supports c-ANNS with different kinds of distance metrics. We also introduce a multi-probe version of LCCS-LSH and conduct extensive experiments over five real-life datasets. The experimental results demonstrate that LCCS-LSH outperforms state-of-the-art LSH schemes.

Journal ArticleDOI
TL;DR: A transfer learning-based HIF detection method is proposed under a cloud-edge collaboration framework of the Internet of Things, which can solve the problem of insufficient data by integrating historical data from multiple distribution networks.
Abstract: High impedance faults (HIFs) in distribution networks are hard to describe and be detected precisely because of the complexity and randomness of their features. Therefore, traditional feature analysis methods may lack sufficient reliability and generalization, which makes data-based methods a more appropriate option. However, according to previous statistical analyses, in practical scenarios, only a small quantity of historical HIF data (less than 20%) can be recorded and utilized. In this article, a transfer learning-based HIF detection method is proposed under a cloud-edge collaboration framework of the Internet of Things, which can solve the problem of insufficient data by integrating historical data from multiple distribution networks. Through the cloud-edge collaboration framework, all features from different distribution networks are first integrated to form a basic cloud convolutional neural network model for HIF detection. The features are extracted and updated by edge computers based on the accurate synchronous measurements provided by distribution-level phasor measurement units. To uniform the data scales of the different distribution networks, principal component analysis is adopted during feature extraction. Specific to each distribution network, the target HIF detection model is transferred from the basic cloud model by fine-tuning. Furthermore, a data augmentation method based on locality sensitive hashing is proposed to improve the performance of the transferred model. The proposed HIF detection method can be operated in both online and offline modes. The performance was verified by seven different distribution networks in numerical simulations and one practical experimental distribution network.


Proceedings ArticleDOI
14 Jun 2020
TL;DR: A hashing function for the application of face template protection, which improves the correctness of existing algorithms while maintaining the security simultaneously, and reaches the REQ-WBP (Weak Biometric Privacy) security level, which implies irreversibility.
Abstract: In this paper, we present a hashing function for the application of face template protection, which improves the correctness of existing algorithms while maintaining the security simultaneously. The novel architecture constructed based on four components: a self-defined concept called padding people, Random Fourier Features, Support Vector Machine, and Locality Sensitive Hashing. The proposed method is trained, with one-shot and multi-shot enrollment, to encode the user’s biometric data to a predefined output with high probability. The predefined hashing output is cryptographically hashed and stored as a secure face template. Predesigning outputs ensures the strict requirements of biometric cryptosystems, namely, randomness and unlinkability. We prove that our method reaches the REQ-WBP (Weak Biometric Privacy) security level, which implies irreversibility. The efficacy of our approach is evaluated on the widely used CMU-PIE, FEI, andFERET databases; our matching performances achieve 100% genuine acceptance rate at 0% false acceptance rate for all three databases and enrollment types. To our knowledge, our matching results outperform most of state-of-the-art results.

Journal ArticleDOI
TL;DR: A scalable mining algorithm to discover contextual outliers using relevant subspaces using the MapReduce programming model running on a Hadoop cluster is proposed and the experimental results validate the effectiveness, interpretability, scalability, and extensibility of the algorithm.
Abstract: In this paper, we propose a scalable mining algorithm to discover contextual outliers using relevant subspaces. We develop the mining algorithm using the MapReduce programming model running on a Hadoop cluster. Relevant subspaces, which effectively capture the local distribution of various datasets, are quantified using local sparseness of attribute dimensions. We design a novel way of calculating local outlier factors in a relevant subspace with the probability density of local datasets; this new approach can effectively reflect the outlier degree of a data object that does not satisfy the distribution of the local dataset in the relevant subspace. Attribute dimensions of a relevant subspace, and local outlier factors are expressed as vital contextual information, which improves the interpretability of outliers. Importantly, the selection of ${N}$ data objects with the largest local outlier factor value is categorized as contextual outliers in our solution. To this end, our scalable mining algorithm, which incorporates the locality sensitive hashing distributed strategy, is implemented on a Hadoop cluster. The experimental results validate the effectiveness, interpretability, scalability, and extensibility of the algorithm using both synthetic data and stellar spectral data as experimental datasets.

Proceedings ArticleDOI
20 Apr 2020
TL;DR: This work proposes a lightweight distributed indexing framework, called ChainLink, that supports approximate kNN queries over TB-scale time series data, and designs a novel hashing technique, called Single Pass Signature (SPS), that successfully tackles the above problem.
Abstract: Scalable subsequence matching is critical for supporting analytics on big time series from mining, prediction to hypothesis testing. However, state-of-the-art subsequence matching techniques do not scale well to TB-scale datasets. Not only does index construction become prohibitively expensive, but also the query response time deteriorates quickly as the length of the query subsequence exceeds several 100s of data points. Although Locality Sensitive Hashing (LSH) has emerged as a promising solution for indexing long time series, it relies on expensive hash functions that perform multiple passes over the data and thus is impractical for big time series. In this work, we propose a lightweight distributed indexing framework, called ChainLink, that supports approximate kNN queries over TB-scale time series data. As a foundation of ChainLink, we design a novel hashing technique, called Single Pass Signature (SPS), that successfully tackles the above problem. In particular, we prove theoretically and demonstrate experimentally that the similarity proximity of the indexed subsequences is preserved by our proposed single-pass SPS scheme. Leveraging this SPS innovation, Chainlink then adopts a three-step approach for scalable index building: (1) in-place data re-organization within each partition to enable efficient record-level random access to all subsequences, (2) parallel building of hash-based local indices on top of the re-organized data using our SPS scheme for efficient search within each partition, and (3) efficient aggregation of the local indices to construct a centralized yet highly compact global index for effective pruning of irrelevant partitions during query processing. ChainLink achieves the above three steps in one single map-reduce process. Our experimental evaluation shows that ChainLink indices are compact at less than 2% of dataset size while state-of-the-art index sizes tend to be almost the same size as the dataset. Better still, ChainLink is up to 2 orders of magnitude faster in its index construction time compared to state-of-the-art techniques, while improving both the final query response time by up to 10 fold and the result accuracy by 15%.

Journal ArticleDOI
01 Aug 2020
TL;DR: This tutorial reviews exact and approximate methods such as cover tree, locality sensitive hashing, product quantization, and proximity graphs, and discusses the selectivity estimation problem and shows how researchers are bringing in state-of-the-art ML techniques to address the problem.
Abstract: Similarity query processing has been an active research topic for several decades. It is an essential procedure in a wide range of applications. Recently, embedding and auto-encoding methods as well as pre-trained models have gained popularity. They basically deal with high-dimensional data, and this trend brings new opportunities and challenges to similarity query processing for high-dimensional data. Meanwhile, new techniques have emerged to tackle this long-standing problem theoretically and empirically. In this tutorial, we summarize existing solutions, especially recent advancements from both database (DB) and machine learning (ML) communities, and analyze their strengths and weaknesses. We review exact and approximate methods such as cover tree, locality sensitive hashing, product quantization, and proximity graphs. We also discuss the selectivity estimation problem and show how researchers are bringing in state-of-the-art ML techniques to address the problem. By highlighting the strong connections between DB and ML, we hope that this tutorial provides an impetus towards new ML for DB solutions and vice versa.

Book ChapterDOI
09 Jan 2020
TL;DR: Experimental results show that the proposed scheme outperforms the existing state of the art SCBIR schemes and performance of the scheme is evaluated based on retrieval precision and search efficiency on distinct and similar image categories.
Abstract: Secure Content-Based Image Retrieval (SCBIR) is gaining enormous importance due to its applications involving highly sensitive images comprising of medical and personally identifiable data such as clinical decision-making, biometric-matching, and multimedia search. SCBIR on outsourced images is realized by generating secure searchable index from features like color, shape, and texture in unencrypted images. We focus on enhancing the efficiency of SCBIR by combining two visual descriptors which serve as a modified feature descriptor. To improve the efficacy of search, pre-filter tables are generated using Locality Sensitive Hashing (LSH) and resulting adjacent hash buckets are joined to enhance retrieval precision. Top-k relevant images are securely retrieved using Secure k-Nearest Neighbors (kNN) algorithm. Performance of the scheme is evaluated based on retrieval precision and search efficiency on distinct and similar image categories. Experimental results show that the proposed scheme outperforms the existing state of the art SCBIR schemes.

Proceedings ArticleDOI
14 Jun 2020
TL;DR: Hu et al. as mentioned in this paper considered fairness in the sense of equal opportunity: all points that are within distance r from the query should have the same probability to be returned, and developed a data structure for fair similarity search under inner product that requires nearly linear space and exploits locality sensitive filters.
Abstract: Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. There are several variants of the similarity search problem, and one of the most relevant is the r-near neighbor (r-NN) problem: given a radius r>0 and a set of points S, construct a data structure that, for any given query point q, returns a point p within distance at most r from q. In this paper, we study the r-NN problem in the light of fairness. We consider fairness in the sense of equal opportunity: all points that are within distance r from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee. To address this, we propose efficient data structures for r-NN where all points in S that are near q have the same probability to be selected and returned by the query. Specifically, we first propose a black-box approach that, given any LSH scheme, constructs a data structure for uniformly sampling points in the neighborhood of a query. Then, we develop a data structure for fair similarity search under inner product that requires nearly-linear space and exploits locality sensitive filters. The paper concludes with an experimental evaluation that highlights (un)fairness in a recommendation setting on real-world datasets and discusses the inherent unfairness introduced by solving other variants of the problem.

Journal ArticleDOI
TL;DR: This work presents a scalable FPGA-based system architecture to accelerate the comparison of binary strings, optimized for high-throughput using hundreds of computing elements, arranged in a systolic array.
Abstract: This paper is concerned with Field Programmable Gate Arrays (FPGA)-based systems for energy-efficient high-throughput string comparison. Modern applications which involve comparisons across large data sets—such as large sequence sets in molecular biology—are by their nature computationally intensive. In this work, we present a scalable FPGA-based system architecture to accelerate the comparison of binary strings. The current architecture supports arbitrary lengths in the range 16 to 2048-bit, covering a wide range of possible applications. In our example application, we consider DNA sequences embedded in a binary vector space through Locality Sensitive Hashing (LSH) one of several possible encodings that enable us to avoid more costly character-based operations. Here the resulting encoding is a 512-bit binary signature with comparisons based on the Hamming distance. In this approach, most of the load arises from the calculation of the O ( m ∗ n ) Hamming distances between the signatures, where m is the number of queries and n is the number of signatures contained in the database. Signature generation only needs to be performed once, and we do not consider it further, focusing instead on accelerating the signature comparisons. The proposed FPGA-based architecture is optimized for high-throughput using hundreds of computing elements, arranged in a systolic array. These core computing elements can be adapted to support other string comparison algorithms with little effort, while the other infrastructure stays the same. On a Xilinx Virtex UltraScale+ FPGA (XCVU9P-2), a peak throughput of 75.4 billion comparisons per second—of 512-bit signatures—was achieved, using a design with 384 parallel processing elements and a clock frequency of 200 MHz. This makes our FPGA design 86 times faster than a highly optimized CPU implementation. Compared to a GPU design, executed on an NVIDIA GTX1060, it performs nearly five times faster.

Journal ArticleDOI
TL;DR: The software msSLASH is presented, which implements a fast spectral library searching algorithm based on the Locality‐Sensitive Hashing (LSH) technique, and it is demonstrated that the algorithm significantly reduced the number of spectral comparisons, and as a result, achieved 2–9X speedup in comparison with existing spectral Library searching algorithm SpectraST.
Abstract: With the accumulation of MS/MS spectra collected in spectral libraries, the spectral library searching approach emerges as an important approach for peptide identification in proteomics, complementary to the commonly used protein database searching approach, in particular for the proteomic analyses of well-studied model organisms, such as human. Existing spectral library searching algorithms compare a query MS/MS spectrum with each spectrum in the library with matched precursor mass and charge state, which may become computationally intensive with the rapidly growing library size. Here, the software msSLASH, which implements a fast spectral library searching algorithm based on the Locality-Sensitive Hashing (LSH) technique, is presented. The algorithm first converts the library and query spectra into bit-strings using LSH functions, and then computes the similarity between the spectra with highly similar bit-string. Using the spectral library searching of large real-world MS/MS spectra datasets, it is demonstrated that the algorithm significantly reduced the number of spectral comparisons, and as a result, achieved 2-9X speedup in comparison with existing spectral library searching algorithm SpectraST. The spectral searching algorithm is implemented in C/C++, and is ready to be used in proteomic data analyses.

Posted Content
TL;DR: This work proposes a novel hashing algorithm BioHash that produces sparse high dimensional hash codes in a data-driven manner and shows that BioHash outperforms previously published benchmarks for various hashing methods.
Abstract: The fruit fly Drosophila's olfactory circuit has inspired a new locality sensitive hashing (LSH) algorithm, FlyHash. In contrast with classical LSH algorithms that produce low dimensional hash codes, FlyHash produces sparse high-dimensional hash codes and has also been shown to have superior empirical performance compared to classical LSH algorithms in similarity search. However, FlyHash uses random projections and cannot learn from data. Building on inspiration from FlyHash and the ubiquity of sparse expansive representations in neurobiology, our work proposes a novel hashing algorithm BioHash that produces sparse high dimensional hash codes in a data-driven manner. We show that BioHash outperforms previously published benchmarks for various hashing methods. Since our learning algorithm is based on a local and biologically plausible synaptic plasticity rule, our work provides evidence for the proposal that LSH might be a computational reason for the abundance of sparse expansive motifs in a variety of biological systems. We also propose a convolutional variant BioConvHash that further improves performance. From the perspective of computer science, BioHash and BioConvHash are fast, scalable and yield compressed binary representations that are useful for similarity search.

Book ChapterDOI
30 Sep 2020
TL;DR: This work presents a novel index structure called radius-optimized Locality Sensitive Hashing (roLSH), and extensive experimental analysis on real datasets shows the performance benefit of roLSH over existing state-of-the-art LSH techniques.
Abstract: Similarity search in high-dimensional spaces is an important task for many multimedia applications. Due to the notorious curse of dimensionality, approximate nearest neighbor techniques are preferred over exact searching techniques since they can return good enough results at a much better speed. Locality Sensitive Hashing (LSH) is a very popular random hashing technique for finding approximate nearest neighbors. Existing state-of-the-art Locality Sensitive Hashing techniques that focus on improving performance of the overall process, mainly focus on minimizing the total number of IOs while sacrificing the overall processing time. The main time-consuming process in LSH techniques is the process of finding neighboring points in projected spaces. We present a novel index structure called radius-optimized Locality Sensitive Hashing (roLSH). With the help of sampling techniques and Neural Networks, we present two techniques to find neighboring points in projected spaces efficiently, without sacrificing the accuracy of the results. Our extensive experimental analysis on real datasets shows the performance benefit of roLSH over existing state-of-the-art LSH techniques.

Proceedings ArticleDOI
TL;DR: Grale as mentioned in this paper uses locality sensitive hashing (LSH) to fuse different measures of similarity to create a graph which exhibits high task-specific homophily between its nodes, which is designed for running on large datasets.
Abstract: How can we find the right graph for semi-supervised learning? In real world applications, the choice of which edges to use for computation is the first step in any graph learning process. Interestingly, there are often many types of similarity available to choose as the edges between nodes, and the choice of edges can drastically affect the performance of downstream semi-supervised learning systems. However, despite the importance of graph design, most of the literature assumes that the graph is static. In this work, we present Grale, a scalable method we have developed to address the problem of graph design for graphs with billions of nodes. Grale operates by fusing together different measures of(potentially weak) similarity to create a graph which exhibits high task-specific homophily between its nodes. Grale is designed for running on large datasets. We have deployed Grale in more than 20 different industrial settings at Google, including datasets which have tens of billions of nodes, and hundreds of trillions of potential edges to score. By employing locality sensitive hashing techniques,we greatly reduce the number of pairs that need to be scored, allowing us to learn a task specific model and build the associated nearest neighbor graph for such datasets in hours, rather than the days or even weeks that might be required otherwise. We illustrate this through a case study where we examine the application of Grale to an abuse classification problem on YouTube with hundreds of million of items. In this application, we find that Grale detects a large number of malicious actors on top of hard-coded rules and content classifiers, increasing the total recall by 89% over those approaches alone.

Journal ArticleDOI
TL;DR: An efficient mechanism for the detection of plagiarism in repositories of Model-Driven Engineering (MDE) assignments based on the adaptation of the Locality Sensitive Hashing, an approximate nearest neighbor search mechanism, to the modeling technical space is provided.
Abstract: Reports suggest plagiarism is a common occurrence in universities. While plagiarism detection mechanisms exist for textual artifacts, this is less so for non-code related ones such as software desi...

Proceedings ArticleDOI
01 Mar 2020
TL;DR: In this paper, an implicit self-regularization was proposed to push the mean and variance of filter angles in a network towards 90° and 0° simultaneously to achieve (near) orthogonality among the filters, without using any other explicit regularization.
Abstract: In this paper, we investigate the empirical impact of orthogonality regularization (OR) in deep learning, either solo or collaboratively. Recent works on OR showed some promising results on the accuracy. In our ablation study, however, we do not observe such significant improvement from existing OR techniques compared with the conventional training based on weight decay, dropout, and batch normalization. To identify the real gain from OR, inspired by the locality sensitive hashing (LSH) in angle estimation, we propose to introduce an implicit self-regularization into OR to push the mean and variance of filter angles in a network towards 90° and 0° simultaneously to achieve (near) orthogonality among the filters, without using any other explicit regularization. Our regularization can be implemented as an architectural plug-in and integrated with an arbitrary network. We reveal that OR helps stabilize the training process and leads to faster convergence and better generalization.

Journal ArticleDOI
TL;DR: The proposed conLSH based aligner is compared with rHAT, popularly used for aligning SMRT reads, and is found to comprehensively beat it in speed as well as in memory requirements.