Showing papers on "Locality-sensitive hashing published in 2020"

PDF

Open Access

Posted Content•

End-to-End Object Detection with Adaptive Clustering Transformer

[...]

Minghang Zheng¹, Peng Gao², Xiaogang Wang², Hongsheng Li², Hao Dong¹ - Show less +1 more•Institutions (2)

Peking University¹, The Chinese University of Hong Kong²

18 Nov 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: A novel variant of transformer named Adaptive Clustering Transformer named ACT has been proposed to reduce the computation cost for high-resolution input and achieves a good balance between accuracy and computation cost (FLOPs).

...read moreread less

Abstract: End-to-end Object Detection with Transformer (DETR)proposes to perform object detection with Transformer and achieve comparable performance with two-stage object detection like Faster-RCNN. However, DETR needs huge computational resources for training and inference due to the high-resolution spatial input. In this paper, a novel variant of transformer named Adaptive Clustering Transformer(ACT) has been proposed to reduce the computation cost for high-resolution input. ACT cluster the query features adaptively using Locality Sensitive Hashing (LSH) and ap-proximate the query-key interaction using the prototype-key interaction. ACT can reduce the quadratic O(N2) complexity inside self-attention into O(NK) where K is the number of prototypes in each layer. ACT can be a drop-in module replacing the original self-attention module without any training. ACT achieves a good balance between accuracy and computation cost (FLOPs). The code is available as supplementary for the ease of experiment replication and verification.

...read moreread less

109 citations

Journal Article•DOI•

Multi-dimensional quality-driven service recommendation with privacy-preservation in mobile edge environment

[...]

Weiyi Zhong¹, Xiaochun Yin², Xuyun Zhang³, Shancang Li⁴, Wanchun Dou⁵, Ruili Wang⁶, Lianyong Qi¹ - Show less +3 more•Institutions (6)

Qufu Normal University¹, Weifang University², Macquarie University³, University of the West of England⁴, Nanjing University⁵, Massey University⁶

01 May 2020-Computer Communications

TL;DR: A multi-dimensional quality ensemble-driven recommendation approach named RecLSH-TOPSIS based on LSH and TOPSIS (Technique for Order Preference by Similarity to Ideal Solution) techniques is proposed, which can make privacy-preserving edge service recommendations with multiple QoS dimensions.

...read moreread less

76 citations

Journal Article•DOI•

A multi-stage anomaly detection scheme for augmenting the security in IoT-enabled applications

[...]

Sahil Garg¹, Sahil Garg², Kuljeet Kaur², Kuljeet Kaur¹, Shalini Batra², Georges Kaddoum¹, Neeraj Kumar², Azzedine Boukerche³ - Show less +4 more•Institutions (3)

École de technologie supérieure¹, Thapar University², University of Ottawa³

01 Mar 2020-Future Generation Computer Systems

TL;DR: A multi-stage model for anomaly detection has been proposed by rectifying the problems incurred in traditional DBSCAN by using a kernel-based locality sensitive hashing and a Davies–Bouldin Index based K-medoid approach.

...read moreread less

72 citations

Posted Content•

A Survey on Deep Hashing Methods

[...]

Xiao Luo, Chong Chen, Huasong Zhong, Hao Zhang, Minghua Deng, Jianqiang Huang, Xian-Sheng Hua - Show less +3 more

04 Mar 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This survey detailedly investigates current deep hashing algorithms including deep supervised hashing and deep unsupervised hashing, and categorizes deep supervised hash methods into pairwise methods, ranking-based methods, pointwise methods as well as quantization according to how measuring the similarities of the learned hash codes.

...read moreread less

Abstract: Nearest neighbor search is to find the data points in the database such that the distances from them to the query are the smallest, which is a fundamental problem in various domains, such as computer vision, recommendation systems and machine learning. Hashing is one of the most widely used methods for its computational and storage efficiency. With the development of deep learning, deep hashing methods show more advantages than traditional methods. In this paper, we present a comprehensive survey of the deep hashing algorithms. Specifically, we categorize deep supervised hashing methods into pairwise similarity preserving, multiwise similarity preserving, implicit similarity preserving, classification-oriented preserving as well as quantization according to the manners of preserving the similarities. In addition, we also introduce some other topics such as deep unsupervised hashing and multi-modal deep hashing methods. Meanwhile, we also present some commonly used public datasets and the scheme to measure the performance of deep hashing algorithms. Finally, we discussed some potential research directions in conclusion.

...read moreread less

59 citations

Proceedings Article•

Learning Space Partitions for Nearest Neighbor Search

[...]

Yihe Dong¹, Piotr Indyk², Ilya Razenshteyn¹, Tal Wagner²•Institutions (2)

Microsoft¹, Massachusetts Institute of Technology²

30 Apr 2020

TL;DR: A new framework for building space partitions reducing the problem to balanced graph partitioning followed by supervised classification is developed and the partitions obtained by Neural LSH consistently outperform partitions found by quantization-based and tree-based methods as well as classic, data-oblivious LSH.

...read moreread less

Abstract: Space partitions of $\mathbb{R}^d$ underlie a vast and important class of fast nearest neighbor search (NNS) algorithms. Inspired by recent theoretical work on NNS for general metric spaces (Andoni et al. 2018b,c), we develop a new framework for building space partitions reducing the problem to balanced graph partitioning followed by supervised classification. We instantiate this general approach with the KaHIP graph partitioner (Sanders and Schulz 2013) and neural networks, respectively, to obtain a new partitioning procedure called Neural Locality-Sensitive Hashing (Neural LSH). On several standard benchmarks for NNS (Aumuller et al. 2017), our experiments show that the partitions obtained by Neural LSH consistently outperform partitions found by quantization-based and tree-based methods as well as classic, data-oblivious LSH.

...read moreread less

53 citations

Journal Article•DOI•

Amplified locality‐sensitive hashing‐based recommender systems with privacy protection

[...]

Xiaoxiao Chi¹, Chao Yan¹, Hao Wang², Wajid Rafique³, Lianyong Qi¹ - Show less +1 more•Institutions (3)

Qufu Normal University¹, Norwegian University of Science and Technology², Nanjing University³

15 Feb 2020-Concurrency and Computation: Practice and Experience

TL;DR: A unique amplified locality‐sensitive hashing (LSH)‐based service recommendation method, that is, SRAmplified‐LSH, is proposed in the article and can guarantee a good balance between accuracy and efficiency of recommendation and user privacy information.

...read moreread less

45 citations

Journal Article•DOI•

Efficient dynamic multi-keyword fuzzy search over encrypted cloud data

[...]

Hong Zhong¹, Li Zhanfei¹, Jie Cui¹, Yue Sun¹, Lu Liu² - Show less +1 more•Institutions (2)

Anhui University¹, University of Leicester²

01 Jan 2020-Journal of Network and Computer Applications

TL;DR: This paper proposes an efficient dynamic multi-keyword fuzzy search scheme for encrypted cloud data to support dynamic file updates and demonstrates that this scheme is more efficient than existing similar schemes.

...read moreread less

31 citations

Journal Article•DOI•

VHP: approximate nearest neighbor search via virtual hypersphere partitioning

[...]

Kejing Lu¹, Hongya Wang², Wei Wang³, Mineichi Kudo¹•Institutions (3)

Hokkaido University¹, Donghua University², University of New South Wales³

01 May 2020

TL;DR: Based on virtual hypersphere partitioning, a novel disk-based indexing and searching scheme VHP is proposed to answer c-ANN queries to achieve up to 2x speedup in running time over the state-of-the-art methods.

...read moreread less

Abstract: Locality sensitive hashing (LSH) is a widely practiced c-approximate nearest neighbor(c-ANN) search algorithm in high dimensional spaces The state-of-the-art LSH based algorithm searches an unbounded and irregular space to identify candidates, which jeopardizes the efficiency To address this issue, we introduce the concept of virtual hypersphere partitioning The core idea is to impose a virtual hypersphere, centered at the query, in the original feature space and only examine points inside the hypersphere The search space of a hypersphere is isotropic and bounded, and thus more efficient than the existing one In practice, we use multiple physical hyperspheres with different radii in corresponding projection subspaces to emulate the single virtual hypersphere We also developed a principled method to compute the hypersphere radii for given success probabilityBased on virtual hypersphere partitioning, we propose a novel disk-based indexing and searching scheme VHP to answer c-ANN queries In the indexing phase, VHP stores LSH projections with independent B+-trees To process a query, VHP keeps increasing the radii of physical hyperspheres co-ordinately, which in effect amounts to enlarging the virtual hypersphere, to accommodate more candidates until the success probability is met Rigorous theoretical analysis shows that the proposed algorithm supports c-ANN search for arbitrarily small c ≥ 1 with probability guarantee Extensive experiments on a variety of datasets, including the billion-scale ones, demonstrate that VHP could achieve different tradeoffs between efficiency and accuracy, and achieves up to 2x speedup in running time over the state-of-the-art methods

...read moreread less

30 citations

Proceedings Article•DOI•

Grale: Designing Networks for Graph Learning

[...]

Jonathan Halcrow¹, Alexandru Mosoi, Sam Ruth¹, Bryan Perozzi¹•Institutions (1)

Google¹

23 Aug 2020

TL;DR: This work presents Grale, a scalable method developed to address the problem of graph design for graphs with billions of nodes, which operates by fusing together different measures of (potentially weak) similarity to create a graph which exhibits high task-specific homophily between its nodes.

...read moreread less

Abstract: How can we find the right graph for semi-supervised learning? In real world applications, the choice of which edges to use for computation is the first step in any graph learning process. Interestingly, there are often many types of similarity available to choose as the edges between nodes, and the choice of edges can drastically affect the performance of downstream semi-supervised learning systems. However, despite the importance of graph design, most of the literature assumes that the graph is static. In this work, we present Grale, a scalable method we have developed to address the problem of graph design for graphs with billions of nodes. Grale operates by fusing together different measures of (potentially weak) similarity to create a graph which exhibits high task-specific homophily between its nodes. Grale is designed for running on large datasets. We have deployed Grale in more than 20 different industrial settings at Google, including datasets which have tens of billions of nodes, and hundreds of trillions of potential edges to score. By employing locality sensitive hashing techniques, we greatly reduce the number of pairs that need to be scored, allowing us to learn a task specific model and build the associated nearest neighbor graph for such datasets in hours, rather than the days or even weeks that might be required otherwise. We illustrate this through a case study where we examine the application of Grale to an abuse classification problem on YouTube with hundreds of million of items. In this application, we find that Grale detects a large number of malicious actors on top of hard-coded rules and content classifiers, increasing the total recall by 89% over those approaches alone.

...read moreread less

27 citations

Posted Content•

Minimizing FLOPs to Learn Efficient Sparse Representations

[...]

Biswajit Paria¹, Chih-Kuan Yeh¹, Ian En-Hsu Yen¹, Ning Xu¹, Pradeep Ravikumar, Barnabás Póczos - Show less +2 more•Institutions (1)

Carnegie Mellon University¹

12 Apr 2020-arXiv: Learning

TL;DR: In this article, the authors propose to learn high dimensional and sparse representations that have similar representational capacity as dense embeddings while being more efficient due to sparse matrix multiplication operations which can be much faster than dense multiplication.

...read moreread less

Abstract: Deep representation learning has become one of the most widely adopted approaches for visual search, recommendation, and identification. Retrieval of such representations from a large database is however computationally challenging. Approximate methods based on learning compact representations, have been widely explored for this problem, such as locality sensitive hashing, product quantization, and PCA. In this work, in contrast to learning compact representations, we propose to learn high dimensional and sparse representations that have similar representational capacity as dense embeddings while being more efficient due to sparse matrix multiplication operations which can be much faster than dense multiplication. Following the key insight that the number of operations decreases quadratically with the sparsity of embeddings provided the non-zero entries are distributed uniformly across dimensions, we propose a novel approach to learn such distributed sparse embeddings via the use of a carefully constructed regularization function that directly minimizes a continuous relaxation of the number of floating-point operations (FLOPs) incurred during retrieval. Our experiments show that our approach is competitive to the other baselines and yields a similar or better speed-vs-accuracy tradeoff on practical datasets.

...read moreread less

25 citations

Proceedings Article•DOI•

R2LSH: A Nearest Neighbor Search Scheme Based on Two-dimensional Projected Spaces

[...]

Kejing Lu¹, Mineichi Kudo¹•Institutions (1)

Hokkaido University¹

20 Apr 2020

TL;DR: A novel and easy-to-implement disk- based method named R2LSH to answer ANN queries in highdimensional spaces and Rigorous theoretical analysis reveals that the proposed algorithm supports c-ANN search for arbitrarily small c ≥ 1 with probability guarantee.

...read moreread less

Abstract: Locality sensitive hashing (LSH) is a widely practiced c-approximate nearest neighbor (c-ANN) search algorithm because of its appealing theoretical guarantee and empirical performance. However, available LSH-based solutions do not achieve a good balance between cost and quality because of certain limitations in their index structures.In this paper, we propose a novel and easy-to-implement disk- based method named R2LSH to answer ANN queries in highdimensional spaces. In the indexing phase, R2LSH maps data objects into multiple two-dimensional projected spaces. In each space, a group of B+-trees is constructed to characterize the corresponding data distribution. In the query phase, by setting a query-centric ball in each projected space and using a dynamic counting technique, R2LSH efficiently determines candidates and returns query results with the required quality. Rigorous theoretical analysis reveals that the proposed algorithm supports c-ANN search for arbitrarily small c ≥ 1 with probability guarantee. Extensive experiments on real datasets verify the superiority of R2LSH over state-of-the-art methods.

...read moreread less

Proceedings Article•

SMYRF - Efficient Attention using Asymmetric Clustering

[...]

Giannis Daras¹, Nikita Kitaev², Augustus Odena³, Alexandros G. Dimakis⁴•Institutions (4)

National Technical University of Athens¹, University of California, Berkeley², Google³, University of Texas at Austin⁴

01 Jan 2020

TL;DR: The algorithm, SMYRF, uses Locality Sensitive Hashing (LSH) in a novel way by defining new Asymmetric transformations and an adaptive scheme that produces balanced clusters that can be used interchangeably with dense attention before and after training.

...read moreread less

Abstract: We propose a novel type of balanced clustering algorithm to approximate attention. Attention complexity is reduced from $O(N^2)$ to $O(N \log N)$, where $N$ is the sequence length. Our algorithm, SMYRF, uses Locality Sensitive Hashing (LSH) in a novel way by defining new Asymmetric transformations and an adaptive scheme that produces balanced clusters. The biggest advantage of SMYRF is that it can be used as a drop-in replacement for dense attention layers without any retraining. On the contrary, prior fast attention methods impose constraints (e.g. queries and keys share the same vector representations) and require re-training from scratch. We apply our method to pre-trained state-of-the-art Natural Language Processing and Computer Vision models and we report significant memory and speed benefits. Notably, SMYRF-BERT outperforms (slightly) BERT on GLUE, while using $50\%$ less memory. We also show that SMYRF can be used interchangeably with dense attention before and after training. Finally, we use SMYRF to train GANs with attention in high resolutions. Using a single TPU, we were able to scale attention to 128x128=16k and 256x256=65k tokens on BigGAN on CelebA-HQ.

...read moreread less

Journal Article•DOI•

iDEC: indexable distance estimating codes for approximate nearest neighbor search

[...]

Long Gong¹, Huayi Wang¹, Mitsunori Ogihara², Jun Xu¹•Institutions (2)

Georgia Institute of Technology¹, University of Miami²

01 May 2020

TL;DR: This work proposes indexable distance estimating codes (iDEC), a new solution framework to ANN that extends and improves the locality sensitive hashing framework in a fundamental and systematic way and discovers deep connections between Error-Estimating Codes (EEC), LSH, and iDEC.

...read moreread less

Abstract: Approximate Nearest Neighbor (ANN) search is a fundamental algorithmic problem, with numerous applications in many areas of computer science In this work, we propose indexable distance estimating codes (iDEC), a new solution framework to ANN that extends and improves the locality sensitive hashing (LSH) framework in a fundamental and systematic way Empirically, an iDEC-based solution has a low index space complexity of O(n) and can achieve a low average query time complexity of approximately O(log n) We show that our iDEC-based solutions for ANN in Hamming and edit distances outperform the respective state-of-the-art LSH-based solutions for both in-memory and external-memory processing We also show that our iDEC-based in-memory ANN-H solution is more scalable than all existing solutions We also discover deep connections between Error-Estimating Codes (EEC), LSH, and iDEC

...read moreread less

Proceedings Article•DOI•

Locality-Sensitive Hashing Scheme based on Longest Circular Co-Substring

[...]

Yifan Lei¹, Qiang Huang¹, Mohan S. Kankanhalli¹, Anthony K. H. Tung¹•Institutions (1)

National University of Singapore¹

11 Jun 2020

TL;DR: The experimental results demonstrate that LCCS-LSH outperforms state-of-the-art LSH schemes and is proposed as a novel LSH scheme based on the Longest Circular Co-Substring (LCCS) search framework with a theoretical guarantee.

...read moreread less

Abstract: Locality-Sensitive Hashing (LSH) is one of the most popular methods for c-Approximate Nearest Neighbor Search (c-ANNS) in high-dimensional spaces. In this paper, we propose a novel LSH scheme based on the Longest Circular Co-Substring (LCCS) search framework (LCCS-LSH) with a theoretical guarantee. We introduce a novel concept of LCCS and a new data structure named Circular Shift Array (CSA) for k-LCCS search. The insight of LCCS search framework is that close data objects will have a longer LCCS than the far-apart ones with high probability. LCCS-LSH is LSH-family-independent, and it supports c-ANNS with different kinds of distance metrics. We also introduce a multi-probe version of LCCS-LSH and conduct extensive experiments over five real-life datasets. The experimental results demonstrate that LCCS-LSH outperforms state-of-the-art LSH schemes.

...read moreread less

Journal Article•DOI•

A Transfer Learning-Based High Impedance Fault Detection Method Under a Cloud-Edge Collaboration Framework

[...]

Yongjie Zhang¹, Xiaojun Wang¹, Jinghan He¹, Yin Xu¹, Fang Zhang¹, Yiping Luo¹ - Show less +2 more•Institutions (1)

Beijing Jiaotong University¹

08 Sep 2020-IEEE Access

TL;DR: A transfer learning-based HIF detection method is proposed under a cloud-edge collaboration framework of the Internet of Things, which can solve the problem of insufficient data by integrating historical data from multiple distribution networks.

...read moreread less

Abstract: High impedance faults (HIFs) in distribution networks are hard to describe and be detected precisely because of the complexity and randomness of their features. Therefore, traditional feature analysis methods may lack sufficient reliability and generalization, which makes data-based methods a more appropriate option. However, according to previous statistical analyses, in practical scenarios, only a small quantity of historical HIF data (less than 20%) can be recorded and utilized. In this article, a transfer learning-based HIF detection method is proposed under a cloud-edge collaboration framework of the Internet of Things, which can solve the problem of insufficient data by integrating historical data from multiple distribution networks. Through the cloud-edge collaboration framework, all features from different distribution networks are first integrated to form a basic cloud convolutional neural network model for HIF detection. The features are extracted and updated by edge computers based on the accurate synchronous measurements provided by distribution-level phasor measurement units. To uniform the data scales of the different distribution networks, principal component analysis is adopted during feature extraction. Specific to each distribution network, the target HIF detection model is transferred from the basic cloud model by fine-tuning. Furthermore, a data augmentation method based on locality sensitive hashing is proposed to improve the performance of the transferred model. The proposed HIF detection method can be operated in both online and offline modes. The performance was verified by seven different distribution networks in numerical simulations and one practical experimental distribution network.

...read moreread less

Journal Article•DOI•

Forecasting model for wind power integrating least squares support vector machine, singular spectrum analysis, deep belief network, and locality‐sensitive hashing

[...]

Arslan Habib¹, Rabeh Abbassi², Rabeh Abbassi³, Andrés Julián Aristizábal⁴, Abdelkader Abbassi³ - Show less +1 more•Institutions (4)

University of Strathclyde¹, Tunis University², University of Kairouan³, Universidad de Bogotá Jorge Tadeo Lozano⁴

01 Feb 2020-Wind Energy

Proceedings Article•DOI•

FEHash: Full Entropy Hash for Face Template Protection

[...]

Thao Dang¹, Lam Dai Tran¹, Thuc D. Nguyen, Deokjai Choi¹•Institutions (1)

Chonnam National University¹

14 Jun 2020

TL;DR: A hashing function for the application of face template protection, which improves the correctness of existing algorithms while maintaining the security simultaneously, and reaches the REQ-WBP (Weak Biometric Privacy) security level, which implies irreversibility.

...read moreread less

Abstract: In this paper, we present a hashing function for the application of face template protection, which improves the correctness of existing algorithms while maintaining the security simultaneously. The novel architecture constructed based on four components: a self-defined concept called padding people, Random Fourier Features, Support Vector Machine, and Locality Sensitive Hashing. The proposed method is trained, with one-shot and multi-shot enrollment, to encode the user’s biometric data to a predefined output with high probability. The predefined hashing output is cryptographically hashed and stored as a secure face template. Predesigning outputs ensures the strict requirements of biometric cryptosystems, namely, randomness and unlinkability. We prove that our method reaches the REQ-WBP (Weak Biometric Privacy) security level, which implies irreversibility. The efficacy of our approach is evaluated on the widely used CMU-PIE, FEI, andFERET databases; our matching performances achieve 100% genuine acceptance rate at 0% false acceptance rate for all three databases and enrollment types. To our knowledge, our matching results outperform most of state-of-the-art results.

...read moreread less

Journal Article•DOI•

Scalable Mining of Contextual Outliers Using Relevant Subspace

[...]

Jifu Zhang¹, Xiaolong Yu¹, Yaling Xun¹, Sulan Zhang¹, Xiao Qin² - Show less +1 more•Institutions (2)

Taiyuan University of Science and Technology¹, Auburn University²

18 Feb 2020-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: A scalable mining algorithm to discover contextual outliers using relevant subspaces using the MapReduce programming model running on a Hadoop cluster is proposed and the experimental results validate the effectiveness, interpretability, scalability, and extensibility of the algorithm.

...read moreread less

Abstract: In this paper, we propose a scalable mining algorithm to discover contextual outliers using relevant subspaces. We develop the mining algorithm using the MapReduce programming model running on a Hadoop cluster. Relevant subspaces, which effectively capture the local distribution of various datasets, are quantified using local sparseness of attribute dimensions. We design a novel way of calculating local outlier factors in a relevant subspace with the probability density of local datasets; this new approach can effectively reflect the outlier degree of a data object that does not satisfy the distribution of the local dataset in the relevant subspace. Attribute dimensions of a relevant subspace, and local outlier factors are expressed as vital contextual information, which improves the interpretability of outliers. Importantly, the selection of ${N}$ data objects with the largest local outlier factor value is categorized as contextual outliers in our solution. To this end, our scalable mining algorithm, which incorporates the locality sensitive hashing distributed strategy, is implemented on a Hadoop cluster. The experimental results validate the effectiveness, interpretability, scalability, and extensibility of the algorithm using both synthetic data and stellar spectral data as experimental datasets.

...read moreread less

Proceedings Article•DOI•

ChainLink: Indexing Big Time Series Data For Long Subsequence Matching

[...]

Noura Alghamdi¹, Liang Zhang¹, Huayi Zhang¹, Elke A. Rundensteiner¹, Mohamed Y. Eltabakh¹ - Show less +1 more•Institutions (1)

Worcester Polytechnic Institute¹

20 Apr 2020

TL;DR: This work proposes a lightweight distributed indexing framework, called ChainLink, that supports approximate kNN queries over TB-scale time series data, and designs a novel hashing technique, called Single Pass Signature (SPS), that successfully tackles the above problem.

...read moreread less

Abstract: Scalable subsequence matching is critical for supporting analytics on big time series from mining, prediction to hypothesis testing. However, state-of-the-art subsequence matching techniques do not scale well to TB-scale datasets. Not only does index construction become prohibitively expensive, but also the query response time deteriorates quickly as the length of the query subsequence exceeds several 100s of data points. Although Locality Sensitive Hashing (LSH) has emerged as a promising solution for indexing long time series, it relies on expensive hash functions that perform multiple passes over the data and thus is impractical for big time series. In this work, we propose a lightweight distributed indexing framework, called ChainLink, that supports approximate kNN queries over TB-scale time series data. As a foundation of ChainLink, we design a novel hashing technique, called Single Pass Signature (SPS), that successfully tackles the above problem. In particular, we prove theoretically and demonstrate experimentally that the similarity proximity of the indexed subsequences is preserved by our proposed single-pass SPS scheme. Leveraging this SPS innovation, Chainlink then adopts a three-step approach for scalable index building: (1) in-place data re-organization within each partition to enable efficient record-level random access to all subsequences, (2) parallel building of hash-based local indices on top of the re-organized data using our SPS scheme for efficient search within each partition, and (3) efficient aggregation of the local indices to construct a centralized yet highly compact global index for effective pruning of irrelevant partitions during query processing. ChainLink achieves the above three steps in one single map-reduce process. Our experimental evaluation shows that ChainLink indices are compact at less than 2% of dataset size while state-of-the-art index sizes tend to be almost the same size as the dataset. Better still, ChainLink is up to 2 orders of magnitude faster in its index construction time compared to state-of-the-art techniques, while improving both the final query response time by up to 10 fold and the result accuracy by 15%.

...read moreread less

Journal Article•DOI•

[...]

Jianbin Qin¹, Wei Wang², Chuan Xiao³, Ying Zhang⁴•Institutions (4)

Shenzhen University¹, University of New South Wales², Nagoya University³, University of Technology, Sydney⁴

01 Aug 2020

TL;DR: This tutorial reviews exact and approximate methods such as cover tree, locality sensitive hashing, product quantization, and proximity graphs, and discusses the selectivity estimation problem and shows how researchers are bringing in state-of-the-art ML techniques to address the problem.

...read moreread less

Abstract: Similarity query processing has been an active research topic for several decades. It is an essential procedure in a wide range of applications. Recently, embedding and auto-encoding methods as well as pre-trained models have gained popularity. They basically deal with high-dimensional data, and this trend brings new opportunities and challenges to similarity query processing for high-dimensional data. Meanwhile, new techniques have emerged to tackle this long-standing problem theoretically and empirically. In this tutorial, we summarize existing solutions, especially recent advancements from both database (DB) and machine learning (ML) communities, and analyze their strengths and weaknesses. We review exact and approximate methods such as cover tree, locality sensitive hashing, product quantization, and proximity graphs. We also discuss the selectivity estimation problem and show how researchers are bringing in state-of-the-art ML techniques to address the problem. By highlighting the strong connections between DB and ML, we hope that this tutorial provides an impetus towards new ML for DB solutions and vice versa.

...read moreread less

Book Chapter•DOI•

Secure Content-Based Image Retrieval Using Combined Features in Cloud

[...]

J. Anju¹, R. Shreelekshmi²•Institutions (2)

College of Engineering, Trivandrum¹, Government Engineering College, Sreekrishnapuram²

09 Jan 2020

TL;DR: Experimental results show that the proposed scheme outperforms the existing state of the art SCBIR schemes and performance of the scheme is evaluated based on retrieval precision and search efficiency on distinct and similar image categories.

...read moreread less

Abstract: Secure Content-Based Image Retrieval (SCBIR) is gaining enormous importance due to its applications involving highly sensitive images comprising of medical and personally identifiable data such as clinical decision-making, biometric-matching, and multimedia search. SCBIR on outsourced images is realized by generating secure searchable index from features like color, shape, and texture in unencrypted images. We focus on enhancing the efficiency of SCBIR by combining two visual descriptors which serve as a modified feature descriptor. To improve the efficacy of search, pre-filter tables are generated using Locality Sensitive Hashing (LSH) and resulting adjacent hash buckets are joined to enhance retrieval precision. Top-k relevant images are securely retrieved using Secure k-Nearest Neighbors (kNN) algorithm. Performance of the scheme is evaluated based on retrieval precision and search efficiency on distinct and similar image categories. Experimental results show that the proposed scheme outperforms the existing state of the art SCBIR schemes.

...read moreread less

Proceedings Article•DOI•

Fair Near Neighbor Search: Independent Range Sampling in High Dimensions

[...]

Martin Aumüller¹, Rasmus Pagh¹, Francesco Silvestri²•Institutions (2)

IT University of Copenhagen¹, University of Padua²

14 Jun 2020

TL;DR: Hu et al. as mentioned in this paper considered fairness in the sense of equal opportunity: all points that are within distance r from the query should have the same probability to be returned, and developed a data structure for fair similarity search under inner product that requires nearly linear space and exploits locality sensitive filters.

...read moreread less

Abstract: Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. There are several variants of the similarity search problem, and one of the most relevant is the r-near neighbor (r-NN) problem: given a radius r>0 and a set of points S, construct a data structure that, for any given query point q, returns a point p within distance at most r from q. In this paper, we study the r-NN problem in the light of fairness. We consider fairness in the sense of equal opportunity: all points that are within distance r from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee. To address this, we propose efficient data structures for r-NN where all points in S that are near q have the same probability to be selected and returned by the query. Specifically, we first propose a black-box approach that, given any LSH scheme, constructs a data structure for uniformly sampling points in the neighborhood of a query. Then, we develop a data structure for fair similarity search under inner product that requires nearly-linear space and exploits locality sensitive filters. The paper concludes with an experimental evaluation that highlights (un)fairness in a recommendation setting on real-world datasets and discusses the inherent unfairness introduced by solving other variants of the problem.

...read moreread less

Journal Article•DOI•

Accelerating Binary String Comparisons with a Scalable, Streaming-Based System Architecture Based on FPGAs

[...]

Sarah Pilz, Florian Porrmann, Martin Kaiser, Jens Hagemeyer, James M. Hogan, Ulrich Rückert - Show less +2 more

21 Feb 2020-Algorithms

TL;DR: This work presents a scalable FPGA-based system architecture to accelerate the comparison of binary strings, optimized for high-throughput using hundreds of computing elements, arranged in a systolic array.

...read moreread less

Abstract: This paper is concerned with Field Programmable Gate Arrays (FPGA)-based systems for energy-efficient high-throughput string comparison. Modern applications which involve comparisons across large data sets—such as large sequence sets in molecular biology—are by their nature computationally intensive. In this work, we present a scalable FPGA-based system architecture to accelerate the comparison of binary strings. The current architecture supports arbitrary lengths in the range 16 to 2048-bit, covering a wide range of possible applications. In our example application, we consider DNA sequences embedded in a binary vector space through Locality Sensitive Hashing (LSH) one of several possible encodings that enable us to avoid more costly character-based operations. Here the resulting encoding is a 512-bit binary signature with comparisons based on the Hamming distance. In this approach, most of the load arises from the calculation of the O ( m ∗ n ) Hamming distances between the signatures, where m is the number of queries and n is the number of signatures contained in the database. Signature generation only needs to be performed once, and we do not consider it further, focusing instead on accelerating the signature comparisons. The proposed FPGA-based architecture is optimized for high-throughput using hundreds of computing elements, arranged in a systolic array. These core computing elements can be adapted to support other string comparison algorithms with little effort, while the other infrastructure stays the same. On a Xilinx Virtex UltraScale+ FPGA (XCVU9P-2), a peak throughput of 75.4 billion comparisons per second—of 512-bit signatures—was achieved, using a design with 384 parallel processing elements and a clock frequency of 200 MHz. This makes our FPGA design 86 times faster than a highly optimized CPU implementation. Compared to a GPU design, executed on an NVIDIA GTX1060, it performs nearly five times faster.

...read moreread less

Journal Article•DOI•

A Fast and Memory-Efficient Spectral Library Search Algorithm Using Locality-Sensitive Hashing.

[...]

Lei Wang¹, Kaiyuan Liu¹, Sujun Li¹, Haixu Tang¹•Institutions (1)

Indiana University¹

29 Jun 2020-Proteomics

TL;DR: The software msSLASH is presented, which implements a fast spectral library searching algorithm based on the Locality‐Sensitive Hashing (LSH) technique, and it is demonstrated that the algorithm significantly reduced the number of spectral comparisons, and as a result, achieved 2–9X speedup in comparison with existing spectral Library searching algorithm SpectraST.

...read moreread less

Abstract: With the accumulation of MS/MS spectra collected in spectral libraries, the spectral library searching approach emerges as an important approach for peptide identification in proteomics, complementary to the commonly used protein database searching approach, in particular for the proteomic analyses of well-studied model organisms, such as human. Existing spectral library searching algorithms compare a query MS/MS spectrum with each spectrum in the library with matched precursor mass and charge state, which may become computationally intensive with the rapidly growing library size. Here, the software msSLASH, which implements a fast spectral library searching algorithm based on the Locality-Sensitive Hashing (LSH) technique, is presented. The algorithm first converts the library and query spectra into bit-strings using LSH functions, and then computes the similarity between the spectra with highly similar bit-string. Using the spectral library searching of large real-world MS/MS spectra datasets, it is demonstrated that the algorithm significantly reduced the number of spectral comparisons, and as a result, achieved 2-9X speedup in comparison with existing spectral library searching algorithm SpectraST. The spectral searching algorithm is implemented in C/C++, and is ready to be used in proteomic data analyses.

...read moreread less

Posted Content•

Bio-Inspired Hashing for Unsupervised Similarity Search.

[...]

Chaitanya K. Ryali¹, John J. Hopfield², Leopold Grinberg³, Dmitry Krotov³•Institutions (3)

University of California, San Diego¹, Princeton University², IBM³

14 Jan 2020-arXiv: Learning

TL;DR: This work proposes a novel hashing algorithm BioHash that produces sparse high dimensional hash codes in a data-driven manner and shows that BioHash outperforms previously published benchmarks for various hashing methods.

...read moreread less

Abstract: The fruit fly Drosophila's olfactory circuit has inspired a new locality sensitive hashing (LSH) algorithm, FlyHash. In contrast with classical LSH algorithms that produce low dimensional hash codes, FlyHash produces sparse high-dimensional hash codes and has also been shown to have superior empirical performance compared to classical LSH algorithms in similarity search. However, FlyHash uses random projections and cannot learn from data. Building on inspiration from FlyHash and the ubiquity of sparse expansive representations in neurobiology, our work proposes a novel hashing algorithm BioHash that produces sparse high dimensional hash codes in a data-driven manner. We show that BioHash outperforms previously published benchmarks for various hashing methods. Since our learning algorithm is based on a local and biologically plausible synaptic plasticity rule, our work provides evidence for the proposal that LSH might be a computational reason for the abundance of sparse expansive motifs in a variety of biological systems. We also propose a convolutional variant BioConvHash that further improves performance. From the perspective of computer science, BioHash and BioConvHash are fast, scalable and yield compressed binary representations that are useful for similarity search.

...read moreread less

Book Chapter•DOI•

Improving Locality Sensitive Hashing by Efficiently Finding Projected Nearest Neighbors

[...]

Omid Jafari¹, Parth Nagarkar¹, Jonathan Montaño¹•Institutions (1)

New Mexico State University¹

30 Sep 2020

TL;DR: This work presents a novel index structure called radius-optimized Locality Sensitive Hashing (roLSH), and extensive experimental analysis on real datasets shows the performance benefit of roLSH over existing state-of-the-art LSH techniques.

...read moreread less

Abstract: Similarity search in high-dimensional spaces is an important task for many multimedia applications. Due to the notorious curse of dimensionality, approximate nearest neighbor techniques are preferred over exact searching techniques since they can return good enough results at a much better speed. Locality Sensitive Hashing (LSH) is a very popular random hashing technique for finding approximate nearest neighbors. Existing state-of-the-art Locality Sensitive Hashing techniques that focus on improving performance of the overall process, mainly focus on minimizing the total number of IOs while sacrificing the overall processing time. The main time-consuming process in LSH techniques is the process of finding neighboring points in projected spaces. We present a novel index structure called radius-optimized Locality Sensitive Hashing (roLSH). With the help of sampling techniques and Neural Networks, we present two techniques to find neighboring points in projected spaces efficiently, without sacrificing the accuracy of the results. Our extensive experimental analysis on real datasets shows the performance benefit of roLSH over existing state-of-the-art LSH techniques.

...read moreread less

Proceedings Article•DOI•

Grale: Designing Networks for Graph Learning

[...]

Jonathan Halcrow¹, Alexandru Mosoi, Sam Ruth¹, Bryan Perozzi¹•Institutions (1)

Google¹

23 Jul 2020-arXiv: Learning

TL;DR: Grale as mentioned in this paper uses locality sensitive hashing (LSH) to fuse different measures of similarity to create a graph which exhibits high task-specific homophily between its nodes, which is designed for running on large datasets.

...read moreread less

Abstract: How can we find the right graph for semi-supervised learning? In real world applications, the choice of which edges to use for computation is the first step in any graph learning process. Interestingly, there are often many types of similarity available to choose as the edges between nodes, and the choice of edges can drastically affect the performance of downstream semi-supervised learning systems. However, despite the importance of graph design, most of the literature assumes that the graph is static. In this work, we present Grale, a scalable method we have developed to address the problem of graph design for graphs with billions of nodes. Grale operates by fusing together different measures of(potentially weak) similarity to create a graph which exhibits high task-specific homophily between its nodes. Grale is designed for running on large datasets. We have deployed Grale in more than 20 different industrial settings at Google, including datasets which have tens of billions of nodes, and hundreds of trillions of potential edges to score. By employing locality sensitive hashing techniques,we greatly reduce the number of pairs that need to be scored, allowing us to learn a task specific model and build the associated nearest neighbor graph for such datasets in hours, rather than the days or even weeks that might be required otherwise. We illustrate this through a case study where we examine the application of Grale to an abuse classification problem on YouTube with hundreds of million of items. In this application, we find that Grale detects a large number of malicious actors on top of hard-coded rules and content classifiers, increasing the total recall by 89% over those approaches alone.

...read moreread less

Journal Article•DOI•

Efficient Plagiarism Detection for Software Modeling Assignments.

[...]

Salvador Martínez, Manuel Wimmer¹, Jordi Cabot•Institutions (1)

Johannes Kepler University of Linz¹

07 Jan 2020-Computer Science Education

TL;DR: An efficient mechanism for the detection of plagiarism in repositories of Model-Driven Engineering (MDE) assignments based on the adaptation of the Locality Sensitive Hashing, an approximate nearest neighbor search mechanism, to the modeling technical space is provided.

...read moreread less

Abstract: Reports suggest plagiarism is a common occurrence in universities. While plagiarism detection mechanisms exist for textual artifacts, this is less so for non-code related ones such as software desi...

...read moreread less

Proceedings Article•DOI•

Self-Orthogonality Module: A Network Architecture Plug-in for Learning Orthogonal Filters

[...]

Ziming Zhang¹, Wenchi Ma², Yuanwei Wu², Guanghui Wang²•Institutions (2)

Worcester Polytechnic Institute¹, University of Kansas²

01 Mar 2020

TL;DR: In this paper, an implicit self-regularization was proposed to push the mean and variance of filter angles in a network towards 90° and 0° simultaneously to achieve (near) orthogonality among the filters, without using any other explicit regularization.

...read moreread less

Abstract: In this paper, we investigate the empirical impact of orthogonality regularization (OR) in deep learning, either solo or collaboratively. Recent works on OR showed some promising results on the accuracy. In our ablation study, however, we do not observe such significant improvement from existing OR techniques compared with the conventional training based on weight decay, dropout, and batch normalization. To identify the real gain from OR, inspired by the locality sensitive hashing (LSH) in angle estimation, we propose to introduce an implicit self-regularization into OR to push the mean and variance of filter angles in a network towards 90° and 0° simultaneously to achieve (near) orthogonality among the filters, without using any other explicit regularization. Our regularization can be implemented as an architectural plug-in and integrated with an arbitrary network. We reveal that OR helps stabilize the training process and leads to faster convergence and better generalization.

...read moreread less

Journal Article•DOI•

conLSH: Context based Locality Sensitive Hashing for mapping of noisy SMRT reads

[...]

Angana Chakraborty¹, Sanghamitra Bandyopadhyay¹•Institutions (1)

Indian Statistical Institute¹

01 Apr 2020-Computational Biology and Chemistry

TL;DR: The proposed conLSH based aligner is compared with rHAT, popularly used for aligning SMRT reads, and is found to comprehensively beat it in speed as well as in memory requirements.

...read moreread less