Showing papers on "Locality-sensitive hashing published in 2019"

PDF

Open Access

Journal Article•DOI•

SSDH: Semi-Supervised Deep Hashing for Large Scale Image Retrieval

[...]

Jian Zhang¹, Yuxin Peng¹•Institutions (1)

01 Jan 2019-IEEE Transactions on Circuits and Systems for Video Technology

TL;DR: This paper proposes a semi-supervised loss to jointly minimize the empirical error on labeled data, as well as the embedding error on both labeled and unlabeled data, which can preserve the semantic similarity and capture the meaningful neighbors on the underlying data structures for effective hashing.

...read moreread less

Abstract: Hashing methods have been widely used for efficient similarity retrieval on large scale image database. Traditional hashing methods learn hash functions to generate binary codes from hand-crafted features, which achieve limited accuracy since the hand-crafted features cannot optimally represent the image content and preserve the semantic similarity. Recently, several deep hashing methods have shown better performance because the deep architectures generate more discriminative feature representations. However, these deep hashing methods are mainly designed for supervised scenarios, which only exploit the semantic similarity information, but ignore the underlying data structures. In this paper, we propose the semi-supervised deep hashing approach, to perform more effective hash function learning by simultaneously preserving semantic similarity and underlying data structures. The main contributions are as follows: 1) We propose a semi-supervised loss to jointly minimize the empirical error on labeled data, as well as the embedding error on both labeled and unlabeled data, which can preserve the semantic similarity and capture the meaningful neighbors on the underlying data structures for effective hashing. 2) A semi-supervised deep hashing network is designed to extensively exploit both labeled and unlabeled data, in which we propose an online graph construction method to benefit from the evolving deep features during training to better capture semantic neighbors. To the best of our knowledge, the proposed deep network is the first deep hashing method that can perform hash code learning and feature learning simultaneously in a semi-supervised fashion. Experimental results on five widely-used data sets show that our proposed approach outperforms the state-of-the-art hashing methods.

...read moreread less

140 citations

Journal Article•DOI•

A novel combination forecasting model for wind power integrating least square support vector machine, deep belief network, singular spectrum analysis and locality-sensitive hashing

[...]

Yachao Zhang¹, Jian Le², Xiaobing Liao², Feng Zheng¹, Yinghai Li³ - Show less +1 more•Institutions (3)

Fuzhou University¹, Wuhan University², China Three Gorges University³

01 Feb 2019-Energy

TL;DR: The simulation results show that the prediction performance of the proposed combination model based on linear kernel function outperforms all the other comparison models from 1-step to 3-step forecasting and provides a promising and effective alternative for short-term wind power prediction.

...read moreread less

78 citations

Journal Article•DOI•

Generation of Cancelable Iris Templates via Randomized Bit Sampling

[...]

Debanjan Sadhya¹, Balasubramanian Raman²•Institutions (2)

Indian Institute of Information Technology and Management, Gwalior¹, Indian Institute of Technology Roorkee²

22 Mar 2019-IEEE Transactions on Information Forensics and Security

TL;DR: This paper has generated cancelable IrisCode features, coined as locality sampled code (LSC), which simultaneously provides strong security guarantees and satisfactory system performance, and formally analyzed the security guarantees of non-invertibility, revocability, and unlinkability.

...read moreread less

Abstract: Iris-based biometric models are widely recognized to be one of the most accurate forms for authenticating individual identities. Features extracted from the captured iris images (known as IrisCodes) conventionally get stored in their native format over a data repository. However, from a security aspect, the stored templates are highly vulnerable to a wide spectrum of adversarial attack forms. The study in this paper addresses this issue by introducing a privacy-preserving and secure biometric scheme based on the notion of locality sensitive hashing (LSH). In this paper, we have generated cancelable IrisCode features, coined as locality sampled code (LSC), which simultaneously provides strong security guarantees and satisfactory system performance. The functionality of our proposed framework pivots around the fact that intra-class IrisCode samples are “close” to each other, due to which they hash to the same location. Alternatively, the inter-class IrisCodes features are comparatively dissimilar and consequently hash to different locations. We have rigorously examined the intrinsic properties of the LSCs by estimating the intra-class and inter-class collision probabilities for two distinct IrisCodes. Furthermore, we have formally analyzed the security guarantees of non-invertibility, revocability, and unlinkability in our model by establishing various bounds on the adversarial success probability. Extensive empirical tests on the CASIAv3 and IITD benchmark iris databases demonstrate the superior performance of our proposed model, for which we have obtained the best EERs of 0.105% and 1.4%, respectively.

...read moreread less

47 citations

Journal Article•DOI•

DENCAST: distributed density-based clustering for multi-target regression

[...]

Roberto Corizzo¹, Gianvito Pio¹, Michelangelo Ceci¹, Donato Malerba¹•Institutions (1)

University of Bari¹

03 Jun 2019-Journal of Big Data

TL;DR: The DENCAST system is proposed, a novel distributed algorithm implemented in Apache Spark, which performs density-based clustering and exploits the identified clusters to solve both single- and multi-target regression tasks (and thus, solves complex tasks such as time series prediction).

...read moreread less

Abstract: Recent developments in sensor networks and mobile computing led to a huge increase in data generated that need to be processed and analyzed efficiently. In this context, many distributed data mining algorithms have recently been proposed. Following this line of research, we propose the DENCAST system, a novel distributed algorithm implemented in Apache Spark, which performs density-based clustering and exploits the identified clusters to solve both single- and multi-target regression tasks (and thus, solves complex tasks such as time series prediction). Contrary to existing distributed methods, DENCAST does not require a final merging step (usually performed on a single machine) and is able to handle large-scale, high-dimensional data by taking advantage of locality sensitive hashing. Experiments show that DENCAST performs clustering more efficiently than a state-of-the-art distributed clustering algorithm, especially when the number of objects increases significantly. The quality of the extracted clusters is confirmed by the predictive capabilities of DENCAST on several datasets: It is able to significantly outperform (p-value $$<0.05$$ ) state-of-the-art distributed regression methods, in both single and multi-target settings.

...read moreread less

45 citations

Journal Article•DOI•

Locality-sensitive hashing for the edit distance.

[...]

Guillaume Marçais¹, Dan DeBlasio¹, Prashant Pandey¹, Carl Kingsford¹•Institutions (1)

Carnegie Mellon University¹

15 Jul 2019-Bioinformatics

TL;DR: An LSH method, called Order Min Hash (OMH), is presented, a refinement of the minHash LSH used to approximate the Jaccard similarity, in that OMH is sensitive not only to the k-mer contents of the sequences but also to the relative order of the k -mers in the sequences.

...read moreread less

Abstract: MOTIVATION Sequence alignment is a central operation in bioinformatics pipeline and, despite many improvements, remains a computationally challenging problem. Locality-sensitive hashing (LSH) is one method used to estimate the likelihood of two sequences to have a proper alignment. Using an LSH, it is possible to separate, with high probability and relatively low computation, the pairs of sequences that do not have high-quality alignment from those that may. Therefore, an LSH reduces the overall computational requirement while not introducing many false negatives (i.e. omitting to report a valid alignment). However, current LSH methods treat sequences as a bag of k-mers and do not take into account the relative ordering of k-mers in sequences. In addition, due to the lack of a practical LSH method for edit distance, in practice, LSH methods for Jaccard similarity or Hamming similarity are used as a proxy. RESULTS We present an LSH method, called Order Min Hash (OMH), for the edit distance. This method is a refinement of the minHash LSH used to approximate the Jaccard similarity, in that OMH is sensitive not only to the k-mer contents of the sequences but also to the relative order of the k-mers in the sequences. We present theoretical guarantees of the OMH as a gapped LSH. AVAILABILITY AND IMPLEMENTATION The code to generate the results is available at http://github.com/Kingsford-Group/omhismb2019. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

...read moreread less

39 citations

Journal Article•DOI•

A stratified sampling based clustering algorithm for large-scale data

[...]

Xingwang Zhao¹, Xingwang Zhao², Jiye Liang², Chuangyin Dang¹•Institutions (2)

City University of Hong Kong¹, Shanxi University²

01 Jan 2019-Knowledge Based Systems

TL;DR: The experimental results show that the proposed stratified sampling based clustering algorithm outperforms the related algorithms in terms of clustering quality and computational efficiency for large-scale data sets.

...read moreread less

Abstract: Large-scale data analysis is a challenging and relevant task for present-day research and industry As a promising data analysis tool, clustering is becoming more important in the era of big data In large-scale data clustering, sampling is an efficient and most widely used approximation technique Recently, several sampling-based clustering algorithms have attracted considerable attention in large-scale data analysis owing to their efficiency However, some of these existing algorithms have low clustering accuracy, whereas others have high computational complexity To overcome these deficiencies, a stratified sampling based clustering algorithm for large-scale data is proposed in this paper Its basic steps include: (1) obtaining a number of representative samples from different strata with a stratified sampling scheme, which are formed by locality sensitive hashing technique, (2) partitioning the chosen samples into different clusters using the fuzzy c -means clustering algorithm, (3) assigning the out-of-sample objects into their closest clusters via data labeling technique The performance of the proposed algorithm is compared with the state-of-the-art sampling-based fuzzy c -means clustering algorithms on several large-scale data sets including synthetic and real ones The experimental results show that the proposed algorithm outperforms the related algorithms in terms of clustering quality and computational efficiency for large-scale data sets

...read moreread less

37 citations

Proceedings Article•DOI•

Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment

[...]

Raul Fernandez¹, Jisoo Min¹, Demitri Nava¹, Samuel Madden¹•Institutions (1)

Massachusetts Institute of Technology¹

08 Apr 2019

TL;DR: LAZO is a method to simultaneously estimate both the similarity and containment of datasets, based on a redefinition of Jaccard similarity which takes into account the cardinality of each set.

...read moreread less

Abstract: Data analysts often need to find datasets that are similar (i.e., have high overlap) or that are subsets of one another (i.e., one contains the other). Exactly computing such relationships is expensive because it entails an all-pairs comparison between all values in all datasets, an O(n2) operation. Fortunately, it is possible to obtain approximate solutions much faster, using locality sensitive hashing (LSH). Unfortunately, LSH does not lend itself naturally to compute containment, and only returns results with a similarity beyond a pre-defined threshold; we want to know the specific similarity and containment score. The main contribution of this paper is LAZO, a method to simultaneously estimate both the similarity and containment of datasets, based on a redefinition of Jaccard similarity which takes into account the cardinality of each set. In addition, we show how to use the method to improve the quality of the original JS and JC estimates. Last, we implement LAZO as a new indexing structure that has these additional properties: i) it returns numerical scores to indicate the degree of similarity and containment between each candidate and the query—instead of only returning the candidate set; ii) it permits to query for a specific threshold on-the-fly, as opposed to LSH indexes that need to be configured with a pre-defined threshold a priori; iii) it works in a data-oblivious way, so it can be incrementally maintained. We evaluate LAZO on real-world datasets and show its ability to estimate containment and similarity better and faster than existing methods.

...read moreread less

36 citations

Journal Article•DOI•

Metagenomic binning through low-density hashing.

[...]

Yunan Luo¹, Yun William Yu², Yun William Yu³, Jianyang Zeng⁴, Bonnie Berger², Jian Peng¹ - Show less +2 more•Institutions (4)

University of Illinois at Urbana–Champaign¹, Massachusetts Institute of Technology², Harvard University³, Tsinghua University⁴

15 Jan 2019-Bioinformatics

TL;DR: ‘low‐density’ locality sensitive hashing is introduced to bioinformatics, with the addition of Gallager codes for even coverage, enabling quick and accurate metagenomic binning, allowing for the discovery of novel lineages.

...read moreread less

Abstract: Motivation Vastly greater quantities of microbial genome data are being generated where environmental samples mix together the DNA from many different species. Here, we present Opal for metagenomic binning, the task of identifying the origin species of DNA sequencing reads. We introduce 'low-density' locality sensitive hashing to bioinformatics, with the addition of Gallager codes for even coverage, enabling quick and accurate metagenomic binning. Results On public benchmarks, Opal halves the error on precision/recall (F1-score) as compared with both alignment-based and alignment-free methods for species classification. We demonstrate even more marked improvement at higher taxonomic levels, allowing for the discovery of novel lineages. Furthermore, the innovation of low-density, even-coverage hashing should itself prove an essential methodological advance as it enables the application of machine learning to other bioinformatic challenges. Availability and implementation Full source code and datasets are available at http://opal.csail.mit.edu and https://github.com/yunwilliamyu/opal. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

28 citations

Journal Article•DOI•

A distributed approximate nearest neighbors algorithm for efficient large scale mean shift clustering

[...]

Gaël Beck¹, Tarn Duong¹, Mustapha Lebbah¹, Hanane Azzag¹, Christophe Cérin¹ - Show less +1 more•Institutions (1)

University of Paris¹

01 Dec 2019-Journal of Parallel and Distributed Computing

TL;DR: This paper introduces two contributions aiming to provide clustering algorithms with a linear time complexity, as opposed to the quadratic time complexity for the exact Mean Shift clustering and proposes a scalable procedure to approximate the density gradient ascent.

...read moreread less

27 citations

Journal Article•DOI•

A fast LSH-based similarity search method for multivariate time series

[...]

Chenyun Yu¹, Lintong Luo², Leanne Lai Hang Chan¹, Thanawin Rakthanmanon³, Sarana Nutanong - Show less +1 more•Institutions (3)

City University of Hong Kong¹, University of California, Irvine², Kasetsart University³

01 Feb 2019-Information Sciences

TL;DR: An efficient approximation method is proposed based on locality sensitive hashing which firstly retrieves candidate time series and then exploits their hash values to compute distance estimates for pruning and it is demonstrated their benefits in terms of query efficiency when dealing with a collection of multivariate time series.

...read moreread less

25 citations

Journal Article•DOI•

Enhanced block-based copy-move forgery detection using k -means clustering

[...]

Osamah M. Al-Qershi¹, Bee Ee Khoo¹•Institutions (1)

Universiti Sains Malaysia¹

01 Oct 2019-Multidimensional Systems and Signal Processing

TL;DR: An improved matching technique based on enhanced CMFD pipeline via k-means clustering technique that can enhance the detection accuracy in a significant manner and reduce the processing time with LSH-based matching.

...read moreread less

Abstract: The goal of copy-move forgery is to manipulate the semantics of an image. In fact, this can be performed by cloning a region of an image and subsequently pasting it onto a different region within the same image. As such, this paper proposes an improved matching technique based on enhanced CMFD pipeline via k-means clustering technique. By deploying the k-means clustering to group the overlapping blocks, the matching step was independently carried out within each cluster to speed up the matching process. In addition, the clustering step of the feature vectors allowed the matching process to identify the matches accurately. Thus, in order to test the enhanced pipeline, it was combined with Zernike moments and locality sensitive hashing (LSH). The experimental results proved that the proposed method can enhance the detection accuracy in a significant manner. On top of that, the proposed pipeline can reduce the processing time with LSH-based matching.

...read moreread less

Journal Article•DOI•

V i B r : Visualizing Bipartite Relations at Scale with the Minimum Description Length Principle

[...]

Gromit Yeuk-Yin Chan¹, Panpan Xu², Zeng Dai², Liu Ren²•Institutions (2)

New York University¹, Bosch²

01 Jan 2019-IEEE Transactions on Visualization and Computer Graphics

TL;DR: This paper forms the visual summarization task as a co-clustering problem and proposes an efficient algorithm based on locality sensitive hashing (LSH) that can easily scale to large graphs under reasonable interactive time constraints that previous related methods cannot satisfy.

...read moreread less

Abstract: Bipartite graphs model the key relations in many large scale real-world data: customers purchasing items, legislators voting for bills, people's affiliation with different social groups, faults occurring in vehicles, etc. However, it is challenging to visualize large scale bipartite graphs with tens of thousands or even more nodes or edges. In this paper, we propose a novel visual summarization technique for bipartite graphs based on the minimum description length (MDL) principle. The method simultaneously groups the two different set of nodes and constructs aggregated bipartite relations with balanced granularity and precision. It addresses the key trade-off that often occurs for visualizing large scale and noisy data: acquiring a clear and uncluttered overview while maximizing the information content in it. We formulate the visual summarization task as a co-clustering problem and propose an efficient algorithm based on locality sensitive hashing (LSH) that can easily scale to large graphs under reasonable interactive time constraints that previous related methods cannot satisfy. The method leads to the opportunity of introducing a visual analytics framework with multiple levels-of-detail to facilitate interactive data exploration. In the framework, we also introduce a compact visual design inspired by adjacency list representation of graphs as the building block for a small multiples display to compare the bipartite relations for different subsets of data. We showcase the applicability and effectiveness of our approach by applying it on synthetic data with ground truth and performing case studies on real-world datasets from two application domains including roll-call vote record analysis and vehicle fault pattern analysis. Interviews with experts in the political science community and the automotive industry further highlight the benefits of our approach.

...read moreread less

Proceedings Article•DOI•

Efficient Interaction-based Neural Ranking with Locality Sensitive Hashing

[...]

Shiyu Ji¹, Jinjin Shao¹, Tao Yang¹•Institutions (1)

University of California, Santa Barbara¹

13 May 2019

TL;DR: This paper investigates fast approximation of three interaction-based neural ranking algorithms using Locality Sensitive Hashing (LSH), which accelerates query-document interaction computation by using a runtime cache with precomputed term vectors, and speeds up kernel calculation by taking advantages of limited integer similarity values.

...read moreread less

Abstract: Interaction-based neural ranking has been shown to be effective for document search using distributed word representations. However the time or space required is very expensive for online query processing with neural ranking. This paper investigates fast approximation of three interaction-based neural ranking algorithms using Locality Sensitive Hashing (LSH). It accelerates query-document interaction computation by using a runtime cache with precomputed term vectors, and speeds up kernel calculation by taking advantages of limited integer similarity values. This paper presents the design choices with cost analysis, and an evaluation that assesses efficiency benefits and relevance tradeoffs for the tested datasets.

...read moreread less

Journal Article•DOI•

[...]

Osman Durmaz¹, Hasan Sakir Bilge¹•Institutions (1)

Gazi University¹

01 Dec 2019-Pattern Recognition Letters

TL;DR: This study proposes an approach called as Randomized Distributed Hashing (RDH) which uses Locality Sensitive Hashes (LSH) in a distributed scheme which is promising for searching images in large datasets with multiple nodes.

...read moreread less

Journal Article•DOI•

ConvNet and LSH-Based Visual Localization Using Localized Sequence Matching.

[...]

Yongliang Qiao¹, Cindy Cappelle², Yassine Ruichek², Tao Yang²•Institutions (2)

University of Sydney¹, Universite de technologie de Belfort-Montbeliard²

28 May 2019-Sensors

TL;DR: A visual localization approach based on place recognition that combines the powerful ConvNet features and localized image sequence matching and shows good performances even in the presence of appearance and illumination changes is proposed.

...read moreread less

Abstract: Convolutional Network (ConvNet), with its strong image representation ability, has achieved significant progress in the computer vision and robotic fields. In this paper, we propose a visual localization approach based on place recognition that combines the powerful ConvNet features and localized image sequence matching. The image distance matrix is constructed based on the cosine distance of extracted ConvNet features, and then a sequence search technique is applied on this distance matrix for the final visual recognition. To speed up the computational efficiency, the locality sensitive hashing (LSH) method is applied to achieve real-time performances with minimal accuracy degradation. We present extensive experiments on four real world data sets to evaluate each of the specific challenges in visual recognition. A comprehensive performance comparison of different ConvNet layers (each defining a level of features) considering both appearance and illumination changes is conducted. Compared with the traditional approaches based on hand-crafted features and single image matching, the proposed method shows good performances even in the presence of appearance and illumination changes.

...read moreread less

Journal Article•DOI•

Building self-clustering RDF databases using Tunable-LSH

[...]

Güneş Aluç, M. Tamer Özsu¹, Khuzaima Daudjee¹•Institutions (1)

University of Waterloo¹

01 Apr 2019

TL;DR: This work introduces a fast and efficient method for dynamically clustering records in an RDF data management system, Tunable-LSH, which can auto-tune to achieve the aforementioned clustering objective with high accuracy even when the workloads change.

...read moreread less

Abstract: The Resource Description Framework (RDF) is a W3C standard for representing graph-structured data, and SPARQL is the standard query language for RDF. Recent advances in information extraction, linked data management and the Semantic Web have led to a rapid increase in both the volume and the variety of RDF data that are publicly available. As businesses start to capitalize on RDF data, RDF data management systems are being exposed to workloads that are far more diverse and dynamic than what they were designed to handle. Consequently, there is a growing need for developing workload-adaptive and self-tuning RDF data management systems. To realize this objective, we introduce a fast and efficient method for dynamically clustering records in an RDF data management system. Specifically, we assume nothing about the workload upfront, but as SPARQL queries are executed, we keep track of records that are co-accessed by the queries in the workload and physically cluster them. To decide dynamically and in constant-time where a record needs to be placed in the storage system, we develop a new locality-sensitive hashing (LSH) scheme, Tunable-LSH. Using Tunable-LSH, records that are co-accessed across similar sets of queries can be hashed to the same or nearby physical pages in the storage system. What sets Tunable-LSH apart from existing LSH schemes is that it can auto-tune to achieve the aforementioned clustering objective with high accuracy even when the workloads change. Experimental evaluation of Tunable-LSH in an RDF data management system as well as in a standalone hashtable shows end-to-end performance gains over existing solutions.

...read moreread less

Proceedings Article•DOI•

Fast Anomaly Detection in Multiple Multi-Dimensional Data Streams

[...]

Hongyu Sun¹, Qiang He¹, Kewen Liao², Timos Sellis¹, Longkun Guo³, Xuyun Zhang⁴, Jun Shen⁵, Feifei Chen⁶ - Show less +4 more•Institutions (6)

Swinburne University of Technology¹, Australian Catholic University², Fuzhou University³, University of Auckland⁴, University of Wollongong⁵, Deakin University⁶

01 Jan 2019

TL;DR: This paper proposes a fast yet effective anomaly detection approach in multiple multi-dimensional data streams based on a combination of ideas, i.e., stream pre-processing, locality sensitive hashing and dynamic isolation forest, which achieves a magnitude increase in its efficiency compared with state-of-the-art approaches while maintaining competitive detection accuracy.

...read moreread less

Abstract: Multiple multi-dimensional data streams are ubiquitous in the modern world, such as IoT applications, GIS applications and social networks. Detecting anomalies in such data streams in real-time is an important and challenging task. It is able to provide valuable information from data and then assists decision-making. However, exiting approaches for anomaly detection in multi-dimensional data streams have not properly considered the correlations among multiple multi-dimensional streams. Moreover, for multi-dimensional streaming data, online detection speed is often an important concern. In this paper, we propose a fast yet effective anomaly detection approach in multiple multi-dimensional data streams. This is based on a combination of ideas, i.e., stream pre-processing, locality sensitive hashing and dynamic isolation forest. Experiments on real datasets demonstrate that our approach achieves a magnitude increase in its efficiency compared with state-of-the-art approaches while maintaining competitive detection accuracy.

...read moreread less

Journal Article•DOI•

Fast distributed video deduplication via locality-sensitive hashing with similarity ranking

[...]

Yeguang Li¹, Yeguang Li², Liang Hu¹, Ke Xia³, Jie Luo³ - Show less +1 more•Institutions (3)

Jilin University¹, Changchun University², Beihang University³

05 Mar 2019-Eurasip Journal on Image and Video Processing

TL;DR: A flexible and fast distributed video deduplication framework based on hash codes that is able to support the hash table indexing using any existing hashing algorithm in a distributed environment and can efficiently rank the candidate videos by exploring the similarities among the key frames over multiple tables using MapReduce strategy.

...read moreread less

Abstract: The exponentially growing amount of video data being produced has led to tremendous challenges for video deduplication technology. Nowadays, many different deduplication approaches are being rapidly developed, but they are generally slow and their identification processes are somewhat inaccurate. Till now, there is rare work that studies the generic hash-based distributed framework and the efficient similarity ranking strategy for video deduplication. This paper proposes a flexible and fast distributed video deduplication framework based on hash codes. It is able to support the hash table indexing using any existing hashing algorithm in a distributed environment and can efficiently rank the candidate videos by exploring the similarities among the key frames over multiple tables using MapReduce strategy. Our experiments with a popular large-scale dataset demonstrate that the proposed framework can achieve satisfactory video deduplication performance.

...read moreread less

Journal Article•DOI•

Real-time recommendation with locality sensitive hashing

[...]

Ahmet Maruf Aytekin¹, Tevfik Aytekin¹•Institutions (1)

Bahçeşehir University¹

01 Aug 2019-Journal of Intelligent Information Systems

TL;DR: These proposed algorithms have better running time performance than the standard LSH-based applications while preserving the prediction accuracy in reasonable limits and have a large positive impact on aggregate diversity which has recently become an important evaluation measure for recommender algorithms.

...read moreread less

Abstract: Neighborhood-based collaborative filtering (CF) methods are widely used in recommender systems because they are easy-to-implement and highly effective. One of the significant challenges of these methods is the ability to scale with the increasing amount of data since finding nearest neighbors requires a search over all of the data. Approximate nearest neighbor (ANN) methods eliminate this exhaustive search by only looking at the data points that are likely to be similar. Locality sensitive hashing (LSH) is a well-known technique for ANN search in high dimensional spaces. It is also effective in solving the scalability problem of neighborhood-based CF. In this study, we provide novel improvements to the current LSH based recommender algorithms and make a systematic evaluation of LSH in neighborhood-based CF. Besides, we make extensive experiments on real-life datasets to investigate various parameters of LSH and their effects on multiple metrics used to evaluate recommender systems. Our proposed algorithms have better running time performance than the standard LSH-based applications while preserving the prediction accuracy in reasonable limits. Also, the proposed algorithms have a large positive impact on aggregate diversity which has recently become an important evaluation measure for recommender algorithms.

...read moreread less

Book Chapter•DOI•

A Fast kNN-Based Approach for Time Sensitive Anomaly Detection over Data Streams

[...]

Guangjun Wu¹, Zhihui Zhao¹, Ge Fu, Wang Haiping¹, Yong Wang¹, Wang Zhenyu¹, Junteng Hou¹, Liang Huang - Show less +4 more•Institutions (1)

Chinese Academy of Sciences¹

12 Jun 2019

TL;DR: This paper proposes a fast kNN-based approach for Time Sensitive Anomaly Detection (kNN-TSAD), which can find outliers that present different behavior characteristics, including normal and abnormal characteristics, within different time intervals.

...read moreread less

Abstract: Anomaly detection is an important data mining method aiming to discover outliers that show significant diversion from their expected behavior. A widely used criteria for determining outliers is based on the number of their neighboring elements, which are referred to as Nearest Neighbors (NN). Existing kNN-based Anomaly Detection (kNN-AD) algorithms cannot detect streaming outliers, which present time sensitive abnormal behavior characteristics in different time intervals. In this paper, we propose a fast kNN-based approach for Time Sensitive Anomaly Detection (kNN-TSAD), which can find outliers that present different behavior characteristics, including normal and abnormal characteristics, within different time intervals. The core idea of our proposal is that we combine the model of sliding window with Locality Sensitive Hashing (LSH) to monitor streaming elements distribution as well as the number of their Nearest Neighbors as time progresses. We use an $\epsilon $-approximation scheme to implement the model of sliding window to compute Nearest Neighbors on the fly. We conduct widely experiments to examine our approach for time sensitive anomaly detection using three real-world data sets. The results show that our approach can achieve significant improvement on recall and precision for anomaly detection within different time intervals. Especially, our approach achieves two orders of magnitude improvement on time consumption for streaming anomaly detection, when compared with traditional kNN-based anomaly detection algorithms, such as exact-Storm, approx-Storm, MCOD etc, while it only uses 10% of memory consumption.

...read moreread less

Book Chapter•DOI•

FRESH: Fréchet Similarity with Hashing

[...]

Matteo Ceccarello¹, Anne Driemel², Francesco Silvestri³•Institutions (3)

University of Copenhagen¹, University of Bonn², University of Padua³

05 Aug 2019

TL;DR: This paper proposes FRESH, an approximate and randomized approach for r-range search that leverages on a locality sensitive hashing scheme for detecting candidate near neighbors of the query curve, and on a subsequent pruning step based on a cascade of curve simplifications.

...read moreread less

Abstract: This paper studies the r-range search problem for curves under the continuous Frechet distance: given a dataset S of n polygonal curves and a threshold $r>0$, construct a data structure that, for any query curve q, efficiently returns all entries in S with distance at most r from q. We propose FRESH, an approximate and randomized approach for r-range search, that leverages on a locality sensitive hashing scheme for detecting candidate near neighbors of the query curve, and on a subsequent pruning step based on a cascade of curve simplifications. We experimentally compare FRESH to exact and deterministic solutions, and we show that high performance can be reached by suitably relaxing precision and recall.

...read moreread less

Posted Content•

A Distributed and Approximated Nearest Neighbors Algorithm for an Efficient Large Scale Mean Shift Clustering

[...]

Gaël Beck, Tarn Duong, Mustapha Lebbah, Hanane Azzag, Christophe Cérin - Show less +1 more

11 Feb 2019-arXiv: Learning

TL;DR: In this paper, the authors proposed a scalable clustering algorithm based on Locality Sensitive Hashing (LSH) to approximate the density gradient ascent in mean shift clustering.

...read moreread less

Abstract: In this paper we target the class of modal clustering methods where clusters are defined in terms of the local modes of the probability density function which generates the data. The most well-known modal clustering method is the k-means clustering. Mean Shift clustering is a generalization of the k-means clustering which computes arbitrarily shaped clusters as defined as the basins of attraction to the local modes created by the density gradient ascent paths. Despite its potential, the Mean Shift approach is a computationally expensive method for unsupervised learning. Thus, we introduce two contributions aiming to provide clustering algorithms with a linear time complexity, as opposed to the quadratic time complexity for the exact Mean Shift clustering. Firstly we propose a scalable procedure to approximate the density gradient ascent. Second, our proposed scalable cluster labeling technique is presented. Both propositions are based on Locality Sensitive Hashing (LSH) to approximate nearest neighbors. These two techniques may be used for moderate sized datasets. Furthermore, we show that using our proposed approximations of the density gradient ascent as a pre-processing step in other clustering methods can also improve dedicated classification metrics. For the latter, a distributed implementation, written for the Spark/Scala ecosystem is proposed. For all these considered clustering methods, we present experimental results illustrating their labeling accuracy and their potential to solve concrete problems.

...read moreread less

Journal Article•DOI•

C2Net: A Network-Efficient Approach to Collision Counting LSH Similarity Join

[...]

Hangyu Li¹, Sarana Nutanong, Hong Xu¹, Chenyun Yu¹, Foryu Ha¹ - Show less +1 more•Institutions (1)

City University of Hong Kong¹

01 Mar 2019-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A network-efficient solution called C2Net is proposed to improve the utilization of MapReduce combiners and uses two graph partitioning schemes: minimum spanning tree andspectral clustering for runtime collision counting task scheduling.

...read moreread less

Abstract: Similarity join of two datasets $P$ and $Q$ is a primitive operation that is useful in many application domains. The operation involves identifying pairs $(p,q)$ , in the Cartesian product of $P$ and $Q$ such that $(p,q)$ satisfies a stipulated similarity condition. In a high-dimensional space, an approximate similarity join based on locality-sensitive hashing (LSH) provides a good solution while reducing the processing cost with a predictable loss of accuracy. A distributed processing framework such as MapReduce allows the handling of large and high-dimensional datasets. However, network cost estimation frequently turns into a bottleneck in a distributed processing environment, thus resulting in a challenge of achieving faster and more efficient similarity join. This paper focuses on collision counting LSH-based similarity join in MapReduce and proposes a network-efficient solution called C2Net to improve the utilization of MapReduce combiners. The solution uses two graph partitioning schemes: (i) minimum spanning tree for organizing LSH buckets replication; and (ii) spectral clustering for runtime collision counting task scheduling. Experiments have shown that, in comparison to the state of the art, the proposed solution is able to achieve 20 percent data reduction and 50 percent reduction in shuffle time.

...read moreread less

Posted Content•

Efficient Inner Product Approximation in Hybrid Spaces.

[...]

Xiang Wu, Ruiqi Guo, David Simcha, Dave Dopson, Sanjiv Kumar - Show less +1 more

20 Mar 2019-arXiv: Learning

TL;DR: This paper proposes a technique that approximates the inner product computation in hybrid vectors, leading to substantial speedup in search while maintaining high accuracy, and proposes efficient data structures that exploit modern computer architectures, resulting in orders of magnitude faster search than the existing baselines.

...read moreread less

Abstract: Many emerging use cases of data mining and machine learning operate on large datasets with data from heterogeneous sources, specifically with both sparse and dense components. For example, dense deep neural network embedding vectors are often used in conjunction with sparse textual features to provide high dimensional hybrid representation of documents. Efficient search in such hybrid spaces is very challenging as the techniques that perform well for sparse vectors have little overlap with those that work well for dense vectors. Popular techniques like Locality Sensitive Hashing (LSH) and its data-dependent variants also do not give good accuracy in high dimensional hybrid spaces. Even though hybrid scenarios are becoming more prevalent, currently there exist no efficient techniques in literature that are both fast and accurate. In this paper, we propose a technique that approximates the inner product computation in hybrid vectors, leading to substantial speedup in search while maintaining high accuracy. We also propose efficient data structures that exploit modern computer architectures, resulting in orders of magnitude faster search than the existing baselines. The performance of the proposed method is demonstrated on several datasets including a very large scale industrial dataset containing one billion vectors in a billion dimensional space, achieving over 10x speedup and higher accuracy against competitive baselines.

...read moreread less

Journal Article•DOI•

An improved method of locality-sensitive hashing for scalable instance matching

[...]

Mehmet Aydar¹, Serkan Ayvaz²•Institutions (2)

Kent State University¹, Bahçeşehir University²

01 Feb 2019-Knowledge and Information Systems

TL;DR: This study utilizes locality-sensitive hashing (LSH) technique to greatly improve the scalability of candidate instance pair generation and discovers the optimum number of hash functions in each band in LSH based on the candidate similarity threshold.

...read moreread less

Abstract: In this study, we propose a scalable approach for automatically identifying similar candidate instance pairs in very large datasets. Efficient candidate pair generation is an essential to many computational problems involving calculation of instance similarities. Calculating similarities of instances with a large number of properties and efficiently matching a large number of similar instances in a scalable way are two significant bottlenecks of candidate instance pair generation. In our approach, we utilize locality-sensitive hashing (LSH) technique to greatly improve the scalability of candidate instance pair generation. Based on the candidate similarity threshold, our algorithm automatically discovers the optimum number of hash functions in each band in LSH. Moreover, we evaluated the scalability of our approach and its effectiveness in instance matching task using real-world very large datasets.

...read moreread less

Book Chapter•DOI•

Malware Signature Generation Using Locality Sensitive Hashing

[...]

Hassan Naderi¹, P. Vinod², Mauro Conti², Saeed Parsa¹, Mohammad Hadi Alaeiyan¹ - Show less +1 more•Institutions (2)

Iran University of Science and Technology¹, University of Padua²

09 Jan 2019

TL;DR: This work has leveraged the Suffix tree structure and Locality Sensitive Hashing to linearly cluster malicious programs and to reduce the number of signatures significantly.

...read moreread less

Abstract: Security threats due to malicious executable are getting more serious. A lot of researchers are interested in combating malware attacks. In contrast, malicious users aim to increase the usage of polymorphism and metamorphism malware in order to increase the analysis cost and prevent being identified by anti-malware tools. Due to the intuitive similarity between different polymorphisms of a malware family, clustering is an effective approach to deal with this problem. Clustering accordingly is able to reduce the number of signatures. Therefore, we have leveraged the Suffix tree structure and Locality Sensitive Hashing (LSH) to linearly cluster malicious programs and to reduce the number of signatures significantly.

...read moreread less

Journal Article•DOI•

Bucket-size balancing locality sensitive hashing using the map reduce paradigm

[...]

Kyung Mi Lee¹, Yoon-Su Jeong², Sang-Ho Lee¹, Keon Myung Lee¹•Institutions (2)

Chungbuk National University¹, Mokwon University²

01 Jan 2019-Cluster Computing

TL;DR: The proposed method extends the hyperplanes to occupy their vicinity so that the data objects in the vicinity of a hyperplane are treated as belonging to both sides of the hyperplane simultaneously.

...read moreread less

Abstract: Similarity search is an essential operation in such domains as data mining and content-based information retrieval. This simple operation causes considerable burden when the number of data records grows large, especially in big data applications. At the sacrifice of accuracy, approximate methods for finding similar ones have been developed to deliver effective services in a reasonable amount of time. Locality sensitive hashing is a class of efficient approximate similarity search techniques. Various algorithms have been proposed for locality sensitive hashing, which basically try to narrow down the candidate data set to be examined. The candidate data set does not always contain all the similar data to query and thus the search results are approximate. The increase in the size of a candidate set improves the recall of similar ones, but it deteriorates the processing speed. This paper is concerned with a method to increase the recall rate while not entailing significant cost. The method basically uses a random hyperplane partitioning technique to create buckets to which data objects are distributed. The nearest neighbors located on the other side of such hyperplanes can be false negatives when only the bucket to which query belongs is examined for finding similar neighbors. The proposed method extends the hyperplanes to occupy their vicinity so that the data objects in the vicinity of a hyperplane are treated as belonging to both sides of the hyperplane simultaneously. The over-sized buckets are further split by adding additional hyperplanes to control the bucket sizes. To improve the processing speed, the algorithm is realized in MapReduce paradigm on a Hadoop cluster. Some experiment results are presented to show its applicability.

...read moreread less

Posted Content•

RACE: Sub-Linear Memory Sketches for Approximate Near-Neighbor Search on Streaming Data.

[...]

Benjamin Coleman, Anshumali Shrivastava, Richard G. Baraniuk

18 Feb 2019-arXiv: Data Structures and Algorithms

TL;DR: An online sketching algorithm is developed that can compress vectors into a tiny sketch consisting of small arrays of counters whose size scales as $O(N^{b}\log^2{N})$, where $b < 1$ depending on the stability of the near-neighbor search.

...read moreread less

Abstract: We demonstrate the first possibility of a sub-linear memory sketch for solving the approximate near-neighbor search problem. In particular, we develop an online sketching algorithm that can compress $N$ vectors into a tiny sketch consisting of small arrays of counters whose size scales as $O(N^{b}\log^2{N})$, where $b < 1$ depending on the stability of the near-neighbor search. This sketch is sufficient to identify the top-$v$ near-neighbors with high probability. To the best of our knowledge, this is the first near-neighbor search algorithm that breaks the linear memory ($O(N)$) barrier. We achieve sub-linear memory by combining advances in locality sensitive hashing (LSH) based estimation, especially the recently-published ACE algorithm, with compressed sensing and heavy hitter techniques. We provide strong theoretical guarantees; in particular, our analysis sheds new light on the memory-accuracy tradeoff in the near-neighbor search setting and the role of sparsity in compressed sensing, which could be of independent interest. We rigorously evaluate our framework, which we call RACE (Repeated ACE) data structures on a friend recommendation task on the Google plus graph with more than 100,000 high-dimensional vectors. RACE provides compression that is orders of magnitude better than the random projection based alternative, which is unsurprising given the theoretical advantage. We anticipate that RACE will enable both new theoretical perspectives on near-neighbor search and new methodologies for applications like high-speed data mining, internet-of-things (IoT), and beyond.

...read moreread less

Proceedings Article•DOI•

A Fast Single-Site Fingerprint Localization Method in Massive MIMO System

[...]

Wang Xiaojun¹, Lin Liu¹, Yan Lin¹, Xiaoshu Chen¹•Institutions (1)

Southeast University¹

01 Oct 2019

TL;DR: A possible method to locate a mobile device in massive multiple-in-multiple-out(MIMO) systems which represents a leading 5G technology candidate is discussed and the proposed method has advantages of low latency and high localization accuracy compared with traditional algorithms.

...read moreread less

Abstract: Fingerprint localization(FL) is one of the most efficient positioning scheme which exploits the characteristics of the received signal or channel information to estimate the physical position. Although there are many available positioning techniques, most of them are used in indoor positioning. In this paper, we discuss a possible method to locate a mobile device in massive multiple-in-multiple-out(MIMO) systems which represents a leading 5G technology candidate. In offline phase, the fingerprint matrix based on angle-delay channel power is extracted and compressed by three tuple(TT) method before stored into database. In online phase, coarse classification and locality sensitive hashing (LSH) are used to process the data and obtain candidate reference points (RPs). Then the weighted K nearest neighbors (WKNN) is applied to get the estimated location. The simulation results show that the proposed method has advantages of low latency and high localization accuracy compared with traditional algorithms.

...read moreread less

Proceedings Article•DOI•

Efficient Sketching Algorithm for Sparse Binary Data

[...]

Rameshwar Pratap¹, Debajyoti Bera², Karthik Revanuru•Institutions (2)

Indian Institute of Technology Mandi¹, Indraprastha Institute of Information Technology²

10 Oct 2019

TL;DR: This work proposes a sketching (alternatively, dimensionality reduction) algorithm – BinSketch (Binary Data Sketch) – for sparse binary datasets and compares the performance of this algorithm with the state-of-the-art algorithms on the task of mean-square-error and ranking.

...read moreread less

Abstract: Recent advancement of the WWW, IOT, social network, e-commerce, etc. have generated a large volume of data. These datasets are mostly represented by high dimensional and sparse datasets. Many fundamental subroutines of common data analytic tasks such as clustering, classification, ranking, nearest neighbour search, etc. scale poorly with the dimension of the dataset. In this work, we address this problem and propose a sketching (alternatively, dimensionality reduction) algorithm – BinSketch (Binary Data Sketch) – for sparse binary datasets. BinSketch preserves the binary version of the dataset after sketching and maintains estimates for multiple similarity measures such as Jaccard, Cosine, Inner-Product similarities, and Hamming distance, on the same sketch. We present a theoretical analysis of our algorithm and complement it with extensive experimentation on several real-world datasets. We compare the performance of our algorithm with the state-of-the-art algorithms on the task of mean-square-error and ranking. Our proposed algorithm offers a comparable accuracy while suggesting a significant speedup in the dimensionality reduction time, with respect to the other candidate algorithms. Our proposal is simple, easy to implement, and therefore can be adopted in practice.

...read moreread less