scispace - formally typeset
Search or ask a question

Showing papers on "Locality-sensitive hashing published in 2019"


Journal ArticleDOI
TL;DR: This paper proposes a semi-supervised loss to jointly minimize the empirical error on labeled data, as well as the embedding error on both labeled and unlabeled data, which can preserve the semantic similarity and capture the meaningful neighbors on the underlying data structures for effective hashing.
Abstract: Hashing methods have been widely used for efficient similarity retrieval on large scale image database. Traditional hashing methods learn hash functions to generate binary codes from hand-crafted features, which achieve limited accuracy since the hand-crafted features cannot optimally represent the image content and preserve the semantic similarity. Recently, several deep hashing methods have shown better performance because the deep architectures generate more discriminative feature representations. However, these deep hashing methods are mainly designed for supervised scenarios, which only exploit the semantic similarity information, but ignore the underlying data structures. In this paper, we propose the semi-supervised deep hashing approach, to perform more effective hash function learning by simultaneously preserving semantic similarity and underlying data structures. The main contributions are as follows: 1) We propose a semi-supervised loss to jointly minimize the empirical error on labeled data, as well as the embedding error on both labeled and unlabeled data, which can preserve the semantic similarity and capture the meaningful neighbors on the underlying data structures for effective hashing. 2) A semi-supervised deep hashing network is designed to extensively exploit both labeled and unlabeled data, in which we propose an online graph construction method to benefit from the evolving deep features during training to better capture semantic neighbors. To the best of our knowledge, the proposed deep network is the first deep hashing method that can perform hash code learning and feature learning simultaneously in a semi-supervised fashion. Experimental results on five widely-used data sets show that our proposed approach outperforms the state-of-the-art hashing methods.

140 citations


Journal ArticleDOI
01 Feb 2019-Energy
TL;DR: The simulation results show that the prediction performance of the proposed combination model based on linear kernel function outperforms all the other comparison models from 1-step to 3-step forecasting and provides a promising and effective alternative for short-term wind power prediction.

78 citations


Journal ArticleDOI
TL;DR: This paper has generated cancelable IrisCode features, coined as locality sampled code (LSC), which simultaneously provides strong security guarantees and satisfactory system performance, and formally analyzed the security guarantees of non-invertibility, revocability, and unlinkability.
Abstract: Iris-based biometric models are widely recognized to be one of the most accurate forms for authenticating individual identities. Features extracted from the captured iris images (known as IrisCodes) conventionally get stored in their native format over a data repository. However, from a security aspect, the stored templates are highly vulnerable to a wide spectrum of adversarial attack forms. The study in this paper addresses this issue by introducing a privacy-preserving and secure biometric scheme based on the notion of locality sensitive hashing (LSH). In this paper, we have generated cancelable IrisCode features, coined as locality sampled code (LSC), which simultaneously provides strong security guarantees and satisfactory system performance. The functionality of our proposed framework pivots around the fact that intra-class IrisCode samples are “close” to each other, due to which they hash to the same location. Alternatively, the inter-class IrisCodes features are comparatively dissimilar and consequently hash to different locations. We have rigorously examined the intrinsic properties of the LSCs by estimating the intra-class and inter-class collision probabilities for two distinct IrisCodes. Furthermore, we have formally analyzed the security guarantees of non-invertibility, revocability, and unlinkability in our model by establishing various bounds on the adversarial success probability. Extensive empirical tests on the CASIAv3 and IITD benchmark iris databases demonstrate the superior performance of our proposed model, for which we have obtained the best EERs of 0.105% and 1.4%, respectively.

47 citations


Journal ArticleDOI
TL;DR: The DENCAST system is proposed, a novel distributed algorithm implemented in Apache Spark, which performs density-based clustering and exploits the identified clusters to solve both single- and multi-target regression tasks (and thus, solves complex tasks such as time series prediction).
Abstract: Recent developments in sensor networks and mobile computing led to a huge increase in data generated that need to be processed and analyzed efficiently. In this context, many distributed data mining algorithms have recently been proposed. Following this line of research, we propose the DENCAST system, a novel distributed algorithm implemented in Apache Spark, which performs density-based clustering and exploits the identified clusters to solve both single- and multi-target regression tasks (and thus, solves complex tasks such as time series prediction). Contrary to existing distributed methods, DENCAST does not require a final merging step (usually performed on a single machine) and is able to handle large-scale, high-dimensional data by taking advantage of locality sensitive hashing. Experiments show that DENCAST performs clustering more efficiently than a state-of-the-art distributed clustering algorithm, especially when the number of objects increases significantly. The quality of the extracted clusters is confirmed by the predictive capabilities of DENCAST on several datasets: It is able to significantly outperform (p-value $$<0.05$$ ) state-of-the-art distributed regression methods, in both single and multi-target settings.

45 citations


Journal ArticleDOI
TL;DR: An LSH method, called Order Min Hash (OMH), is presented, a refinement of the minHash LSH used to approximate the Jaccard similarity, in that OMH is sensitive not only to the k-mer contents of the sequences but also to the relative order of the k -mers in the sequences.
Abstract: MOTIVATION Sequence alignment is a central operation in bioinformatics pipeline and, despite many improvements, remains a computationally challenging problem. Locality-sensitive hashing (LSH) is one method used to estimate the likelihood of two sequences to have a proper alignment. Using an LSH, it is possible to separate, with high probability and relatively low computation, the pairs of sequences that do not have high-quality alignment from those that may. Therefore, an LSH reduces the overall computational requirement while not introducing many false negatives (i.e. omitting to report a valid alignment). However, current LSH methods treat sequences as a bag of k-mers and do not take into account the relative ordering of k-mers in sequences. In addition, due to the lack of a practical LSH method for edit distance, in practice, LSH methods for Jaccard similarity or Hamming similarity are used as a proxy. RESULTS We present an LSH method, called Order Min Hash (OMH), for the edit distance. This method is a refinement of the minHash LSH used to approximate the Jaccard similarity, in that OMH is sensitive not only to the k-mer contents of the sequences but also to the relative order of the k-mers in the sequences. We present theoretical guarantees of the OMH as a gapped LSH. AVAILABILITY AND IMPLEMENTATION The code to generate the results is available at http://github.com/Kingsford-Group/omhismb2019. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

39 citations


Journal ArticleDOI
TL;DR: The experimental results show that the proposed stratified sampling based clustering algorithm outperforms the related algorithms in terms of clustering quality and computational efficiency for large-scale data sets.
Abstract: Large-scale data analysis is a challenging and relevant task for present-day research and industry As a promising data analysis tool, clustering is becoming more important in the era of big data In large-scale data clustering, sampling is an efficient and most widely used approximation technique Recently, several sampling-based clustering algorithms have attracted considerable attention in large-scale data analysis owing to their efficiency However, some of these existing algorithms have low clustering accuracy, whereas others have high computational complexity To overcome these deficiencies, a stratified sampling based clustering algorithm for large-scale data is proposed in this paper Its basic steps include: (1) obtaining a number of representative samples from different strata with a stratified sampling scheme, which are formed by locality sensitive hashing technique, (2) partitioning the chosen samples into different clusters using the fuzzy c -means clustering algorithm, (3) assigning the out-of-sample objects into their closest clusters via data labeling technique The performance of the proposed algorithm is compared with the state-of-the-art sampling-based fuzzy c -means clustering algorithms on several large-scale data sets including synthetic and real ones The experimental results show that the proposed algorithm outperforms the related algorithms in terms of clustering quality and computational efficiency for large-scale data sets

37 citations


Proceedings ArticleDOI
08 Apr 2019
TL;DR: LAZO is a method to simultaneously estimate both the similarity and containment of datasets, based on a redefinition of Jaccard similarity which takes into account the cardinality of each set.
Abstract: Data analysts often need to find datasets that are similar (i.e., have high overlap) or that are subsets of one another (i.e., one contains the other). Exactly computing such relationships is expensive because it entails an all-pairs comparison between all values in all datasets, an O(n2) operation. Fortunately, it is possible to obtain approximate solutions much faster, using locality sensitive hashing (LSH). Unfortunately, LSH does not lend itself naturally to compute containment, and only returns results with a similarity beyond a pre-defined threshold; we want to know the specific similarity and containment score. The main contribution of this paper is LAZO, a method to simultaneously estimate both the similarity and containment of datasets, based on a redefinition of Jaccard similarity which takes into account the cardinality of each set. In addition, we show how to use the method to improve the quality of the original JS and JC estimates. Last, we implement LAZO as a new indexing structure that has these additional properties: i) it returns numerical scores to indicate the degree of similarity and containment between each candidate and the query—instead of only returning the candidate set; ii) it permits to query for a specific threshold on-the-fly, as opposed to LSH indexes that need to be configured with a pre-defined threshold a priori; iii) it works in a data-oblivious way, so it can be incrementally maintained. We evaluate LAZO on real-world datasets and show its ability to estimate containment and similarity better and faster than existing methods.

36 citations


Journal ArticleDOI
TL;DR: ‘low‐density’ locality sensitive hashing is introduced to bioinformatics, with the addition of Gallager codes for even coverage, enabling quick and accurate metagenomic binning, allowing for the discovery of novel lineages.
Abstract: Motivation Vastly greater quantities of microbial genome data are being generated where environmental samples mix together the DNA from many different species. Here, we present Opal for metagenomic binning, the task of identifying the origin species of DNA sequencing reads. We introduce 'low-density' locality sensitive hashing to bioinformatics, with the addition of Gallager codes for even coverage, enabling quick and accurate metagenomic binning. Results On public benchmarks, Opal halves the error on precision/recall (F1-score) as compared with both alignment-based and alignment-free methods for species classification. We demonstrate even more marked improvement at higher taxonomic levels, allowing for the discovery of novel lineages. Furthermore, the innovation of low-density, even-coverage hashing should itself prove an essential methodological advance as it enables the application of machine learning to other bioinformatic challenges. Availability and implementation Full source code and datasets are available at http://opal.csail.mit.edu and https://github.com/yunwilliamyu/opal. Supplementary information Supplementary data are available at Bioinformatics online.

28 citations


Journal ArticleDOI
TL;DR: This paper introduces two contributions aiming to provide clustering algorithms with a linear time complexity, as opposed to the quadratic time complexity for the exact Mean Shift clustering and proposes a scalable procedure to approximate the density gradient ascent.

27 citations


Journal ArticleDOI
TL;DR: An efficient approximation method is proposed based on locality sensitive hashing which firstly retrieves candidate time series and then exploits their hash values to compute distance estimates for pruning and it is demonstrated their benefits in terms of query efficiency when dealing with a collection of multivariate time series.

25 citations


Journal ArticleDOI
TL;DR: An improved matching technique based on enhanced CMFD pipeline via k-means clustering technique that can enhance the detection accuracy in a significant manner and reduce the processing time with LSH-based matching.
Abstract: The goal of copy-move forgery is to manipulate the semantics of an image. In fact, this can be performed by cloning a region of an image and subsequently pasting it onto a different region within the same image. As such, this paper proposes an improved matching technique based on enhanced CMFD pipeline via k-means clustering technique. By deploying the k-means clustering to group the overlapping blocks, the matching step was independently carried out within each cluster to speed up the matching process. In addition, the clustering step of the feature vectors allowed the matching process to identify the matches accurately. Thus, in order to test the enhanced pipeline, it was combined with Zernike moments and locality sensitive hashing (LSH). The experimental results proved that the proposed method can enhance the detection accuracy in a significant manner. On top of that, the proposed pipeline can reduce the processing time with LSH-based matching.

Journal ArticleDOI
TL;DR: This paper forms the visual summarization task as a co-clustering problem and proposes an efficient algorithm based on locality sensitive hashing (LSH) that can easily scale to large graphs under reasonable interactive time constraints that previous related methods cannot satisfy.
Abstract: Bipartite graphs model the key relations in many large scale real-world data: customers purchasing items, legislators voting for bills, people's affiliation with different social groups, faults occurring in vehicles, etc. However, it is challenging to visualize large scale bipartite graphs with tens of thousands or even more nodes or edges. In this paper, we propose a novel visual summarization technique for bipartite graphs based on the minimum description length (MDL) principle. The method simultaneously groups the two different set of nodes and constructs aggregated bipartite relations with balanced granularity and precision. It addresses the key trade-off that often occurs for visualizing large scale and noisy data: acquiring a clear and uncluttered overview while maximizing the information content in it. We formulate the visual summarization task as a co-clustering problem and propose an efficient algorithm based on locality sensitive hashing (LSH) that can easily scale to large graphs under reasonable interactive time constraints that previous related methods cannot satisfy. The method leads to the opportunity of introducing a visual analytics framework with multiple levels-of-detail to facilitate interactive data exploration. In the framework, we also introduce a compact visual design inspired by adjacency list representation of graphs as the building block for a small multiples display to compare the bipartite relations for different subsets of data. We showcase the applicability and effectiveness of our approach by applying it on synthetic data with ground truth and performing case studies on real-world datasets from two application domains including roll-call vote record analysis and vehicle fault pattern analysis. Interviews with experts in the political science community and the automotive industry further highlight the benefits of our approach.

Proceedings ArticleDOI
13 May 2019
TL;DR: This paper investigates fast approximation of three interaction-based neural ranking algorithms using Locality Sensitive Hashing (LSH), which accelerates query-document interaction computation by using a runtime cache with precomputed term vectors, and speeds up kernel calculation by taking advantages of limited integer similarity values.
Abstract: Interaction-based neural ranking has been shown to be effective for document search using distributed word representations. However the time or space required is very expensive for online query processing with neural ranking. This paper investigates fast approximation of three interaction-based neural ranking algorithms using Locality Sensitive Hashing (LSH). It accelerates query-document interaction computation by using a runtime cache with precomputed term vectors, and speeds up kernel calculation by taking advantages of limited integer similarity values. This paper presents the design choices with cost analysis, and an evaluation that assesses efficiency benefits and relevance tradeoffs for the tested datasets.

Journal ArticleDOI
TL;DR: This study proposes an approach called as Randomized Distributed Hashing (RDH) which uses Locality Sensitive Hashes (LSH) in a distributed scheme which is promising for searching images in large datasets with multiple nodes.

Journal ArticleDOI
28 May 2019-Sensors
TL;DR: A visual localization approach based on place recognition that combines the powerful ConvNet features and localized image sequence matching and shows good performances even in the presence of appearance and illumination changes is proposed.
Abstract: Convolutional Network (ConvNet), with its strong image representation ability, has achieved significant progress in the computer vision and robotic fields. In this paper, we propose a visual localization approach based on place recognition that combines the powerful ConvNet features and localized image sequence matching. The image distance matrix is constructed based on the cosine distance of extracted ConvNet features, and then a sequence search technique is applied on this distance matrix for the final visual recognition. To speed up the computational efficiency, the locality sensitive hashing (LSH) method is applied to achieve real-time performances with minimal accuracy degradation. We present extensive experiments on four real world data sets to evaluate each of the specific challenges in visual recognition. A comprehensive performance comparison of different ConvNet layers (each defining a level of features) considering both appearance and illumination changes is conducted. Compared with the traditional approaches based on hand-crafted features and single image matching, the proposed method shows good performances even in the presence of appearance and illumination changes.

Journal ArticleDOI
01 Apr 2019
TL;DR: This work introduces a fast and efficient method for dynamically clustering records in an RDF data management system, Tunable-LSH, which can auto-tune to achieve the aforementioned clustering objective with high accuracy even when the workloads change.
Abstract: The Resource Description Framework (RDF) is a W3C standard for representing graph-structured data, and SPARQL is the standard query language for RDF. Recent advances in information extraction, linked data management and the Semantic Web have led to a rapid increase in both the volume and the variety of RDF data that are publicly available. As businesses start to capitalize on RDF data, RDF data management systems are being exposed to workloads that are far more diverse and dynamic than what they were designed to handle. Consequently, there is a growing need for developing workload-adaptive and self-tuning RDF data management systems. To realize this objective, we introduce a fast and efficient method for dynamically clustering records in an RDF data management system. Specifically, we assume nothing about the workload upfront, but as SPARQL queries are executed, we keep track of records that are co-accessed by the queries in the workload and physically cluster them. To decide dynamically and in constant-time where a record needs to be placed in the storage system, we develop a new locality-sensitive hashing (LSH) scheme, Tunable-LSH. Using Tunable-LSH, records that are co-accessed across similar sets of queries can be hashed to the same or nearby physical pages in the storage system. What sets Tunable-LSH apart from existing LSH schemes is that it can auto-tune to achieve the aforementioned clustering objective with high accuracy even when the workloads change. Experimental evaluation of Tunable-LSH in an RDF data management system as well as in a standalone hashtable shows end-to-end performance gains over existing solutions.

Proceedings ArticleDOI
01 Jan 2019
TL;DR: This paper proposes a fast yet effective anomaly detection approach in multiple multi-dimensional data streams based on a combination of ideas, i.e., stream pre-processing, locality sensitive hashing and dynamic isolation forest, which achieves a magnitude increase in its efficiency compared with state-of-the-art approaches while maintaining competitive detection accuracy.
Abstract: Multiple multi-dimensional data streams are ubiquitous in the modern world, such as IoT applications, GIS applications and social networks. Detecting anomalies in such data streams in real-time is an important and challenging task. It is able to provide valuable information from data and then assists decision-making. However, exiting approaches for anomaly detection in multi-dimensional data streams have not properly considered the correlations among multiple multi-dimensional streams. Moreover, for multi-dimensional streaming data, online detection speed is often an important concern. In this paper, we propose a fast yet effective anomaly detection approach in multiple multi-dimensional data streams. This is based on a combination of ideas, i.e., stream pre-processing, locality sensitive hashing and dynamic isolation forest. Experiments on real datasets demonstrate that our approach achieves a magnitude increase in its efficiency compared with state-of-the-art approaches while maintaining competitive detection accuracy.

Journal ArticleDOI
TL;DR: A flexible and fast distributed video deduplication framework based on hash codes that is able to support the hash table indexing using any existing hashing algorithm in a distributed environment and can efficiently rank the candidate videos by exploring the similarities among the key frames over multiple tables using MapReduce strategy.
Abstract: The exponentially growing amount of video data being produced has led to tremendous challenges for video deduplication technology. Nowadays, many different deduplication approaches are being rapidly developed, but they are generally slow and their identification processes are somewhat inaccurate. Till now, there is rare work that studies the generic hash-based distributed framework and the efficient similarity ranking strategy for video deduplication. This paper proposes a flexible and fast distributed video deduplication framework based on hash codes. It is able to support the hash table indexing using any existing hashing algorithm in a distributed environment and can efficiently rank the candidate videos by exploring the similarities among the key frames over multiple tables using MapReduce strategy. Our experiments with a popular large-scale dataset demonstrate that the proposed framework can achieve satisfactory video deduplication performance.

Journal ArticleDOI
TL;DR: These proposed algorithms have better running time performance than the standard LSH-based applications while preserving the prediction accuracy in reasonable limits and have a large positive impact on aggregate diversity which has recently become an important evaluation measure for recommender algorithms.
Abstract: Neighborhood-based collaborative filtering (CF) methods are widely used in recommender systems because they are easy-to-implement and highly effective. One of the significant challenges of these methods is the ability to scale with the increasing amount of data since finding nearest neighbors requires a search over all of the data. Approximate nearest neighbor (ANN) methods eliminate this exhaustive search by only looking at the data points that are likely to be similar. Locality sensitive hashing (LSH) is a well-known technique for ANN search in high dimensional spaces. It is also effective in solving the scalability problem of neighborhood-based CF. In this study, we provide novel improvements to the current LSH based recommender algorithms and make a systematic evaluation of LSH in neighborhood-based CF. Besides, we make extensive experiments on real-life datasets to investigate various parameters of LSH and their effects on multiple metrics used to evaluate recommender systems. Our proposed algorithms have better running time performance than the standard LSH-based applications while preserving the prediction accuracy in reasonable limits. Also, the proposed algorithms have a large positive impact on aggregate diversity which has recently become an important evaluation measure for recommender algorithms.

Book ChapterDOI
12 Jun 2019
TL;DR: This paper proposes a fast kNN-based approach for Time Sensitive Anomaly Detection (kNN-TSAD), which can find outliers that present different behavior characteristics, including normal and abnormal characteristics, within different time intervals.
Abstract: Anomaly detection is an important data mining method aiming to discover outliers that show significant diversion from their expected behavior. A widely used criteria for determining outliers is based on the number of their neighboring elements, which are referred to as Nearest Neighbors (NN). Existing kNN-based Anomaly Detection (kNN-AD) algorithms cannot detect streaming outliers, which present time sensitive abnormal behavior characteristics in different time intervals. In this paper, we propose a fast kNN-based approach for Time Sensitive Anomaly Detection (kNN-TSAD), which can find outliers that present different behavior characteristics, including normal and abnormal characteristics, within different time intervals. The core idea of our proposal is that we combine the model of sliding window with Locality Sensitive Hashing (LSH) to monitor streaming elements distribution as well as the number of their Nearest Neighbors as time progresses. We use an \(\epsilon \)-approximation scheme to implement the model of sliding window to compute Nearest Neighbors on the fly. We conduct widely experiments to examine our approach for time sensitive anomaly detection using three real-world data sets. The results show that our approach can achieve significant improvement on recall and precision for anomaly detection within different time intervals. Especially, our approach achieves two orders of magnitude improvement on time consumption for streaming anomaly detection, when compared with traditional kNN-based anomaly detection algorithms, such as exact-Storm, approx-Storm, MCOD etc, while it only uses 10% of memory consumption.

Book ChapterDOI
05 Aug 2019
TL;DR: This paper proposes FRESH, an approximate and randomized approach for r-range search that leverages on a locality sensitive hashing scheme for detecting candidate near neighbors of the query curve, and on a subsequent pruning step based on a cascade of curve simplifications.
Abstract: This paper studies the r-range search problem for curves under the continuous Frechet distance: given a dataset S of n polygonal curves and a threshold \(r>0\), construct a data structure that, for any query curve q, efficiently returns all entries in S with distance at most r from q. We propose FRESH, an approximate and randomized approach for r-range search, that leverages on a locality sensitive hashing scheme for detecting candidate near neighbors of the query curve, and on a subsequent pruning step based on a cascade of curve simplifications. We experimentally compare FRESH to exact and deterministic solutions, and we show that high performance can be reached by suitably relaxing precision and recall.

Posted Content
TL;DR: In this paper, the authors proposed a scalable clustering algorithm based on Locality Sensitive Hashing (LSH) to approximate the density gradient ascent in mean shift clustering.
Abstract: In this paper we target the class of modal clustering methods where clusters are defined in terms of the local modes of the probability density function which generates the data. The most well-known modal clustering method is the k-means clustering. Mean Shift clustering is a generalization of the k-means clustering which computes arbitrarily shaped clusters as defined as the basins of attraction to the local modes created by the density gradient ascent paths. Despite its potential, the Mean Shift approach is a computationally expensive method for unsupervised learning. Thus, we introduce two contributions aiming to provide clustering algorithms with a linear time complexity, as opposed to the quadratic time complexity for the exact Mean Shift clustering. Firstly we propose a scalable procedure to approximate the density gradient ascent. Second, our proposed scalable cluster labeling technique is presented. Both propositions are based on Locality Sensitive Hashing (LSH) to approximate nearest neighbors. These two techniques may be used for moderate sized datasets. Furthermore, we show that using our proposed approximations of the density gradient ascent as a pre-processing step in other clustering methods can also improve dedicated classification metrics. For the latter, a distributed implementation, written for the Spark/Scala ecosystem is proposed. For all these considered clustering methods, we present experimental results illustrating their labeling accuracy and their potential to solve concrete problems.

Journal ArticleDOI
Hangyu Li1, Sarana Nutanong, Hong Xu1, Chenyun Yu1, Foryu Ha1 
TL;DR: A network-efficient solution called C2Net is proposed to improve the utilization of MapReduce combiners and uses two graph partitioning schemes: minimum spanning tree andspectral clustering for runtime collision counting task scheduling.
Abstract: Similarity join of two datasets $P$ and $Q$ is a primitive operation that is useful in many application domains. The operation involves identifying pairs $(p,q)$ , in the Cartesian product of $P$ and $Q$ such that $(p,q)$ satisfies a stipulated similarity condition. In a high-dimensional space, an approximate similarity join based on locality-sensitive hashing (LSH) provides a good solution while reducing the processing cost with a predictable loss of accuracy. A distributed processing framework such as MapReduce allows the handling of large and high-dimensional datasets. However, network cost estimation frequently turns into a bottleneck in a distributed processing environment, thus resulting in a challenge of achieving faster and more efficient similarity join. This paper focuses on collision counting LSH-based similarity join in MapReduce and proposes a network-efficient solution called C2Net to improve the utilization of MapReduce combiners. The solution uses two graph partitioning schemes: (i) minimum spanning tree for organizing LSH buckets replication; and (ii) spectral clustering for runtime collision counting task scheduling. Experiments have shown that, in comparison to the state of the art, the proposed solution is able to achieve 20 percent data reduction and 50 percent reduction in shuffle time.

Posted Content
TL;DR: This paper proposes a technique that approximates the inner product computation in hybrid vectors, leading to substantial speedup in search while maintaining high accuracy, and proposes efficient data structures that exploit modern computer architectures, resulting in orders of magnitude faster search than the existing baselines.
Abstract: Many emerging use cases of data mining and machine learning operate on large datasets with data from heterogeneous sources, specifically with both sparse and dense components. For example, dense deep neural network embedding vectors are often used in conjunction with sparse textual features to provide high dimensional hybrid representation of documents. Efficient search in such hybrid spaces is very challenging as the techniques that perform well for sparse vectors have little overlap with those that work well for dense vectors. Popular techniques like Locality Sensitive Hashing (LSH) and its data-dependent variants also do not give good accuracy in high dimensional hybrid spaces. Even though hybrid scenarios are becoming more prevalent, currently there exist no efficient techniques in literature that are both fast and accurate. In this paper, we propose a technique that approximates the inner product computation in hybrid vectors, leading to substantial speedup in search while maintaining high accuracy. We also propose efficient data structures that exploit modern computer architectures, resulting in orders of magnitude faster search than the existing baselines. The performance of the proposed method is demonstrated on several datasets including a very large scale industrial dataset containing one billion vectors in a billion dimensional space, achieving over 10x speedup and higher accuracy against competitive baselines.

Journal ArticleDOI
TL;DR: This study utilizes locality-sensitive hashing (LSH) technique to greatly improve the scalability of candidate instance pair generation and discovers the optimum number of hash functions in each band in LSH based on the candidate similarity threshold.
Abstract: In this study, we propose a scalable approach for automatically identifying similar candidate instance pairs in very large datasets. Efficient candidate pair generation is an essential to many computational problems involving calculation of instance similarities. Calculating similarities of instances with a large number of properties and efficiently matching a large number of similar instances in a scalable way are two significant bottlenecks of candidate instance pair generation. In our approach, we utilize locality-sensitive hashing (LSH) technique to greatly improve the scalability of candidate instance pair generation. Based on the candidate similarity threshold, our algorithm automatically discovers the optimum number of hash functions in each band in LSH. Moreover, we evaluated the scalability of our approach and its effectiveness in instance matching task using real-world very large datasets.

Book ChapterDOI
09 Jan 2019
TL;DR: This work has leveraged the Suffix tree structure and Locality Sensitive Hashing to linearly cluster malicious programs and to reduce the number of signatures significantly.
Abstract: Security threats due to malicious executable are getting more serious. A lot of researchers are interested in combating malware attacks. In contrast, malicious users aim to increase the usage of polymorphism and metamorphism malware in order to increase the analysis cost and prevent being identified by anti-malware tools. Due to the intuitive similarity between different polymorphisms of a malware family, clustering is an effective approach to deal with this problem. Clustering accordingly is able to reduce the number of signatures. Therefore, we have leveraged the Suffix tree structure and Locality Sensitive Hashing (LSH) to linearly cluster malicious programs and to reduce the number of signatures significantly.

Journal ArticleDOI
TL;DR: The proposed method extends the hyperplanes to occupy their vicinity so that the data objects in the vicinity of a hyperplane are treated as belonging to both sides of the hyperplane simultaneously.
Abstract: Similarity search is an essential operation in such domains as data mining and content-based information retrieval. This simple operation causes considerable burden when the number of data records grows large, especially in big data applications. At the sacrifice of accuracy, approximate methods for finding similar ones have been developed to deliver effective services in a reasonable amount of time. Locality sensitive hashing is a class of efficient approximate similarity search techniques. Various algorithms have been proposed for locality sensitive hashing, which basically try to narrow down the candidate data set to be examined. The candidate data set does not always contain all the similar data to query and thus the search results are approximate. The increase in the size of a candidate set improves the recall of similar ones, but it deteriorates the processing speed. This paper is concerned with a method to increase the recall rate while not entailing significant cost. The method basically uses a random hyperplane partitioning technique to create buckets to which data objects are distributed. The nearest neighbors located on the other side of such hyperplanes can be false negatives when only the bucket to which query belongs is examined for finding similar neighbors. The proposed method extends the hyperplanes to occupy their vicinity so that the data objects in the vicinity of a hyperplane are treated as belonging to both sides of the hyperplane simultaneously. The over-sized buckets are further split by adding additional hyperplanes to control the bucket sizes. To improve the processing speed, the algorithm is realized in MapReduce paradigm on a Hadoop cluster. Some experiment results are presented to show its applicability.

Posted Content
TL;DR: An online sketching algorithm is developed that can compress vectors into a tiny sketch consisting of small arrays of counters whose size scales as $O(N^{b}\log^2{N})$, where $b < 1$ depending on the stability of the near-neighbor search.
Abstract: We demonstrate the first possibility of a sub-linear memory sketch for solving the approximate near-neighbor search problem. In particular, we develop an online sketching algorithm that can compress $N$ vectors into a tiny sketch consisting of small arrays of counters whose size scales as $O(N^{b}\log^2{N})$, where $b < 1$ depending on the stability of the near-neighbor search. This sketch is sufficient to identify the top-$v$ near-neighbors with high probability. To the best of our knowledge, this is the first near-neighbor search algorithm that breaks the linear memory ($O(N)$) barrier. We achieve sub-linear memory by combining advances in locality sensitive hashing (LSH) based estimation, especially the recently-published ACE algorithm, with compressed sensing and heavy hitter techniques. We provide strong theoretical guarantees; in particular, our analysis sheds new light on the memory-accuracy tradeoff in the near-neighbor search setting and the role of sparsity in compressed sensing, which could be of independent interest. We rigorously evaluate our framework, which we call RACE (Repeated ACE) data structures on a friend recommendation task on the Google plus graph with more than 100,000 high-dimensional vectors. RACE provides compression that is orders of magnitude better than the random projection based alternative, which is unsurprising given the theoretical advantage. We anticipate that RACE will enable both new theoretical perspectives on near-neighbor search and new methodologies for applications like high-speed data mining, internet-of-things (IoT), and beyond.

Proceedings ArticleDOI
01 Oct 2019
TL;DR: A possible method to locate a mobile device in massive multiple-in-multiple-out(MIMO) systems which represents a leading 5G technology candidate is discussed and the proposed method has advantages of low latency and high localization accuracy compared with traditional algorithms.
Abstract: Fingerprint localization(FL) is one of the most efficient positioning scheme which exploits the characteristics of the received signal or channel information to estimate the physical position. Although there are many available positioning techniques, most of them are used in indoor positioning. In this paper, we discuss a possible method to locate a mobile device in massive multiple-in-multiple-out(MIMO) systems which represents a leading 5G technology candidate. In offline phase, the fingerprint matrix based on angle-delay channel power is extracted and compressed by three tuple(TT) method before stored into database. In online phase, coarse classification and locality sensitive hashing (LSH) are used to process the data and obtain candidate reference points (RPs). Then the weighted K nearest neighbors (WKNN) is applied to get the estimated location. The simulation results show that the proposed method has advantages of low latency and high localization accuracy compared with traditional algorithms.

Proceedings ArticleDOI
10 Oct 2019
TL;DR: This work proposes a sketching (alternatively, dimensionality reduction) algorithm – BinSketch (Binary Data Sketch) – for sparse binary datasets and compares the performance of this algorithm with the state-of-the-art algorithms on the task of mean-square-error and ranking.
Abstract: Recent advancement of the WWW, IOT, social network, e-commerce, etc. have generated a large volume of data. These datasets are mostly represented by high dimensional and sparse datasets. Many fundamental subroutines of common data analytic tasks such as clustering, classification, ranking, nearest neighbour search, etc. scale poorly with the dimension of the dataset. In this work, we address this problem and propose a sketching (alternatively, dimensionality reduction) algorithm – BinSketch (Binary Data Sketch) – for sparse binary datasets. BinSketch preserves the binary version of the dataset after sketching and maintains estimates for multiple similarity measures such as Jaccard, Cosine, Inner-Product similarities, and Hamming distance, on the same sketch. We present a theoretical analysis of our algorithm and complement it with extensive experimentation on several real-world datasets. We compare the performance of our algorithm with the state-of-the-art algorithms on the task of mean-square-error and ranking. Our proposed algorithm offers a comparable accuracy while suggesting a significant speedup in the dimensionality reduction time, with respect to the other candidate algorithms. Our proposal is simple, easy to implement, and therefore can be adopted in practice.