scispace - formally typeset
Search or ask a question

Showing papers on "Locality-sensitive hashing published in 2021"


Proceedings ArticleDOI
01 Jun 2021
TL;DR: Non-local sparse attention (NLSA) as mentioned in this paper is designed to retain long-range modeling capability from non-local operation while enjoying robustness and high-efficiency of sparse representation, which partitions the input space into hash buckets of related features.
Abstract: Both Non-Local (NL) operation and sparse representation are crucial for Single Image Super-Resolution (SISR). In this paper, we investigate their combinations and propose a novel Non-Local Sparse Attention (NLSA) with dynamic sparse attention pattern. NLSA is designed to retain long-range modeling capability from NL operation while enjoying robustness and high-efficiency of sparse representation. Specifically, NLSA rectifies non-local attention with spherical locality sensitive hashing (LSH) that partitions the input space into hash buckets of related features. For every query signal, NLSA assigns a bucket to it and only computes attention within the bucket. The resulting sparse attention prevents the model from attending to locations that are noisy and less-informative, while reducing the computational cost from quadratic to asymptotic linear with respect to the spatial size. Extensive experiments validate the effectiveness and efficiency of NLSA. With a few non-local sparse attention modules, our architecture, called non-local sparse network (NLSN), reaches state-of-the-art performance for SISR quantitatively and qualitatively.

216 citations


Journal ArticleDOI
TL;DR: The classic Locality-Sensitive Hashing (LSH) technique is enhanced, after which an approach based on enhanced LSH is proposed for accurate and less-sensitive cross-platform recommendation decision-makings.
Abstract: Recommender systems are a promising way for users to quickly find the valuable information that they are interested in from massive data. Concretely, by capturing the user's personalized preferences, a recommender system can return a list of recommended items that best match the user preferences by using collaborative filtering. However, in the big data environment, the heavily fragmented distribution of the QoS (Quality of Services) data for recommendation decision- making presents a large challenge when integrating the QoS data from different platforms while ensuring that the sensitive user information contained in the QoS data is secure. Furthermore, due to the common tradeoff between data availability and privacy in data-driven applications, protecting the sensitive user information contained in the QoS data will probably decrease the availability of QoS data and finally produce inaccurate recommendation results. Considering these challenges, we enhance the classic Locality-Sensitive Hashing (LSH) technique, after which we propose an approach based on enhanced LSH for accurate and less-sensitive cross-platform recommendation decision-makings. Finally, extensive experiments are designed and tested on the reputable WS-DREAM dataset. The test reports prove the benefits of our work compared to other competitive approaches in the aspects of recommendation accuracy, efficiency and privacy protection performances.

89 citations


Journal ArticleDOI
TL;DR: A big data-driven and nonparametric model aided by 6G is proposed in this article to extract similar traffic patterns over time for accurate and efficient short-term traffic flow prediction in massive IoT, which is mainly based on time-aware locality-sensitive hashing (LSH).
Abstract: With the advent of the Internet of Things (IoT) and the increasing popularity of the intelligent transportation system, a large number of sensing devices are installed on the road for monitoring traffic dynamics in real time. These sensors can collect streaming traffic data distributed across different traffic sites, which constitute the main source of big traffic data. Analyzing and mining such big traffic data in massive IoT can help traffic administrations to make scientific and reasonable traffic scheduling decisions, so as to avoid prospective traffic congestions in the future. However, the above traffic decision making often requires frequent and massive data transmissions between distributed sensors and centralized cloud computing centers, which calls for lightweight data integrations and accurate data analyses based on large-scale traffic data. In view of this challenge, a big data-driven and nonparametric model aided by 6G is proposed in this article to extract similar traffic patterns over time for accurate and efficient short-term traffic flow prediction in massive IoT, which is mainly based on time-aware locality-sensitive hashing (LSH). We design a wide range of experiments based on a real-world big traffic data set to validate the feasibility of our proposal. Experimental reports demonstrate that the prediction accuracy and efficiency of our proposal are increased by 32.6% and 97.3%, respectively, compared with the other two competitive approaches.

34 citations


Proceedings ArticleDOI
11 Jul 2021
TL;DR: Xia et al. as discussed by the authors proposed a method based on Locality Sensitive Hashing (LSH) that can detect near-duplicates in sublinear time for a given query.
Abstract: Recently, research on explainable recommender systems has drawn much attention from both academia and industry, resulting in a variety of explainable models. As a consequence, their evaluation approaches vary from model to model, which makes it quite difficult to compare the explainability of different models. To achieve a standard way of evaluating recommendation explanations, we provide three benchmark datasets for EXplanaTion RAnking (denoted as EXTRA), on which explainability can be measured by ranking-oriented metrics. Constructing such datasets, however, poses great challenges. First, user-item-explanation triplet interactions are rare in existing recommender systems, so how to find alternatives becomes a challenge. Our solution is to identify nearly identical sentences from user reviews. This idea then leads to the second challenge, i.e., how to efficiently categorize the sentences in a dataset into different groups, since it has quadratic runtime complexity to estimate the similarity between any two sentences. To mitigate this issue, we provide a more efficient method based on Locality Sensitive Hashing (LSH) that can detect near-duplicates in sub-linear time for a given query. Moreover, we make our code publicly available to allow researchers in the community to create their own datasets.

26 citations



Proceedings ArticleDOI
09 Jun 2021
TL;DR: In this article, a parallel index-based SCAN algorithm based on GS*-Index was proposed. But the parallel algorithm is not as efficient as the sequential algorithm, since it does not effectively share work among queries with different SCAN parameter settings.
Abstract: SCAN (Structural Clustering Algorithm for Networks) is a well-studied, widely used graph clustering algorithm. For large graphs, however, sequential SCAN variants are prohibitively slow, and parallel SCAN variants do not effectively share work among queries with different SCAN parameter settings. Since users of SCAN often explore many parameter settings to find good clusterings, it is worthwhile to precompute an index that speeds up queries. This paper presents a practical and provably efficient parallel index-based SCAN algorithm based on GS*-Index, a recent sequential algorithm. Our parallel algorithm improves upon the asymptotic work of the sequential algorithm by using integer sorting. It is also highly parallel, achieving logarithmic span (parallel time) for both index construction and clustering queries. Furthermore, we apply locality-sensitive hashing (LSH) to design a novel approximate SCAN algorithm and prove guarantees for its clustering behavior. We present an experimental evaluation of our algorithms on large real-world graphs. On a 48-core machine with two-way hyper-threading, our parallel index construction achieves 50--151× speedup over the construction of GS*-Index. In fact, even on a single thread, our index construction algorithm is faster than GS*-Index. Our parallel index query implementation achieves 5--32× speedup over GS*-Index queries across a range of SCAN parameter values, and our implementation is always faster than ppSCAN, a state-of-the-art parallel SCAN algorithm. Moreover, our experiments show that applying LSH results in faster index construction while maintaining good clustering quality.

13 citations


Proceedings ArticleDOI
09 Jun 2021
TL;DR: This paper introduces a new asymmetric transformation and develops the first two provable hyperplane hashing schemes, Nearest Hyperplane hashing (NH) and Furthest Hyperplanes hashing (FH), for high-dimensional P2HNNS beyond the unit hypersphere.
Abstract: Point-to-Hyperplane Nearest Neighbor Search (P2HNNS) is a fundamental yet challenging problem, and it has plenty of applications in various fields. Existing hyperplane hashing schemes enjoy sub-linear query time and achieve excellent performance on applications such as large-scale active learning with Support Vector Machines (SVMs). However, they only conditionally deal with this problem with a strong assumption that all of the data objects are normalized, located at the unit hypersphere. Those hyperplane hashing schemes may be arbitrarily bad without this assumption. In this paper, we introduce a new asymmetric transformation and develop the first two provable hyperplane hashing schemes, Nearest Hyperplane hashing (NH) and Furthest Hyperplane hashing (FH), for high-dimensional P2HNNS beyond the unit hypersphere. With this asymmetric transformation, we demonstrate that the hash functions of NH and FH are locality-sensitive to the hyperplane queries, and both of them enjoy quality guarantee on query results. Moreover, we propose a data-dependent multi-partition strategy to boost the search performance of FH. NH can perform the hyperplane queries in sub-linear time, while FH enjoys a better practical performance. We evaluate NH and FH over five real-life datasets and show that we are around $3 \sim 100 \times$ faster than the best competitor in four out of five datasets, especially for the recall in $[20%, 80%]$. Code is available at \urlhttps://github.com/HuangQiang/P2HNNS.

10 citations


Posted Content
TL;DR: This article proposed a query efficient attack strategy to generate plausible adversarial examples on text classification and entailment tasks, which jointly leverages attention mechanism and locality sensitive hashing (LSH) to reduce the query count.
Abstract: Existing black box search methods have achieved high success rate in generating adversarial attacks against NLP models. However, such search methods are inefficient as they do not consider the amount of queries required to generate adversarial attacks. Also, prior attacks do not maintain a consistent search space while comparing different search methods. In this paper, we propose a query efficient attack strategy to generate plausible adversarial examples on text classification and entailment tasks. Our attack jointly leverages attention mechanism and locality sensitive hashing (LSH) to reduce the query count. We demonstrate the efficacy of our approach by comparing our attack with four baselines across three different search spaces. Further, we benchmark our results across the same search space used in prior attacks. In comparison to attacks proposed, on an average, we are able to reduce the query count by 75% across all datasets and target models. We also demonstrate that our attack achieves a higher success rate when compared to prior attacks in a limited query setting.

10 citations


Proceedings ArticleDOI
09 Jun 2021
TL;DR: BiDens as mentioned in this paper proposes a novel densification method, i.e., BiDens, which is more efficient to fill a sketch's empty bins with values of its non-empty bins in either the forward or backward directions.
Abstract: As an efficient tool for approximate similarity computation and search, Locality Sensitive Hashing (LSH) has been widely used in many research areas including databases, data mining, information retrieval, and machine learning. Classical LSH methods typically require to perform hundreds or even thousands of hashing operations when computing the LSH sketch for each input item (e.g., a set or a vector); however, this complexity is still too expensive and even impractical for applications requiring processing data in real-time. To address this issue, several fast methods such as OPH and BCWS have been proposed to efficiently compute the LSH sketches; however, these methods may generate many sketches with empty bins, which may introduce large errors for similarity estimation and also limit their usage for fast similarity search. To solve this issue, we propose a novel densification method, i.e., BiDens. Compared with existing densification methods, our BiDens is more efficient to fill a sketch's empty bins with values of its non-empty bins in either the forward or backward directions. Furthermore, it also densifies empty bins to satisfy the densification principle (i.e., the LSH property). Theoretical analysis and experimental results on similarity estimation, fast similarity search, and kernel linearization using real-world datasets demonstrate that our BiDens is up to 106 times faster than state-of-the-art methods while achieving the same or even better accuracy.

9 citations


Journal ArticleDOI
TL;DR: Experiments carried out on the collected multi-sensor database show that the proposed indexing approach greatly improves the performance of fingerprint indexing and outperforms existing ones in almost all the cases.
Abstract: Searching the identity of an unknown fingerprint over large databases is very challenging. Minutia Cylinder-Code (MCC) has been proved to be very effective in mapping a minutiae-based representation (positions and directions only) into a set of fixed-length transformation-invariant binary vectors. Based on MCC, a Locality-Sensitive Hashing (LSH) scheme has been designed to index fingerprint in large databases, which uses a numerical approximation for the similarity between MCC vectors. However, the LSH scheme is not robust enough when there is certain distortion between template and searched samples, such as fingerprints captured by multi-sensors. In this paper, we propose a finer hash bit selection method based on LSH. Besides, we take into consideration another feature - the single maximum collision for indexing and fuse the candidate lists produced by both indexing methods to produce the final candidate list. Experimentations carried out on our collected multi-sensor database (2D and 3D databases) show that the proposed indexing approach greatly improves the performance of fingerprint indexing. Extensive evaluation was also conducted on some public benchmark databases for fingerprint indexing, and the results demonstrated that the new approach outperforms existing ones in almost all the cases.

9 citations


Journal ArticleDOI
TL;DR: Comprehensive experiments on fingerprint vectors that derived from several FVC fingerprint benchmarks and rigorous analysis demonstrate decent secret retrieval performance yet offer strong resilience against six major security and privacy attacks.
Abstract: Biometric Cryptosystems for secret binding such as fuzzy vault and fuzzy commitment are provable secure and offers a convenient way for secret management and protection. Despite numerous practical schemes have been reported, they are deficient in resisting several security and privacy attacks. In this paper, we propose a novel bio-cryptosystem that based on the three key ingredients namely Index of Maximum (IoM) hashing, $(m, k)$ ( m , k ) threshold secret sharing and b -band mini vaults notion. The IoM hashing is motivated from the ranking based Locality Sensitive Hashing theory meant for non-invertible transformation. On the other hand, the $(m, k)$ ( m , k ) threshold secret sharing scheme and the b -band mini vaults manage overcome inherent limitations of biometric cryptosystems when integrated with IoM hashing. The proposed scheme strikes the balance between performance and the privacy/security protection. Unlike fuzzy vault and fuzzy commitment, which primarily devised for unordered and binary biometrics, respectively, our scheme is tailored for feature vector-based biometrics (vectorial biometrics). Comprehensive experiments on fingerprint vectors that derived from several FVC fingerprint benchmarks and rigorous analysis demonstrate decent secret retrieval performance yet offer strong resilience against six major security and privacy attacks.

Journal ArticleDOI
TL;DR: PQCF significantly outperforms the state-of-the-art hashing-based CF and QCF increases recommendation accuracy compared to pQCF, and is developed block coordinate descent for efficient optimization and reveal the learning of latent factors is seamlessly integrated with quantization.
Abstract: Because of strict response-time constraints, efficiency of top-k recommendation is crucial for real-world recommender systems. Locality sensitive hashing and index-based methods usually store both index data and item feature vectors in main memory, so they handle a limited number of items. Hashing-based recommendation methods enjoy low memory cost and fast retrieval of items, but suffer from large accuracy degradation. In this paper, we propose product Quantized Collaborative Filtering (pQCF) for better trade-off between efficiency and accuracy. pQCF decomposes a joint latent space of users and items into a Cartesian product of low-dimensional subspaces, and learns clustered representation within each subspace. A latent factor is then represented by a short code, which is composed of subspace cluster indexes. A user’s preference for an item can be efficiently calculated via table lookup. We then develop block coordinate descent for efficient optimization and reveal the learning of latent factors is seamlessly integrated with quantization. We further investigate an asymmetric pQCF, dubbed as QCF, where user latent factors are not quantized and shared across different subspaces. The extensive experiments with 6 real-world datasets show that pQCF significantly outperforms the state-of-the-art hashing-based CF and QCF increases recommendation accuracy compared to pQCF.

Journal ArticleDOI
Deng Cai1
TL;DR: In this article, a simple but effective novel hash index search approach was proposed and made a thorough comparison of eleven popular hashing algorithms, and the random-projection-based LSH ranked the first, which is in contradiction to the claims in all the other 10 hashing articles.
Abstract: Approximate Nearest Neighbor Search (ANNS) is a fundamental problem in many areas of machine learning and data mining. During the past decade, numerous hashing algorithms are proposed to solve this problem. Every proposed algorithm claims to outperform Locality Sensitive Hashing (LSH), which is the most popular hashing method. However, the evaluation of these hashing article was not thorough enough, and the claim should be re-examined. If implemented correctly, almost all the hashing methods will have their performance improved as the code length increases. However, many existing hashing article only report the performance with the code length shorter than 128. In this article, we carefully revisit the problem of search-with-a-hash-index and analyze the pros and cons of two popular hash index search procedures. Then we proposed a simple but effective novel hash index search approach and made a thorough comparison of eleven popular hashing algorithms. Surprisingly, the random-projection-based Locality Sensitive Hashing ranked the first, which is in contradiction to the claims in all the other 10 hashing article. Despite the extreme simplicity of random-projection-based LSH, our results show that the capability of this algorithm has been far underestimated. For the sake of reproducibility, all the codes used in the article are released on GitHub, which can be used as a testing platform for a fair comparison between various hashing algorithms.

Proceedings ArticleDOI
09 Jun 2021
TL;DR: LDME as discussed by the authors is a correction set based graph summarization algorithm that produces compact output representations in a fast and scalable manner by using weighted locality sensitive hashing to reduce the number comparisons required to find good node merges.
Abstract: Summarizing graphs is of paramount importance due to diverse applications of large-scale graph analysis. A popular family of summarization methods is the group-based approach. The general idea consists of merging nodes of the original graph into supernodes of the summary graph, encoding original edges into superedges/correction set edges, and dropping certain superedges or correction set edges (for lossy summarization). The current state of the art has several steps in its computation that are serious bottlenecks in terms of running time and scalability. In this work, we propose algorithm LDME, a correction set based graph summarization algorithm that produces compact output representations in a fast and scalable manner. To achieve this, we introduce (1) weighted locality sensitive hashing to drastically reduce the number comparisons required to find good node merges, (2) an efficient way to compute the best quality merges that produces more compact outputs, and (3) a new sort-based encoding algorithm that is faster and more robust. More interestingly, our algorithm provides performance tuning settings to allow the option of trading compression for running time. On high compression settings, LDME achieves compression equal to or better than the state of the art with up to 53x speedup in running time. On high speed settings, LDME achieves up to two orders of magnitude speedup with only slightly lower compression.

Journal ArticleDOI
TL;DR: An adaptation of the ReliefF algorithm that simplifies the costliest of its step by approximating the nearest neighbor graph using locality‐sensitive hashing (LSH), which can process data sets that are too large for the original ReliefF.
Abstract: Feature selection algorithms, such as ReliefF, are very important for processing high‐dimensionality data sets. However, widespread use of popular and effective such algorithms is limited by their computational cost. We describe an adaptation of the ReliefF algorithm that simplifies the costliest of its step by approximating the nearest neighbor graph using locality‐sensitive hashing (LSH). The resulting ReliefF‐LSH algorithm can process data sets that are too large for the original ReliefF, a capability further enhanced by distributed implementation in Apache Spark. Furthermore, ReliefF‐LSH obtains better results and is more generally applicable than currently available alternatives to the original ReliefF, as it can handle regression and multiclass data sets. The fact that it does not require any additional hyperparameters with respect to ReliefF also avoids costly tuning. A set of experiments demonstrates the validity of this new approach and confirms its good scalability.

Journal ArticleDOI
TL;DR: The study evaluated sophisticated composition methods, such as Bidirectional Recurrent Neural Networks and Long Short-Term Memory hidden units, to renovate each tuple to word representation distribution to capture similarities amidst tuples.
Abstract: In the Big Data Era, Entity Resolution (ER) faces many challenges such as high scalability, the coexistence of complex similarity metrics, tautonymy and synonym, and the requirement of Data Quality Evaluation. Moreover, despite more than seventy years of development efforts, there is still a high demand for democratizing ER to reduce human participation in tuning parameters, data labeling, defining blocking functions, and feature engineering. This study aimed to explore a novel Stacked Dedupe Learning ER system with high accuracy and efficiency. The study evaluated sophisticated composition methods, such as Bidirectional Recurrent Neural Networks (BiRNNs) and Long Short-Term Memory (LSTM) hidden units, to renovate each tuple to word representation distribution in a sense to capture similarities amidst tuples. Also, pre-trained words embedding where they were not available, ways to learn and tune Word Representation Distribution customized for ER tasks under different scenarios were considered. More so, the Locality Sensitive Hashing (LSH) based blocking approach, which considered the entire attributes of a tuple and produced slighter blocks, compared with traditional methods with few attributes, were assessed. The algorithm was tested on multiple datasets namely benchmarks, and multi-lingual data. The experimental results showed that Stacked Dedupe Learning achieves high quality and good performance, and scales well compared to the existing solutions.

Proceedings ArticleDOI
Samuel McCauley1
01 Jan 2021
TL;DR: This work achieves the first bounds for any approximation factor c, via a simple and easy-to-implement hash function, and shows how to apply these ideas to the closely-related Approximate Nearest Neighbor problem for edit distance, obtaining similar time bounds.
Abstract: Edit distance similarity search, also called approximate pattern matching, is a fundamental problem with widespread database applications. The goal of the problem is to preprocess n strings of length d, to quickly answer queries q of the form: if there is a database string within edit distance r of q, return a database string within edit distance cr of q. Previous approaches to this problem either rely on very large (superconstant) approximation ratios c, or very small search radii r. Outside of a narrow parameter range, these solutions are not competitive with trivially searching through all n strings. In this work we give a simple and easy-to-implement hash function that can quickly answer queries for a wide range of parameters. Specifically, our strategy can answer queries in time O(d3^rn^{1/c}). The best known practical results require c ≫ r to achieve any correctness guarantee; meanwhile, the best known theoretical results are very involved and difficult to implement, and require query time that can be loosely bounded below by 24^r. Our results significantly broaden the range of parameters for which there exist nontrivial theoretical bounds, while retaining the practicality of a locality-sensitive hash function.

Book ChapterDOI
29 Sep 2021
TL;DR: In this paper, the authors derive a triangle inequality for cosine similarity that is suitable for efficient similarity search with many standard search structures (such as the VP-tree, Cover-tree and M-tree) and discuss fast approximations for it.
Abstract: Similarity search is a fundamental problem for many data analysis techniques. Many efficient search techniques rely on the triangle inequality of metrics, which allows pruning parts of the search space based on transitive bounds on distances. Recently, cosine similarity has become a popular alternative choice to the standard Euclidean metric, in particular in the context of textual data and neural network embeddings. Unfortunately, cosine similarity is not metric and does not satisfy the standard triangle inequality. Instead, many search techniques for cosine rely on approximation techniques such as locality sensitive hashing. In this paper, we derive a triangle inequality for cosine similarity that is suitable for efficient similarity search with many standard search structures (such as the VP-tree, Cover-tree, and M-tree); show that this bound is tight and discuss fast approximations for it. We hope that this spurs new research on accelerating exact similarity search for cosine similarity, and possible other similarity measures beyond the existing work for distance metrics.

Proceedings ArticleDOI
14 Aug 2021
TL;DR: A comprehensive review of high-dimensional similarity query processing for data science can be found in this article, where the authors introduce solutions from a variety of research communities, including data mining (DM), database (DB), machine learning (ML), computer vision (CV), natural language processing (NLP), and theoretical computer science (TCS).
Abstract: Similarity query (a.k.a. nearest neighbor query) processing has been an active research topic for several decades. It is an essential procedure in a wide range of applications (e.g., classification & regression, deduplication, image retrieval, and recommender systems). Recently, representation learning and auto-encoding methods as well as pre-trained models have gained popularity. They basically deal with dense high-dimensional data, and this trend brings new opportunities and challenges to similarity query processing. Meanwhile, new techniques have emerged to tackle this long-standing problem theoretically and empirically. This tutorial aims to provide a comprehensive review of high-dimensional similarity query processing for data science. It introduces solutions from a variety of research communities, including data mining (DM), database (DB), machine learning (ML), computer vision (CV), natural language processing (NLP), and theoretical computer science (TCS), thereby highlighting the interplay between modern computer science and artificial intelligence technologies. We first discuss the importance of high-dimensional similarity query processing in data science applications, and then review query processing algorithms such as cover tree, locality sensitive hashing, product quantization, proximity graphs, as well as recent advancements such as learned indexes. We analyze their strengths and weaknesses and discuss the selection of algorithms in various application scenarios. Moreover, we consider the selectivity estimation of high-dimensional similarity queries, and show how researchers are bringing in state-of-the-art ML techniques to address this problem. We expect that this tutorial will provide an impetus towards new technologies for data science.

Journal ArticleDOI
TL;DR: By integrating the power-law distribution of travel data and tourism recommendation technology, the authors’ work solves the problem existing in traditional TRSs that recommendation results are overly narrow and lack in serendipity, and provides users with a wider range of choices and hence improves user experience in T RSs.
Abstract: One challenge for tourism recommendation systems (TRSs) is the long-tail phenomenon of ratings or popularity among tourist products. This paper aims to improve the diversity and efficiency of TRSs utilizing the power-law distribution of long-tail data.,Using Sina Weibo check-in data for example, this paper demonstrates that the long-tail phenomenon exists in user travel behaviors and fits the long-tail travel data with power-law distribution. To solve data sparsity in the long-tail part and increase recommendation diversity of TRSs, the paper proposes a collaborative filtering (CF) recommendation algorithm combining with power-law distribution. Furthermore, by combining power-law distribution with locality sensitive hashing (LSH), the paper optimizes user similarity calculation to improve the calculation efficiency of TRSs.,The comparison experiments show that the proposed algorithm greatly improves the recommendation diversity and calculation efficiency while maintaining high precision and recall of recommendation, providing basis for further dynamic recommendation.,TRSs provide a better solution to the problem of information overload in the tourism field. However, based on the historical travel data over the whole population, most current TRSs tend to recommend hot and similar spots to users, lacking in diversity and failing to provide personalized recommendations. Meanwhile, the large high-dimensional sparse data in online social networks (OSNs) brings huge computational cost when calculating user similarity with traditional CF algorithms. In this paper, by integrating the power-law distribution of travel data and tourism recommendation technology, the authors’ work solves the problem existing in traditional TRSs that recommendation results are overly narrow and lack in serendipity, and provides users with a wider range of choices and hence improves user experience in TRSs. Meanwhile, utilizing locality sensitive hash functions, the authors’ work hashes users from high-dimensional vectors to one-dimensional integers and maps similar users into the same buckets, which realizes fast nearest neighbors search in high-dimensional space and solves the extreme sparsity problem of high dimensional travel data. Furthermore, applying the hashing results to user similarity calculation, the paper greatly reduces computational complexity and improves calculation efficiency of TRSs, which reduces the system load and enables TRSs to provide effective and timely recommendations for users.

Proceedings ArticleDOI
10 May 2021
TL;DR: NetSHa as discussed by the authors exploits the in-network computational capacity provided by programmable switches, and designs a sort-reduce approach to drop the potential poor candidate answers and aggregates the good candidate answers.
Abstract: Locality Sensitive Hashing (LSH) is widely adopted to index similar data in high-dimensional space for approximate nearest neighbor search. With the rapid increase of datasets, recent interests in LSH have moved to the implementation of distributed search systems with low response time and high throughput. However, as the scale of the concurrent queries and the volume of available data grow, large amounts of index messages still need to be transmitted to centralized servers for the candidate answer reducing and resorting. Hence, the network remains the bottleneck in distributed search systems.To address this gap, we turn our efforts to the network itself and propose NetSHa. NetSHa exploits the in-network computational capacity provided by programmable switches. Specially, NetSHa designs a sort-reduce approach to drop the potential poor candidate answers and aggregates the good candidate answers on programmable switches, while preserving the search quality. We implement NetSHa on Barefoot Tofino switches and evaluate it using 3 datasets (i.e., Random, Wiki and Image). The experimental results show that NetSHa reduces the packet volume by 10 times at most and improves the search efficiency by 3x at least, in comparison with typical LSH-based distributed search frameworks.

Journal ArticleDOI
TL;DR: In this article, the authors proposed a combined method for identification of near-duplicates in electronic scientific papers, which combines the methods for identifying of near duplicates of various types of data, including text data, mathematical formulas, numerical data, etc.
Abstract: The methods for identification of near-duplicates in electronic scientific papers, which include the content of the same type, for example, text data, mathematical formulas, numerical data, etc. were described. For text data, the method of locally sensitive hashing with the finding of Hamming distance between the elements of indices of electronic scientific papers was formalized. If Hamming distance exceeds a fixed numerical threshold, a scientific paper contains a near-duplicate. For numerical data, sub-sequences for each scientific work are formed and the proximity between the papers is determined as the Euclidian distance between the vectors consisting of the numbers of these sub-sequences. To compare mathematical formulas, the method for comparing the sample of formulas is used and the names of variables are compared. To identify near-duplicates in graphic information, two directions are separated: finding key points in the image and applying locally sensitive hashing for individual pixels of the image. Since scientific papers often include such objects as schemes and diagrams, subscriptions to them are examined separately using the methods for comparing text information. The combined method for identification of near-duplicates in electronic scientific papers, which combines the methods for identification of near-duplicates of various types of data, was proposed. To implement the combined method for the identification of near-duplicates in electronic scientific papers, an information-analytical system that processes scientific materials depending on the content type was devised. This makes it possible to qualitatively identify near-duplicates and as widely as possible identify possible abuses and plagiarism in electronic scientific papers: scientific articles, dissertations, monographs, conference materials, etc.

Journal ArticleDOI
TL;DR: This work proposes a new secure multi-keyword fuzzy search scheme for encrypted cloud data that leverages random redundancy method to handle the deterministic of bloom filter to resist SNMF attack and allows users to conduct complicated fuzzy search with logic operations “AND”, “OR” and “NOT” to meet more flexible and fine-grained query demands.
Abstract: Fuzzy keyword search is a necessary and important feature of information retrieval in modern cloud storage services since users with insufficient knowledge may input typos or keywords with inconsistent formats. However, most of the existing fuzzy search schemes adopt bloom filter and locality sensitive hashing which cannot resist Sparse Non-negative Matrix Factorization based attack (SNMF attack). To this end, we propose a new secure multi-keyword fuzzy search scheme for encrypted cloud data, our scheme leverages random redundancy method to handle the deterministic of bloom filter to resist SNMF attack. Besides the privacy, our scheme uses tree-based index construction to improve search efficiency and allows users to conduct complicated fuzzy search with logic operations “AND”, “OR” and “NOT”, which can meet more flexible and fine-grained query demands. The theoretical analysis and experiments on real-world data show the security and high performance of our scheme.

Journal ArticleDOI
TL;DR: The results show that the proposed VCQ and VIQ algorithms can both achieve much higher accuracy than state-of-the-art quantization methods, and although VCQ performs better than VIQ, ANN search with VIQ provides much higher search efficiency.
Abstract: Approximate Nearest Neighbor(ANN) search is the core problem in many large-scale machine learning and computer vision applications such as multimodal retrieval Hashing is becoming increasingly popular, since it can provide efficient similarity search and compact data representations suitable for handling such large-scale ANN search problems Most hashing algorithms concentrate on learning more effective projection functions However, the accuracy loss in the quantization step has been ignored and barely studied In this paper, we analyse the importance of various projected dimensions, distribute them into several groups and quantize them with two types of values which can both better preserve the neighborhood structure among data One is Variable Integer-based Quantization (VIQ) that quantizes each projected dimension with integer values The other is Variable Codebook-based Quantization (VCQ) that quantizes each projected dimension with corresponding codebook values We conduct experiments on five common public data sets containing up to one million vectors The results show that the proposed VCQ and VIQ algorithms can both achieve much higher accuracy than state-of-the-art quantization methods Furthermore, although VCQ performs better than VIQ, ANN search with VIQ provides much higher search efficiency

Book ChapterDOI
29 Jan 2021
TL;DR: Locality Sensitive Hashing (LSH) as mentioned in this paper is a popular technique for finding approximate nearest neighbors in high-dimensional spaces and provides theoretical guarantees on the query results and is highly scalable.
Abstract: Finding nearest neighbors in high-dimensional spaces is a fundamental operation in many multimedia retrieval applications. Exact tree-based approaches are known to suffer from the notorious curse of dimensionality for high-dimensional data. Approximate searching techniques sacrifice some accuracy while returning good enough results for faster performance. Locality Sensitive Hashing (LSH) is a popular technique for finding approximate nearest neighbors. There are two main benefits of LSH techniques: they provide theoretical guarantees on the query results, and they are highly scalable. The most dominant costs for existing external memory-based LSH techniques are algorithm time and index I/Os required to find candidate points. Existing works do not compare both of these costs in their evaluation. In this experimental survey paper, we show the impact of both these costs on the overall performance. We compare three state-of-the-art techniques on six real-world datasets, and show the importance of comparing these costs to achieve a more fair comparison.

Posted Content
TL;DR: Locality Sensitive Hashing (LSH) is one of the most popular techniques for finding approximate nearest neighbor searches in high-dimensional spaces as discussed by the authors, and the main benefits of LSH are its sublinear query performance and theoretical guarantees on the query accuracy.
Abstract: Finding nearest neighbors in high-dimensional spaces is a fundamental operation in many diverse application domains. Locality Sensitive Hashing (LSH) is one of the most popular techniques for finding approximate nearest neighbor searches in high-dimensional spaces. The main benefits of LSH are its sub-linear query performance and theoretical guarantees on the query accuracy. In this survey paper, we provide a review of state-of-the-art LSH and Distributed LSH techniques. Most importantly, unlike any other prior survey, we present how Locality Sensitive Hashing is utilized in different application domains.

Journal ArticleDOI
TL;DR: An efficient MVIF-CBMR method based on late fusion that combines retrieval result-level of Medio-Lateral Oblique and Cranio-Caudal views and takes two query ROIs corresponding to two different views as input and displays the most similar ROIs to each view using a dynamic similarity assessment.

Journal ArticleDOI
TL;DR: This approach exploits properties of single and multiple random projections, which allows us to store meaningful auxiliary information at internal nodes of a random projection tree as well as to design priority functions to guide the search process that results in improved nearest neighbor search performance.

Journal ArticleDOI
17 Jun 2021
TL;DR: In this article, the authors study the r-NN problem in the light of individual fairness and providing equal opportunities: all points that are within distance r from the query should have the same probability to be returned.
Abstract: Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. Given a set of points S and a radius parameter r > 0, the rnear neighbor (r-NN) problem asks for a data structure that, given any query point q, returns a point p within distance at most r from q. In this paper, we study the r-NN problem in the light of individual fairness and providing equal opportunities: all points that are within distance r from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee.

Proceedings ArticleDOI
01 Aug 2021
TL;DR: LSHvec as mentioned in this paper leverages Locality Sensitive Hashing (LSH) for k-mer encoding and adopts skip-gram with negative sampling to learn kmer embeddings.
Abstract: Drawing from the analogy between natural language and "genomic sequence language", we explored the applicability of word embeddings in natural language processing (NLP) to represent DNA reads in Metagenomics studies. Here, k-mer is the equivalent concept of word in NLP and it has been widely used in analyzing sequence data. However, directly replacing word embedding with k-mer embedding is problematic due to two reasons: First, the number of distinct k-mers is far more than the number of different words in our vocabulary, making the model too huge to be stored in memory. Second, sequencing errors create lots of novel k-mers (noise), which significantly degrade model performance. In this work, we introduce LSHvec, a model that leverages Locality Sensitive Hashing (LSH) for k-mer encoding to overcome these challenges. After k-mers are LSH encoded, we adopt the skip-gram with negative sampling to learn k-mer embeddings. Experiments on metagenomic datasets with labels demonstrate that k-mer encoding using LSH can not only accelerate training time and reduce the memory requirements to store the model, but also achieve higher accuracy than using alternative encoding methods. We validate that LSHvec is robust on reads with high sequencing error rates and works well with any sequencing technologies. In addition, the trained low-dimensional k-mer embeddings can be potentially used for accurate metagenomic read clustering and taxonomic classification. Finally, We demonstrate the unprecedented capability of LSHvec by participating in the second round of CAMI challenges and show that LSHvec is able to handle metagenome datasets that exceed Terabytes in size through distributed training across multiple nodes.