scispace - formally typeset
Search or ask a question

Showing papers on "Locality-sensitive hashing published in 2010"


Proceedings ArticleDOI
13 Jun 2010
TL;DR: This work proposes a semi-supervised hashing method that is formulated as minimizing empirical error on the labeled data while maximizing variance and independence of hash bits over the labeled and unlabeled data.
Abstract: Large scale image search has recently attracted considerable attention due to easy availability of huge amounts of data. Several hashing methods have been proposed to allow approximate but highly efficient search. Unsupervised hashing methods show good performance with metric distances but, in image search, semantic similarity is usually given in terms of labeled pairs of images. There exist supervised hashing methods that can handle such semantic similarity but they are prone to overfitting when labeled data is small or noisy. Moreover, these methods are usually very slow to train. In this work, we propose a semi-supervised hashing method that is formulated as minimizing empirical error on the labeled data while maximizing variance and independence of hash bits over the labeled and unlabeled data. The proposed method can handle both metric as well as semantic similarity. The experimental results on two large datasets (up to one million samples) demonstrate its superior performance over state-of-the-art supervised and unsupervised methods.

662 citations


Proceedings Article
21 Jun 2010
TL;DR: This paper proposes a novel data-dependent projection learning method such that each hash function is designed to correct the errors made by the previous one sequentially, and shows significant performance gains over the state-of-the-art methods on two large datasets containing up to 1 million points.
Abstract: Hashing based Approximate Nearest Neighbor (ANN) search has attracted much attention due to its fast query time and drastically reduced storage. However, most of the hashing methods either use random projections or extract principal directions from the data to derive hash functions. The resulting embedding suffers from poor discrimination when compact codes are used. In this paper, we propose a novel data-dependent projection learning method such that each hash function is designed to correct the errors made by the previous one sequentially. The proposed method easily adapts to both unsupervised and semi-supervised scenarios and shows significant performance gains over the state-of-the-art methods on two large datasets containing up to 1 million points.

357 citations


Journal ArticleDOI
TL;DR: This paper compares several families of space hashing functions in a real setup and reveals that unstructured quantizer significantly improves the accuracy of LSH, as it closely fits the data in the feature space.

327 citations


Proceedings ArticleDOI
13 Jun 2010
TL;DR: This paper proposes a supervised hashing method, i.e., the LAbel-regularized Max-margin Partition (LAMP) algorithm, which generates hash functions in weakly-supervised setting and provides a collision bound which is beyond pairwise data interaction based on Markov random fields theory.
Abstract: The explosive growth of the vision data motivates the recent studies on efficient data indexing methods such as locality-sensitive hashing (LSH). Most existing approaches perform hashing in an unsupervised way. In this paper we move one step forward and propose a supervised hashing method, i.e., the LAbel-regularized Max-margin Partition (LAMP) algorithm. The proposed method generates hash functions in weakly-supervised setting, where a small portion of sample pairs are manually labeled to be “similar” or “dissimilar”. We formulate the task as a Constrained Convex-Concave Procedure (CCCP), which can be relaxed into a series of convex sub-problems solvable with efficient Quadratic-Program (QP). The proposed hashing method possesses other characteristics including: 1) most existing LSH approaches rely on linear feature representation. Unfortunately, kernel tricks are often more natural to gauge the similarity between visual objects in vision research, which corresponds to probably infinite-dimensional Hilbert spaces. The proposed LAMP has a natural support for kernel-based feature representation. 2) traditional hashing methods assume uniform data distributions. Typically, the collision probability of two samples in hash buckets is only determined by pairwise similarity, unrelated to contextual data distribution. In contrast, we provide such a collision bound which is beyond pairwise data interaction based on Markov random fields theory. Extensive empirical evaluations are conducted on five widely-used benchmarks. It takes only several seconds to generate a new hashing function, and the adopted random supporting-vector scheme enables the LAMP algorithm scalable to large-scale problems. Experimental results well validate the superiorities of the LAMP algorithm over the state-of-the-art kernel-based hashing methods.

166 citations


Journal ArticleDOI
TL;DR: This work improves LSH by proposing an access method called the Locality-Sensitive B-tree (LSB-tree) to enable fast, accurate, high-dimensional NN search in relational databases, and extends the LSB technique to solve another classic problem, called Closest Pair (CP) search, in high- dimensional space.
Abstract: Nearest Neighbor (NN) search in high-dimensional space is an important problem in many applications. From the database perspective, a good solution needs to have two properties: (i) it can be easily incorporated in a relational database, and (ii) its query cost should increase sublinearly with the dataset size, regardless of the data and query distributions. Locality-Sensitive Hashing (LSH) is a well-known methodology fulfilling both requirements, but its current implementations either incur expensive space and query cost, or abandon its theoretical guarantee on the quality of query results.Motivated by this, we improve LSH by proposing an access method called the Locality-Sensitive B-tree (LSB-tree) to enable fast, accurate, high-dimensional NN search in relational databases. The combination of several LSB-trees forms a LSB-forest that has strong quality guarantees, but improves dramatically the efficiency of the previous LSH implementation having the same guarantees. In practice, the LSB-tree itself is also an effective index which consumes linear space, supports efficient updates, and provides accurate query results. In our experiments, the LSB-tree was faster than: (i) iDistance (a famous technique for exact NN search) by two orders of magnitude, and (ii) MedRank (a recent approximate method with nontrivial quality guarantees) by one order of magnitude, and meanwhile returned much better results.As a second step, we extend our LSB technique to solve another classic problem, called Closest Pair (CP) search, in high-dimensional space. The long-term challenge for this problem has been to achieve subquadratic running time at very high dimensionalities, which fails most of the existing solutions. We show that, using a LSB-forest, CP search can be accomplished in (worst-case) time significantly lower than the quadratic complexity, yet still ensuring very good quality. In practice, accurate answers can be found using just two LSB-trees, thus giving a substantial reduction in the space and running time. In our experiments, our technique was faster: (i) than distance browsing (a well-known method for solving the problem exactly) by several orders of magnitude, and (ii) than D-shift (an approximate approach with theoretical guarantees in low-dimensional space) by one order of magnitude, and at the same time, outputs better results.

107 citations


Proceedings Article
01 Jan 2010
TL;DR: This work considers the problem of processing K-Nearest Neighbor queries over large datasets where the index is jointly maintained by a set of machines in a computing cluster and proposes the proposed RankReduce approach which uses locality sensitive hashing (LSH) together with a MapReduce implementation.
Abstract: We consider the problem of processing K-Nearest Neighbor (KNN) queries over large datasets where the index is jointly maintained by a set of machines in a computing cluster. The proposed RankReduce approach uses locality sensitive hashing (LSH) together with a MapReduce implementation, which by design is a perfect match as the hashing principle of LSH can be smoothly integrated in the mapping phase of MapReduce. The LSH algorithm assigns similar objects to the same fragments in the distributed file system which enables a effective selection of potential candidate neighbors which get then reduced to the set of K-Nearest Neighbors. We address problems arising due to the different characteristics of MapReduce and LSH to achieve an efficient search process on the one hand and high LSH accuracy on the other hand. We discuss several pitfalls and detailed descriptions on how to circumvent these. We evaluate RankReduce using both synthetic data and a dataset obtained from Flickr.com demonstrating the suitability of the approach.

102 citations


Proceedings ArticleDOI
13 Jun 2010
TL;DR: This paper presents a novel and fast algorithm for learning binary hash functions for fast nearest neighbor retrieval and links the method to the idea of maximizing conditional entropy among pair of bits and derive an extremely efficient linear time hash learning algorithm.
Abstract: Searching approximate nearest neighbors in large scale high dimensional data set has been a challenging problem. This paper presents a novel and fast algorithm for learning binary hash functions for fast nearest neighbor retrieval. The nearest neighbors are defined according to the semantic similarity between the objects. Our method uses the information of these semantic similarities and learns a hash function with binary code such that only objects with high similarity have small Hamming distance. The hash function is incrementally trained one bit at a time, and as bits are added to the hash code Hamming distances between dissimilar objects increase. We further link our method to the idea of maximizing conditional entropy among pair of bits and derive an extremely efficient linear time hash learning algorithm. Experiments on similar image retrieval and celebrity face recognition show that our method produces apparent improvement in performance over some state-of-the-art methods.

89 citations


Book ChapterDOI
05 Sep 2010
TL;DR: A modified NBNN based on the hypothesis that each local descriptor is drawn from a class-dependent probability measure generalizes to optimal combinations of feature types by solving the parameter selection problem by hinge-loss minimization.
Abstract: Naive Bayes Nearest Neighbor (NBNN) is a feature-based image classifier that achieves impressive degree of accuracy [1] by exploiting 'Image-to-Class' distances and by avoiding quantization of local image descriptors It is based on the hypothesis that each local descriptor is drawn froma class-dependent probability measure The density of the latter is estimated by the non-parametric kernel estimator, which is further simplified under the assumption that the normalization factor is class-independent While leading to significant simplification, the assumption underlying the original NBNN is too restrictive and considerably degrades its generalization ability The goal of this paper is to address this issue As we relax the incriminated assumption we are faced with a parameter selection problem that we solve by hinge-loss minimization We also show that our modified formulation naturally generalizes to optimal combinations of feature types Experiments conducted on several datasets show that the gain over the original NBNN may attain up to 20 percentage points We also take advantage of the linearity of optimal NBNN to perform classification by detection through efficient sub-window search [2], with yet another performance gain As a result, our classifier outperforms--in terms of misclassification error--methods based on support vector machine and bags of quantized features on some datasets

89 citations


Proceedings Article
11 Jul 2010
TL;DR: This paper utilizes the norm-keeping property of p-stable functions to ensure that two data's collision probability reflects their non-metric distance in original feature space and investigates various concrete examples to validate the proposed algorithm.
Abstract: Non-metric distances are often more reasonable compared with metric ones in terms of consistency with human perceptions. However, existing locality-sensitive hashing (LSH) algorithms can only support data which are gauged with metrics. In this paper we propose a novel locality-sensitive hashing algorithm targeting such non-metric data. Data in original feature space are embedded into an implicit reproducing kernel Kreĭn space and then hashed to obtain binary bits. Here we utilize the norm-keeping property of p-stable functions to ensure that two data's collision probability reflects their non-metric distance in original feature space. We investigate various concrete examples to validate the proposed algorithm. Extensive empirical evaluations well illustrate its effectiveness in terms of accuracy and retrieval speedup.

76 citations


Proceedings ArticleDOI
19 Jul 2010
TL;DR: This paper presents a novel near-duplicate document detection method that is not only more accurate than the commonly used methods such as Shingles and I-Match, but also shows consistent improvement across the domains, which is a desired property lacked by existing methods.
Abstract: In this paper, we present a novel near-duplicate document detection method that can easily be tuned for a particular domain. Our method represents each document as a real-valued sparse k-gram vector, where the weights are learned to optimize for a specified similarity function, such as the cosine similarity or the Jaccard coefficient. Near-duplicate documents can be reliably detected through this improved similarity measure. In addition, these vectors can be mapped to a small number of hash-values as document signatures through the locality sensitive hashing scheme for efficient similarity computation. We demonstrate our approach in two target domains: Web news articles and email messages. Our method is not only more accurate than the commonly used methods such as Shingles and I-Match, but also shows consistent improvement across the domains, which is a desired property lacked by existing methods.

71 citations


Proceedings ArticleDOI
25 Oct 2010
TL;DR: A novel multi-label propagation scheme that encodes the tag information of an image as a unit label confidence vector, which naturally imposes inter-label constraints and manipulates labels interactively and outperforms the state-of-the-art algorithms.
Abstract: Annotating large-scale image corpus requires huge amount of human efforts and is thus generally unaffordable, which directly motivates recent development of semi-supervised or active annotation methods. In this paper we revisit this notoriously challenging problem and develop a novel multi-label propagation scheme, whereby both the efficacy and accuracy of large-scale image annotation are further enhanced. Our investigation starts from a survey of previous graph propagation based annotation approaches, wherein we analyze their main drawbacks when scaling up to large-scale datasets and handling multi-label setting. Our proposed scheme outperforms the state-of-the-art algorithms by making the following contributions. 1) Unlike previous approaches that propagate over individual label independently, our proposed large-scale multi-label propagation (LSMP) scheme encodes the tag information of an image as a unit label confidence vector, which naturally imposes inter-label constraints and manipulates labels interactively. It then utilizes the probabilistic Kullback-Leibler divergence for problem formulation on multi-label propagation. 2) We perform the multi-label propagation on the so-called hashing-based L1-graph, which is efficiently derived with Locality Sensitive Hashing approach followed by sparse L1-graph construction within the individual hashing buckets. 3) An efficient and convergency provable iterative procedure is presented for problem optimization. Extensive experiments on NUS-WIDE dataset (both lite version with 56k images and full version with 270k images) well validate the effectiveness and scalability of the proposed approach.

Proceedings ArticleDOI
29 Oct 2010
TL;DR: An image-based indoor localization system using omnidirectional panoramic images to which location information is added by the combination of the robust image matching by PCA-SIFT and fast nearest neighbor search algorithm based on Locality Sensitive Hashing (LSH).
Abstract: In this paper, we developed an image-based indoor localization system using omnidirectional panoramic images to which location information is added. By the combination of the robust image matching by PCA-SIFT and fast nearest neighbor search algorithm based on Locality Sensitive Hashing (LSH), our system can estimate users' positions with high accuracy and in a short time. To improve the precision, we introduced the "confidence" of the image matching results. We conducted experiments at the Railway Museum and we obtained 426 omnidirectional panoramic reference images and 1067 supplemental images for image matching. Experimental results using 126 test images demonstrated that the location detection accuracy is above 90% with about 2.2s of processing time.

Proceedings ArticleDOI
13 Jun 2010
TL;DR: It is shown that with hashing, the sparse representation can be recovered with a high probability because hashing preserves the restrictive isometry property and is presented a theoretical analysis on the recognition rate.
Abstract: We propose a face recognition approach based on hashing. The approach yields comparable recognition rates with the random l 1 approach [18], which is considered the state-of-the-art. But our method is much faster: it is up to 150 times faster than [18] on the YaleB dataset. We show that with hashing, the sparse representation can be recovered with a high probability because hashing preserves the restrictive isometry property. Moreover, we present a theoretical analysis on the recognition rate of the proposed hashing approach. Experiments show a very competitive recognition rate and significant speedup compared with the state-of-the-art.

Proceedings ArticleDOI
06 Jun 2010
TL;DR: This work presents a new technique of solving the approximate NNS problem in Euclidean space using a Ternary Content Addressable Memory (TCAM), which needs near linear space and has O(1) query time, and designs an experiment with TCAMs within an enterprise ethernet switch to validate that TLSH can be used to perform 1.5 million queries per second per 1Gb/s port.
Abstract: Similarity search methods are widely used as kernels in various data mining and machine learning applications including those in computational biology, web search/clustering. Nearest neighbor search (NNS) algorithms are often used to retrieve similar entries, given a query. While there exist efficient techniques for exact query lookup using hashing, similarity search using exact nearest neighbors suffers from a "curse of dimensionality", i.e. for high dimensional spaces, best known solutions offer little improvement over brute force search and thus are unsuitable for large scale streaming applications. Fast solutions to the approximate NNS problem include Locality Sensitive Hashing (LSH) based techniques, which need storage polynomial in n with exponent greater than 1, and query time sublinear, but still polynomial in n, where n is the size of the database. In this work we present a new technique of solving the approximate NNS problem in Euclidean space using a Ternary Content Addressable Memory (TCAM), which needs near linear space and has O(1) query time. In fact, this method also works around the best known lower bounds in the cell probe model for the query time using a data structure near linear in the size of the data base. TCAMs are high performance associative memories widely used in networking applications such as address lookups and access control lists. A TCAM can query for a bit vector within a database of ternary vectors, where every bit position represents 0, 1 or *. The * is a wild card representing either a 0 or a 1. We leverage TCAMs to design a variant of LSH, called Ternary Locality Sensitive Hashing (TLSH) wherein we hash database entries represented by vectors in the Euclidean space into {0,1,*}. By using the added functionality of a TLSH scheme with respect to the * character, we solve an instance of the approximate nearest neighbor problem with 1 TCAM access and storage nearly linear in the size of the database. We validate our claims with extensive simulations using both real world (Wikipedia) as well as synthetic (but illustrative) datasets. We observe that using a TCAM of width 288 bits, it is possible to solve the approximate NNS problem on a database of size 1 million points with high accuracy. Finally, we design an experiment with TCAMs within an enterprise ethernet switch (Cisco Catalyst 4500) to validate that TLSH can be used to perform 1.5 million queries per second per 1Gb/s port. We believe that this work can open new avenues in very high speed data mining.

Proceedings ArticleDOI
26 Apr 2010
TL;DR: A new framework based on performing a Traveling Salesman computation on a reduced sparse graph obtained through Locality Sensitive Hashing is proposed, which achieves improved compression while scaling to tens of millions of documents.
Abstract: Web search engines depend on the full-text inverted index data structure Because the query processing performance is so dependent on the size of the inverted index, a plethora of research has focused on fast end effective techniques for compressing this structure Recently, several authors have proposed techniques for improving index compression by optimizing the assignment of document identifiers to the documents in the collection, leading to significant reduction in overall index sizeIn this paper, we propose improved techniques for document identifier assignment Previous work includes simple and fast heuristics such as sorting by URL, as well as more involved approaches based on the Traveling Salesman Problem or on graph partitioning These techniques achieve good compression but do not scale to larger document collections We propose a new framework based on performing a Traveling Salesman computation on a reduced sparse graph obtained through Locality Sensitive Hashing This technique achieves improved compression while scaling to tens of millions of documents Based on this framework, we describe a number of new algorithms, and perform a detailed evaluation on three large data sets showing improvements in index size

Book ChapterDOI
05 Sep 2010
TL;DR: Two new algorithms that extend Spectral Hashing to non-Euclidean spaces and are able to retrieve similar objects in as low as O(K) time complexity, where K is the number of clusters in the data.
Abstract: Approximate Nearest Neighbor (ANN) methods such as Locality Sensitive Hashing, Semantic Hashing, and Spectral Hashing, provide computationally efficient procedures for finding objects similar to a query object in large datasets. These methods have been successfully applied to search web-scale datasets that can contain millions of images. Unfortunately, the key assumption in these procedures is that objects in the dataset lie in a Euclidean space. This assumption is not always valid and poses a challenge for several computer vision applications where data commonly lies in complex non-Euclidean manifolds. In particular, dynamic data such as human activities are commonly represented as distributions over bags of video words or as dynamical systems. In this paper, we propose two new algorithms that extend Spectral Hashing to non-Euclidean spaces. The first method considers the Riemannian geometry of the manifold and performs Spectral Hashing in the tangent space of the manifold at several points. The second method divides the data into subsets and takes advantage of the kernel trick to perform non-Euclidean Spectral Hashing. For a data set of N samples the proposed methods are able to retrieve similar objects in as low as O(K) time complexity, where K is the number of clusters in the data. Since K ≪ N, our methods are extremely efficient. We test and evaluate our methods on synthetic data generated from the Unit Hypersphere and the Grassmann manifold. Finally, we show promising results on a human action database.

Proceedings ArticleDOI
01 Oct 2010
TL;DR: A novel k-nearest neighbor search algorithm (KNNS) for proximity computation in motion planning algorithm that exploits the computational capabilities of many-core GPUs and exploits the multiple cores and data parallelism effectively.
Abstract: We present a novel k-nearest neighbor search algorithm (KNNS) for proximity computation in motion planning algorithm that exploits the computational capabilities of many-core GPUs. Our approach uses locality sensitive hashing and cuckoo hashing to construct an efficient KNNS algorithm that has linear space and time complexity and exploits the multiple cores and data parallelism effectively. In practice, we see magnitude improvement in speed and scalability over prior GPU-based KNNS algorithm. On some benchmarks, our KNNS algorithm improves the performance of overall planner by 20–40 times for CPU-based planner and up to 2 times for GPU-based planner.

Proceedings ArticleDOI
31 Aug 2010
TL;DR: A FPGA-based parallel architecture to accelerate the statistical identification of multimedia applications while maintaining high classification accuracy based on the k-Nearest Neighbors algorithm, which has been shown to be one of the most accurate machine learning algorithms for Internet traffic classification.
Abstract: Real-time classification of Internet traffic according to application types is vital for network management and surveillance. Identifying emerging applications based on well-known port numbers is no longer reliable. While deep packet inspection (DPI) solutions can be accurate, they require constant updates of signatures and become infeasible for encrypted payload especially in multimedia applications (e.g. Skype). Statistical approaches based on machine learning have thus been considered more promising and robust to encryption, privacy, protocol obfuscation, etc. However, the computation complexity of traffic classification using those statistical solutions is high, which prevents them being deployed in systems that need to manage Internet traffic in real time. This paper proposes a FPGA-based parallel architecture to accelerate the statistical identification of multimedia applications while maintaining high classification accuracy. Specifically, we base our design on the k-Nearest Neighbors (k-NN) algorithm which has been shown to be one of the most accurate machine learning algorithms for Internet traffic classification. To enable high-rate data streaming for real-time classification, we adopt the locality sensitive hashing (LSH) for approximate k-NN. The LSH scheme is carefully designed to achieve high accuracy while being efficient for implementation on FPGA. Processing components in the architecture are optimized to realize high throughput. Extensive experiments and FPGA implementation results show that our design can achieve high accuracy above 99% for classifying three main categories of multimedia applications from Internet traffic while sustaining 80 Gbps throughput for minimum size (40 bytes) packets.

Journal ArticleDOI
TL;DR: An incremental technique for discovering duplicates in large databases of textual sequences, i.e., syntactically different tuples, that refer to the same real-world entity that is efficiently identified by simply retrieving those tuples that appear in the same buckets associated to the query tuple itself, without completely scanning the original database.
Abstract: We propose an incremental technique for discovering duplicates in large databases of textual sequences, i.e., syntactically different tuples, that refer to the same real-world entity. The problem is approached from a clustering perspective: given a set of tuples, the objective is to partition them into groups of duplicate tuples. Each newly arrived tuple is assigned to an appropriate cluster via nearest-neighbor classification. This is achieved by means of a suitable hash-based index, that maps any tuple to a set of indexing keys and assigns tuples with high syntactic similarity to the same buckets. Hence, the neighbors of a query tuple can be efficiently identified by simply retrieving those tuples that appear in the same buckets associated to the query tuple itself, without completely scanning the original database. Two alternative schemes for computing indexing keys are discussed and compared. An extensive experimental evaluation on both synthetic and real data shows the effectiveness of our approach.

Proceedings ArticleDOI
01 Mar 2010
TL;DR: This article proposes a new hashing framework for tree-structured data that maps an unordered tree into a multiset of simple wedge-shaped structures refered to as pivots and empirically demonstrates the efficacy and efficiency of the overall approach on a range of real-world datasets and applications.
Abstract: In this article we propose a new hashing framework for tree-structured data. Our method maps an unordered tree into a multiset of simple wedge-shaped structures refered to as pivots. By coupling our pivot multisets with the idea of minwise hashing, we realize a fixed sized signature-sketch of the tree-structured datum yielding an effective mechanism for hashing such data. We discuss several potential pivot structures and study some of the theoretical properties of such structures, and discuss their implications to tree edit distance and properties related to perfect hashing. We then empirically demonstrate the efficacy and efficiency of the overall approach on a range of real-world datasets and applications.

Book ChapterDOI
15 Jul 2010
TL;DR: An efficient and scalable framework is proposed for XML structural similarity search on large cluster with MapReduce, where sub-structures of XML structure are extracted from large XML corpus located on a large cluster in parallel and Min-Hashing and locality sensitive hashing techniques are developed.
Abstract: XML is a de-facto standard for web data exchange and information representation. Efficient management of these large volumes of XML data brings challenges to conventional technique. To cope with large scale data, MapReduce computing framework as an efficient solution has attracted more and more attention in the database community recently. In this paper, an efficient and scalable framework is proposed for XML structural similarity search on large cluster with MapReduce. First, sub-structures of XML structure are extracted from large XML corpus located on a large cluster in parallel. Then Min-Hashing and locality sensitive hashing techniques are developed on the distributed and parallel computing framework for efficient structural similarity search processing. An empirical study on the cluster with real large datasets demonstrates the effectiveness and efficiency of our approach.

Journal ArticleDOI
TL;DR: A new algorithm for accelerated similarity searching and clustering of very large compound sets using embedding and indexing (EI) techniques and allows us to significantly reduce the CPU time required to cluster these large compound libraries from several months to only a few days.
Abstract: Motivation: Similarity searching and clustering of chemical compounds by structural similarities are important computational approaches for identifying drug-like small molecules. Most algorithms available for these tasks are limited by their speed and scalability, and cannot handle today's large compound databases with several million entries. Results: In this article, we introduce a new algorithm for accelerated similarity searching and clustering of very large compound sets using embedding and indexing (EI) techniques. First, we present EI-Search as a general purpose similarity search method for finding objects with similar features in large databases and apply it here to searching and clustering of large compound sets. The method embeds the compounds in a high-dimensional Euclidean space and searches this space using an efficient index-aware nearest neighbor search method based on locality sensitive hashing (LSH). Second, to cluster large compound sets, we introduce the EI-Clustering algorithm that combines the EI-Search method with Jarvis–Patrick clustering. Both methods were tested on three large datasets with sizes ranging from about 260 000 to over 19 million compounds. In comparison to sequential search methods, the EI-Search method was 40–200 times faster, while maintaining comparable recall rates. The EI-Clustering method allowed us to significantly reduce the CPU time required to cluster these large compound libraries from several months to only a few days. Availability: Software implementations and online services have been developed based on the methods introduced in this study. The online services provide access to the generated clustering results and ultra-fast similarity searching of the PubChem Compound database with subsecond response time. Contact: thomas.girke@ucr.edu Supplementary information: Supplementary data are available at Bioinformatics online.

Proceedings ArticleDOI
25 Oct 2010
TL;DR: This paper proposes a scalable semi-supervised multiple kernel learning method (S3MKL) that is suitable for real world image classification and personalized web image re-ranking with very little user interaction and designs a hashing system where multiple kernel locality sensitive hashing (MKLSH) are constructed with respect to different kernels.
Abstract: For large scale image data mining, a challenging problem is to design a method that could work efficiently under the situation of little ground-truth annotation and a mass of unlabeled or noisy data. As one of the major solutions, semi-supervised learning (SSL) has been deeply investigated and widely used in image classification, ranking and retrieval. However, most SSL approaches are not able to incorporate multiple information sources. Furthermore, no sample selection is done on unlabeled data, leading to the unpredictable risk brought by uncontrolled unlabeled data and heavy computational burden that is not suitable for learning on real world dataset. In this paper, we propose a scalable semi-supervised multiple kernel learning method (S3MKL) to deal with the first problem. Our method imposes group LASSO regularization on the kernel coefficients to avoid over-fitting and conditional expectation consensus for regularizing the behaviors of different kernel on the unlabeled data. To reduce the risk of using unlabeled data, we also design a hashing system where multiple kernel locality sensitive hashing (MKLSH) are constructed with respect to different kernels to identify a set of "informative" and "compact" unlabeled training subset from a large unlabeled data corpus. Combining S3MKL with MKLSH, the method is suitable for real world image classification and personalized web image re-ranking with very little user interaction. Comprehensive experiments are conducted to test the performance of our method, and the results show that our method provides promising powers for large scale real world image classification and retrieval.

Book ChapterDOI
05 Sep 2010
TL;DR: This paper proposes the random locality sensitive vocabulary (RLSV) scheme, a simple scheme that generates and aggregates multiple visual vocabularies based on random projections, without taking clustering or training efforts.
Abstract: Visual vocabulary construction is an integral part of the popular Bag-of-Features (BOF) model. When visual data scale up (in terms of the dimensionality of features or/and the number of samples), most existing algorithms (e.g. k-means) become unfavorable due to the prohibitive time and space requirements. In this paper we propose the random locality sensitive vocabulary (RLSV) scheme towards efficient visual vocabulary construction in such scenarios. Integrating ideas from the Locality Sensitive Hashing (LSH) and the Random Forest (RF), RLSV generates and aggregates multiple visual vocabularies based on random projections, without taking clustering or training efforts. This simple scheme demonstrates superior time and space efficiency over prior methods, in both theory and practice, while often achieving comparable or even better performances. Besides, extensions to supervised and kernelized vocabulary constructions are also discussed and experimented with.

Proceedings ArticleDOI
18 Sep 2010
TL;DR: This work focuses on a recently presented Metric Index, redefine its hashing and searching process in the terms of LSH, and performs extensive measurements on two datasets to verify that the M-Index fulfills the conditions of the LSH concept.
Abstract: The concept of Locality-sensitive Hashing (LSH) has been successfully used for searching in high-dimensional data and a number of locality-preserving hash functions have been introduced. In order to extend the applicability of the LSH approach to a general metric space, we focus on a recently presented Metric Index (M-Index), we redefine its hashing and searching process in the terms of LSH, and perform extensive measurements on two datasets to verify that the M-Index fulfills the conditions of the LSH concept. We widely discuss "optimal" properties of LSH functions and the efficiency of a given LSH function with respect to kNN queries. The results also indicate that the M-Index hashing and searching is more efficient than the tested standard LSH approach for Euclidean distance.

Proceedings ArticleDOI
25 Oct 2010
TL;DR: This work proposes a novel scheme called locality sensitive multi-leveled approximation (LSMA) that optimizes the near duplicate video similarity matching over streams based on the locality sensitive hashing under EMD metric.
Abstract: Since near duplicates are ubiquitous over different data sources, increasing research efforts have been put to near duplicate detection recently. Among all the near duplicate detection tasks, an important one is continuous near duplicate monitoring over video streams. Existing video monitoring techniques are not effective for handling the variations that commonly exist among near duplicates. Moreover, approaches proposed for the near duplicate detection in archived video databases are inefficient when applied to high speed video streams. In this work, we propose a framework for effectively online monitoring near duplicates over video streams. Specifically, we first propose a novel representation, a video cuboid signature, to describe a video segment. To capture the local spatio-temporal information of video subclips, we employ the Earth Mover's Distance (EMD) to measure the similarity between two signatures. Both the signature construction and the sequence similarity measure are incrementally processed by exploiting the inherent property of signature series. Then, we propose a novel scheme called locality sensitive multi-leveled approximation (LSMA) that optimizes the near duplicate video similarity matching over streams based on the locality sensitive hashing under EMD metric. The extensive experiments demonstrate the high performance of our approach in terms of the detection accuracy and time cost.

Proceedings Article
16 Jan 2010
TL;DR: A simple randomized data structure for two-dimensional point sets that allows fast nearest neighbor queries in many cases is presented and an implementation outperforms several previous implementations for commonly used benchmarks.
Abstract: We present a simple randomized data structure for two-dimensional point sets that allows fast nearest neighbor queries in many cases. An implementation outperforms several previous implementations for commonly used benchmarks.

Proceedings ArticleDOI
25 Oct 2010
TL;DR: A new multi-stage LSH scheme that consists in extracting compact but accurate representations from audio tracks by exploiting the LSH idea to summarize audio tracks and adequately organizing the resulting representations in LSH tables, retaining almost the same accuracy as an exact kNN retrieval is suggested.
Abstract: In order to improve the reliability and the scalability of content-based retrieval of variant audio tracks from large music databases, we suggest a new multi-stage LSH scheme that consists in (i) extracting compact but accurate representations from audio tracks by exploiting the LSH idea to summarize audio tracks, and (ii) adequately organizing the resulting representations in LSH tables, retaining almost the same accuracy as an exact kNN retrieval. In the first stage, we use major bins of successive chroma features to calculate a multi-probe histogram (MPH) that is concise but retains the information about local temporal correlations. In the second stage, based on the order statistics (OS) of the MPH, we propose a new LSH scheme, OS-LSH, to organize and probe the histograms. The representation and organization of the audio tracks are storage efficient and support robust and scalable retrieval. Extensive experiments over a large dataset with 30,000 real audio tracks confirm the effectiveness and efficiency of the proposed scheme.

Proceedings ArticleDOI
05 Jul 2010
TL;DR: This paper proposes to use Multi-Probe Locality Sensitive Hashing (MPLSH) to index the video clips for fast similarity search and high recall and is able to filter out a large number of dissimilar clips from video database.
Abstract: Detection of duplicate or near-duplicate videos on large-scale database plays an important role in video search. In this paper, we analyze the problem of near-duplicates detection and propose a practical and effective solution for real-time large-scale video retrieval. Unlike many existing approaches which make use of video frames or key-frames, our solution is based on a more discriminative signature of video clips. The feature used in this paper is an extension of ordinal measures which have proven to be robust to change in brightness, compression formats and compression ratios. For efficient retrieval, we propose to use Multi-Probe Locality Sensitive Hashing (MPLSH) to index the video clips for fast similarity search and high recall. MPLSH is able to filter out a large number of dissimilar clips from video database. To refine the search process, we apply a slightly more expensive clip-based signature matching between a pair of videos. Experimental results on the data set of 12, 790 videos [26] show that the proposed approach achieves at least 6.5% average precision improvement over the baseline color histogram approach while satisfying real-time requirements. Furthermore, our approach is able to locate the frame offset of copy segment in near-duplicate videos.

Proceedings ArticleDOI
23 Sep 2010
TL;DR: A new method for content-based video copy detection is presented that includes fingerprint extracting and matching phases and has excellent performance in accuracy and speed.
Abstract: In this paper, a new method for content-based video copy detection is presented. This method includes fingerprint extracting and matching phases. In the fingerprint extracting phase, a video is represented by a set of Speeded Up Robust Features (SURF), which outperforms other local features. In the fingerprint matching phase, the Locality Sensitive Hashing (LSH) is applied to efficiently detect video copy. The experimental results show that the proposed method has excellent performance in accuracy and speed.