scispace - formally typeset
Search or ask a question

Showing papers on "Locality-sensitive hashing published in 2012"


Proceedings ArticleDOI
Wei Liu1, Jun Wang2, Rongrong Ji1, Yu-Gang Jiang3, Shih-Fu Chang1 
16 Jun 2012
TL;DR: A novel kernel-based supervised hashing model which requires a limited amount of supervised information, i.e., similar and dissimilar data pairs, and a feasible training cost in achieving high quality hashing, and significantly outperforms the state-of-the-arts in searching both metric distance neighbors and semantically similar neighbors is proposed.
Abstract: Recent years have witnessed the growing popularity of hashing in large-scale vision problems. It has been shown that the hashing quality could be boosted by leveraging supervised information into hash function learning. However, the existing supervised methods either lack adequate performance or often incur cumbersome model training. In this paper, we propose a novel kernel-based supervised hashing model which requires a limited amount of supervised information, i.e., similar and dissimilar data pairs, and a feasible training cost in achieving high quality hashing. The idea is to map the data to compact binary codes whose Hamming distances are minimized on similar pairs and simultaneously maximized on dissimilar pairs. Our approach is distinct from prior works by utilizing the equivalence between optimizing the code inner products and the Hamming distances. This enables us to sequentially and efficiently train the hash functions one bit at a time, yielding very short yet discriminative codes. We carry out extensive experiments on two image benchmarks with up to one million samples, demonstrating that our approach significantly outperforms the state-of-the-arts in searching both metric distance neighbors and semantically similar neighbors, with accuracy gains ranging from 13% to 46%.

1,461 citations


Journal ArticleDOI
TL;DR: Two algorithms for the approximate nearest neighbor problem in high dimensional spaces for data sets of size n living in IR are presented, achieving query times that are sub-linear in n and polynomial in d.
Abstract: We present two algorithms for the approximate nearest neighbor problem in high dimensional spaces. For data sets of size n living in IR, the algorithms require space that is only polynomial in n and d, while achieving query times that are sub-linear in n and polynomial in d. We also show applications to other high-dimensional geometric problems, such as the approximate minimum spanning tree.

1,182 citations


Journal ArticleDOI
TL;DR: This work proposes a semi-supervised hashing (SSH) framework that minimizes empirical error over the labeled set and an information theoretic regularizer over both labeled and unlabeled sets and presents three different semi- supervised hashing methods, including orthogonal hashing, nonorthogonal hash, and sequential hashing.
Abstract: Hashing-based approximate nearest neighbor (ANN) search in huge databases has become popular due to its computational and memory efficiency. The popular hashing methods, e.g., Locality Sensitive Hashing and Spectral Hashing, construct hash functions based on random or principal projections. The resulting hashes are either not very accurate or are inefficient. Moreover, these methods are designed for a given metric similarity. On the contrary, semantic similarity is usually given in terms of pairwise labels of samples. There exist supervised hashing methods that can handle such semantic similarity, but they are prone to overfitting when labeled data are small or noisy. In this work, we propose a semi-supervised hashing (SSH) framework that minimizes empirical error over the labeled set and an information theoretic regularizer over both labeled and unlabeled sets. Based on this framework, we present three different semi-supervised hashing methods, including orthogonal hashing, nonorthogonal hashing, and sequential hashing. Particularly, the sequential hashing method generates robust codes in which each hash function is designed to correct the errors made by the previous ones. We further show that the sequential learning paradigm can be extended to unsupervised domains where no labeled pairs are available. Extensive experiments on four large datasets (up to 80 million samples) demonstrate the superior performance of the proposed SSH methods over state-of-the-art supervised and unsupervised hashing techniques.

834 citations


Proceedings Article
Jae-Pil Heo1, Youngwoon Lee1, Junfeng He2, Shih-Fu Chang2, Sung-Eui Yoon1 
16 Jun 2012
TL;DR: The extensive experiments show that the spherical hashing technique significantly outperforms six state-of-the-art hashing techniques based on hyperplanes across various image benchmarks of sizes ranging from one to 75 million of GIST descriptors, which confirms the unique merits of the proposed idea in using hyperspheres to encode proximity regions in high-dimensional spaces.
Abstract: Many binary code encoding schemes based on hashing have been actively studied recently, since they can provide efficient similarity search, especially nearest neighbor search, and compact data representations suitable for handling large scale image databases in many computer vision problems. Existing hashing techniques encode high-dimensional data points by using hyperplane-based hashing functions. In this paper we propose a novel hypersphere-based hashing function, spherical hashing, to map more spatially coherent data points into a binary code compared to hyperplane-based hashing functions. Furthermore, we propose a new binary code distance function, spherical Hamming distance, that is tailored to our hypersphere-based binary coding scheme, and design an efficient iterative optimization process to achieve balanced partitioning of data points for each hash function and independence between hashing functions. Our extensive experiments show that our spherical hashing technique significantly outperforms six state-of-the-art hashing techniques based on hyperplanes across various image benchmarks of sizes ranging from one to 75 million of GIST descriptors. The performance gains are consistent and large, up to 100% improvements. The excellent results confirm the unique merits of the proposed idea in using hyperspheres to encode proximity regions in high-dimensional spaces. Finally, our method is intuitive and easy to implement.

455 citations


Journal ArticleDOI
TL;DR: It is shown how to generalize locality-sensitive hashing to accommodate arbitrary kernel functions, making it possible to preserve the algorithm's sublinear time similarity search guarantees for a wide class of useful similarity functions.
Abstract: Fast retrieval methods are critical for many large-scale and data-driven vision applications. Recent work has explored ways to embed high-dimensional features or complex distance functions into a low-dimensional Hamming space where items can be efficiently searched. However, existing methods do not apply for high-dimensional kernelized data when the underlying feature embedding for the kernel is unknown. We show how to generalize locality-sensitive hashing to accommodate arbitrary kernel functions, making it possible to preserve the algorithm's sublinear time similarity search guarantees for a wide class of useful similarity functions. Since a number of successful image-based kernels have unknown or incomputable embeddings, this is especially valuable for image retrieval tasks. We validate our technique on several data sets, and show that it enables accurate and fast performance for several vision problems, including example-based object classification, local feature matching, and content-based retrieval.

359 citations


Proceedings ArticleDOI
01 Apr 2012
TL;DR: This paper proposes an efficient scheme for similarity search over encrypted data and utilizes a state-of-the-art algorithm for fast near neighbor search in high dimensional spaces called locality sensitive hashing to ensure the confidentiality of the sensitive data.
Abstract: In recent years, due to the appealing features of cloud computing, large amount of data have been stored in the cloud. Although cloud based services offer many advantages, privacy and security of the sensitive data is a big concern. To mitigate the concerns, it is desirable to outsource sensitive data in encrypted form. Encrypted storage protects the data against illegal access, but it complicates some basic, yet important functionality such as the search on the data. To achieve search over encrypted data without compromising the privacy, considerable amount of searchable encryption schemes have been proposed in the literature. However, almost all of them handle exact query matching but not similarity matching, a crucial requirement for real world applications. Although some sophisticated secure multi-party computation based cryptographic techniques are available for similarity tests, they are computationally intensive and do not scale for large data sources. In this paper, we propose an efficient scheme for similarity search over encrypted data. To do so, we utilize a state-of-the-art algorithm for fast near neighbor search in high dimensional spaces called locality sensitive hashing. To ensure the confidentiality of the sensitive data, we provide a rigorous security definition and prove the security of the proposed scheme under the provided definition. In addition, we provide a real world application of the proposed scheme and verify the theoretical results with empirical observations on a real dataset.

266 citations


Proceedings ArticleDOI
12 Aug 2012
TL;DR: This paper proposes a probabilistic model, called MLBE, to learn hash functions from multimodal data automatically, and devise an efficient algorithm for the learning of binary latent factors which corresponds to hash function learning.
Abstract: In recent years, both hashing-based similarity search and multimodal similarity search have aroused much research interest in the data mining and other communities. While hashing-based similarity search seeks to address the scalability issue, multimodal similarity search deals with applications in which data of multiple modalities are available. In this paper, our goal is to address both issues simultaneously. We propose a probabilistic model, called multimodal latent binary embedding (MLBE), to learn hash functions from multimodal data automatically. MLBE regards the binary latent factors as hash codes in a common Hamming space. Given data from multiple modalities, we devise an efficient algorithm for the learning of binary latent factors which corresponds to hash function learning. Experimental validation of MLBE has been conducted using both synthetic data and two realistic data sets. Experimental results show that MLBE compares favorably with two state-of-the-art models.

209 citations


Proceedings ArticleDOI
20 May 2012
TL;DR: This paper proposes to use a base of m single LSH functions to construct "dynamic" compound hash functions, and defines a new LSH scheme called Collision Counting LSH (C2LSH), which outperforms the state of the art method LSB-forest in high dimensional space.
Abstract: Locality-Sensitive Hashing (LSH) and its variants are well-known methods for solving the c-approximate NN Search problem in high-dimensional space. Traditionally, several LSH functions are concatenated to form a "static" compound hash function for building a hash table. In this paper, we propose to use a base of m single LSH functions to construct "dynamic" compound hash functions, and define a new LSH scheme called Collision Counting LSH (C2LSH). If the number of LSH functions under which a data object o collides with a query object q is greater than a pre-specified collision threhold l, then o can be regarded as a good candidate of c-approximate NN of q. This is the basic idea of C2LSH.Our theoretical studies show that, by appropriately choosing the size of LSH function base m and the collision threshold l, C2LSH can have a guarantee on query quality. Notably, the parameter m is not affected by dimensionality of data objects, which makes C2LSH especially good for high dimensional NN search. The experimental studies based on synthetic datasets and four real datasets have shown that C2LSH outperforms the state of the art method LSB-forest in high dimensional space.

197 citations


Journal ArticleDOI
01 Jan 2012
TL;DR: BayesLSH as discussed by the authors is a principled Bayesian algorithm for candidate pruning and similarity estimation using LSH, which can quickly prune away a large majority of the false positive candidate pairs, leading to significant speedups over baseline approaches.
Abstract: Given a collection of objects and an associated similarity measure, the all-pairs similarity search problem asks us to find all pairs of objects with similarity greater than a certain user-specified threshold. Locality-sensitive hashing (LSH) based methods have become a very popular approach for this problem. However, most such methods only use LSH for the first phase of similarity search - i.e. efficient indexing for candidate generation. In this paper, we present BayesLSH, a principled Bayesian algorithm for the subsequent phase of similarity search - performing candidate pruning and similarity estimation using LSH. A simpler variant, BayesLSH-Lite, which calculates similarities exactly, is also presented. Our algorithms are able to quickly prune away a large majority of the false positive candidate pairs, leading to significant speedups over baseline approaches. For BayesLSH, we also provide probabilistic guarantees on the quality of the output, both in terms of accuracy and recall. Finally, the quality of BayesLSH's output can be easily tuned and does not require any manual setting of the number of hashes to use for similarity estimation, unlike standard approaches. For two state-of-the-art candidate generation algorithms, AllPairs and LSH, BayesLSH enables significant speedups, typically in the range 2x-20x for a wide variety of datasets.

159 citations


Journal ArticleDOI
TL;DR: A new interaction modality for training which requires only yes-no type binary feedback instead of a precise category label is proposed, which is especially powerful in the presence of hundreds of categories.
Abstract: Machine learning techniques for computer vision applications like object recognition, scene classification, etc., require a large number of training samples for satisfactory performance. Especially when classification is to be performed over many categories, providing enough training samples for each category is infeasible. This paper describes new ideas in multiclass active learning to deal with the training bottleneck, making it easier to train large multiclass image classification systems. First, we propose a new interaction modality for training which requires only yes-no type binary feedback instead of a precise category label. The modality is especially powerful in the presence of hundreds of categories. For the proposed modality, we develop a Value-of-Information (VOI) algorithm that chooses informative queries while also considering user annotation cost. Second, we propose an active selection measure that works with many categories and is extremely fast to compute. This measure is employed to perform a fast seed search before computing VOI, resulting in an algorithm that scales linearly with dataset size. Third, we use locality sensitive hashing to provide a very fast approximation to active learning, which gives sublinear time scaling, allowing application to very large datasets. The approximation provides up to two orders of magnitude speedups with little loss in accuracy. Thorough empirical evaluation of classification accuracy, noise sensitivity, imbalanced data, and computational performance on a diverse set of image datasets demonstrates the strengths of the proposed algorithms.

140 citations


Proceedings Article
03 Dec 2012
TL;DR: One permutation hashing can achieve similar (or even better) accuracies compared to the k-permutation scheme, and the experiments with training SVM and logistic regression confirm that this one permutation scheme should perform similarly to the original (k-permutations) minwise hashing.
Abstract: Minwise hashing is a standard procedure in the context of search, for efficiently estimating set similarities in massive binary data such as text. Recently, 6-bit minwise hashing has been applied to large-scale learning and sublinear time near-neighbor search. The major drawback of minwise hashing is the expensive preprocessing, as the method requires applying (e.g.,) k = 200 to 500 permutations on the data. This paper presents a simple solution called one permutation hashing. Conceptually, given a binary data matrix, we permute the columns once and divide the permuted columns evenly into k bins; and we store, for each data vector, the smallest nonzero location in each bin. The probability analysis illustrates that this one permutation scheme should perform similarly to the original (k-permutation) minwise hashing. Our experiments with training SVM and logistic regression confirm that one permutation hashing can achieve similar (or even better) accuracies compared to the k-permutation scheme. See more details in arXiv:1208.1259.

Proceedings Article
03 Dec 2012
TL;DR: The extensive experiments on real data well validate that given the same length of binary code, SBLSH may achieve significant mean squared error reduction in estimating pairwise angular similarity, andSBLSH shows the superiority over SRP-LSH in approximate nearest neighbor (ANN) retrieval experiments.
Abstract: Sign-random-projection locality-sensitive hashing (SRP-LSH) is a probabilistic dimension reduction method which provides an unbiased estimate of angular similarity, yet suffers from the large variance of its estimation. In this work, we propose the Super-Bit locality-sensitive hashing (SBLSH). It is easy to implement, which orthogonalizes the random projection vectors in batches, and it is theoretically guaranteed that SBLSH also provides an unbiased estimate of angular similarity, yet with a smaller variance when the angle to estimate is within (0, π/2]. The extensive experiments on real data well validate that given the same length of binary code, SBLSH may achieve significant mean squared error reduction in estimating pairwise angular similarity. Moreover, SBLSH shows the superiority over SRP-LSH in approximate nearest neighbor (ANN) retrieval experiments.

Proceedings ArticleDOI
12 Aug 2012
TL;DR: Manhattan hashing (MH), which is based on Manhattan distance, is proposed to solve the problem of Hamming distance based hashing and can significantly outperform other state-of-the-art methods.
Abstract: Hashing is used to learn binary-code representation for data with expectation of preserving the neighborhood structure in the original feature space. Due to its fast query speed and reduced storage cost, hashing has been widely used for efficient nearest neighbor search in a large variety of applications like text and image retrieval. Most existing hashing methods adopt Hamming distance to measure the similarity (neighborhood) between points in the hashcode space. However, one problem with Hamming distance is that it may destroy the neighborhood structure in the original feature space, which violates the essential goal of hashing. In this paper, Manhattan hashing (MH), which is based on Manhattan distance, is proposed to solve the problem of Hamming distance based hashing. The basic idea of MH is to encode each projected dimension with multiple bits of natural binary code (NBC), based on which the Manhattan distance between points in the hashcode space is calculated for nearest neighbor search. MH can effectively preserve the neighborhood structure in the data to achieve the goal of hashing. To the best of our knowledge, this is the first work to adopt Manhattan distance with NBC for hashing. Experiments on several large-scale image data sets containing up to one million points show that our MH method can significantly outperform other state-of-the-art methods.

Journal ArticleDOI
TL;DR: The results prove the relevance of such a new LSH scheme either providing far better accuracy in the context of image retrieval than euclidean scheme for an equivalent speed, or providing an equivalent accuracy but with a high gain in terms of processing speed.
Abstract: In the past 10 years, new powerful algorithms based on efficient data structures have been proposed to solve the problem of Nearest Neighbors search (or Approximate Nearest Neighbors search). If the Euclidean Locality Sensitive Hashing algorithm, which provides approximate nearest neighbors in a euclidean space with sublinear complexity, is probably the most popular, the euclidean metric does not always provide as accurate and as relevant results when considering similarity measure as the Earth-Mover Distance and χ2 distances. In this paper, we present a new LSH scheme adapted to χ2 distance for approximate nearest neighbors search in high-dimensional spaces. We define the specific hashing functions, we prove their local-sensitivity, and compare, through experiments, our method with the Euclidean Locality Sensitive Hashing algorithm in the context of image retrieval on real image databases. The results prove the relevance of such a new LSH scheme either providing far better accuracy in the context of image retrieval than euclidean scheme for an equivalent speed, or providing an equivalent accuracy but with a high gain in terms of processing speed.

Book ChapterDOI
25 Oct 2012
TL;DR: Similarity Preserving Hashing (SPH) as mentioned in this paper is one of the most widely used hash functions in computer science and used in several applications, e.g. in computer forensics to identify known files.
Abstract: Hash functions are a widespread class of functions in computer science and used in several applications, e.g. in computer forensics to identify known files. One basic property of cryptographic Hash Functions is the avalanche effect that causes a significantly different output if an input is changed slightly. As some applications also need to identify similar files (e.g. spam/virus detection) this raised the need for Similarity Preserving Hashing. In recent years, several approaches came up, all with different namings, properties, strengths and weaknesses which is due to a missing definition.

Journal ArticleDOI
17 Jul 2012
TL;DR: An algorithm for optimizing the parameters and use of LSH is described, which returns the LSH parameters that allow an LSH index to meet the performance goal and have the minimum computational cost.
Abstract: Locality-sensitive hashing (LSH) is the basis of many algorithms that use a probabilistic approach to find nearest neighbors. We describe an algorithm for optimizing the parameters and use of LSH. Prior work ignores these issues or suggests a search for the best parameters. We start with two histograms: one that characterizes the distributions of distances to a point's nearest neighbors and the second that characterizes the distance between a query and any point in the data set. Given a desired performance level (the chance of finding the true nearest neighbor) and a simple computational cost model, we return the LSH parameters that allow an LSH index to meet the performance goal and have the minimum computational cost. We can also use this analysis to connect LSH to deterministic nearest-neighbor algorithms such as $k$ - $d$ trees and thus start to unify the two approaches.

Proceedings ArticleDOI
25 Mar 2012
TL;DR: In this paper, the authors proposed a binarization scheme for vectors of high dimension based on the recent concept of anti-sparse coding, and showed its excellent performance for approximate nearest neighbor search.
Abstract: This paper proposes a binarization scheme for vectors of high dimension based on the recent concept of anti-sparse coding, and shows its excellent performance for approximate nearest neighbor search. Unlike other binarization schemes, this framework allows, up to a scaling factor, the explicit reconstruction from the binary representation of the original vector. The paper also shows that random projections which are used in Locality Sensitive Hashing algorithms, are significantly outperformed by regular frames for both synthetic and real data if the number of bits exceeds the vector dimensionality, i.e., when high precision is required.

Posted Content
TL;DR: This paper proposes the distributed Layered LSH scheme, and proves that it exponentially decreases the network cost, while maintaining a good load balance between different machines.
Abstract: Distributed frameworks are gaining increasingly widespread use in applications that process large amounts of data. One important example application is large scale similarity search, for which Locality Sensitive Hashing (LSH) has emerged as the method of choice, specially when the data is high-dimensional. At its core, LSH is based on hashing the data points to a number of buckets such that similar points are more likely to map to the same buckets. To guarantee high search quality, the LSH scheme needs a rather large number of hash tables. This entails a large space requirement, and in the distributed setting, with each query requiring a network call per hash bucket look up, this also entails a big network load. The Entropy LSH scheme proposed by Panigrahy significantly reduces the number of required hash tables by looking up a number of query offsets in addition to the query itself. While this improves the LSH space requirement, it does not help with (and in fact worsens) the search network efficiency, as now each query offset requires a network call. In this paper, focusing on the Euclidian space under $l_2$ norm and building up on Entropy LSH, we propose the distributed Layered LSH scheme, and prove that it exponentially decreases the network cost, while maintaining a good load balance between different machines. Our experiments also verify that our scheme results in a significant network traffic reduction that brings about large runtime improvement in real world applications.

Proceedings ArticleDOI
10 Dec 2012
TL;DR: Numerical experiments on image datasets demonstrate the useful behavior of the deep multi-view hashing (DMVH), compared to recently-proposed multi-modal deep network as well as existing shallow models of hashing.
Abstract: Hashing seeks an embedding of high-dimensional objects into a similarity-preserving low-dimensional Hamming space such that similar objects are indexed by binary codes with small Hamming distances. A variety of hashing methods have been developed, but most of them resort to a single view (representation) of data. However, objects are often described by multiple representations. For instance, images are described by a few different visual descriptors (such as SIFT, GIST, and HOG), so it is desirable to incorporate multiple representations into hashing, leading to multi-view hashing. In this paper we present a deep network for multi-view hashing, referred to as deep multi-view hashing, where each layer of hidden nodes is composed of view-specific and shared hidden nodes, in order to learn individual and shared hidden spaces from multiple views of data. Numerical experiments on image datasets demonstrate the useful behavior of our deep multi-view hashing (DMVH), compared to recently-proposed multi-modal deep network as well as existing shallow models of hashing.

Proceedings ArticleDOI
29 Oct 2012
TL;DR: This paper presents an efficient alternating optimization to learn the hashing functions and the optimal kernel combination, and shows that the proposed method can achieve 11% and 34% performance gains over state-of-the-art methods.
Abstract: Hashing methods, which generate binary codes to preserve certain similarity, recently have become attractive in many applications like large scale visual search However, most of state-of-the-art hashing methods only utilize single feature type, while combining multiple features has been proved very helpful in image search In this paper we propose a novel hashing approach that utilizes the information conveyed by different features The multiple feature hashing can be formulated as a similarity preserving problem with optimal linearly-combined multiple kernels Such formulation is not only compatible with general types of data and diverse types of similarities indicated by different visual features, but also helpful to achieve fast training and search We present an efficient alternating optimization to learn the hashing functions and the optimal kernel combination Experimental results on two well-known benchmarks CIFAR-10 and NUS-WIDE show that the proposed method can achieve 11% and 34% performance gains over state-of-the-art methods

Proceedings Article
Wei Liu1, Jun Wang2, Yadong Mu1, Sanjiv Kumar3, Shih-Fu Chang1 
26 Jun 2012
TL;DR: The key idea is the bilinear form of the proposed hash functions, which leads to higher collision probability than the existing hyperplane hash functions when using random projections, which boosts the search performance over the random projection based solutions.
Abstract: Hyperplane hashing aims at rapidly searching nearest points to a hyperplane, and has shown practical impact in scaling up active learning with SVMs. Unfortunately, the existing randomized methods need long hash codes to achieve reasonable search accuracy and thus suffer from reduced search speed and large memory overhead. To this end, this paper proposes a novel hyperplane hashing technique which yields compact hash codes. The key idea is the bilinear form of the proposed hash functions, which leads to higher collision probability than the existing hyperplane hash functions when using random projections. To further increase the performance, we propose a learning based framework in which the bilinear functions are directly learned from the data. This results in short yet discriminative codes, and also boosts the search performance over the random projection based solutions. Large-scale active learning experiments carried out on two datasets with up to one million samples demonstrate the overall superiority of the proposed approach.

Journal ArticleDOI
TL;DR: Experimental results show that LSBF, compared with a baseline approach and other state-of-the-art work in the literature, takes less time to respond AMQ and consumes much less storage space.
Abstract: In many network applications, Bloom filters are used to support exact-matching membership query for their randomized space-efficient data structure with a small probability of false answers. In this paper, we extend the standard Bloom filter to Locality-Sensitive Bloom Filter (LSBF) to provide Approximate Membership Query (AMQ) service. We achieve this by replacing uniform and independent hash functions with locality-sensitive hash functions. Such replacement makes the storage in LSBF to be locality sensitive. Meanwhile, LSBF is space efficient and query responsive by employing the Bloom filter design. In the design of the LSBF structure, we propose a bit vector to reduce False Positives (FP). The bit vector can verify multiple attributes belonging to one member. We also use an active overflowed scheme to significantly decrease False Negatives (FN). Rigorous theoretical analysis (e.g., on FP, FN, and space overhead) shows that the design of LSBF is space compact and can provide accurate response to approximate membership queries. We have implemented LSBF in a real distributed system to perform extensive experiments using real-world traces. Experimental results show that LSBF, compared with a baseline approach and other state-of-the-art work in the literature (SmartStore and LSB-tree), takes less time to respond AMQ and consumes much less storage space.

Proceedings ArticleDOI
29 Oct 2012
TL;DR: A novel framework for efficient large-scale video retrieval that integrates feature pooling and hashing in a single framework, and shows that the influence maximization problem is submodular, which allows a greedy optimization method to achieve a nearly optimal solution.
Abstract: This paper develops a novel framework for efficient large-scale video retrieval. We aim to find video according to higher level similarities, which is beyond the scope of traditional near duplicate search. Following the popular hashing technique we employ compact binary codes to facilitate nearest neighbor search. Unlike the previous methods which capitalize on only one type of hash code for retrieval, this paper combines heterogeneous hash codes to effectively describe the diverse and multi-scale visual contents in videos. Our method integrates feature pooling and hashing in a single framework. In the pooling stage, we cast video frames into a set of pre-specified components, which capture a variety of semantics of video contents. In the hashing stage, we represent each video component as a compact hash code, and combine multiple hash codes into hash tables for effective search. To speed up the retrieval while retaining most informative codes, we propose a graph-based influence maximization method to bridge the pooling and hashing stages. We show that the influence maximization problem is submodular, which allows a greedy optimization method to achieve a nearly optimal solution. Our method works very efficiently, retrieving thousands of video clips from TRECVID dataset in about 0.001 second. For a larger scale synthetic dataset with 1M samples, it uses less than 1 second in response to 100 queries. Our method is extensively evaluated in both unsupervised and supervised scenarios, and the results on TRECVID Multimedia Event Detection and Columbia Consumer Video datasets demonstrate the success of our proposed technique.

Proceedings ArticleDOI
01 Apr 2012
TL;DR: A new Bi-level LSH algorithm to perform approximate k-nearest neighbor search in high dimensional spaces that maps well to current GPU architectures and can improve the quality of approximate KNN queries as compared to prior LSH-based algorithms.
Abstract: We present a new Bi-level LSH algorithm to perform approximate $k$-nearest neighbor search in high dimensional spaces. Our formulation is based on a two-level scheme. In the first level, we use a RP-tree that divides the dataset into sub-groups with bounded aspect ratios and is used to distinguish well-separated clusters. During the second level, we compute a single LSH hash table for each sub-group along with a hierarchical structure based on space-filling curves. Given a query, we first determine the sub-group that it belongs to and perform $k$-nearest neighbor search within the suitable buckets in the LSH hash table corresponding to the sub-group. Our algorithm also maps well to current GPU architectures and can improve the quality of approximate KNN queries as compared to prior LSH-based algorithms. We highlight its performance on two large, high-dimensional image datasets. Given a runtime budget, Bi-level LSH can provide better accuracy in terms of recall or error ration. Moreover, our formulation reduces the variation in runtime cost or the quality of results.

Journal ArticleDOI
TL;DR: On a fingerprint database with more than 1,000 videos, it is demonstrated that EWH is more than 10 times faster than LSH and has a significantly better detection accuracy with a 15 times lower error rate.
Abstract: A fast approximate nearest neighbor search algorithm for the (binary) Hamming space is proposed. The proposed Error Weighted Hashing (EWH) algorithm is up to 20 times faster than the popular locality sensitive hashing (LSH) algorithm and works well even for large nearest neighbor distances where LSH fails. EWH significantly reduces the number of candidate nearest neighbors by weighing them based on the difference between their hash vectors. EWH can be used for multimedia retrieval and copy detection systems that are based on binary fingerprinting. On a fingerprint database with more than 1,000 videos, for a specific detection accuracy, we demonstrate that EWH is more than 10 times faster than LSH. For the same retrieval time, we show that EWH has a significantly better detection accuracy with a 15 times lower error rate.

Proceedings ArticleDOI
29 Oct 2012
TL;DR: In this paper, the authors proposed a distributed Layered LSH (LSH) scheme, and proved that it exponentially decreases the network cost, while maintaining a good load balance between different machines.
Abstract: Distributed frameworks are gaining increasingly widespread use in applications that process large amounts of data One important example application is large scale similarity search, for which Locality Sensitive Hashing (LSH) has emerged as the method of choice, specially when the data is high-dimensional To guarantee high search quality, the LSH scheme needs a rather large number of hash tables This entails a large space requirement, and in the distributed setting, with each query requiring a network call per hash bucket look up, also a big network load Panigrahy's Entropy LSH scheme significantly reduces the space requirement but does not help with (and in fact worsens) the search network efficiency In this paper, focusing on the Euclidian space under ι2 norm and building up on Entropy LSH, we propose the distributed Layered LSH scheme, and prove that it exponentially decreases the network cost, while maintaining a good load balance between different machines Our experiments also verify that our theoretical results

Book ChapterDOI
24 Sep 2012
TL;DR: A very simple and effective strategy for sub-linear time near neighbor search is developed, by creating hash tables directly using the bits generated by b-bit minwise hashing.
Abstract: Numerous applications in search, databases, machine learning, and computer vision, can benefit from efficient algorithms for near neighbor search. This paper proposes a simple framework for fast near neighbor search in high-dimensional binary data, which are common in practice (e.g., text). We develop a very simple and effective strategy for sub-linear time near neighbor search, by creating hash tables directly using the bits generated by b-bit minwise hashing. The advantages of our method are demonstrated through thorough comparisons with two strong baselines: spectral hashing and sign (1-bit) random projections.

Book ChapterDOI
11 Nov 2012
TL;DR: This paper proposes the use of state-of-the art locality-sensitive hashing techniques to vastly improve the scalability of instance matching across multiple types, and describes how these techniques can be used to estimate containment or equivalence relations between two type systems.
Abstract: In this paper, we describe a mechanism for ontology alignment using instance based matching of types (or classes). Instance-based matching is known to be a useful technique for matching ontologies that have different names and different structures. A key problem in instance matching of types, however, is scaling the matching algorithm to (a) handle types with a large number of instances, and (b) efficiently match a large number of type pairs. We propose the use of state-of-the art locality-sensitive hashing (LSH) techniques to vastly improve the scalability of instance matching across multiple types. We show the feasibility of our approach with DBpedia and Freebase, two different type systems with hundreds and thousands of types, respectively. We describe how these techniques can be used to estimate containment or equivalence relations between two type systems, and we compare two different LSH techniques for computing instance similarity.

01 Jan 2012
TL;DR: Receiver operating characteristics (ROC) comparisons between the proposed hashing and singular value decompositions (SVD) based hashing, also called SVD-SVD hashing, presented by Kozat et al. at the 11th International Conference on Image Processing (ICIP'04) are conducted, and the results indicate that the proposed hash shows better performances in robustness and discriminative capability than the SVD.
Abstract: Image hashing is a new technology in multimedia security. It maps visually identical images to the same or similar short strings called image hashes, and finds applications in image retrieval, image authentication, digital watermarking, image indexing, and image copy detection. This paper presents a perceptual hashing for color images. The input image in RGB color space is firstly converted into a normalized image by interpolation and filtering. Color space conversions from RGB to YCbCr and HSI are then performed. Next, invariant moments of each component of the above two color spaces are calculated. The image hash is finally obtained by concatenating the invariant moments of these components. Similarity between image hashes is evaluated by L2 norm. Experiments show that the proposed hashing is robust against normal digital processing, such as JPEG compression, watermark embedding, gamma correction, Gaussian low-pass filtering, adjustments of brightness and contrast, image scaling, and image rotation. Receiver operating characteristics (ROC) comparisons between the proposed hashing and singular value decompositions (SVD) based hashing, also called SVD-SVD hashing, presented by Kozat et al. at the 11th International Conference on Image Processing (ICIP'04) are conducted, and the results indicate that the proposed hashing shows better performances in robustness and discriminative capability than the SVD-SVD hashing.

Proceedings ArticleDOI
12 Aug 2012
TL;DR: A novel Boosting Multi-Kernel Locality-Sensitive Hashing (BMKLSH) scheme is proposed that significantly boosts the retrieval performance of KLSH by making use of multiple kernels and outperforms the state-of-the-art techniques.
Abstract: Similarity search is a key challenge for multimedia retrieval applications where data are usually represented in high-dimensional space. Among various algorithms proposed for similarity search in high-dimensional space, Locality-Sensitive Hashing (LSH) is the most popular one, which recently has been extended to Kernelized Locality-Sensitive Hashing (KLSH) by exploiting kernel similarity for better retrieval efficacy. Typically, KLSH works only with a single kernel, which is often limited in real-world multimedia applications, where data may originate from multiple resources or can be represented in several different forms. For example, in content-based multimedia retrieval, a variety of features can be extracted to represent contents of an image. To overcome the limitation of regular KLSH, we propose a novel Boosting Multi-Kernel Locality-Sensitive Hashing (BMKLSH) scheme that significantly boosts the retrieval performance of KLSH by making use of multiple kernels. We conduct extensive experiments for large-scale content-based image retrieval, in which encouraging results show that the proposed method outperforms the state-of-the-art techniques.