scispace - formally typeset
Search or ask a question

Showing papers on "Locality-sensitive hashing published in 2015"


Proceedings ArticleDOI
07 Jun 2015
TL;DR: This work proposes a new supervised hashing framework, where the learning objective is to generate the optimal binary hash codes for linear classification, and introduces an auxiliary variable to reformulate the objective such that it can be solved substantially efficiently by employing a regularization algorithm.
Abstract: Recently, learning based hashing techniques have attracted broad research interests because they can support efficient storage and retrieval for high-dimensional data such as images, videos, documents, etc. However, a major difficulty of learning to hash lies in handling the discrete constraints imposed on the pursued hash codes, which typically makes hash optimizations very challenging (NP-hard in general). In this work, we propose a new supervised hashing framework, where the learning objective is to generate the optimal binary hash codes for linear classification. By introducing an auxiliary variable, we reformulate the objective such that it can be solved substantially efficiently by employing a regularization algorithm. One of the key steps in this algorithm is to solve a regularization sub-problem associated with the NP-hard binary optimization. We show that the sub-problem admits an analytical solution via cyclic coordinate descent. As such, a high-quality discrete solution can eventually be obtained in an efficient computing manner, therefore enabling to tackle massive datasets. We evaluate the proposed approach, dubbed Supervised Discrete Hashing (SDH), on four large image datasets and demonstrate its superiority to the state-of-the-art hashing methods in large-scale image retrieval.

923 citations


Proceedings ArticleDOI
07 Jun 2015
TL;DR: Extensive evaluations on several benchmark image datasets show that the proposed simultaneous feature learning and hash coding pipeline brings substantial improvements over other state-of-the-art supervised or unsupervised hashing methods.
Abstract: Similarity-preserving hashing is a widely-used method for nearest neighbour search in large-scale image retrieval tasks. For most existing hashing methods, an image is first encoded as a vector of hand-engineering visual features, followed by another separate projection or quantization step that generates binary codes. However, such visual feature vectors may not be optimally compatible with the coding process, thus producing sub-optimal hashing codes. In this paper, we propose a deep architecture for supervised hashing, in which images are mapped into binary codes via carefully designed deep neural networks. The pipeline of the proposed deep architecture consists of three building blocks: 1) a sub-network with a stack of convolution layers to produce the effective intermediate image features; 2) a divide-and-encode module to divide the intermediate image features into multiple branches, each encoded into one hash bit; and 3) a triplet ranking loss designed to characterize that one image is more similar to the second image than to the third one. Extensive evaluations on several benchmark image datasets show that the proposed simultaneous feature learning and hash coding pipeline brings substantial improvements over other state-of-the-art supervised or unsupervised hashing methods.

822 citations


Proceedings ArticleDOI
07 Jun 2015
TL;DR: This paper proposes an effective Semantics-Preserving Hashing method, termed SePH, which transforms semantic affinities of training data as supervised information into a probability distribution and approximates it with to-be-learnt hash codes in Hamming space via minimizing the Kullback-Leibler divergence.
Abstract: With benefits of low storage costs and high query speeds, hashing methods are widely researched for efficiently retrieving large-scale data, which commonly contains multiple views, e.g. a news report with images, videos and texts. In this paper, we study the problem of cross-view retrieval and propose an effective Semantics-Preserving Hashing method, termed SePH. Given semantic affinities of training data as supervised information, SePH transforms them into a probability distribution and approximates it with to-be-learnt hash codes in Hamming space via minimizing the Kullback-Leibler divergence. Then kernel logistic regression with a sampling strategy is utilized to learn the nonlinear projections from features in each view to the learnt hash codes. And for any unseen instance, predicted hash codes and their corresponding output probabilities from observed views are utilized to determine its unified hash code, using a novel probabilistic approach. Extensive experiments conducted on three benchmark datasets well demonstrate the effectiveness and reasonableness of SePH.

463 citations


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a supervised learning framework to generate compact and bit-scalable hashing codes directly from raw images, where they pose hashing learning as a problem of regularized similarity learning.
Abstract: Extracting informative image features and learning effective approximate hashing functions are two crucial steps in image retrieval. Conventional methods often study these two steps separately, e.g., learning hash functions from a predefined hand-crafted feature space. Meanwhile, the bit lengths of output hashing codes are preset in the most previous methods, neglecting the significance level of different bits and restricting their practical flexibility. To address these issues, we propose a supervised learning framework to generate compact and bit-scalable hashing codes directly from raw images. We pose hashing learning as a problem of regularized similarity learning. In particular, we organize the training images into a batch of triplet samples, each sample containing two images with the same label and one with a different label. With these triplet samples, we maximize the margin between the matched pairs and the mismatched pairs in the Hamming space. In addition, a regularization term is introduced to enforce the adjacency consistency, i.e., images of similar appearances should have similar codes. The deep convolutional neural network is utilized to train the model in an end-to-end fashion, where discriminative image features and hash functions are simultaneously optimized. Furthermore, each bit of our hashing codes is unequally weighted, so that we can manipulate the code lengths by truncating the insignificant bits. Our framework outperforms state-of-the-arts on public benchmarks of similar image search and also achieves promising results in the application of person re-identification in surveillance. It is also shown that the generated bit-scalable hashing codes well preserve the discriminative powers with shorter code lengths.

457 citations


Posted Content
TL;DR: Experiments show that the proposed deep pairwise-supervised hashing method (DPSH), to perform simultaneous feature learning and hashcode learning for applications with pairwise labels, can outperform other methods to achieve the state-of-the-art performance in image retrieval applications.
Abstract: Recent years have witnessed wide application of hashing for large-scale image retrieval. However, most existing hashing methods are based on hand-crafted features which might not be optimally compatible with the hashing procedure. Recently, deep hashing methods have been proposed to perform simultaneous feature learning and hash-code learning with deep neural networks, which have shown better performance than traditional hashing methods with hand-crafted features. Most of these deep hashing methods are supervised whose supervised information is given with triplet labels. For another common application scenario with pairwise labels, there have not existed methods for simultaneous feature learning and hash-code learning. In this paper, we propose a novel deep hashing method, called deep pairwise-supervised hashing(DPSH), to perform simultaneous feature learning and hash-code learning for applications with pairwise labels. Experiments on real datasets show that our DPSH method can outperform other methods to achieve the state-of-the-art performance in image retrieval applications.

412 citations


Proceedings Article
25 Jul 2015
TL;DR: A novel method, called scalable graph hashing with feature transformation (SGH), for large-scale graph hashing that can effectively approximate the whole graph without explicitly computing the similarity graph matrix, based on which a sequential learning method is proposed to learn the hash functions in a bitwise manner.
Abstract: Hashing has been widely used for approximate nearest neighbor (ANN) search in big data applications because of its low storage cost and fast retrieval speed. The goal of hashing is to map the data points from the original space into a binary-code space where the similarity (neighborhood structure) in the original space is preserved. By directly exploiting the similarity to guide the hashing code learning procedure, graph hashing has attracted much attention. However, most existing graph hashing methods cannot achieve satisfactory performance in real applications due to the high complexity for graph modeling. In this paper, we propose a novel method, called scalable graph hashing with feature transformation (SGH), for large-scale graph hashing. Through feature transformation, we can effectively approximate the whole graph without explicitly computing the similarity graph matrix, based on which a sequential learning method is proposed to learn the hash functions in a bitwise manner. Experiments on two datasets with one million data points show that our SGH method can outperform the state-of-the-art methods in terms of both accuracy and scalability.

216 citations


Posted Content
TL;DR: In this article, an optimal data-dependent hashing scheme for the approximate near neighbor problem was proposed, which achieves query time O(d n^{\rho+o(1)}) and space O(n 2 + o(1 + dn) for the Euclidean space and approximation factor c>1 for the Hamming space.
Abstract: We show an optimal data-dependent hashing scheme for the approximate near neighbor problem. For an $n$-point data set in a $d$-dimensional space our data structure achieves query time $O(d n^{\rho+o(1)})$ and space $O(n^{1+\rho+o(1)} + dn)$, where $\rho=\tfrac{1}{2c^2-1}$ for the Euclidean space and approximation $c>1$. For the Hamming space, we obtain an exponent of $\rho=\tfrac{1}{2c-1}$. Our result completes the direction set forth in [AINR14] who gave a proof-of-concept that data-dependent hashing can outperform classical Locality Sensitive Hashing (LSH). In contrast to [AINR14], the new bound is not only optimal, but in fact improves over the best (optimal) LSH data structures [IM98,AI06] for all approximation factors $c>1$. From the technical perspective, we proceed by decomposing an arbitrary dataset into several subsets that are, in a certain sense, pseudo-random.

211 citations


Proceedings Article
07 Dec 2015
TL;DR: This work shows the existence of a Locality-Sensitive Hashing (LSH) family for the angular distance that yields an approximate Near Neighbor Search algorithm with the asymptotically optimal running time exponent and establishes a fine-grained lower bound for the quality of any LSH family for angular distance.
Abstract: We show the existence of a Locality-Sensitive Hashing (LSH) family for the angular distance that yields an approximate Near Neighbor Search algorithm with the asymptotically optimal running time exponent. Unlike earlier algorithms with this property (e.g., Spherical LSH [1, 2]), our algorithm is also practical, improving upon the well-studied hyperplane LSH [3] in practice. We also introduce a multiprobe version of this algorithm and conduct an experimental evaluation on real and synthetic data sets. We complement the above positive results with a fine-grained lower bound for the quality of any LSH family for angular distance. Our lower bound implies that the above LSH family exhibits a trade-off between evaluation time and quality that is close to optimal for a natural class of LSH functions.

179 citations


Journal ArticleDOI
TL;DR: This paper proposes a novel hashing method named neighborhood discriminant hashing (NDH) (for short) to implement approximate similarity search by exploiting local discriminative information, i.e., the labels of a sample can be inherited from the neighbor samples it selects.
Abstract: With the proliferation of large-scale community-contributed images, hashing-based approximate nearest neighbor search in huge databases has aroused considerable interest from the fields of computer vision and multimedia in recent years because of its computational and memory efficiency. In this paper, we propose a novel hashing method named neighborhood discriminant hashing (NDH) (for short) to implement approximate similarity search. Different from the previous work, we propose to learn a discriminant hashing function by exploiting local discriminative information, i.e., the labels of a sample can be inherited from the neighbor samples it selects. The hashing function is expected to be orthogonal to avoid redundancy in the learned hashing bits as much as possible, while an information theoretic regularization is jointly exploited using maximum entropy principle. As a consequence, the learned hashing function is compact and nonredundant among bits, while each bit is highly informative. Extensive experiments are carried out on four publicly available data sets and the comparison results demonstrate the outperforming performance of the proposed NDH method over state-of-the-art hashing techniques.

171 citations


Proceedings Article
25 Jul 2015
TL;DR: A novel Semantic Topic Multimodal Hashing (STMH) is developed by considering latent semantic information in coding procedure and demonstrates that the proposed method outperforms several state-of-the-art methods.
Abstract: Multimodal hashing is essential to cross-media similarity search for its low storage cost and fast query speed. Most existing multimodal hashing methods embedded heterogeneous data into a common low-dimensional Hamming space, and then rounded the continuous embeddings to obtain the binary codes. Yet they usually neglect the inherent discrete nature of hashing for relaxing the discrete constraints, which will cause degraded retrieval performance especially for long codes. For this purpose, a novel Semantic Topic Multimodal Hashing (STMH) is developed by considering latent semantic information in coding procedure. It first discovers clustering patterns of texts and robust factorizes the matrix of images to obtain multiple semantic topics of texts and concepts of images. Then the learned multimodal semantic features are transformed into a common subspace by their correlations. Finally, each bit of unified hash code can be generated directly by figuring out whether a topic or concept is contained in a text or an image. Therefore, the obtained model by STMH is more suitable for hashing scheme as it directly learns discrete hash codes in the coding process. Experimental results demonstrate that the proposed method outperforms several state-of-the-art methods.

170 citations


Journal ArticleDOI
TL;DR: This paper presents a novel unsupervised multiview alignment hashing approach based on regularized kernel nonnegative matrix factorization, which can find a compact representation uncovering the hidden semantics and simultaneously respecting the joint probability distribution of data.
Abstract: Hashing is a popular and efficient method for nearest neighbor search in large-scale data spaces by embedding high-dimensional feature descriptors into a similarity preserving Hamming space with a low dimension. For most hashing methods, the performance of retrieval heavily depends on the choice of the high-dimensional feature descriptor. Furthermore, a single type of feature cannot be descriptive enough for different images when it is used for hashing. Thus, how to combine multiple representations for learning effective hashing functions is an imminent task. In this paper, we present a novel unsupervised multiview alignment hashing approach based on regularized kernel nonnegative matrix factorization, which can find a compact representation uncovering the hidden semantics and simultaneously respecting the joint probability distribution of data. In particular, we aim to seek a matrix factorization to effectively fuse the multiple information sources meanwhile discarding the feature redundancy. Since the raised problem is regarded as nonconvex and discrete, our objective function is then optimized via an alternate way with relaxation and converges to a locally optimal solution. After finding the low-dimensional representation, the hashing functions are finally obtained through multivariable logistic regression. The proposed method is systematically evaluated on three data sets: 1) Caltech-256; 2) CIFAR-10; and 3) CIFAR-20, and the results show that our method significantly outperforms the state-of-the-art multiview hashing techniques.

Proceedings ArticleDOI
14 Jun 2015
TL;DR: The new bound is not only optimal, but in fact improves over the best LSH data structures (Indyk, Motwani 1998) (Andoni, Indyk 2006) for all approximation factors c>1.
Abstract: We show an optimal data-dependent hashing scheme for the approximate near neighbor problem For an n-point dataset in a d-dimensional space our data structure achieves query time O(d ⋅ nρ+o(1)) and space O(n1+ρ+o(1) + d ⋅ n), where ρ=1/(2c2-1) for the Euclidean space and approximation c>1 For the Hamming space, we obtain an exponent of ρ=1/(2c-1) Our result completes the direction set forth in (Andoni, Indyk, Nguyen, Razenshteyn 2014) who gave a proof-of-concept that data-dependent hashing can outperform classic Locality Sensitive Hashing (LSH) In contrast to (Andoni, Indyk, Nguyen, Razenshteyn 2014), the new bound is not only optimal, but in fact improves over the best (optimal) LSH data structures (Indyk, Motwani 1998) (Andoni, Indyk 2006) for all approximation factors c>1From the technical perspective, we proceed by decomposing an arbitrary dataset into several subsets that are, in a certain sense, pseudo-random

Proceedings Article
25 Jul 2015
TL;DR: This work presents a cross-modal hashing approach, called quantized correlation hashing (QCH), which takes into consideration the quantization loss over domains and the relation between domains, and outperforms the state-of-the-art multi- modal hashing methods.
Abstract: Cross-modal hashing is designed to facilitate fast search across domains. In this work, we present a cross-modal hashing approach, called quantized correlation hashing (QCH), which takes into consideration the quantization loss over domains and the relation between domains. Unlike previous approaches that separate the optimization of the quantizer independent of maximization of domain correlation, our approach simultaneously optimizes both processes. The underlying relation between the domains that describes the same objects is established via maximizing the correlation between the hash codes across the domains. The resulting multi-modal objective function is transformed to a unimodal formalization, which is optimized through an alternative procedure. Experimental results on three real world datasets demonstrate that our approach outperforms the state-of-the-art multi-modal hashing methods.

Journal ArticleDOI
TL;DR: It is shown that hashing on the basis of t-distributed stochastic neighbor embedding outperforms state-of-the-art hashing methods on large-scale benchmark data sets, and is very effective for image classification with very short code lengths, and the proposed framework can be further improved.
Abstract: Learning-based hashing methods have attracted considerable attention due to their ability to greatly increase the scale at which existing algorithms may operate. Most of these methods are designed to generate binary codes preserving the Euclidean similarity in the original space. Manifold learning techniques, in contrast, are better able to model the intrinsic structure embedded in the original high-dimensional data. The complexities of these models, and the problems with out-of-sample data, have previously rendered them unsuitable for application to large-scale embedding, however. In this paper, how to learn compact binary embeddings on their intrinsic manifolds is considered. In order to address the above-mentioned difficulties, an efficient, inductive solution to the out-of-sample data problem, and a process by which nonparametric manifold learning may be used as the basis of a hashing method are proposed. The proposed approach thus allows the development of a range of new hashing techniques exploiting the flexibility of the wide variety of manifold learning approaches available. It is particularly shown that hashing on the basis of t-distributed stochastic neighbor embedding outperforms state-of-the-art hashing methods on large-scale benchmark data sets, and is very effective for image classification with very short code lengths. It is shown that the proposed framework can be further improved, for example, by minimizing the quantization error with learned orthogonal rotations without much computation overhead. In addition, a supervised inductive manifold hashing framework is developed by incorporating the label information, which is shown to greatly advance the semantic retrieval performance.

Journal ArticleDOI
01 Sep 2015
TL;DR: A novel concept of query-aware bucket partition which uses a given query as the "anchor" for bucket partition, which removes random shift required by traditional query-oblivious LSH functions is introduced.
Abstract: Locality-Sensitive Hashing (LSH) and its variants are the well-known indexing schemes for the c-Approximate Nearest Neighbor (c-ANN) search problem in high-dimensional Euclidean space. Traditionally, LSH functions are constructed in a query-oblivious manner in the sense that buckets are partitioned before any query arrives. However, objects closer to a query may be partitioned into different buckets, which is undesirable. Due to the use of query-oblivious bucket partition, the state-of-the-art LSH schemes for external memory, namely C2LSH and LSB-Forest, only work with approximation ratio of integer c ≥ 2.In this paper, we introduce a novel concept of query-aware bucket partition which uses a given query as the "anchor" for bucket partition. Accordingly, a query-aware LSH function is a random projection coupled with query-aware bucket partition, which removes random shift required by traditional query-oblivious LSH functions. Notably, query-aware bucket partition can be easily implemented so that query performance is guaranteed. We propose a novel query-aware LSH scheme named QALSH for c-ANN search over external memory. Our theoretical studies show that QALSH enjoys a guarantee on query quality. The use of query-aware LSH function enables QALSH to work with any approximation ratio c > 1. Extensive experiments show that QALSH outperforms C2LSH and LSB-Forest, especially in high-dimensional space. Specifically, by using a ratio c

Proceedings ArticleDOI
07 Jun 2015
TL;DR: A novel approach to handle these two problems simultaneously based on the idea of data sketching, which can learn hash functions in an online fashion, while needs rather low computational complexity and storage space.
Abstract: Recently, hashing based approximate nearest neighbor (ANN) search has attracted much attention. Extensive new algorithms have been developed and successfully applied to different applications. However, two critical problems are rarely mentioned. First, in real-world applications, the data often comes in a streaming fashion but most of existing hashing methods are batch based models. Second, when the dataset becomes huge, it is almost impossible to load all the data into memory to train hashing models. In this paper, we propose a novel approach to handle these two problems simultaneously based on the idea of data sketching. A sketch of one dataset preserves its major characters but with significantly smaller size. With a small size sketch, our method can learn hash functions in an online fashion, while needs rather low computational complexity and storage space. Extensive experiments on two large scale benchmarks and one synthetic dataset demonstrate the efficacy of the proposed method.

Journal ArticleDOI
TL;DR: The extensive experiments show that the spherical hashing technique significantly outperforms state-of-the-art techniques based on hyperplanes across various benchmarks with sizes ranging from one to 75 million of GIST, BoW and VLAD descriptors, and is intuitive and easy to implement.
Abstract: Many binary code embedding schemes have been actively studied recently, since they can provide efficient similarity search, and compact data representations suitable for handling large scale image databases. Existing binary code embedding techniques encode high-dimensional data by using hyperplane-based hashing functions. In this paper we propose a novel hypersphere-based hashing function, spherical hashing , to map more spatially coherent data points into a binary code compared to hyperplane-based hashing functions. We also propose a new binary code distance function, spherical Hamming distance , tailored for our hypersphere-based binary coding scheme, and design an efficient iterative optimization process to achieve both balanced partitioning for each hash function and independence between hashing functions. Furthermore, we generalize spherical hashing to support various similarity measures defined by kernel functions. Our extensive experiments show that our spherical hashing technique significantly outperforms state-of-the-art techniques based on hyperplanes across various benchmarks with sizes ranging from one to 75 million of GIST, BoW and VLAD descriptors. The performance gains are consistent and large, up to 100 percent improvements over the second best method among tested methods. These results confirm the unique merits of using hyperspheres to encode proximity regions in high-dimensional spaces. Finally, our method is intuitive and easy to implement.

Journal ArticleDOI
TL;DR: The proposed framework allows a number of existing approaches to hashing to be placed in context, and simplifies the development of new problem-specific hashing methods, and decomposes the hashing learning problem into two steps: binary code (hash bit) learning and hash function learning.
Abstract: To build large-scale query-by-example image retrieval systems, embedding image features into a binary Hamming space provides great benefits. Supervised hashing aims to map the original features to compact binary codes that are able to preserve label based similarity in the binary Hamming space. Most existing approaches apply a single form of hash function, and an optimization process which is typically deeply coupled to this specific form. This tight coupling restricts the flexibility of those methods, and can result in complex optimization problems that are difficult to solve. In this work we proffer a flexible yet simple framework that is able to accommodate different types of loss functions and hash functions. The proposed framework allows a number of existing approaches to hashing to be placed in context, and simplifies the development of new problem-specific hashing methods. Our framework decomposes the hashing learning problem into two steps: binary code (hash bit) learning and hash function learning. The first step can typically be formulated as binary quadratic problems, and the second step can be accomplished by training a standard binary classifier. For solving large-scale binary code inference, we show how it is possible to ensure that the binary quadratic problems are submodular such that efficient graph cut methods may be used. To achieve efficiency as well as efficacy on large-scale high-dimensional data, we propose to use boosted decision trees as the hash functions, which are nonlinear, highly descriptive, and are very fast to train and evaluate. Experiments demonstrate that the proposed method significantly outperforms most state-of-the-art methods, especially on high-dimensional data.

Book ChapterDOI
16 Aug 2015
TL;DR: By replacing the brute-force list search in sieving algorithms with Charikar’s angular locality-sensitive hashing method, the proposed HashSieve algorithm already outperforms the GaussSieve, and the practical increase in the space complexity is much smaller than the asymptotic bounds suggest.
Abstract: By replacing the brute-force list search in sieving algorithms with Charikar’s angular locality-sensitive hashing (LSH) method, we get both theoretical and practical speedups for solving the shortest vector problem (SVP) on lattices. Combining angular LSH with a variant of Nguyen and Vidick’s heuristic sieve algorithm, we obtain heuristic time and space complexities for solving SVP of \(2^{0.3366n + o(n)}\) and \(2^{0.2075n + o(n)}\) respectively, while combining the same hash family with Micciancio and Voulgaris’ GaussSieve algorithm leads to an algorithm with (conjectured) heuristic time and space complexities of \(2^{0.3366n + o(n)}\). Experiments with the GaussSieve-variant show that in moderate dimensions the proposed HashSieve algorithm already outperforms the GaussSieve, and the practical increase in the space complexity is much smaller than the asymptotic bounds suggest, and can be further reduced with probing. Extrapolating to higher dimensions, we estimate that a fully optimized and parallelized implementation of the GaussSieve-based HashSieve algorithm might need a few core years to solve SVP in dimension 130 or even 140.

Proceedings ArticleDOI
09 Aug 2015
TL;DR: A novel method to learning bridging mapping for cross-modal hashing, named LBMCH, is proposed to characterize the cross- modal semantic correspondence by seamlessly connecting these distinct hamming spaces with each preserving the local structure of data objects from an individual modality.
Abstract: Hashing has gained considerable attention on large-scale similarity search, due to its enjoyable efficiency and low storage cost. In this paper, we study the problem of learning hash functions in the context of multi-modal data for cross-modal similarity search. Notwithstanding the progress achieved by existing methods, they essentially learn only one common hamming space, where data objects from all modalities are mapped to conduct similarity search. However, such method is unable to well characterize the flexible and discriminative local (neighborhood) structure in all modalities simultaneously, hindering them to achieve better performance. Bearing such stand-out limitation, we propose to learn heterogeneous hamming spaces with each preserving the local structure of data objects from an individual modality. Then, a novel method to learning bridging mapping for cross-modal hashing, named LBMCH, is proposed to characterize the cross-modal semantic correspondence by seamlessly connecting these distinct hamming spaces. Meanwhile, the local structure of each data object in a modality is preserved by constructing an anchor based representation, enabling LBMCH to characterize a linear complexity w.r.t the size of training set. The efficacy of LBMCH is experimentally validated against real-world cross-modal datasets.

Posted Content
TL;DR: A comprehensive survey of the learning-to-hash framework and representative techniques of various types, including unsupervised, semisupervised, and supervised, is provided and recent hashing approaches utilizing the deep learning models are summarized.
Abstract: The explosive growth in big data has attracted much attention in designing efficient indexing and search methods recently. In many critical applications such as large-scale search and pattern matching, finding the nearest neighbors to a query is a fundamental research problem. However, the straightforward solution using exhaustive comparison is infeasible due to the prohibitive computational complexity and memory requirement. In response, Approximate Nearest Neighbor (ANN) search based on hashing techniques has become popular due to its promising performance in both efficiency and accuracy. Prior randomized hashing methods, e.g., Locality-Sensitive Hashing (LSH), explore data-independent hash functions with random projections or permutations. Although having elegant theoretic guarantees on the search quality in certain metric spaces, performance of randomized hashing has been shown insufficient in many real-world applications. As a remedy, new approaches incorporating data-driven learning methods in development of advanced hash functions have emerged. Such learning to hash methods exploit information such as data distributions or class labels when optimizing the hash codes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in the original feature spaces in the hash code spaces. The goal of this paper is to provide readers with systematic understanding of insights, pros and cons of the emerging techniques. We provide a comprehensive survey of the learning to hash framework and representative techniques of various types, including unsupervised, semi-supervised, and supervised. In addition, we also summarize recent hashing approaches utilizing the deep learning models. Finally, we discuss the future direction and trends of research in this area.

Proceedings Article
12 Jul 2015
TL;DR: In this paper, the authors showed that the quantization used in L2-LSH is suboptimal for MIPS compared to signed random projections (SRP) which is another popular hashing scheme for cosine similarity.
Abstract: Recently we showed that the problem of Maximum Inner Product Search (MIPS) is efficient and it admits provably sub-linear hashing algorithms. In [23], we used asymmetric transformations to convert the problem of approximate MIPS into the problem of approximate near neighbor search which can be efficiently solved using L2-LSH. In this paper, we revisit the problem of MIPS and argue that the quantizations used in L2-LSH is suboptimal for MIPS compared to signed random projections (SRP) which is another popular hashing scheme for cosine similarity (or correlations). Based on this observation, we provide different asymmetric transformations which convert the problem of approximate MIPS into the problem amenable to SRP instead of L2-LSH. An additional advantage of our scheme is that we also obtain LSH type space partitioning which is not possible with the existing scheme. Our theoretical analysis shows that the new scheme is significantly better than the original scheme for MIPS. Experimental evaluations strongly support the theoretical findings. In addition, we also provide the first empirical comparison that shows the superiority of hashing over tree based methods [21] for MIPS.

Proceedings ArticleDOI
07 Dec 2015
TL;DR: In experiments with three image retrieval benchmarks, the proposed online algorithm attains retrieval accuracy that is comparable to competing state-of-the-art batch-learning solutions, while the formulation is orders of magnitude faster and being online it is adaptable to the variations of the data.
Abstract: With the staggering growth in image and video datasets, algorithms that provide fast similarity search and compact storage are crucial. Hashing methods that map the data into Hamming space have shown promise, however, many of these methods employ a batch-learning strategy in which the computational cost and memory requirements may become intractable and infeasible with larger and larger datasets. To overcome these challenges, we propose an online learning algorithm based on stochastic gradient descent in which the hash functions are updated iteratively with streaming data. In experiments with three image retrieval benchmarks, our online algorithm attains retrieval accuracy that is comparable to competing state-of-the-art batch-learning solutions, while our formulation is orders of magnitude faster and being online it is adaptable to the variations of the data. Moreover, our formulation yields improved retrieval performance over a recently reported online hashing technique, Online Kernel Hashing.

Journal ArticleDOI
27 Jul 2015
TL;DR: A data structure that reduces approximate nearest neighbor query times for image patches in large datasets by up to 9× over k-coherence, up to 12× over TreeCANN, and up to 200× over PatchMatch is presented.
Abstract: This paper presents a data structure that reduces approximate nearest neighbor query times for image patches in large datasets. Previous work in texture synthesis has demonstrated real-time synthesis from small exemplar textures. However, high performance has proved elusive for modern patch-based optimization techniques which frequently use many exemplar images in the tens of megapixels or above. Our new algorithm, PatchTable, offloads as much of the computation as possible to a pre-computation stage that takes modest time, so patch queries can be as efficient as possible. There are three key insights behind our algorithm: (1) a lookup table similar to locality sensitive hashing can be precomputed, and used to seed sufficiently good initial patch correspondences during querying, (2) missing entries in the table can be filled during pre-computation with our fast Voronoi transform, and (3) the initially seeded correspondences can be improved with a precomputed k-nearest neighbors mapping. We show experimentally that this accelerates the patch query operation by up to 9× over k-coherence, up to 12× over TreeCANN, and up to 200× over PatchMatch. Our fast algorithm allows us to explore efficient and practical imaging and computational photography applications. We show results for artistic video stylization, light field super-resolution, and multi-image editing.

Journal ArticleDOI
TL;DR: A novel methodology is proposed for detecting and trending events from tweet clusters that are discovered by using locality sensitive hashing (LSH) technique to speed-up the cluster-discovery process while retaining the cluster quality.

Journal ArticleDOI
TL;DR: A path-based similarity join (PS-join) method to return the top k similar pairs of objects based on any user specified join path in a heterogeneous information network is proposed and can derive various similarity semantics.
Abstract: As a newly emerging network model, heterogeneous information networks (HINs) have received growing attention. Many data mining tasks have been explored in HINs, including clustering, classification, and similarity search. Similarity join is a fundamental operation required for many problems. It is attracting attention from various applications on network data, such as friend recommendation, link prediction, and online advertising. Although similarity join has been well studied in homogeneous networks, it has not yet been studied in heterogeneous networks. Especially, none of the existing research on similarity join takes different semantic meanings behind paths into consideration and almost all completely ignore the heterogeneity and diversity of the HINs. In this paper, we propose a path-based similarity join (PS-join) method to return the top $k$ similar pairs of objects based on any user specified join path in a heterogeneous information network. We study how to prune expensive similarity computation by introducing bucket pruning based locality sensitive hashing (BPLSH) indexing. Compared with existing Link-based Similarity join (LS-join) method, PS-join can derive various similarity semantics. Experimental results on real data sets show the efficiency and effectiveness of the proposed approach.

Journal ArticleDOI
01 Nov 2015
TL;DR: It is shown that a right or wrong decision in picking the right hashing scheme and hash function combination may lead to significant difference in performance, and that hashing should be considered a white box before blindly using it in applications, such as query processing.
Abstract: Hashing is a solved problem. It allows us to get constant time access for lookups. Hashing is also simple. It is safe to use an arbitrary method as a black box and expect good performance, and optimizations to hashing can only improve it by a negligible delta. Why are all of the previous statements plain wrong? That is what this paper is about. In this paper we thoroughly study hashing for integer keys and carefully analyze the most common hashing methods in a five-dimensional requirements space: (1) data-distribution, (2) load factor, (3) dataset size, (4) read/write-ratio, and (5) un/successful-ratio. Each point in that design space may potentially suggest a different hashing scheme, and additionally also a different hash function. We show that a right or wrong decision in picking the right hashing scheme and hash function combination may lead to significant difference in performance. To substantiate this claim, we carefully analyze two additional dimensions: (6) five representative hashing schemes (which includes an improved variant of Robin Hood hashing), (7) four important classes of hash functions widely used today. That is, we consider 20 different combinations in total. Finally, we also provide a glimpse about the effect of table memory layout and the use of SIMD instructions. Our study clearly indicates that picking the right combination may have considerable impact on insert and lookup performance, as well as memory footprint. A major conclusion of our work is that hashing should be considered a white box before blindly using it in applications, such as query processing. Finally, we also provide a strong guideline about when to use which hashing method.

Journal ArticleDOI
TL;DR: A Λ-fold Redundant Blocking Framework is presented, that relies on the Locality-Sensitive Hashing technique for identifying candidate record pairs, which have undergone an anonymization transformation, and illustrates that the performance attained is highly correlated to the distance-preserving properties of the anonymization format used.
Abstract: We present a $\Lambda$ -fold Redundant Blocking Framework, that relies on the Locality-Sensitive Hashing technique for identifying candidate record pairs, which have undergone an anonymization transformation. In this context, we demonstrate the usage and evaluate the performance of a variety of families of hash functions used for blocking. We illustrate that the performance attained is highly correlated to the distance-preserving properties of the anonymization format used. The parameters, of the blocking scheme, are optimally selected so that we achieve the highest possible accuracy in the least possible running time. We also introduce an SMC-based protocol in order to compare the formulated record pairs homomorphically, without running the risk of breaching the privacy of the underlying records.

Proceedings ArticleDOI
13 Oct 2015
TL;DR: This paper proposes a novel unsupervised hashing approach, dubbed multi-view latent hashing (MVLH), to effectively incorporate multi- view data into hash code learning, and provides a novel scheme to directly learn the codes without resorting to continuous relaxations.
Abstract: Hashing techniques have attracted broad research interests in recent multimedia studies. However, most of existing hashing methods focus on learning binary codes from data with only one single view, and thus cannot fully utilize the rich information from multiple views of data. In this paper, we propose a novel unsupervised hashing approach, dubbed multi-view latent hashing (MVLH), to effectively incorporate multi-view data into hash code learning. Specifically, the binary codes are learned by the latent factors shared by multiple views from an unified kernel feature space, where the weights of different views are adaptively learned according to the reconstruction error with each view. We then propose to solve the associate optimization problem with an efficient alternating algorithm. To obtain high-quality binary codes, we provide a novel scheme to directly learn the codes without resorting to continuous relaxations, where each bit is efficiently computed in a closed form. We evaluate the proposed method on several large-scale datasets and the results demonstrate the superiority of our method over several other state-of-the-art methods.

Proceedings Article
25 Jul 2015
TL;DR: This paper proposes a novel Ranking Preserving Hashing (RPH) approach that directly optimizes a popular ranking measure, Normalized Discounted Cumulative Gain (NDCG), to obtain effective hashing codes with high ranking accuracy.
Abstract: Hashing method becomes popular for large scale similarity search due to its storage and computational efficiency. Many machine learning techniques, ranging from unsupervised to supervised, have been proposed to design compact hashing codes. Most of the existing hashing methods generate binary codes to efficiently find similar data examples to a query. However, the ranking accuracy among the retrieved data examples is not modeled. But in many real world applications, ranking measure is important for evaluating the quality of hashing codes. In this paper, we propose a novel Ranking Preserving Hashing (RPH) approach that directly optimizes a popular ranking measure, Normalized Discounted Cumulative Gain (NDCG), to obtain effective hashing codes with high ranking accuracy. The main difficulty in the direct optimization of NDCG measure is that it depends on the ranking order of data examples, which forms a non-convex nonsmooth optimization problem. We address this challenge by optimizing the expectation of NDCG measure calculated based on a linear hashing function. A gradient descent method is designed to achieve the goal. An extensive set of experiments on two large scale datasets demonstrate the superior ranking performance of the proposed approach over several state-of-the-art hashing methods.