Showing papers on "Locality-sensitive hashing published in 2014"

PDF

Open Access

Proceedings Article•

Supervised hashing for image retrieval via image representation learning

[...]

Rongkai Xia¹, Yan Pan¹, Hanjiang Lai¹, Cong Liu¹, Shuicheng Yan² - Show less +1 more•Institutions (2)

Sun Yat-sen University¹, National University of Singapore²

27 Jul 2014

TL;DR: Extensive empirical evaluations on three benchmark datasets with different kinds of images show that the proposed method has superior performance gains over several state-of-the-art supervised and unsupervised hashing methods.

...read moreread less

Abstract: Hashing is a popular approximate nearest neighbor search approach for large-scale image retrieval. Supervised hashing, which incorporates similarity/ dissimilarity information on entity pairs to improve the quality of hashing function learning, has recently received increasing attention. However, in the existing supervised hashing methods for images, an input image is usually encoded by a vector of handcrafted visual features. Such hand-crafted feature vectors do not necessarily preserve the accurate semantic similarities of images pairs, which may often degrade the performance of hashing function learning. In this paper, we propose a supervised hashing method for image retrieval, in which we automatically learn a good image representation tailored to hashing as well as a set of hash functions. The proposed method has two stages. In the first stage, given the pairwise similarity matrix S over training images, we propose a scalable coordinate descent method to decompose S into a product of HHT where H is a matrix with each of its rows being the approximate hash code associated to a training image. In the second stage, we propose to simultaneously learn a good feature representation for the input images as well as a set of hash functions, via a deep convolutional network tailored to the learned hash codes in H and optionally the discrete class labels of the images. Extensive empirical evaluations on three benchmark datasets with different kinds of images show that the proposed method has superior performance gains over several state-of-the-art supervised and unsupervised hashing methods.

...read moreread less

925 citations

Posted Content•

Hashing for Similarity Search: A Survey

[...]

Jingdong Wang, Heng Tao Shen, Jingkuan Song, Jianqiu Ji

13 Aug 2014-arXiv: Data Structures and Algorithms

TL;DR: This paper presents a survey on one of the main solutions to approximate search, hashing, which has been widely studied since the pioneering work locality sensitive hashing, and divides the hashing algorithms into two main categories.

...read moreread less

Abstract: Similarity search (nearest neighbor search) is a problem of pursuing the data items whose distances to a query item are the smallest from a large database. Various methods have been developed to address this problem, and recently a lot of efforts have been devoted to approximate search. In this paper, we present a survey on one of the main solutions, hashing, which has been widely studied since the pioneering work locality sensitive hashing. We divide the hashing algorithms two main categories: locality sensitive hashing, which designs hash functions without exploring the data distribution and learning to hash, which learns hash functions according the data distribution, and review them from various aspects, including hash function design and distance measure and search scheme in the hash coding space.

...read moreread less

531 citations

Proceedings Article•

Discrete Graph Hashing

[...]

Wei Liu¹, Cun Mu², Sanjiv Kumar³, Shih-Fu Chang²•Institutions (3)

IBM¹, Columbia University², Google³

08 Dec 2014

TL;DR: Extensive experiments performed on four large datasets with up to one million samples show that the discrete optimization based graph hashing method obtains superior search accuracy over state-of-the-art un-supervised hashing methods, especially for longer codes.

...read moreread less

Abstract: Hashing has emerged as a popular technique for fast nearest neighbor search in gigantic databases. In particular, learning based hashing has received considerable attention due to its appealing storage and search efficiency. However, the performance of most unsupervised learning based hashing methods deteriorates rapidly as the hash code length increases. We argue that the degraded performance is due to inferior optimization procedures used to achieve discrete binary codes. This paper presents a graph-based unsupervised hashing model to preserve the neighborhood structure of massive data in a discrete code space. We cast the graph hashing problem into a discrete optimization framework which directly learns the binary codes. A tractable alternating maximization algorithm is then proposed to explicitly deal with the discrete constraints, yielding high-quality codes to well capture the local neighborhoods. Extensive experiments performed on four large datasets with up to one million samples show that our discrete optimization based graph hashing method obtains superior search accuracy over state-of-the-art un-supervised hashing methods, especially for longer codes.

...read moreread less

512 citations

Proceedings Article•DOI•

Collective Matrix Factorization Hashing for Multimodal Data

[...]

Guiguang Ding¹, Yuchen Guo¹, Jile Zhou¹•Institutions (1)

Tsinghua University¹

23 Jun 2014

TL;DR: This paper puts forward a novel hashing method, which is referred to Collective Matrix Factorization Hashing (CMFH), which learns unified hash codes by collective matrix factorization with latent factor model from different modalities of one instance, which can not only supports cross-view search but also increases the search accuracy by merging multiple view information sources.

...read moreread less

Abstract: Nearest neighbor search methods based on hashing have attracted considerable attention for effective and efficient large-scale similarity search in computer vision and information retrieval community. In this paper, we study the problems of learning hash functions in the context of multimodal data for cross-view similarity search. We put forward a novel hashing method, which is referred to Collective Matrix Factorization Hashing (CMFH). CMFH learns unified hash codes by collective matrix factorization with latent factor model from different modalities of one instance, which can not only supports cross-view search but also increases the search accuracy by merging multiple view information sources. We also prove that CMFH, a similarity-preserving hashing learning method, has upper and lower boundaries. Extensive experiments verify that CMFH significantly outperforms several state-of-the-art methods on three different datasets.

...read moreread less

475 citations

Proceedings Article•DOI•

Fast Supervised Hashing with Decision Trees for High-Dimensional Data

[...]

Guosheng Lin¹, Chunhua Shen¹, Qinfeng Shi¹, Anton van den Hengel¹, David Suter¹ - Show less +1 more•Institutions (1)

University of Adelaide¹

23 Jun 2014

TL;DR: Experiments demonstrate that the proposed method significantly outperforms most state-of-the-art methods in retrieval precision and training time, and is orders of magnitude faster than many methods in terms of training time.

...read moreread less

Abstract: Supervised hashing aims to map the original features to compact binary codes that are able to preserve label based similarity in the Hamming space. Non-linear hash functions have demonstrated their advantage over linear ones due to their powerful generalization capability. In the literature, kernel functions are typically used to achieve non-linearity in hashing, which achieve encouraging retrieval perfor- mance at the price of slow evaluation and training time. Here we propose to use boosted decision trees for achieving non-linearity in hashing, which are fast to train and evalu- ate, hence more suitable for hashing with high dimensional data. In our approach, we first propose sub-modular for- mulations for the hashing binary code inference problem and an efficient GraphCut based block search method for solving large-scale inference. Then we learn hash func- tions by training boosted decision trees to fit the binary codes. Experiments demonstrate that our proposed method significantly outperforms most state-of-the-art methods in retrieval precision and training time. Especially for high- dimensional data, our method is orders of magnitude faster than many methods in terms of training time.

...read moreread less

418 citations

Posted Content•

Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS)

[...]

Anshumali Shrivastava¹, Ping Li²•Institutions (2)

Cornell University¹, Rutgers University²

22 May 2014-arXiv: Machine Learning

TL;DR: This work presents the first provably sublinear time algorithm for approximateMaximum Inner Product Search (MIPS), and is also the first hashing algorithm for searching with (un-normalized) inner product as the underlying similarity measure.

...read moreread less

Abstract: We present the first provably sublinear time algorithm for approximate \emph{Maximum Inner Product Search} (MIPS). Our proposal is also the first hashing algorithm for searching with (un-normalized) inner product as the underlying similarity measure. Finding hashing schemes for MIPS was considered hard. We formally show that the existing Locality Sensitive Hashing (LSH) framework is insufficient for solving MIPS, and then we extend the existing LSH framework to allow asymmetric hashing schemes. Our proposal is based on an interesting mathematical phenomenon in which inner products, after independent asymmetric transformations, can be converted into the problem of approximate near neighbor search. This key observation makes efficient sublinear hashing scheme for MIPS possible. In the extended asymmetric LSH (ALSH) framework, we provide an explicit construction of provably fast hashing scheme for MIPS. The proposed construction and the extended LSH framework could be of independent theoretical interest. Our proposed algorithm is simple and easy to implement. We evaluate the method, for retrieving inner products, in the collaborative filtering task of item recommendations on Netflix and Movielens datasets.

...read moreread less

299 citations

Proceedings Article•DOI•

Supervised hashing with latent factor models

[...]

Peichao Zhang¹, Wei Zhang¹, Wu-Jun Li², Minyi Guo¹•Institutions (2)

Shanghai Jiao Tong University¹, Nanjing University²

03 Jul 2014

TL;DR: Experimental results show that LFH can achieve superior accuracy than state-of-the-art methods with comparable training time, and a linear-time variant with stochastic learning is proposed for training LFH on large-scale datasets.

...read moreread less

Abstract: Due to its low storage cost and fast query speed, hashing has been widely adopted for approximate nearest neighbor search in large-scale datasets. Traditional hashing methods try to learn the hash codes in an unsupervised way where the metric (Euclidean) structure of the training data is preserved. Very recently, supervised hashing methods, which try to preserve the semantic structure constructed from the semantic labels of the training points, have exhibited higher accuracy than unsupervised methods. In this paper, we propose a novel supervised hashing method, called latent factor hashing(LFH), to learn similarity-preserving binary codes based on latent factor models. An algorithm with convergence guarantee is proposed to learn the parameters of LFH. Furthermore, a linear-time variant with stochastic learning is proposed for training LFH on large-scale datasets. Experimental results on two large datasets with semantic labels show that LFH can achieve superior accuracy than state-of-the-art methods with comparable training time.

...read moreread less

274 citations

Proceedings Article•

Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS)

[...]

Anshumali Shrivastava¹, Ping Li²•Institutions (2)

Cornell University¹, Rutgers University²

08 Dec 2014

TL;DR: In this paper, the authors presented the first provably sublinear time hashing algorithm for approximate maximum inner product search (MIPS), which is based on a key observation that the problem of finding maximum inner products, after independent asymmetric transformations, can be converted into a problem of approximate near neighbor search in classical settings.

...read moreread less

Abstract: We present the first provably sublinear time hashing algorithm for approximate Maximum Inner Product Search (MIPS). Searching with (un-normalized) inner product as the underlying similarity measure is a known difficult problem and finding hashing schemes for MIPS was considered hard. While the existing Locality Sensitive Hashing (LSH) framework is insufficient for solving MIPS, in this paper we extend the LSH framework to allow asymmetric hashing schemes. Our proposal is based on a key observation that the problem of finding maximum inner products, after independent asymmetric transformations, can be converted into the problem of approximate near neighbor search in classical settings. This key observation makes efficient sublinear hashing scheme for MIPS possible. Under the extended asymmetric LSH (ALSH) framework, this paper provides an example of explicit construction of provably fast hashing scheme for MIPS. Our proposed algorithm is simple and easy to implement. The proposed hashing scheme leads to significant computational savings over the two popular conventional LSH schemes: (i) Sign Random Projection (SRP) and (ii) hashing based on p-stable distributions for L2 norm (L2LSH), in the collaborative filtering task of item recommendations on Netflix and Movielens (10M) datasets.

...read moreread less

230 citations

Journal Article•DOI•

Density Sensitive Hashing

[...]

Zhongming Jin¹, Cheng Li¹, Yue Lin¹, Deng Cai¹•Institutions (1)

Zhejiang University¹

01 Aug 2014-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: Density Sensitive Hashing (DSH) as discussed by the authors is an extension of locality sensitive hashing (LSH) which avoids the purely random projections selection and uses those projective functions which best agree with the distribution of the data.

...read moreread less

Abstract: Nearest neighbor search is a fundamental problem in various research fields like machine learning, data mining and pattern recognition. Recently, hashing-based approaches, for example, locality sensitive hashing (LSH), are proved to be effective for scalable high dimensional nearest neighbor search. Many hashing algorithms found their theoretic root in random projection. Since these algorithms generate the hash tables (projections) randomly, a large number of hash tables (i.e., long codewords) are required in order to achieve both high precision and recall. To address this limitation, we propose a novel hashing algorithm called density sensitive hashing (DSH) in this paper. DSH can be regarded as an extension of LSH. By exploring the geometric structure of the data, DSH avoids the purely random projections selection and uses those projective functions which best agree with the distribution of the data. Extensive experimental results on real-world data sets have shown that the proposed method achieves better performance compared to the state-of-the-art hashing approaches.

...read moreread less

206 citations

Journal Article•DOI•

[...]

Jonathan Masci¹, Michael M. Bronstein¹, Alexander M. Bronstein², Jürgen Schmidhuber¹•Institutions (2)

University of Lugano¹, Tel Aviv University²

01 Apr 2014-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: An efficient computational framework for hashing data belonging to multiple modalities into a single representation space where they become mutually comparable, based on a novel coupled siamese neural network architecture.

...read moreread less

Abstract: We introduce an efficient computational framework for hashing data belonging to multiple modalities into a single representation space where they become mutually comparable. The proposed approach is based on a novel coupled siamese neural network architecture and allows unified treatment of intra- and inter-modality similarity learning. Unlike existing cross-modality similarity learning approaches, our hashing functions are not limited to binarized linear projections and can assume arbitrarily complex forms. We show experimentally that our method significantly outperforms state-of-the-art hashing approaches on multimedia retrieval tasks.

...read moreread less

203 citations

Journal Article•DOI•

Robust Perceptual Image Hashing Based on Ring Partition and NMF

[...]

Zhenjun Tang¹, Xianquan Zhang¹, Shichao Zhang¹•Institutions (1)

Guangxi Normal University¹

01 Mar 2014-IEEE Transactions on Knowledge and Data Engineering

TL;DR: An efficient image hashing with a ring partition and a nonnegative matrix factorization (NMF) is designed, which has both the rotation robustness and good discriminative capability.

...read moreread less

Abstract: This paper designs an efficient image hashing with a ring partition and a nonnegative matrix factorization (NMF), which has both the rotation robustness and good discriminative capability. The key contribution is a novel construction of rotation-invariant secondary image, which is used for the first time in image hashing and helps to make image hash resistant to rotation. In addition, NMF coefficients are approximately linearly changed by content-preserving manipulations, so as to measure hash similarity with correlation coefficient. We conduct experiments for illustrating the efficiency with 346 images. Our experiments show that the proposed hashing is robust against content-preserving operations, such as image rotation, JPEG compression, watermark embedding, Gaussian low-pass filtering, gamma correction, brightness adjustment, contrast adjustment, and image scaling. Receiver operating characteristics (ROC) curve comparisons are also conducted with the state-of-the-art algorithms, and demonstrate that the proposed hashing is much better than all these algorithms in classification performances with respect to robustness and discrimination.

...read moreread less

Journal Article•DOI•

Asymmetric Distances for Binary Embeddings

[...]

Albert Gordo¹, Florent Perronnin², Yunchao Gong³, Svetlana Lazebnik⁴•Institutions (4)

French Institute for Research in Computer Science and Automation¹, Xerox², University of North Carolina at Chapel Hill³, University of Illinois at Urbana–Champaign⁴

01 Jan 2014-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This work proposes two general asymmetric distances that are applicable to a wide variety of embedding techniques including locality sensitive hashing (LSH), locality sensitive binary codes (LSBC), spectral hashing (SH), PCA embedding (PCA), PCAE with random rotations (PCAE-RR), and PCA with iterative quantization (PCae-ITQ).

...read moreread less

Abstract: In large-scale query-by-example retrieval, embedding image signatures in a binary space offers two benefits: data compression and search efficiency. While most embedding algorithms binarize both query and database signatures, it has been noted that this is not strictly a requirement. Indeed, asymmetric schemes that binarize the database signatures but not the query still enjoy the same two benefits but may provide superior accuracy. In this work, we propose two general asymmetric distances that are applicable to a wide variety of embedding techniques including locality sensitive hashing (LSH), locality sensitive binary codes (LSBC), spectral hashing (SH), PCA embedding (PCAE), PCAE with random rotations (PCAE-RR), and PCAE with iterative quantization (PCAE-ITQ). We experiment on four public benchmarks containing up to 1M images and show that the proposed asymmetric distances consistently lead to large improvements over the symmetric Hamming distance for all binary embedding techniques.

...read moreread less

Journal Article•DOI•

SRS: solving c-approximate nearest neighbor queries in high dimensional euclidean space with a tiny index

[...]

Yifang Sun¹, Wei Wang¹, Jianbin Qin¹, Ying Zhang², Xuemin Lin¹ - Show less +1 more•Institutions (2)

University of New South Wales¹, University of Technology, Sydney²

01 Sep 2014

TL;DR: Several surprisingly simple methods to answer c-ANN queries with theoretical guarantees requiring only a single tiny index are proposed and demonstrate superior performance against the state-of-the-art LSH-based methods, and scale up well to 1 billion high-dimensional points on a single commodity PC.

...read moreread less

Abstract: Nearest neighbor searches in high-dimensional space have many important applications in domains such as data mining, and multimedia databases. The problem is challenging due to the phenomenon called "curse of dimensionality". An alternative solution is to consider algorithms that returns a c-approximate nearest neighbor (c-ANN) with guaranteed probabilities. Locality Sensitive Hashing (LSH) is among the most widely adopted method, and it achieves high efficiency both in theory and practice. However, it is known to require an extremely high amount of space for indexing, hence limiting its scalability.In this paper, we propose several surprisingly simple methods to answer c-ANN queries with theoretical guarantees requiring only a single tiny index. Our methods are highly flexible and support a variety of functionalities, such as finding the exact nearest neighbor with any given probability. In the experiment, our methods demonstrate superior performance against the state-of-the-art LSH-based methods, and scale up well to 1 billion high-dimensional points on a single commodity PC.

...read moreread less

Proceedings Article•DOI•

Discriminative coupled dictionary hashing for fast cross-media retrieval

[...]

Zhou Yu¹, Fei Wu¹, Yi Yang², Qi Tian³, Jiebo Luo⁴, Yueting Zhuang¹ - Show less +2 more•Institutions (4)

Zhejiang University¹, University of Queensland², University of Texas at San Antonio³, University of Rochester⁴

03 Jul 2014

TL;DR: The experiments show that the DCDH and MV-DCDH outperform the state-of-the-art methods significantly on cross-media retrieval and conjecture that a balanced representation is crucial in cross- media retrieval.

...read moreread less

Abstract: Cross-media hashing, which conducts cross-media retrieval by embedding data from different modalities into a common low-dimensional Hamming space, has attracted intensive attention in recent years. The existing cross-media hashing approaches only aim at learning hash functions to preserve the intra-modality and inter-modality correlations, but do not directly capture the underlying semantic information of the multi-modal data. We propose a discriminative coupled dictionary hashing (DCDH) method in this paper. In DCDH, the coupled dictionary for each modality is learned with side information (e.g., categories). As a result, the coupled dictionaries not only preserve the intra-similarity and inter-correlation among multi-modal data, but also contain dictionary atoms that are semantically discriminative (i.e., the data from the same category is reconstructed by the similar dictionary atoms). To perform fast cross-media retrieval, we learn hash functions which map data from the dictionary space to a low-dimensional Hamming space. Besides, we conjecture that a balanced representation is crucial in cross-media retrieval. We introduce multi-view features on the relatively ``weak'' modalities into DCDH and extend it to multi-view DCDH (MV-DCDH) in order to enhance their representation capability. The experiments on two real-world data sets show that our DCDH and MV-DCDH outperform the state-of-the-art methods significantly on cross-media retrieval.

...read moreread less

Proceedings Article•

Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search

[...]

Anshumali Shrivastava¹, Ping Li²•Institutions (2)

Cornell University¹, Rutgers University²

21 Jun 2014

TL;DR: The heart of the proposed hash function is a "rotation" scheme which densifies the sparse sketches of one permutation hashing in an unbiased fashion thereby maintaining the LSH property, which makes the obtained sketches suitable for hash table construction.

...read moreread less

Abstract: The query complexity of locality sensitive hashing (LSH) based similarity search is dominated by the number of hash evaluations, and this number grows with the data size (Indyk & Motwani, 1998). In industrial applications such as search where the data are often high-dimensional and binary (e.g., text n-grams), minwise hashing is widely adopted, which requires applying a large number of permutations on the data. This is costly in computation and energy-consumption. In this paper, we propose a hashing technique which generates all the necessary hash evaluations needed for similarity search, using one single permutation. The heart of the proposed hash function is a "rotation" scheme which densifies the sparse sketches of one permutation hashing (Li et al., 2012) in an unbiased fashion thereby maintaining the LSH property. This makes the obtained sketches suitable for hash table construction. This idea of rotation presented in this paper could be of independent interest for densifying other types of sparse sketches. Using our proposed hashing method, the query time of a (K,L)-parameterized LSH is reduced from the typical O(dKL) complexity to merely O(KL + dL), where d is the number of nonzeros of the data vector, K is the number of hashes in each hash table, and L is the number of hash tables. Our experimental evaluation on real data confirms that the proposed scheme significantly reduces the query processing time over minwise hashing without loss in retrieval accuracies.

...read moreread less

Proceedings Article•DOI•

Collaborative Hashing

[...]

Xianglong Liu¹, Junfeng He², Cheng Deng³, Bo Lang¹•Institutions (3)

Beihang University¹, Facebook², Xidian University³

23 Jun 2014

TL;DR: This work proposes a collaborative hashing scheme for the data in matrix form to enable fast search in various applications such as image search using bag of words and recommendation using user-item ratings, and demonstrates that the proposed method outperforms state-of-the-art baselines.

...read moreread less

Abstract: Hashing technique has become a promising approach for fast similarity search. Most of existing hashing research pursue the binary codes for the same type of entities by preserving their similarities. In practice, there are many scenarios involving nearest neighbor search on the data given in matrix form, where two different types of, yet naturally associated entities respectively correspond to its two dimensions or views. To fully explore the duality between the two views, we propose a collaborative hashing scheme for the data in matrix form to enable fast search in various applications such as image search using bag of words and recommendation using user-item ratings. By simultaneously preserving both the entity similarities in each view and the interrelationship between views, our collaborative hashing effectively learns the compact binary codes and the explicit hash functions for out-of-sample extension in an alternating optimization way. Extensive evaluations are conducted on three well-known datasets for search inside a single view and search across different views, demonstrating that our proposed method outperforms state-of-the-art baselines, with significant accuracy gains ranging from 7.67% to 45.87% relatively.

...read moreread less

Journal Article•DOI•

SK-LSH: an efficient index structure for approximate nearest neighbor search

[...]

Yingfan Liu¹, Jiangtao Cui¹, Zi Huang², Hui Li¹, Heng Tao Shen² - Show less +1 more•Institutions (2)

Xidian University¹, University of Queensland²

01 May 2014

TL;DR: An exhaustive empirical study over several real-world data sets demonstrates the superior efficiency and accuracy of SK-LSH for the ANN search, compared with state-of-the-art methods, including LSB, C2LSH and CK-Means.

...read moreread less

Abstract: Approximate Nearest Neighbor (ANN) search in high dimensional space has become a fundamental paradigm in many applications. Recently, Locality Sensitive Hashing (LSH) and its variants are acknowledged as the most promising solutions to ANN search. However, state-of-the-art LSH approaches suffer from a drawback: accesses to candidate objects require a large number of random I/O operations. In order to guarantee the quality of returned results, sufficient objects should be verified, which would consume enormous I/O cost.To address this issue, we propose a novel method, called SortingKeys-LSH (SK-LSH), which reduces the number of page accesses through locally arranging candidate objects. We firstly define a new measure to evaluate the distance between the compound hash keys of two points. A linear order relationship on the set of compound hash keys is then created, and the corresponding data points can be sorted accordingly. Hence, data points that are close to each other according to the distance measure can be stored locally in an index file. During the ANN search, only a limited number of disk pages among few index files are necessary to be accessed for sufficient candidate generation and verification, which not only significantly reduces the response time but also improves the accuracy of the returned results. Our exhaustive empirical study over several real-world data sets demonstrates the superior efficiency and accuracy of SK-LSH for the ANN search, compared with state-of-the-art methods, including LSB, C2LSH and CK-Means.

...read moreread less

Proceedings Article•

In Defense of MinHash Over SimHash

[...]

Anshumali Shrivastava, Ping Li

01 Jan 2014

TL;DR: A theoretical answer is provided (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search.

...read moreread less

Abstract: MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search. The collision probability of MinHash is a function of resemblance similarity (R), while the collision probability of SimHash is a function of cosine similarity (S). To provide a common basis for comparison, we evaluate retrieval results in terms of S for both MinHash and SimHash. This evaluation is valid as we can prove that MinHash is a valid LSH with respect to S, by using a general inequality S 2

...read moreread less

Journal Article•DOI•

Robust Hashing With Local Models for Approximate Similarity Search

[...]

Jingkuan Song¹, Yi Yang¹, Xuelong Li², Zi Huang¹, Yang Yang³ - Show less +1 more•Institutions (3)

University of Queensland¹, Chinese Academy of Sciences², National University of Singapore³

01 Jul 2014-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: This paper proposes a novel hashing method, namely, robust hashing with local models (RHLM), which learns a set of robust hash functions to map the high-dimensional data points into binary hash codes by effectively utilizing local structural information.

...read moreread less

Abstract: Similarity search plays an important role in many applications involving high-dimensional data. Due to the known dimensionality curse, the performance of most existing indexing structures degrades quickly as the feature dimensionality increases. Hashing methods, such as locality sensitive hashing (LSH) and its variants, have been widely used to achieve fast approximate similarity search by trading search quality for efficiency. However, most existing hashing methods make use of randomized algorithms to generate hash codes without considering the specific structural information in the data. In this paper, we propose a novel hashing method, namely, robust hashing with local models (RHLM), which learns a set of robust hash functions to map the high-dimensional data points into binary hash codes by effectively utilizing local structural information. In RHLM, for each individual data point in the training dataset, a local hashing model is learned and used to predict the hash codes of its neighboring data points. The local models from all the data points are globally aligned so that an optimal hash code can be assigned to each data point. After obtaining the hash codes of all the training data points, we design a robust method by employing $\ell_{2,1}$ -norm minimization on the loss function to learn effective hash functions, which are then used to map each database point into its hash code. Given a query data point, the search process first maps it into the query hash code by the hash functions and then explores the buckets, which have similar hash codes to the query hash code. Extensive experimental results conducted on real-life datasets show that the proposed RHLM outperforms the state-of-the-art methods in terms of search quality and efficiency.

...read moreread less

Journal Article•DOI•

Multiple feature kernel hashing for large-scale visual search

[...]

Xianglong Liu¹, Junfeng He², Bo Lang¹•Institutions (2)

Beihang University¹, Facebook²

01 Feb 2014-Pattern Recognition

TL;DR: Experimental results show that the proposed multiple feature kernel hashing framework can achieve superior accuracy and efficiency over state-of-the-art methods, and alternating optimization ways efficiently learn hashing functions and the kernel space.

...read moreread less

Proceedings Article•DOI•

DSH: data sensitive hashing for high-dimensional k-nnsearch

[...]

Jinyang Gao¹, H. V. Jagadish², Wei Lu¹, Beng Chin Ooi¹•Institutions (2)

National University of Singapore¹, University of Michigan²

18 Jun 2014

TL;DR: Data Sensitive Hashing improves the hashing functions and hashing family, and is orthogonal to most of the recent state-of-the-art approaches which mainly focus on indexing and querying strategies.

...read moreread less

Abstract: The need to locate the k-nearest data points with respect to a given query point in a multi- and high-dimensional space is common in many applications. Therefore, it is essential to provide efficient support for such a search. Locality Sensitive Hashing (LSH) has been widely accepted as an effective hash method for high-dimensional similarity search. However, data sets are typically not distributed uniformly over the space, and as a result, the buckets of LSH are unbalanced, causing the performance of LSH to degrade. In this paper, we propose a new and efficient method called Data Sensitive Hashing (DSH) to address this drawback. DSH improves the hashing functions and hashing family, and is orthogonal to most of the recent state-of-the-art approaches which mainly focus on indexing and querying strategies. DSH leverages data distributions and is capable of directly preserving the nearest neighbor relations. We show the theoretical guarantee of DSH, and demonstrate its efficiency experimentally.

...read moreread less

Proceedings Article•DOI•

Preference preserving hashing for efficient recommendation

[...]

Zhiwei Zhang¹, Qifan Wang¹, Lingyun Ruan¹, Luo Si¹•Institutions (1)

Purdue University¹

03 Jul 2014

TL;DR: Experiments show that the recommendation speed of the proposed PPH algorithm can be hundreds of times faster than original MF with real-valued features, and the recommendation accuracy is significantly better than previous work of hashing for recommendation.

...read moreread less

Abstract: Recommender systems usually need to compare a large number of items before users' most preferred ones can be found This process can be very costly if recommendations are frequently made on large scale datasets. In this paper, a novel hashing algorithm, named Preference Preserving Hashing (PPH), is proposed to speed up recommendation. Hashing has been widely utilized in large scale similarity search (e.g. similar image search), and the search speed with binary hashing code is significantly faster than that with real-valued features. However, one challenge of applying hashing to recommendation is that, recommendation concerns users' preferences over items rather than their similarities. To address this challenge, PPH contains two novel components that work with the popular matrix factorization (MF) algorithm. In MF, users' preferences over items are calculated as the inner product between the learned real-valued user/item features. The first component of PPH constrains the learning process, so that users' preferences can be well approximated by user-item similarities. The second component, which is a novel quantization algorithm,generates the binary hashing code from the learned real-valued user/item features. Finally, recommendation can be achieved efficiently via fast hashing code search. Experiments on three real world datasets show that the recommendation speed of the proposed PPH algorithm can be hundreds of times faster than original MF with real-valued features, and the recommendation accuracy is significantly better than previous work of hashing for recommendation.

...read moreread less

Journal Article•DOI•

Robust image hashing with dominant DCT coefficients

[...]

Zhenjun Tang¹, Fan Yang¹, Liyan Huang¹, Xianquan Zhang¹, Xianquan Zhang² - Show less +1 more•Institutions (2)

Guangxi Normal University¹, Guilin University of Electronic Technology²

01 Sep 2014-Optik

TL;DR: A robust image hashing with dominant discrete cosine transform (DCT) coefficients is proposed that converts the input image to a normalized image, divides it into non-overlapping blocks, extracts dominant DCT coefficients in the first row/column of each block to construct feature matrices, and finally conducts matrix compression by calculating and quantifying column distances.

...read moreread less

Journal Article•DOI•

Discrete Cosine Transform Locality-Sensitive Hashes for Face Retrieval

[...]

Mehran Kafai¹, Kave Eshghi², Bir Bhanu³•Institutions (3)

Hewlett-Packard¹, Google², University of California, Riverside³

01 Jun 2014-IEEE Transactions on Multimedia

TL;DR: It is shown that DCT hashing has significantly better retrieval accuracy and it is more efficient compared to other popular state-of-the-art hash algorithms.

...read moreread less

Abstract: Descriptors such as local binary patterns perform well for face recognition. Searching large databases using such descriptors has been problematic due to the cost of the linear search, and the inadequate performance of existing indexing methods. We present Discrete Cosine Transform (DCT) hashing for creating index structures for face descriptors. Hashes play the role of keywords: an index is created, and queried to find the images most similar to the query image. Common hash suppression is used to improve retrieval efficiency and accuracy. Results are shown on a combination of six publicly available face databases (LFW, FERET, FEI, BioID, Multi-PIE, and RaFD). It is shown that DCT hashing has significantly better retrieval accuracy and it is more efficient compared to other popular state-of-the-art hash algorithms.

...read moreread less

Journal Article•DOI•

Data-Dependent Hashing Based on p-Stable Distribution

[...]

Xiao Bai¹, Haichuan Yang¹, Jun Zhou², Peng Ren³, Jian Cheng⁴ - Show less +1 more•Institutions (4)

Beihang University¹, Griffith University², China University of Petroleum³, Chinese Academy of Sciences⁴

27 Aug 2014-IEEE Transactions on Image Processing

TL;DR: This paper begins by formulating the Euclidean distance preserving property in terms of variance estimation, and develops a projection method, which maps the original data to arbitrary dimensional vectors, which results in a supervised hashing scheme, which preserves semantic similarity of data.

...read moreread less

Abstract: The p-stable distribution is traditionally used for data-independent hashing. In this paper, we describe how to perform data-dependent hashing based on p-stable distribution. We commence by formulating the Euclidean distance preserving property in terms of variance estimation. Based on this property, we develop a projection method, which maps the original data to arbitrary dimensional vectors. Each projection vector is a linear combination of multiple random vectors subject to p-stable distribution, in which the weights for the linear combination are learned based on the training data. An orthogonal matrix is then learned data-dependently for minimizing the thresholding error in quantization. Combining the projection method and orthogonal matrix, we develop an unsupervised hashing scheme, which preserves the Euclidean distance. Compared with data-independent hashing methods, our method takes the data distribution into consideration and gives more accurate hashing results with compact hash codes. Different from many data-dependent hashing methods, our method accommodates multiple hash tables and is not restricted by the number of hash functions. To extend our method to a supervised scenario, we incorporate a supervised label propagation scheme into the proposed projection method. This results in a supervised hashing scheme, which preserves semantic similarity of data. Experimental results show that our methods have outperformed several state-of-the-art hashing approaches in both effectiveness and efficiency.

...read moreread less

Journal Article•DOI•

[...]

Lei Zhang¹, Yongdong Zhang¹, Xiaoguang Gu¹, Jinhui Tang², Qi Tian³ - Show less +1 more•Institutions (3)

Chinese Academy of Sciences¹, Nanjing University of Science and Technology², University of Texas at San Antonio³

21 May 2014-IEEE Transactions on Image Processing

TL;DR: This paper proposes a novel hashing method, referred to as topology preserving hashing (TPH), which is distinct from prior works by also preserving the neighborhood ranking and is capable of mining semantic relationship between unlabeled data without supervised information.

...read moreread less

Abstract: Hashing-based similarity search techniques is becoming increasingly popular in large data sets To capture meaningful neighbors, the topology of a data set, which represents the neighborhood relationships between its subregions and the relative proximities between the neighbors of each subregion, eg, the relative neighborhood ranking of each subregion, should be exploited However, most existing hashing methods are developed to preserve neighborhood relationships while ignoring the relative neighborhood proximities Moreover, most hashing methods lack in providing a good result ranking, since there are often lots of results sharing the same Hamming distance to a query In this paper, we propose a novel hashing method to solve these two issues jointly The proposed method is referred to as topology preserving hashing (TPH) TPH is distinct from prior works by also preserving the neighborhood ranking Based on this framework, we present three different TPH methods, including linear unsupervised TPH, semisupervised TPH, and kernelized TPH Particularly, our unsupervised TPH is capable of mining semantic relationship between unlabeled data without supervised information Extensive experiments on four large data sets demonstrate the superior performances of the proposed methods over several state-of-the-art unsupervised and semisupervised hashing techniques

...read moreread less

Journal Article•DOI•

Fast Low-Rank Subspace Segmentation

[...]

Xin Zhang¹, Fuchun Sun¹, Guangcan Liu², Yi Ma³•Institutions (3)

Tsinghua University¹, University of Illinois at Urbana–Champaign², Microsoft³

01 May 2014-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper devise a fast subspace segmentation algorithm with complexity of O(n log (n), achieved by first using partial Singular Value Decomposition (SVD) to approximate the solution of LRSS, utilizing Locality Sensitive Hashing (LSH) to build a sparse affinity graph that encodes the subspace memberships, and adopting a fast Normalized Cut (NCut) algorithm to produce the final segmentation results.

...read moreread less

Abstract: Subspace segmentation is the problem of segmenting (or grouping) a set of $n$ data points into a number of clusters, with each cluster being a (linear) subspace. The recently established algorithms such as Sparse Subspace Clustering (SSC), Low-Rank Representation (LRR) and Low-Rank Subspace Segmentation (LRSS) are effective in terms of segmentation accuracy, but computationally inefficient as they possess a complexity of $O(n^{3})$ , which is too high to afford for the case where $n$ is very large. In this paper we devise a fast subspace segmentation algorithm with complexity of $O(n\log (n))$ . This is achieved by firstly using partial Singular Value Decomposition (SVD) to approximate the solution of LRSS, secondly utilizing Locality Sensitive Hashing (LSH) to build a sparse affinity graph that encodes the subspace memberships, and finally adopting a fast Normalized Cut (NCut) algorithm to produce the final segmentation results. Besides of high efficiency, our algorithm also has comparable effectiveness as the original LRSS method.

...read moreread less

Proceedings Article•

Locality preserving hashing

[...]

Kang Zhao¹, Hongtao Lu¹, Jincheng Mei¹•Institutions (1)

Shanghai Jiao Tong University¹

27 Jul 2014

TL;DR: A novel hashing algorithm called Locality Preserving Hashing is proposed, which learns a set of locality preserving projections with a joint optimization framework, which minimizes the average projection distance and quantization loss simultaneously.

...read moreread less

Abstract: Hashing has recently attracted considerable attention for large scale similarity search However, learning compact codes with good performance is still a challenge In many cases, the real-world data lies on a low-dimensional manifold embedded in high-dimensional ambient space To capture meaningful neighbors, a compact hashing representation should be able to uncover the intrinsic geometric structure of the manifold, eg, the neighborhood relationships between subregions Most existing hashing methods only consider this issue during mapping data points into certain projected dimensions When getting the binary codes, they either directly quantize the projected values with a threshold, or use an orthogonal matrix to refine the initial projection matrix, which both consider projection and quantization separately, and will not well preserve the locality structure in the whole learning process In this paper, we propose a novel hashing algorithm called Locality Preserving Hashing to effectively solve the above problems Specifically, we learn a set of locality preserving projections with a joint optimization framework, which minimizes the average projection distance and quantization loss simultaneously Experimental comparisons with other state-of-the-art methods on two large scale datasets demonstrate the effectiveness and efficiency of our method

...read moreread less

Proceedings Article•DOI•

Scalable heterogeneous translated hashing

[...]

Ying Wei¹, Yangqiu Song², Yi Zhen³, Bo Liu¹, Qiang Yang¹ - Show less +1 more•Institutions (3)

Hong Kong University of Science and Technology¹, University of Illinois at Urbana–Champaign², Duke University³

24 Aug 2014

TL;DR: This paper proposes a Heterogeneous Translated Hashing (HTH) method with such auxiliary bridge incorporated not only to improve current multi-view search but also to enable similarity search across heterogeneous media which have no direct correspondence.

...read moreread less

Abstract: Hashing has enjoyed a great success in large-scale similarity search. Recently, researchers have studied the multi-modal hashing to meet the need of similarity search across different types of media. However, most of the existing methods are applied to search across multi-views among which explicit bridge information is provided. Given a heterogeneous media search task, we observe that abundant multi-view data can be found on the Web which can serve as an auxiliary bridge. In this paper, we propose a Heterogeneous Translated Hashing (HTH) method with such auxiliary bridge incorporated not only to improve current multi-view search but also to enable similarity search across heterogeneous media which have no direct correspondence. HTH simultaneously learns hash functions embedding heterogeneous media into different Hamming spaces, and translators aligning these spaces. Unlike almost all existing methods that map heterogeneous data in a common Hamming space, mapping to different spaces provides more flexible and discriminative ability. We empirically verify the effectiveness and efficiency of our algorithm on two real world large datasets, one publicly available dataset of Flickr and the other MIRFLICKR-Yahoo Answers dataset.

...read moreread less

Posted Content•

Improved Densification of One Permutation Hashing

[...]

Anshumali Shrivastava¹, Ping Li²•Institutions (2)

Cornell University¹, Rutgers University²

18 Jun 2014-arXiv: Methodology

TL;DR: A new densification procedure is provided which is provably better than the existing scheme and has the same cost of $O(d + KL)$ for query processing, thereby making it strictly preferable over the existing procedure.

...read moreread less

Abstract: The existing work on densification of one permutation hashing reduces the query processing cost of the $(K,L)$-parameterized Locality Sensitive Hashing (LSH) algorithm with minwise hashing, from $O(dKL)$ to merely $O(d + KL)$, where $d$ is the number of nonzeros of the data vector, $K$ is the number of hashes in each hash table, and $L$ is the number of hash tables. While that is a substantial improvement, our analysis reveals that the existing densification scheme is sub-optimal. In particular, there is no enough randomness in that procedure, which affects its accuracy on very sparse datasets. In this paper, we provide a new densification procedure which is provably better than the existing scheme. This improvement is more significant for very sparse datasets which are common over the web. The improved technique has the same cost of $O(d + KL)$ for query processing, thereby making it strictly preferable over the existing procedure. Experimental evaluations on public datasets, in the task of hashing based near neighbor search, support our theoretical findings.

...read moreread less

Collapse