Showing papers on "Locality-sensitive hashing published in 2017"

PDF

Open Access

Journal Article•DOI•

[...]

Xiaofeng Zhu¹, Xuelong Li², Shichao Zhang¹, Zongben Xu³, Litao Yu⁴, Can Wang⁵ - Show less +2 more•Institutions (5)

Guangxi Normal University¹, Chinese Academy of Sciences², Xi'an Jiaotong University³, University of Queensland⁴, Griffith University⁵

11 May 2017-IEEE Transactions on Multimedia

TL;DR: The proposed hashing method achieves efficient similarity search and effective hashing performance and high generalization ability (simultaneously preserving two kinds of complementary similarity structures, i.e., local structures via manifold learning and global structures via PCA).

...read moreread less

Abstract: This paper proposes a new hashing framework to conduct similarity search via the following steps: first, employing linear clustering methods to obtain a set of representative data points and a set of landmarks of the big dataset; second, using the landmarks to generate a probability representation for each data point. The proposed probability representation method is further proved to preserve the neighborhood of each data point. Third, PCA is integrated with manifold learning to lean the hash functions using the probability representations of all representative data points. As a consequence, the proposed hashing method achieves efficient similarity search (with linear time complexity) and effective hashing performance and high generalization ability (simultaneously preserving two kinds of complementary similarity structures, i.e., local structures via manifold learning and global structures via PCA). Experimental results on four public datasets clearly demonstrate the advantages of our proposed method in terms of similarity search, compared to the state-of-the-art hashing methods.

...read moreread less

183 citations

Journal Article•DOI•

Cross-View Retrieval via Probability-Based Semantics-Preserving Hashing

[...]

Zijia Lin¹, Guiguang Ding¹, Jungong Han², Jianmin Wang¹•Institutions (2)

Tsinghua University¹, Northumbria University²

20 Dec 2017-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: This paper proposes an effective probability-based semantics-preserving hashing method (SePH) method to tackle the problem of cross-view retrieval, and conducts extensive experiments on diverse benchmark datasets to evaluate the proposed SePH.

...read moreread less

Abstract: For efficiently retrieving nearest neighbors from large-scale multiview data, recently hashing methods are widely investigated, which can substantially improve query speeds. In this paper, we propose an effective probability-based semantics-preserving hashing (SePH) method to tackle the problem of cross-view retrieval. Considering the semantic consistency between views, SePH generates one unified hash code for all observed views of any instance. For training, SePH first transforms the given semantic affinities of training data into a probability distribution, and aims to approximate it with another one in Hamming space, via minimizing their Kullback–Leibler divergence. Specifically, the latter probability distribution is derived from all pair-wise Hamming distances between to-be-learnt hash codes of the training data. Then with learnt hash codes, any kind of predictive models like linear ridge regression, logistic regression, or kernel logistic regression, can be learnt as hash functions in each view for projecting the corresponding view-specific features into hash codes. As for out-of-sample extension, given any unseen instance, the learnt hash functions in its observed views can predict view-specific hash codes. Then by deriving or estimating the corresponding output probabilities with respect to the predicted view-specific hash codes, a novel probabilistic approach is further proposed to utilize them for determining a unified hash code. To evaluate the proposed SePH, we conduct extensive experiments on diverse benchmark datasets, and the experimental results demonstrate that SePH is reasonable and effective.

...read moreread less

169 citations

Posted Content•

Deep Supervised Discrete Hashing

[...]

Qi Li¹, Zhenan Sun¹, Ran He¹, Tieniu Tan¹•Institutions (1)

Chinese Academy of Sciences¹

31 May 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a deep supervised discrete hashing algorithm based on the assumption that the learned binary codes should be ideal for classification is proposed. But, there are some limitations of previous deep hashing methods (e.g., the semantic information is not fully exploited).

...read moreread less

Abstract: With the rapid growth of image and video data on the web, hashing has been extensively studied for image or video search in recent years Benefit from recent advances in deep learning, deep hashing methods have achieved promising results for image retrieval However, there are some limitations of previous deep hashing methods (eg, the semantic information is not fully exploited) In this paper, we develop a deep supervised discrete hashing algorithm based on the assumption that the learned binary codes should be ideal for classification Both the pairwise label information and the classification information are used to learn the hash codes within one stream framework We constrain the outputs of the last layer to be binary codes directly, which is rarely investigated in deep hashing algorithm Because of the discrete nature of hash codes, an alternating minimization method is used to optimize the objective function Experimental results have shown that our method outperforms current state-of-the-art methods on benchmark datasets

...read moreread less

152 citations

Journal Article•DOI•

Linear Subspace Ranking Hashing for Cross-Modal Retrieval

[...]

Kai Li¹, Guo-Jun Qi¹, Jun Ye¹, Kien A. Hua¹•Institutions (1)

University of Central Florida¹

01 Sep 2017-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper proposes a novel ranking-based hashing framework that maps data from different modalities into a common Hamming space where the cross-modal similarity can be measured using Hamming distance and shows that theranking-based hash function has a natural probabilistic approximation which transforms the original highly discontinuous optimization problem into one that can be efficiently solved using simple gradient descent algorithms.

...read moreread less

Abstract: Hashing has attracted a great deal of research in recent years due to its effectiveness for the retrieval and indexing of large-scale high-dimensional multimedia data. In this paper, we propose a novel ranking-based hashing framework that maps data from different modalities into a common Hamming space where the cross-modal similarity can be measured using Hamming distance. Unlike existing cross-modal hashing algorithms where the learned hash functions are binary space partitioning functions, such as the sign and threshold function, the proposed hashing scheme takes advantage of a new class of hash functions closely related to rank correlation measures which are known to be scale-invariant, numerically stable, and highly nonlinear. Specifically, we jointly learn two groups of linear subspaces, one for each modality, so that features’ ranking orders in different linear subspaces maximally preserve the cross-modal similarities. We show that the ranking-based hash function has a natural probabilistic approximation which transforms the original highly discontinuous optimization problem into one that can be efficiently solved using simple gradient descent algorithms. The proposed hashing framework is also flexible in the sense that the optimization procedures are not tied up to any specific form of loss function, which is typical for existing cross-modal hashing methods, but rather we can flexibly accommodate different loss functions with minimal changes to the learning steps. We demonstrate through extensive experiments on four widely-used real-world multimodal datasets that the proposed cross-modal hashing method can achieve competitive performance against several state-of-the-arts with only moderate training and testing time.

...read moreread less

117 citations

Journal Article•DOI•

DeepER - Deep Entity Resolution.

[...]

Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, Nan Tang - Show less +1 more

02 Oct 2017-arXiv: Databases

TL;DR: This work presents a novel ER system, called DeepER, that achieves good accuracy, high efficiency, as well as ease-of-use, and requires much less human labeled data and does not need feature engineering, compared with traditional machine learning based approaches.

...read moreread less

Abstract: Entity resolution (ER) is a key data integration problem. Despite the efforts in 70+ years in all aspects of ER, there is still a high demand for democratizing ER - humans are heavily involved in labeling data, performing feature engineering, tuning parameters, and defining blocking functions. With the recent advances in deep learning, in particular distributed representation of words (a.k.a. word embeddings), we present a novel ER system, called DeepER, that achieves good accuracy, high efficiency, as well as ease-of-use (i.e., much less human efforts). For accuracy, we use sophisticated composition methods, namely uni- and bi-directional recurrent neural networks (RNNs) with long short term memory (LSTM) hidden units, to convert each tuple to a distributed representation (i.e., a vector), which can in turn be used to effectively capture similarities between tuples. We consider both the case where pre-trained word embeddings are available as well the case where they are not; we present ways to learn and tune the distributed representations. For efficiency, we propose a locality sensitive hashing (LSH) based blocking approach that uses distributed representations of tuples; it takes all attributes of a tuple into consideration and produces much smaller blocks, compared with traditional methods that consider only a few attributes. For ease-of-use, DeepER requires much less human labeled data and does not need feature engineering, compared with traditional machine learning based approaches which require handcrafted features, and similarity functions along with their associated thresholds. We evaluate our algorithms on multiple datasets (including benchmarks, biomedical data, as well as multi-lingual data) and the extensive experimental results show that DeepER outperforms existing solutions.

...read moreread less

106 citations

Proceedings Article•DOI•

Deep Asymmetric Pairwise Hashing

[...]

Fumin Shen¹, Xin Gao¹, Li Liu², Yang Yang¹, Heng Tao Shen¹ - Show less +1 more•Institutions (2)

University of Electronic Science and Technology of China¹, University of East Anglia²

23 Oct 2017

TL;DR: This work proposes a novel Deep Asymmetric Pairwise Hashing approach (DAPH) for supervised hashing, and devise an efficient alternating algorithm to optimize the asymmetric deep hash functions and high-quality binary code jointly.

...read moreread less

Abstract: Recently, deep neural networks based hashing methods have greatly improved the multimedia retrieval performance by simultaneously learning feature representations and binary hash functions. Inspired by the latest advance in the asymmetric hashing scheme, in this work, we propose a novel Deep Asymmetric Pairwise Hashing approach (DAPH) for supervised hashing. The core idea is that two deep convolutional models are jointly trained such that their output codes for a pair of images can well reveal the similarity indicated by their semantic labels. A pairwise loss is elaborately designed to preserve the pairwise similarities between images as well as incorporating the independence and balance hash code learning criteria. By taking advantage of the flexibility of asymmetric hash functions, we devise an efficient alternating algorithm to optimize the asymmetric deep hash functions and high-quality binary code jointly. Experiments on three image benchmarks show that DAPH achieves the state-of-the-art performance on large-scale image retrieval.

...read moreread less

105 citations

Proceedings Article•DOI•

Scalable and Sustainable Deep Learning via Randomized Hashing

[...]

Ryan Spring¹, Anshumali Shrivastava¹•Institutions (1)

Rice University¹

04 Aug 2017

TL;DR: This work presents a novel hashing-based technique to drastically reduce the amount of computation needed to train and test neural networks, and demonstrates the scalability and sustainability (energy efficiency) of the proposed algorithm via rigorous experimental evaluations on several datasets.

...read moreread less

Abstract: Current deep learning architectures are growing larger in order to learn from complex datasets. These architectures require giant matrix multiplication operations to train millions of parameters. Conversely, there is another growing trend to bring deep learning to low-power, embedded devices. The matrix operations, associated with the training and testing of deep networks, are very expensive from a computational and energy standpoint. We present a novel hashing-based technique to drastically reduce the amount of computation needed to train and test neural networks. Our approach combines two recent ideas, Adaptive Dropout and Randomized Hashing for Maximum Inner Product Search (MIPS), to select the nodes with the highest activations efficiently. Our new algorithm for deep learning reduces the overall computational cost of the forward and backward propagation steps by operating on significantly fewer nodes. As a consequence, our algorithm uses only 5% of the total multiplications, while keeping within 1% of the accuracy of the original model on average. A unique property of the proposed hashing-based back-propagation is that the updates are always sparse. Due to the sparse gradient updates, our algorithm is ideally suited for asynchronous, parallel training, leading to near-linear speedup, as the number of cores increases. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations on several datasets.

...read moreread less

102 citations

Proceedings Article•DOI•

Deep Semantic Hashing with Generative Adversarial Networks

[...]

Zhaofan Qiu¹, Yingwei Pan¹, Ting Yao², Tao Mei²•Institutions (2)

University of Science and Technology of China¹, Microsoft²

07 Aug 2017

TL;DR: This paper studies the exploration of generating synthetic data through semi-supervised generative adversarial networks (GANs), which leverages largely unlabeled and limited labeled training data to produce highly compelling data with intrinsic invariance and global coherence, for better understanding statistical structures of natural data.

...read moreread less

Abstract: Hashing has been a widely-adopted technique for nearest neighbor search in large-scale image retrieval tasks. Recent research has shown that leveraging supervised information can lead to high quality hashing. However, the cost of annotating data is often an obstacle when applying supervised hashing to a new domain. Moreover, the results can suffer from the robustness problem as the data at training and test stage may come from different distributions. This paper studies the exploration of generating synthetic data through semi-supervised generative adversarial networks (GANs), which leverages largely unlabeled and limited labeled training data to produce highly compelling data with intrinsic invariance and global coherence, for better understanding statistical structures of natural data. We demonstrate that the above two limitations can be well mitigated by applying the synthetic data for hashing. Specifically, a novel deep semantic hashing with GANs (DSH-GANs) is presented, which mainly consists of four components: a deep convolution neural networks (CNN) for learning image representations, an adversary stream to distinguish synthetic images from real ones, a hash stream for encoding image representations to hash codes and a classification stream. The whole architecture is trained end-to-end by jointly optimizing three losses, i.e., adversarial loss to correct label of synthetic or real for each sample, triplet ranking loss to preserve the relative similarity ordering in the input real-synthetic triplets and classification loss to classify each sample accurately. Extensive experiments conducted on both CIFAR-10 and NUS-WIDE image benchmarks validate the capability of exploiting synthetic images for hashing. Our framework also achieves superior results when compared to state-of-the-art deep hash models.

...read moreread less

97 citations

Proceedings Article•DOI•

MIHash: Online Hashing with Mutual Information

[...]

Fatih Cakir¹, Kun He¹, Sarah Adel Bargal¹, Stan Sclaroff¹•Institutions (1)

Boston University¹

27 Mar 2017

TL;DR: This paper proposes an efficient quality measure for hash functions, based on an information-theoretic quantity, mutual information, and uses it successfully as a criterion to eliminate unnecessary hash table updates, and develops a novel hashing method, MIHash, that can be used in both online and batch settings.

...read moreread less

Abstract: Learning-based hashing methods are widely used for nearest neighbor retrieval, and recently, online hashing methods have demonstrated good performance-complexity trade-offs by learning hash functions from streaming data. In this paper, we first address a key challenge for online hashing: the binary codes for indexed data must be recomputed to keep pace with updates to the hash functions. We propose an efficient quality measure for hash functions, based on an information-theoretic quantity, mutual information, and use it successfully as a criterion to eliminate unnecessary hash table updates. Next, we also show how to optimize the mutual information objective using stochastic gradient descent. We thus develop a novel hashing method, MIHash, that can be used in both online and batch settings. Experiments on image retrieval benchmarks (including a 2.5M image dataset) confirm the effectiveness of our formulation, both in reducing hash table recomputations and in learning high-quality hash functions.

...read moreread less

91 citations

Journal Article•DOI•

Distributed Adaptive Binary Quantization for Fast Nearest Neighbor Search

[...]

Xianglong Liu¹, Zhujin Li¹, Cheng Deng², Dacheng Tao³•Institutions (3)

Beihang University¹, Xidian University², University of Sydney³

24 Jul 2017-IEEE Transactions on Image Processing

TL;DR: This work proposes an adaptive binary quantization (ABQ) method that learns a discriminative hash function with prototypes associated with small unique binary codes and devise a distributed framework for the large-scale learning, which can significantly speed up the training of ABQ in the distributed environment.

...read moreread less

Abstract: Hashing has been proved an attractive technique for fast nearest neighbor search over big data. Compared with the projection based hashing methods, prototype-based ones own stronger power to generate discriminative binary codes for the data with complex intrinsic structure. However, existing prototype-based methods, such as spherical hashing and K-means hashing, still suffer from the ineffective coding that utilizes the complete binary codes in a hypercube. To address this problem, we propose an adaptive binary quantization (ABQ) method that learns a discriminative hash function with prototypes associated with small unique binary codes. Our alternating optimization adaptively discovers the prototype set and the code set of a varying size in an efficient way, which together robustly approximate the data relations. Our method can be naturally generalized to the product space for long hash codes, and enjoys the fast training linear to the number of the training data. We further devise a distributed framework for the large-scale learning, which can significantly speed up the training of ABQ in the distributed environment that has been widely deployed in many areas nowadays. The extensive experiments on four large-scale (up to 80 million) data sets demonstrate that our method significantly outperforms state-of-the-art hashing methods, with up to 58.84% performance gains relatively.

...read moreread less

82 citations

Journal Article•DOI•

Sequential Discrete Hashing for Scalable Cross-Modality Similarity Retrieval

[...]

Li Liu¹, Zijia Lin², Ling Shao¹, Fumin Shen³, Guiguang Ding², Jungong Han⁴ - Show less +2 more•Institutions (4)

Southwest University¹, Tsinghua University², University of Electronic Science and Technology of China³, Northumbria University⁴

01 Jan 2017-IEEE Transactions on Image Processing

TL;DR: This paper introduces a novel supervised cross-modality hashing framework, which can generate unified binary codes for instances represented in different modalities and significantly outperforms the state-of-the-art multimodality hashing techniques.

...read moreread less

Abstract: With the dramatic development of the Internet, how to exploit large-scale retrieval techniques for multimodal web data has become one of the most popular but challenging problems in computer vision and multimedia. Recently, hashing methods are used for fast nearest neighbor search in large-scale data spaces, by embedding high-dimensional feature descriptors into a similarity preserving Hamming space with a low dimension. Inspired by this, in this paper, we introduce a novel supervised cross-modality hashing framework, which can generate unified binary codes for instances represented in different modalities. Particularly, in the learning phase, each bit of a code can be sequentially learned with a discrete optimization scheme that jointly minimizes its empirical loss based on a boosting strategy. In a bitwise manner, hash functions are then learned for each modality, mapping the corresponding representations into unified hash codes. We regard this approach as cross-modality sequential discrete hashing (CSDH), which can effectively reduce the quantization errors arisen in the oversimplified rounding-off step and thus lead to high-quality binary codes. In the test phase, a simple fusion scheme is utilized to generate a unified hash code for final retrieval by merging the predicted hashing results of an unseen instance from different modalities. The proposed CSDH has been systematically evaluated on three standard data sets: Wiki, MIRFlickr, and NUS-WIDE, and the results show that our method significantly outperforms the state-of-the-art multimodality hashing techniques.

...read moreread less

Journal Article•DOI•

Robust image hashing with multidimensional scaling

[...]

Zhenjun Tang¹, Ziqing Huang¹, Xianquan Zhang¹, Huan Lao¹•Institutions (1)

Guangxi Normal University¹

01 Aug 2017-Signal Processing

TL;DR: This study investigates the use of MDS in image hashing and proposes an MDS-based hashing algorithm resistant to any-angle rotation that outperforms some state-of-the-art algorithms in classification with respect to robustness and discrimination.

...read moreread less

Proceedings Article•DOI•

LSHiForest: A Generic Framework for Fast Tree Isolation Based Ensemble Anomaly Analysis

[...]

Xuyun Zhang¹, Wanchun Dou², Qiang He³, Rui Zhou⁴, Christopher Leckie⁵, Ramamohanarao Kotagiri⁵, Zoran Salcic¹ - Show less +3 more•Institutions (5)

University of Auckland¹, Nanjing University², Swinburne University of Technology³, Victoria University, Australia⁴, University of Melbourne⁵

19 Apr 2017

TL;DR: A generic framework for fast tree isolation based ensemble anomaly analysis with the use of a Locality-Sensitive Hashing (LSH) forest, which can be instantiated with a diverse range of LSH families, and the fast isolation mechanism can be extended to any distance measures, data types and data spaces where an LSH family is defined.

...read moreread less

Abstract: Anomaly or outlier detection is a major challenge in big data analytics because anomaly patterns provide valuable insights for decision-making in a wide range of applications Recently proposed anomaly detection methods based on the tree isolation mechanism are very fast due to their logarithmic time complexity, making them capable of handling big data sets efficiently However, the underlying similarity or distance measures in these methods have not been well understood Contrary to the claims that these methods never rely on any distance measure, we find that they have close relationships with certain distance measures This implies that the current use of this fast isolation mechanism is only limited to these distance measures and fails to generalise to other commonlyused measures In this paper, we propose a generic framework named LSHiForest for fast tree isolation based ensemble anomaly analysis with the use of a Locality-Sensitive Hashing (LSH) forest Being generic, the proposed framework can be instantiated with a diverse range of LSH families, and the fast isolation mechanism can be extended to any distance measures, data types and data spaces where an LSH family is defined In particular, the instances of our framework with kernelised LSH families or learning based hashing schemes can detect complicated anomalies like local or surrounded anomalies We also formally show that the existing tree isolation based detection methods are special cases of our framework with the corresponding distance measures Extensive experiments on both synthetic and real-world benchmark data sets show that the framework can achieve both high time efficiency and anomaly detection quality

...read moreread less

Journal Article•DOI•

Unsupervised t-Distributed Video Hashing and Its Deep Hashing Extension

[...]

Yanbin Hao¹, Tingting Mu², John Y. Goulermas³, Jianguo Jiang¹, Richang Hong¹, Meng Wang¹ - Show less +2 more•Institutions (3)

Hefei University of Technology¹, University of Manchester², University of Liverpool³

07 Aug 2017-IEEE Transactions on Image Processing

TL;DR: This paper develops a robust unsupervised training strategy for a deep hashing network by adapting the corresponding optimization objective and constructing the hash mapping function via a deep neural network.

...read moreread less

Abstract: In this paper, a novel unsupervised hashing algorithm, referred to as t-USMVH, and its extension to unsupervised deep hashing, referred to as t-UDH, are proposed to support large-scale video-to-video retrieval. To improve robustness of the unsupervised learning, the t-USMVH combines multiple types of feature representations and effectively fuses them by examining a continuous relevance score based on a Gaussian estimation over pairwise distances, and also a discrete neighbor score based on the cardinality of reciprocal neighbors. To reduce sensitivity to scale changes for mapping objects that are far apart from each other, Student t-distribution is used to estimate the similarity between the relaxed hash code vectors for keyframes. This results in more accurate preservation of the desired unsupervised similarity structure in the hash code space. By adapting the corresponding optimization objective and constructing the hash mapping function via a deep neural network, we develop a robust unsupervised training strategy for a deep hashing network. The efficiency and effectiveness of the proposed methods are evaluated on two public video collections via comparisons against multiple classical and the state-of-the-art methods.

...read moreread less

Proceedings Article•

Large Graph Hashing with Spectral Rotation.

[...]

Xuelong Li¹, Di Hu¹, Feiping Nie¹•Institutions (1)

Northwestern Polytechnical University¹

13 Feb 2017

TL;DR: This paper proposes to impose a socalled spectral rotation technique to the spectral hashing objective, which could transform the candidate solution into a new one that better approximates the discrete one and shows the method outperforms state-of-the-art hashing methods, especially the spectral graph ones.

...read moreread less

Abstract: Faced with the requirements of huge amounts of data processing nowadays, hashing techniques have attracted much attention due to their efficient storage and searching ability. Among these techniques, the ones based on spectral graph show remarkable performance as they could embed the data on a low-dimensional manifold and maintain the neighborhood structure via a non-linear spectral eigenmap. However, the spectral solution in real value of such methods may deviate from the discrete solution. The common practice is just performing a simple rounding operation to obtain the final binary codes, which could break constraints and even result in worse condition. In this paper, we propose to impose a so-called spectral rotation technique to the spectral hashing objective, which could transform the candidate solution into a new one that better approximates the discrete one. Moreover, the binary codes are obtained from the modified solution via minimizing the Euclidean distance, which could result in more semantical correlation within the manifold, where the constraints for codes are always held. We provide an efficient alternative algorithm to solve the above problems. And a manifold learning perceptive for motivating the proposed method is also shown. Extensive experiments are conducted on three large-scale benchmark datasets and the results show our method outperforms state-of-the-art hashing methods, especially the spectral graph ones.

...read moreread less

Proceedings Article•DOI•

Locality-Sensitive Hashing of Curves

[...]

Anne Driemel, Francesco Silvestri¹, Francesco Silvestri²•Institutions (2)

IT University of Copenhagen¹, University of Padua²

01 Jun 2017

TL;DR: In this paper, the first locality-sensitive hashing schemes for the discrete Frechet distance and the dynamic time warping distance were proposed, where the distance measures internally optimize the alignment between the curves.

...read moreread less

Abstract: We study data structures for storing a set of polygonal curves in R^d such that, given a query curve, we can efficiently retrieve similar curves from the set, where similarity is measured using the discrete Frechet distance or the dynamic time warping distance. To this end we devise the first locality-sensitive hashing schemes for these distance measures. A major challenge is posed by the fact that these distance measures internally optimize the alignment between the curves. We give solutions for different types of alignments including constrained and unconstrained versions. For unconstrained alignments, we improve over a result by Indyk [SoCG 2002] for short curves. Let n be the number of input curves and let m be the maximum complexity of a curve in the input. In the particular case where m 0, our solutions imply an approximate near-neighbor data structure for the discrete Frechet distance that uses space in O(n^(1+a) log n) and achieves query time in O(n^a log^2 n) and constant approximation factor. Furthermore, our solutions provide a trade-off between approximation quality and computational performance: for any parameter k in [m], we can give a data structure that uses space in O(2^(2k) m^(k-1) n log n + nm), answers queries in O( 2^(2k) m^(k) log n) time and achieves approximation factor in O(m/k).

...read moreread less

Journal Article•DOI•

Ranking Based Locality Sensitive Hashing Enabled Cancelable Biometrics: Index-of-Max Hashing

[...]

Zhe Jin, Yen-Lung Lai, Jung-Yeon Hwang, Soo-Hyung Kim, Andrew Beng Jin Teoh - Show less +1 more

16 Mar 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: The analyses justify its resilience to the existing and newly introduced security and privacy attacks as well as satisfy the revocability and unlinkability criteria of cancelable biometrics.

...read moreread less

Abstract: In this paper, we propose a ranking based locality sensitive hashing inspired two-factor cancelable biometrics, dubbed "Index-of-Max" (IoM) hashing for biometric template protection. With externally generated random parameters, IoM hashing transforms a real-valued biometric feature vector into discrete index (max ranked) hashed code. We demonstrate two realizations from IoM hashing notion, namely Gaussian Random Projection based and Uniformly Random Permutation based hashing schemes. The discrete indices representation nature of IoM hashed codes enjoy serveral merits. Firstly, IoM hashing empowers strong concealment to the biometric information. This contributes to the solid ground of non-invertibility guarantee. Secondly, IoM hashing is insensitive to the features magnitude, hence is more robust against biometric features variation. Thirdly, the magnitude-independence trait of IoM hashing makes the hash codes being scale-invariant, which is critical for matching and feature alignment. The experimental results demonstrate favorable accuracy performance on benchmark FVC2002 and FVC2004 fingerprint databases. The analyses justify its resilience to the existing and newly introduced security and privacy attacks as well as satisfy the revocability and unlinkability criteria of cancelable biometrics.

...read moreread less

Journal Article•DOI•

Online supervised hashing

[...]

Fatih Cakir¹, Sarah Adel Bargal¹, Stan Sclaroff¹•Institutions (1)

Boston University¹

01 Mar 2017-Computer Vision and Image Understanding

TL;DR: OSH: an Online Supervised Hashing technique that is based on Error Correcting Output Codes is proposed, which considers a stochastic setting where the data arrives sequentially and the method learns and adapts its hashing functions in a discriminative manner and yields state-of-the-art retrieval performance.

...read moreread less

Proceedings Article•DOI•

Set similarity search beyond MinHash

[...]

Tobias Christiani¹, Rasmus Pagh¹•Institutions (1)

University of Copenhagen¹

19 Jun 2017

TL;DR: In this article, the authors considered the problem of approximate set similarity search under Braun-Blanquet similarity B(x, y) = |x ∩ y| / max(|x|, |y|) and presented a simple data structure that solves this problem with space usage O(n1+ρlogn + ∑x e P|x) where n = |P| and ρ = log( 1/b1)/log(1/b2).

...read moreread less

Abstract: We consider the problem of approximate set similarity search under Braun-Blanquet similarity B(x, y) = |x ∩ y| / max(|x|, |y|). The (b1, b2)-approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets P such that, given a query set q, if there exists x E P with B(q, x) ≥ b1, then we can efficiently return x′ E P with B(q, x′) > b2. We present a simple data structure that solves this problem with space usage O(n1+ρlogn + ∑x e P|x|) and query time O(|q|nρ logn) where n = |P| and ρ = log(1/b1)/log(1/b2). Making use of existing lower bounds for locality-sensitive hashing by O'Donnell et al. (TOCT 2014) we show that this value of ρ is tight across the parameter space, i.e., for every choice of constants 0 b2 b1 In the case where all sets have the same size our solution strictly improves upon the value of ρ that can be obtained through the use of state-of-the-art data-independent techniques in the Indyk-Motwani locality-sensitive hashing framework (STOC 1998) such as Broder's MinHash (CCS 1997) for Jaccard similarity and Andoni et al.'s cross-polytope LSH (NIPS 2015) for cosine similarity. Surprisingly, even though our solution is data-independent, for a large part of the parameter space we outperform the currently best data-dependent method by Andoni and Razenshteyn (STOC 2015).

...read moreread less

Posted Content•

A New Unbiased and Efficient Class of LSH-Based Samplers and Estimators for Partition Function Computation in Log-Linear Models.

[...]

Ryan Spring, Anshumali Shrivastava

15 Mar 2017-arXiv: Machine Learning

TL;DR: This paper proposes a new sampling scheme and an unbiased estimator that estimates the partition function accurately in sub-linear time and demonstrates the effectiveness of the proposed approach against other state-of-the-art estimation techniques including IS and the efficient variant of Gumbel-Max sampling.

...read moreread less

Abstract: Log-linear models are arguably the most successful class of graphical models for large-scale applications because of their simplicity and tractability. Learning and inference with these models require calculating the partition function, which is a major bottleneck and intractable for large state spaces. Importance Sampling (IS) and MCMC-based approaches are lucrative. However, the condition of having a "good" proposal distribution is often not satisfied in practice. In this paper, we add a new dimension to efficient estimation via sampling. We propose a new sampling scheme and an unbiased estimator that estimates the partition function accurately in sub-linear time. Our samples are generated in near-constant time using locality sensitive hashing (LSH), and so are correlated and unnormalized. We demonstrate the effectiveness of our proposed approach by comparing the accuracy and speed of estimating the partition function against other state-of-the-art estimation techniques including IS and the efficient variant of Gumbel-Max sampling. With our efficient sampling scheme, we accurately train real-world language models using only 1-2% of computations.

...read moreread less

Posted Content•

Hashing as Tie-Aware Learning to Rank

[...]

Kun He¹, Fatih Cakir², Sarah Adel Bargal³, Stan Sclaroff¹•Institutions (3)

Boston University¹, Facebook², Istituto Italiano di Tecnologia³

23 May 2017-arXiv: Machine Learning

TL;DR: This paper develops learning to rank formulations for hashing, aimed at directly optimizing ranking-based evaluation metrics such as Average Precision and Normalized Discounted Cumulative Gain.

...read moreread less

Abstract: Hashing, or learning binary embeddings of data, is frequently used in nearest neighbor retrieval In this paper, we develop learning to rank formulations for hashing, aimed at directly optimizing ranking-based evaluation metrics such as Average Precision (AP) and Normalized Discounted Cumulative Gain (NDCG) We first observe that the integer-valued Hamming distance often leads to tied rankings, and propose to use tie-aware versions of AP and NDCG to evaluate hashing for retrieval Then, to optimize tie-aware ranking metrics, we derive their continuous relaxations, and perform gradient-based optimization with deep neural networks Our results establish the new state-of-the-art for image retrieval by Hamming ranking in common benchmarks

...read moreread less

Proceedings Article•DOI•

Unsupervised Triplet Hashing for Fast Image Retrieval

[...]

Shanshan Huang¹, Yichao Xiong¹, Ya Zhang¹, Jia Wang¹•Institutions (1)

Shanghai Jiao Tong University¹

23 Oct 2017

TL;DR: Huang et al. as discussed by the authors proposed a novel CNN-based unsupervised hashing method, namely Unsupervised Triplet Hashing (UTH), which is designed based on the following three principles: 1) maximizing the discrimination among image representations; 2) minimizing the quantization loss between the original real-valued feature descriptors and the learned hash codes.

...read moreread less

Abstract: The explosive growth of multimedia contents has made hashing an indispensable component in image retrieval. In particular, learning-based hashing has recently shown great promising with the advance of Convolutional Neural Network (CNN). However, the existing hashing methods are mostly tuned for classification. Learning hash functions for retrieval tasks, especially for instance-level retrieval, still faces many challenges. Considering the difficulty in obtaining labeled datasets for image retrieval task in large scale, we propose a novel CNN-based unsupervised hashing method, namely Unsupervised Triplet Hashing (UTH). The unsupervised hashing network is designed based on the following three principles: 1) maximizing the discrimination among image representations; 2) minimizing the quantization loss between the original real-valued feature descriptors and the learned hash codes; 3) maximizing the information entropy for the learned hash codes to improve their representation ability. Extensive experiments on CIFAR-10, MNIST and In-shop datasets have shown that UTH outperforms several state-of-the-art unsupervised hashing methods in terms of retrieval accuracy.

...read moreread less

Proceedings Article•DOI•

Pseudo Label based Unsupervised Deep Discriminative Hashing for Image Retrieval

[...]

Qinghao Hu¹, Jiaxiang Wu¹, Jian Cheng¹, Lifang Wu², Hanqing Lu¹ - Show less +1 more•Institutions (2)

Chinese Academy of Sciences¹, Beijing University of Technology²

23 Oct 2017

TL;DR: Experiments demonstrate that the proposed pseudo label based unsupervised deep discriminative hashing method outperforms the state-of-art unsuper supervised hashing methods.

...read moreread less

Abstract: Hashing methods play an important role in large scale image retrieval. Traditional hashing methods use hand-crafted features to learn hash functions, which can not capture the high level semantic information. Deep hashing algorithms use deep neural networks to learn feature representation and hash functions simultaneously. Most of these algorithms exploit supervised information to train the deep network. However, supervised information is expensive to obtain. In this paper, we propose a pseudo label based unsupervised deep discriminative hashing algorithm. First, we cluster images via K-means and the cluster labels are treated as pseudo labels. Then we train a deep hashing network with pseudo labels by minimizing the classification loss and quantization loss. Experiments on two datasets demonstrate that our unsupervised deep discriminative hashing method outperforms the state-of-art unsupervised hashing methods.

...read moreread less

Journal Article•DOI•

Latent Structure Preserving Hashing

[...]

Li Liu¹, Mengyang Yu¹, Ling Shao¹•Institutions (1)

Northumbria University¹

01 May 2017-International Journal of Computer Vision

TL;DR: A more generalized multi-layer LSPH framework, in which hierarchical representations can be effectively learned by a multiplicative up-propagation algorithm, and Experimental results on three large-scale retrieval datasets show that ML-LSPH can achieve better performance than the single-layer lSPH and both of them outperform existing hashing techniques on large- scale data.

...read moreread less

Abstract: Aiming at efficient similarity search, hash functions are designed to embed high-dimensional feature descriptors to low-dimensional binary codes such that similar descriptors will lead to binary codes with a short distance in the Hamming space. It is critical to effectively maintain the intrinsic structure and preserve the original information of data in a hashing algorithm. In this paper, we propose a novel hashing algorithm called Latent Structure Preserving Hashing (LSPH), with the target of finding a well-structured low-dimensional data representation from the original high-dimensional data through a novel objective function based on Nonnegative Matrix Factorization (NMF) with their corresponding Kullback-Leibler divergence of data distribution as the regularization term. Via exploiting the joint probabilistic distribution of data, LSPH can automatically learn the latent information and successfully preserve the structure of high-dimensional data. To further achieve robust performance with complex and nonlinear data, in this paper, we also contribute a more generalized multi-layer LSPH (ML-LSPH) framework, in which hierarchical representations can be effectively learned by a multiplicative up-propagation algorithm. Once obtaining the latent representations, the hash functions can be easily acquired through multi-variable logistic regression. Experimental results on three large-scale retrieval datasets, i.e., SIFT 1M, GIST 1M and 500 K TinyImage, show that ML-LSPH can achieve better performance than the single-layer LSPH and both of them outperform existing hashing techniques on large-scale data.

...read moreread less

Proceedings Article•DOI•

Semi-Relaxation Supervised Hashing for Cross-Modal Retrieval

[...]

Peng-Fei Zhang¹, Chuan-Xiang Li¹, Meng-Yuan Liu¹, Liqiang Nie¹, Xin-Shun Xu¹ - Show less +1 more•Institutions (1)

Shandong University¹

23 Oct 2017

TL;DR: This paper proposes a novel supervised cross-modal hashing method---Semi-Relaxation Supervised Hashing (SRSH), which can learn the hash functions and the binary codes simultaneously and relaxes a part of binary constraints, instead of all of them, by introducing an intermediate representation variable.

...read moreread less

Abstract: Recently, some cross-modal hashing methods have been devised for cross-modal search task. Essentially, given a similarity matrix, most of these methods tackle a discrete optimization problem by separating it into two stages, i.e., first relaxing the binary constraints and finding a solution of the relaxed optimization problem, then quantizing the solution to obtain the binary codes. This scheme will generate large quantization error. Some discrete optimization methods have been proposed to tackle this; however, the generation of the binary codes is independent of the features in the original space, which makes it not robust to noise. To consider these problems, in this paper, we propose a novel supervised cross-modal hashing method---Semi-Relaxation Supervised Hashing (SRSH). It can learn the hash functions and the binary codes simultaneously. At the same time, to tackle the optimization problem, it relaxes a part of binary constraints, instead of all of them, by introducing an intermediate representation variable. By doing this, the quantization error can be reduced and the optimization problem can also be easily solved by an iterative algorithm proposed in this paper. Extensive experimental results on three benchmark datasets demonstrate that SRSH can obtain competitive results and outperform state-of-the-art unsupervised and supervised cross-modal hashing methods.

...read moreread less

Journal Article•DOI•

Toward Fast Niching Evolutionary Algorithms: A Locality Sensitive Hashing-Based Approach

[...]

Yu-Hui Zhang¹, Yue-Jiao Gong¹, Huaxiang Zhang², Tianlong Gu³, Jun Zhang¹ - Show less +1 more•Institutions (3)

South China University of Technology¹, Shandong Normal University², Guilin University of Electronic Technology³

01 Jun 2017-IEEE Transactions on Evolutionary Computation

TL;DR: Experimental results show that the fast niching versions of the multimodal algorithms can exhibit similar or even better performance than their original ones and the execution time of the algorithms is significantly reduced.

...read moreread less

Abstract: Niching techniques have recently been incorporated into evolutionary algorithms (EAs) for multisolution optimization in multimodal landscape. However, existing niching techniques inevitably increase the time complexity of basic EAs due to the computation of the distance matrix of individuals. In this paper, we propose a fast niching technique. The technique avoids pairwise distance calculations by introducing the locality sensitive hashing, an efficient algorithm for approximately retrieving nearest neighbors. Individuals are projected to a number of buckets by hash functions. The similar individuals possess a higher probability of being hashed into the same bucket than the dissimilar ones. Then, interactions between individuals are limited to the candidates that fall in the same bucket to achieve local evolution. It is proved that the complexity of the proposed fast niching is linear to the population size. In addition, this mechanism induces stable niching behavior and it inherently keeps a balance between the exploration and exploitation of multiple optima. The theoretical analysis conducted in this paper suggests that the proposed technique is able to provide bounds for the exploration and exploitation probabilities. Experimental results show that the fast niching versions of the multimodal algorithms can exhibit similar or even better performance than their original ones. More importantly, the execution time of the algorithms is significantly reduced.

...read moreread less

Proceedings Article•DOI•

SitNet: Discrete Similarity Transfer Network for Zero-shot Hashing

[...]

Yuchen Guo¹, Guiguang Ding¹, Jungong Han², Yue Gao¹•Institutions (2)

Tsinghua University¹, Northumbria University²

01 Aug 2017

TL;DR: A novel zero-shot hashing approach, called Discrete Similarity Transfer Network (SitNet), to preserve the semantic similarity between images from both “seen” concepts and new “unseen’ concepts and experiments validate the superiority of SitNet to the state-of-the-arts.

...read moreread less

Abstract: Hashing has been widely utilized for fast image retrieval recently. With semantic information as supervision, hashing approaches perform much better, especially when combined with deep convolution neural network(CNN). However, in practice, new concepts emerge every day, making collecting supervised information for re-training hashing model infeasible. In this paper, we propose a novel zero-shot hashing approach, called Discrete Similarity Transfer Network (SitNet), to preserve the semantic similarity between images from both “seen” concepts and new “unseen” concepts. Motivated by zero-shot learning, the semantic vectors of concepts are adopted to capture the similarity structures among classes, making the model trained with seen concepts generalize well for unseen ones. We adopt a multi-task architecture to exploit the supervised information for seen concepts and the semantic vectors simultaneously. Moreover, a discrete hashing layer is integrated into the network for hashcode generating to avoid the information loss caused by real-value relaxation in training phase, which is a critical problem in existing works. Experiments on three benchmarks validate the superiority of SitNet to the state-of-the-arts.

...read moreread less

Proceedings Article•DOI•

LSH forest: practical algorithms made theoretical

[...]

Alexandr Andoni¹, Ilya Razenshteyn², Negev Shekel Nosatzki¹•Institutions (2)

Columbia University¹, Massachusetts Institute of Technology²

16 Jan 2017

TL;DR: The end result is the first instance of a simple, practical algorithm that provably leverages data-dependent hashing to improve upon data-oblivious LSH, and is provably better than the best LSH algorithm for the Hamming space.

...read moreread less

Abstract: We analyze LSH Forest [BCG05]---a popular heuristic for the nearest neighbor search---and show that a careful yet simple modification of it outperforms "vanilla" LSH algorithms. The end result is the first instance of a simple, practical algorithm that provably leverages data-dependent hashing to improve upon data-oblivious LSH.Here is the entire algorithm for the d-dimensional Hamming space. The LSH Forest, for a given dataset, applies a random permutation to all the d coordinates, and builds a trie on the resulting strings. In our modification, we further augment this trie: for each node, we store a constant number of points close to the mean of the corresponding subset of the dataset, which are compared to any query point reaching that node. The overall data structure is simply several such tries sampled independently.While the new algorithm does not quantitatively improve upon the best data-dependent hashing algorithms from [AR15] (which are known to be optimal), it is significantly simpler, being based on a practical heuristic, and is provably better than the best LSH algorithm for the Hamming space [IM98, HIM12].

...read moreread less

Posted Content•

Stochastic Generative Hashing

[...]

Bo Dai¹, Ruiqi Guo², Sanjiv Kumar², Niao He³, Le Song¹ - Show less +1 more•Institutions (3)

Georgia Institute of Technology¹, Google², University of Illinois at Urbana–Champaign³

11 Jan 2017-arXiv: Learning

TL;DR: This paper proposes a novel generative approach to learn hash functions through Minimum Description Length principle such that the learned hash codes maximally compress the dataset and can also be used to regenerate the inputs.

...read moreread less

Abstract: Learning-based binary hashing has become a powerful paradigm for fast search and retrieval in massive databases. However, due to the requirement of discrete outputs for the hash functions, learning such functions is known to be very challenging. In addition, the objective functions adopted by existing hashing techniques are mostly chosen heuristically. In this paper, we propose a novel generative approach to learn hash functions through Minimum Description Length principle such that the learned hash codes maximally compress the dataset and can also be used to regenerate the inputs. We also develop an efficient learning algorithm based on the stochastic distributional gradient, which avoids the notorious difficulty caused by binary output constraints, to jointly optimize the parameters of the hash function and the associated generative model. Extensive experiments on a variety of large-scale datasets show that the proposed method achieves better retrieval results than the existing state-of-the-art methods.

...read moreread less

Proceedings Article•DOI•

Parameter-free locality sensitive hashing for spherical range reporting

[...]

Thomas D. Ahle¹, Martin Aumüller¹, Rasmus Pagh¹•Institutions (1)

University of Copenhagen¹

16 Jan 2017

TL;DR: A parameter-free way of using multi-probing, for LSH families that support it, and it is shown that for many such families this approach allows us to get expected query time close to $O(n^\rho+t)$, which is the best the authors can hope to achieve using LSH.

...read moreread less

Abstract: We present a data structure for spherical range reporting on a point set S, i.e., reporting all points in S that lie within radius r of a given query point q (with a small probability of error). Our solution builds upon the Locality-Sensitive Hashing (LSH) framework of Indyk and Motwani, which represents the asymptotically best solutions to near neighbor problems in high dimensions. While traditional LSH data structures have several parameters whose optimal values depend on the distance distribution from q to the points of S (and in particular on the number of points to report), our data structure is essentially parameter-free and only takes as parameter the space the user is willing to allocate. Nevertheless, its expected query time basically matches that of an LSH data structure whose parameters have been optimally chosen for the data and query in question under the given space constraints. In particular, our data structure provides a smooth trade-off between hard queries (typically addressed by standard LSH parameter settings) and easy queries such as those where the number of points to report is a constant fraction of S, or where almost all points in S are far away from the query point. In contrast, known data structures fix LSH parameters based on certain parameters of the input alone.The algorithm has expected query time bounded by O(t(n/t)ρ), where t is the number of points to report and ρ ∈ (0, 1) depends on the data distribution and the strength of the LSH family used. The previously best running time in high dimensions was Ω(tnρ), achieved by traditional LSH-based data structures where parameters are tuned for outputting a single point within distance r. Further, for many data distributions where the intrinsic dimensionality of the point set close to q is low, we can give improved upper bounds on the expected query time. We finally present a parameter-free way of using multi-probing, for LSH families that support it, and show that for many such families this approach allows us to get expected query time close to O(nρ + t), which is the best we can hope to achieve using LSH.

...read moreread less

Collapse