scispace - formally typeset
Search or ask a question
Topic

Feature hashing

About: Feature hashing is a research topic. Over the lifetime, 993 publications have been published within this topic receiving 51462 citations.


Papers
More filters
Proceedings ArticleDOI
13 Apr 2015
TL;DR: A thorough experimental comparison between ART, Judy, two variants of hashing via quadratic probing, and three variants of Cuckoo hashing is presented, which indicates that neither ART nor Judy are competitive to the aforementioned hashing schemes in terms of performance, and, in the case of ART, sometimes not even in Terms of space.
Abstract: With prices of main memory constantly decreasing, people nowadays are more interested in performing their computations in main memory, and leave high I/O costs of traditional disk-based systems out of the equation. This change of paradigm, however, represents new challenges to the way data should be stored and indexed in main memory in order to be processed efficiently. Traditional data structures, like the venerable B-tree, were designed to work on disk-based systems, but they are no longer the way to go in main-memory systems, at least not in their original form, due to the poor cache utilization of the systems they run on. Because of this, in particular, during the last decade there has been a considerable amount of research on index data structures for main-memory systems. Among the most recent and most interesting data structures for main-memory systems there is the recently-proposed adaptive radix tree ARTful (ART for short). The authors of ART presented experiments that indicate that ART was clearly a better choice over other recent tree-based data structures like FAST and B+-trees. However, ART was not the first adaptive radix tree. To the best of our knowledge, the first was the Judy Array (Judy for short), and a comparison between ART and Judy was not shown. Moreover, the same set of experiments indicated that only a hash table was competitive to ART. The hash table used by the authors of ART in their study was a chained hash table, but this kind of hash tables can be suboptimal in terms of space and performance due to their potentially high use of pointers. In this paper we present a thorough experimental comparison between ART, Judy, two variants of hashing via quadratic probing, and three variants of Cuckoo hashing. These hashing schemes are known to be very efficient. For our study we consider whether the data structures are to be used as a non-covering index (relying on an additional store), or as a covering index (covering key-value pairs). We consider both OLAP and OLTP scenarios. Our experiments strongly indicate that neither ART nor Judy are competitive to the aforementioned hashing schemes in terms of performance,

28 citations

Journal ArticleDOI
TL;DR: To enable fast fingerprints searching over a very large database, a new kernelized multiple feature hashing method is proposed to convert the real- value fingerprints into compact binary-value fingerprints.
Abstract: Image fingerprinting is regarded as an alternative approach to watermarking in terms of near-duplicate detection application. It consists of feature extraction and feature indexing. Generally, the former is mainly related to discrimination, robustness , and security while the latter closely focuses on the efficiency of fingerprints search. To enable fast fingerprints searching over a very large database, we propose a new kernelized multiple feature hashing method to convert the real-value fingerprints into compact binary-value fingerprints. During the process of converting, the proposed hashing method jointly utilizes the kernel trick and multiple feature fusion strategy to map the image represented by multiple features into a compact binary code. With the help of the kernel function, the hashing method can be applied to any format (such as string, graph, set, and so on) as long as there is an associated kernel function available for similarity measurement. In addition, taking multiple features into account aims at improving the discriminability since these multiple evidences are complementary to each other. The extensive experimental results show that the proposed algorithm outperforms state-of-the-art kernelized hashing methods by up to 10 percent.

28 citations

Proceedings Article
01 Jan 2017
TL;DR: An experimental comparison of different hashing schemes when used inside FH, OPH, and LSH finds that mixed tabulation hashing is almost as fast as the multiply-mod-prime scheme ax+b mod p, and Mutiply- mod-prime is guaranteed to work well on sufficiently random data, but it can lead to bias and poor concentration on both real-world and synthetic data.
Abstract: Hashing is a basic tool for dimensionality reduction employed in several aspects of machine learning. However, the perfomance analysis is often carried out under the abstract assumption that a truly random unit cost hash function is used, without concern for which concrete hash function is employed. The concrete hash function may work fine on sufficiently random input. The question is if it can be trusted in the real world when faced with more structured input. In this paper we focus on two prominent applications of hashing, namely similarity estimation with the one permutation hashing (OPH) scheme of Li et al. [NIPS'12] and feature hashing (FH) of Weinberger et al. [ICML'09], both of which have found numerous applications, i.e. in approximate near-neighbour search with LSH and large-scale classification with SVM. We consider the recent mixed tabulation hash function of Dahlgaard et al. [FOCS'15] which was proved theoretically to perform like a truly random hash function in many applications, including the above OPH. Here we first show improved concentration bounds for FH with truly random hashing and then argue that mixed tabulation performs similar when the input vectors are sparse. Our main contribution, however, is an experimental comparison of different hashing schemes when used inside FH, OPH, and LSH. We find that mixed tabulation hashing is almost as fast as the classic multiply-mod-prime scheme ax+b mod p. Mutiply-mod-prime is guaranteed to work well on sufficiently random data, but we demonstrate that in the above applications, it can lead to bias and poor concentration on both real-world and synthetic data. We also compare with the very popular MurmurHash3, which has no proven guarantees. Mixed tabulation and MurmurHash3 both perform similar to truly random hashing in our experiments. However, mixed tabulation was 40% faster than MurmurHash3, and it has the proven guarantee of good performance on all possible input making it more reliable.

28 citations

Journal ArticleDOI
TL;DR: A Boosting style algorithm for simultaneously optimizing both the optimal label subsets and hashing functions in a unified formulation is developed, and a query-adaptive retrieval mechanism based on hash bit selection for mixed queries is proposed, no matter whether or not the query words exist in the training data.
Abstract: This article defines a new hashing task motivated by real-world applications in content-based image retrieval, that is, effective data indexing and retrieval given mixed query (query image together with user-provided keywords). Our work is distinguished from state-of-the-art hashing research by two unique features: (1) Unlike conventional image retrieval systems, the input query is a combination of an exemplar image and several descriptive keywords, and (2) the input image data are often associated with multiple labels. It is an assumption that is more consistent with the realistic scenarios. The mixed image-keyword query significantly extends traditional image-based query and better explicates the user intention. Meanwhile it complicates semantics-based indexing on the multilabel data. Though several existing hashing methods can be adapted to solve the indexing task, unfortunately they all prove to suffer from low effectiveness. To enhance the hashing efficiency, we propose a novel scheme “boosted shared hashing”. Unlike prior works that learn the hashing functions on either all image labels or a single label, we observe that the hashing function can be more effective if it is designed to index over an optimal label subset. In other words, the association between labels and hash bits are moderately sparse. The sparsity of the bit-label association indicates greatly reduced computation and storage complexities for indexing a new sample, since only limited number of hashing functions will become active for the specific sample. We develop a Boosting style algorithm for simultaneously optimizing both the optimal label subsets and hashing functions in a unified formulation, and further propose a query-adaptive retrieval mechanism based on hash bit selection for mixed queries, no matter whether or not the query words exist in the training data. Moreover, we show that the proposed method can be easily extended to the case where the data similarity is gauged by nonlinear kernel functions. Extensive experiments are conducted on standard image benchmarks like CIFAR-10, NUS-WIDE and a-TRECVID. The results validate both the sparsity of the bit-label association and the convergence of the proposed algorithm, and demonstrate that the proposed hashing scheme achieves substantially superior performances over state-of-the-art methods under the same hash bit budget.

28 citations

Proceedings ArticleDOI
10 Aug 2015
TL;DR: The proposed MuCH method significantly outperforms the state-of-art hashing methods especially on search efficiency and proposes a similarity measure, called Multi-Component Similarity, aggregating Hamming similarities in multiple hash tables to capture the non-transitive property of semantic similarity.
Abstract: Approximating the semantic similarity between entities in the learned Hamming space is the key for supervised hashing techniques The semantic similarities between entities are often non-transitive since they could share different latent similarity components For example, in social networks, we connect with people for various reasons, such as sharing common interests, working in the same company, being alumni and so on Obviously, these social connections are non-transitive if people are connected due to different reasons However, existing supervised hashing methods treat the pairwise similarity relationships in a simple and unified way and project data into a single Hamming space, while neglecting that the non-transitive property cannot be ade- quately captured by a single Hamming space In this paper, we propose a non-transitive hashing method, namely Multi-Component Hashing (MuCH), to identify the latent similarity components to cope with the non-transitive similarity relationships MuCH generates multiple hash tables with each hash table corresponding to a similarity component, and preserves the non-transitive similarities in different hash table respectively Moreover, we propose a similarity measure, called Multi-Component Similarity, aggregating Hamming similarities in multiple hash tables to capture the non-transitive property of semantic similarity We conduct extensive experiments on one synthetic dataset and two public real-world datasets (ie DBLP and NUS-WIDE) The results clearly demonstrate that the proposed MuCH method significantly outperforms the state-of-art hashing methods especially on search efficiency

27 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
84% related
Convolutional neural network
74.7K papers, 2M citations
84% related
Feature (computer vision)
128.2K papers, 1.7M citations
84% related
Deep learning
79.8K papers, 2.1M citations
83% related
Support vector machine
73.6K papers, 1.7M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202333
202289
202111
202016
201916
201838