scispace - formally typeset
Search or ask a question
Topic

Feature hashing

About: Feature hashing is a research topic. Over the lifetime, 993 publications have been published within this topic receiving 51462 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: This paper compares several families of space hashing functions in a real setup and reveals that unstructured quantizer significantly improves the accuracy of LSH, as it closely fits the data in the feature space.

327 citations

Proceedings ArticleDOI
28 Nov 2011
TL;DR: This paper presents a novel approach - Multiple Feature Hashing (MFH) to tackle both the accuracy and the scalability issues of NDVR and shows that the proposed method outperforms the state-of-the-art techniques in both accuracy and efficiency.
Abstract: Near-duplicate video retrieval (NDVR) has recently attracted lots of research attention due to the exponential growth of online videos. It helps in many areas, such as copyright protection, video tagging, online video usage monitoring, etc. Most of existing approaches use only a single feature to represent a video for NDVR. However, a single feature is often insufficient to characterize the video content. Besides, while the accuracy is the main concern in previous literatures, the scalability of NDVR algorithms for large scale video datasets has been rarely addressed. In this paper, we present a novel approach - Multiple Feature Hashing (MFH) to tackle both the accuracy and the scalability issues of NDVR. MFH preserves the local structure information of each individual feature and also globally consider the local structures for all the features to learn a group of hash functions which map the video keyframes into the Hamming space and generate a series of binary codes to represent the video dataset. We evaluate our approach on a public video dataset and a large scale video dataset consisting of 132,647 videos, which was collected from YouTube by ourselves. The experiment results show that the proposed method outperforms the state-of-the-art techniques in both accuracy and efficiency.

324 citations

Proceedings ArticleDOI
19 Jul 2010
TL;DR: Self-Taught Hashing (STH) as discussed by the authors is a self-taught hashing method that finds the optimal l-bit binary codes for all documents in the given corpus via unsupervised learning, and then trains l classifiers via supervised learning to predict the lbit code for any query document unseen before.
Abstract: The ability of fast similarity search at large scale is of great importance to many Information Retrieval (IR) applications. A promising way to accelerate similarity search is semantic hashing which designs compact binary codes for a large number of documents so that semantically similar documents are mapped to similar codes (within a short Hamming distance). Although some recently proposed techniques are able to generate high-quality codes for documents known in advance, obtaining the codes for previously unseen documents remains to be a very challenging problem. In this paper, we emphasise this issue and propose a novel Self-Taught Hashing (STH) approach to semantic hashing: we first find the optimal l-bit binary codes for all documents in the given corpus via unsupervised learning, and then train l classifiers via supervised learning to predict the l-bit code for any query document unseen before. Our experiments on three real-world text datasets show that the proposed approach using binarised Laplacian Eigenmap (LapEig) and linear Support Vector Machine (SVM) outperforms state-of-the-art techniques significantly.

322 citations

Proceedings ArticleDOI
17 Oct 2011
TL;DR: The key idea behind BitShred is using feature hashing to dramatically reduce the high-dimensional feature spaces that are common in malware analysis, and to mine correlated features between malware families and samples using co-clustering techniques.
Abstract: The sheer volume of new malware found each day is growing at an exponential pace This growth has created a need for automatic malware triage techniques that determine what malware is similar, what malware is unique, and why In this paper, we present BitShred, a system for large-scale malware similarity analysis and clustering, and for automatically uncovering semantic inter- and intra-family relationships within clusters The key idea behind BitShred is using feature hashing to dramatically reduce the high-dimensional feature spaces that are common in malware analysis Feature hashing also allows us to mine correlated features between malware families and samples using co-clustering techniques Our evaluation shows that BitShred speeds up typical malware triage tasks by up to 2,365x and uses up to 82x less memory on a single CPU, all with comparable accuracy to previous approaches We also develop a parallelized version of BitShred, and demonstrate scalability within the Hadoop framework

314 citations

Proceedings ArticleDOI
22 Apr 2001
TL;DR: This work describes an approach for obtaining good hash tables based on using multiple hashes of each input key (which is an IP address), which proves extremely suitable in instances where the goal is to have one hash bucket fit into a cache line.
Abstract: High performance Internet routers require a mechanism for very efficient IP address lookups. Some techniques used to this end, such as binary search on levels, need to construct quickly a good hash table for the appropriate IP prefixes. We describe an approach for obtaining good hash tables based on using multiple hashes of each input key (which is an IP address). The methods we describe are fast, simple, scalable, parallelizable, and flexible. In particular, in instances where the goal is to have one hash bucket fit into a cache line, using multiple hashes proves extremely suitable. We provide a general analysis of this hashing technique and specifically discuss its application to binary search on levels.

294 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
84% related
Convolutional neural network
74.7K papers, 2M citations
84% related
Feature (computer vision)
128.2K papers, 1.7M citations
84% related
Deep learning
79.8K papers, 2.1M citations
83% related
Support vector machine
73.6K papers, 1.7M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202333
202289
202111
202016
201916
201838