scispace - formally typeset
Search or ask a question
Topic

Feature hashing

About: Feature hashing is a research topic. Over the lifetime, 993 publications have been published within this topic receiving 51462 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: The major finding is that the performance of these functions with the Nigerian names is comparable to those for the other data sets, and the superiority of the random and division methods over others is confirmed, even though the division method will often be preferred for its ease of computation.
Abstract: The best hash function for a particular data set can often be found by empirical studies. The studies reported here are aimed at discovering the most appropriate function for hashing Nigerian names. Five common hash functions—the division, multiplication, midsquare, radix conversion and random methods—along with two collision—handling techniques—linear probing and chaining—were initially tried out on three data sets, each consisting of about 1000 words. The first data set consists of Nigerian names, the second of English names, and the third of words with computing associations. The major finding is that the performance of these functions with the Nigerian names is comparable to those for the other data sets. Also, the superiority of the random and division methods over others is confirmed, even though the division method will often be preferred for its ease of computation. It is also demonstrated that chaining, as a technique for collision-handling, is to be preferred. The hash methods and collision-handling methods were further tested by using much larger data sets and long multiple word strings. These further tests confirmed the previous findings.

4 citations

Book ChapterDOI
02 Oct 2009
TL;DR: In this letter, a new speech hashing scheme based on short-time stability is presented and the characteristic of natural speech that the principal components of linear prediction coefficients among neighboring frames tend to be very similar is utilized to generate the hash sequence.
Abstract: The performance of a perceptual hashing system, which is often measured by discrimination and robustness, is directly related to the features that the system extracts. In this letter, a new speech hashing scheme based on short-time stability is presented. The characteristic of natural speech that the principal components of linear prediction coefficients among neighboring frames tend to be very similar is utilized to generate the hash sequence. Experimental results demonstrate the effectiveness of the proposed scheme in terms of discrimination and robustness.

4 citations

Proceedings ArticleDOI
11 Jul 2016
TL;DR: A novel hash learning framework that maps high-dimensional multimodal data into a common Hamming space where the cross-modal similarity can be measured using Hamming distance is proposed.
Abstract: Hashing has been widely used for approximate nearest neighbor search of high-dimensional multimedia data. In this paper, we propose a novel hash learning framework that maps high-dimensional multimodal data into a common Hamming space where the cross-modal similarity can be measured using Hamming distance. Unlike existing cross-modal hashing methods that learn hash functions in the form of numeric quantization of linear projections, the proposed hash learning algorithm encodes features' ranking properties and takes advantage of rank correlations which are known to be scale-invariant, numerically stable and highly nonlinear. Specifically, we learn two groups of subspaces jointly, one for each modality, so that the ranking orders in those subspaces maximally preserve the cross-modal similarity. Extensive experiments on realworld datasets demonstrate superiority of the proposed methods compared to state-of-the-arts.

4 citations

Journal ArticleDOI
01 Sep 1992
TL;DR: Some heuristics for computing the character weights in a Cichelli-style, minimal perfect hashing function are given and an example using the names of the fifty United States is given to illustrate how the weights are determined.
Abstract: Some heuristics for computing the character weights in a Cichelli-style, minimal perfect hashing function are given. These ideas should perform best when applied to relatively small, static sets of character strings and they can be used as the foundation for a large programming assignment. An example using the names of the fifty United States is given to illustrate how the weights are determined.

4 citations

01 Jan 2009
TL;DR: This research introduces an original anomaly detection approach based on a sublexical unit hash model for application level content based on the split fovea theory in human recognition that is an advance over previous arbitrarily defined payload keyword and 1-gram frequency analysis approaches.
Abstract: This research introduces an original anomaly detection approach based on a sublexical unit hash model for application level content. This approach is an advance over previous arbitrarily defined payload keyword and 1-gram frequency analysis approaches. Based on the split fovea theory in human recognition, this new approach uses a special hash function to identify groups of neighboring words. The hash frequency distribution is calculated to build the profile for a specific content type. Examples of utilizing the algorithm for detecting spam and phishing emails are illustrated in this dissertation. A brief review of network intrusion and anomaly detection will first be presented, followed by a discussion of recent research initiatives on application level anomaly detection. Previous research results for payload keyword and byte frequency based anomaly detection will also be presented. The drawback in using N-gram analysis, which has been applied in most related research efforts, is discussed at the end of chapter 2. The importance of text content analysis to application level anomaly detection will also be explained. After a background introduction of the split fovea theory in psychological research, the proposed sublexical unit hash frequency distribution based method will be presented. How human recognition theory is applied as the fundamental element for a proposed hashing algorithm will be examined followed by a demonstration of how the hashing algorithm is applied to anomaly detection. Spam email is used as the major example in this discussion. The reason spam and phishing emails are used in our experiments includes the availability of detailed experimental data and the possibility of conducting an in-depth analysis of the test data. An interesting comparison between the proposed algorithm and several popular commercial spam email filters used by Google and Yahoo is also presented. The outcome shows the benefits of the proposed approach. The last chapter provides a review of the research and explains how the previous payload keyword approach evolved into the hash model solution. The last chapter discusses the possibility of extending the hash model based anomaly detection to other areas including Unicode applications.

4 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
84% related
Convolutional neural network
74.7K papers, 2M citations
84% related
Feature (computer vision)
128.2K papers, 1.7M citations
84% related
Deep learning
79.8K papers, 2.1M citations
83% related
Support vector machine
73.6K papers, 1.7M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202333
202289
202111
202016
201916
201838