scispace - formally typeset
Search or ask a question
Topic

Feature hashing

About: Feature hashing is a research topic. Over the lifetime, 993 publications have been published within this topic receiving 51462 citations.


Papers
More filters
Patent
18 Apr 2019
TL;DR: In this paper, the authors make use of random data structures to facilitate streaming inference for a Latent Dirichlet Allocation (LDA) model using a countmin sketch to track sufficient statistics for the inference procedure.
Abstract: Embodiments make novel use of random data structures to facilitate streaming inference for a Latent Dirichlet Allocation (LDA) model Utilizing random data structures facilitates streaming inference by entirely avoiding the need for pre-computation, which is generally an obstacle to many current “streaming” variants of LDA as described above Specifically, streaming inference—based on an inference algorithm such as Stochastic Cellular Automata (SCA), Gibbs sampling, and/or Stochastic Expectation Maximization (SEM)—is implemented using a count-min sketch to track sufficient statistics for the inference procedure Use of a count-min sketch avoids the need to know the vocabulary size V a priori Also, use of a count-min sketch directly enables feature hashing, which addresses the problem of effectively encoding words into indices without the need of pre-computation Approximate counters are also used within the count-min sketch to avoid bit overflow issues with the counts in the sketch

2 citations

Proceedings ArticleDOI
26 Mar 2012
TL;DR: This paper proposes to reduce remote accesses by assigning hash buckets smartly to the nodes, if the hash buckets store the same points, and can access multiple hash buckets that should be accessed in processing a query with a single remote access, thereby decreasingRemote accesses.
Abstract: Locality-Sensitive Hashing (LSH) is a well-known approximate nearest-neighbor search algorithm for high-dimensional data. Though LSH searches nearest-neighbor points for a query very fast, LSH has a drawback that the space complexity is very large. For this reason, so as to apply LSH to a large dataset, it is crucial to implement LSH in distributed environments which consist of multiple nodes. One simple and natural method to implement LSH in the distributed environment is to have every node keep the same number of hash tables. However, this method increases remote accesses, because many nodes are accessed to access all the hash tables. Thus, this simple method will suffer from the long query response time, if the communication delay is the bottleneck. This paper proposes to reduce remote accesses by assigning hash buckets smartly to the nodes. In particular, our method assigns hash buckets from different hash tables to the same node, if the hash buckets store the same points. Due to this strategy, our method can access multiple hash buckets that should be accessed in processing a query with a single remote access, thereby decreasing remote accesses.

2 citations

Proceedings ArticleDOI
06 Jul 2011
TL;DR: This work compares several feature selection strategies on different writing contents and proves their effectiveness in experimental evaluation, showing that the best feature selection strategy improves reproduction/collision rates, at an average, to approx.
Abstract: Biometric hashing has the objective to robustly generate stable values from variable biometric data of each particular user and at the same time to generate different values for different users. The quality of hash generation is therefore determined by reproduction and collision rates, which represent the probabilities of hash reproduction in genuine and impostor trials correspondingly. In our work, hash vectors are created based on statistical feature set extracted from dynamic handwritten data. Since the choice of features has been done rather intuitively, it can be observed, that some features have very high intra-class variance and cannot be reproduced for some users. Other features have very low inter-class variance and are always reproduced in impostor trials. Thus, feature selection is required to eliminate all irrelevant features and to allow reliable hash generation. This work compares several feature selection strategies on different writing contents and proves their effectiveness in experimental evaluation. Our experiments show that the best feature selection strategy improves reproduction/collision rates, at an average, to approx. 40%. This makes the robust biometric hash generation with reproduction rate of 93.40% and collision rate of 6.67% practical.

2 citations

Proceedings ArticleDOI
03 Feb 1987
TL;DR: A straightforward modification of linear hashing is presented which, according to experimental results, significantly reduces the average number of retrieval probes in almost aft cases when compared with standard linear hashing.
Abstract: Linear hashing is a technique for constructing dynamic files for direct access. It has an advantage over other dynamic methods in that it lacks a directory. However, a weakness of the method is that at high packing factors, it requires more probes on the average to access a record than do many of the static methods. This paper presents a straightforward modification of linear hashing which, according to experimental results, significantly reduces the average number of retrieval probes in almost aft cases when compared with standard linear hashing. The parameter of overflow page size is an important one for adjusting performance. By choosing an appropriate overflow page size, the user may obtain results which are also better or comparable to those of other variants of linear hashing. In addition, the paper analyzes the effects of varying the primary page size, the overflow page size, and the packing factor on retrieval performance.

2 citations

Posted Content
07 Nov 2017
TL;DR: A new sub-linear space data structure is introduced that captures the most heavily weighted features in linear classifiers trained over data streams that enables memory-limited execution of several statistical analyses over streams, including online feature selection, streaming data explanation, relative deltoid detection, and streaming estimation of pointwise mutual information.
Abstract: We introduce a new sub-linear space data structure---the Weight-Median Sketch---that captures the most heavily weighted features in linear classifiers trained over data streams. This enables memory-limited execution of several statistical analyses over streams, including online feature selection, streaming data explanation, relative deltoid detection, and streaming estimation of pointwise mutual information. In contrast with related sketches that capture the most commonly occurring features (or items) in a data stream, the Weight-Median Sketch captures the features that are most discriminative of one stream (or class) compared to another. The Weight-Median sketch adopts the core data structure used in the Count-Sketch, but, instead of sketching counts, it captures sketched gradient updates to the model parameters. We provide a theoretical analysis of this approach that establishes recovery guarantees in the online learning setting, and demonstrate substantial empirical improvements in accuracy-memory trade-offs over alternatives, including count-based sketches and feature hashing.

2 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
84% related
Convolutional neural network
74.7K papers, 2M citations
84% related
Feature (computer vision)
128.2K papers, 1.7M citations
84% related
Deep learning
79.8K papers, 2.1M citations
83% related
Support vector machine
73.6K papers, 1.7M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202333
202289
202111
202016
201916
201838