Topic

Feature hashing

About: Feature hashing is a research topic. Over the lifetime, 993 publications have been published within this topic receiving 51462 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Patent•

Streaming latent dirichlet allocation

[...]

Tristan Jean-Baptiste¹, Wick Michael, Green Stephen•Institutions (1)

Business International Corporation¹

18 Apr 2019

TL;DR: In this paper, the authors make use of random data structures to facilitate streaming inference for a Latent Dirichlet Allocation (LDA) model using a countmin sketch to track sufficient statistics for the inference procedure.

...read moreread less

Abstract: Embodiments make novel use of random data structures to facilitate streaming inference for a Latent Dirichlet Allocation (LDA) model Utilizing random data structures facilitates streaming inference by entirely avoiding the need for pre-computation, which is generally an obstacle to many current “streaming” variants of LDA as described above Specifically, streaming inference—based on an inference algorithm such as Stochastic Cellular Automata (SCA), Gibbs sampling, and/or Stochastic Expectation Maximization (SEM)—is implemented using a count-min sketch to track sufficient statistics for the inference procedure Use of a count-min sketch avoids the need to know the vocabulary size V a priori Also, use of a count-min sketch directly enables feature hashing, which addresses the problem of effectively encoding words into indices without the need of pre-computation Approximate counters are also used within the count-min sketch to avoid bit overflow issues with the counts in the sketch

...read moreread less

2 citations

Proceedings Article•DOI•

MIXED-LSH: Reduction of Remote Accesses in Distributed Locality-Sensitive Hashing Based on L1-distance

[...]

Hisashi Koga, Masayuki Oguri, Toshinori Watanabe

26 Mar 2012

TL;DR: This paper proposes to reduce remote accesses by assigning hash buckets smartly to the nodes, if the hash buckets store the same points, and can access multiple hash buckets that should be accessed in processing a query with a single remote access, thereby decreasingRemote accesses.

...read moreread less

Abstract: Locality-Sensitive Hashing (LSH) is a well-known approximate nearest-neighbor search algorithm for high-dimensional data. Though LSH searches nearest-neighbor points for a query very fast, LSH has a drawback that the space complexity is very large. For this reason, so as to apply LSH to a large dataset, it is crucial to implement LSH in distributed environments which consist of multiple nodes. One simple and natural method to implement LSH in the distributed environment is to have every node keep the same number of hash tables. However, this method increases remote accesses, because many nodes are accessed to access all the hash tables. Thus, this simple method will suffer from the long query response time, if the communication delay is the bottleneck. This paper proposes to reduce remote accesses by assigning hash buckets smartly to the nodes. In particular, our method assigns hash buckets from different hash tables to the same node, if the hash buckets store the same points. Due to this strategy, our method can access multiple hash buckets that should be accessed in processing a query with a single remote access, thereby decreasing remote accesses.

...read moreread less

2 citations

Proceedings Article•DOI•

Towards robust biohash generation for dynamic handwriting using feature selection

[...]

Andrey Makrushin¹, Tobias Scheidat¹, Claus Vielhauer¹•Institutions (1)

Otto-von-Guericke University Magdeburg¹

06 Jul 2011

TL;DR: This work compares several feature selection strategies on different writing contents and proves their effectiveness in experimental evaluation, showing that the best feature selection strategy improves reproduction/collision rates, at an average, to approx.

...read moreread less

Abstract: Biometric hashing has the objective to robustly generate stable values from variable biometric data of each particular user and at the same time to generate different values for different users. The quality of hash generation is therefore determined by reproduction and collision rates, which represent the probabilities of hash reproduction in genuine and impostor trials correspondingly. In our work, hash vectors are created based on statistical feature set extracted from dynamic handwritten data. Since the choice of features has been done rather intuitively, it can be observed, that some features have very high intra-class variance and cannot be reproduced for some users. Other features have very low inter-class variance and are always reproduced in impostor trials. Thus, feature selection is required to eliminate all irrelevant features and to allow reliable hash generation. This work compares several feature selection strategies on different writing contents and proves their effectiveness in experimental evaluation. Our experiments show that the best feature selection strategy improves reproduction/collision rates, at an average, to approx. 40%. This makes the robust biometric hash generation with reproduction rate of 93.40% and collision rate of 6.67% practical.

...read moreread less

2 citations

Proceedings Article•DOI•

Linear hashing with Priority Splitting: A method for improving the retrieval performance of linear hashing

[...]

Willard D. Ruchte¹, Alan L. Tharp¹•Institutions (1)

North Carolina State University¹

03 Feb 1987

TL;DR: A straightforward modification of linear hashing is presented which, according to experimental results, significantly reduces the average number of retrieval probes in almost aft cases when compared with standard linear hashing.

...read moreread less

Abstract: Linear hashing is a technique for constructing dynamic files for direct access. It has an advantage over other dynamic methods in that it lacks a directory. However, a weakness of the method is that at high packing factors, it requires more probes on the average to access a record than do many of the static methods. This paper presents a straightforward modification of linear hashing which, according to experimental results, significantly reduces the average number of retrieval probes in almost aft cases when compared with standard linear hashing. The parameter of overflow page size is an important one for adjusting performance. By choosing an appropriate overflow page size, the user may obtain results which are also better or comparable to those of other variants of linear hashing. In addition, the paper analyzes the effects of varying the primary page size, the overflow page size, and the packing factor on retrieval performance.

...read moreread less

2 citations

Posted Content•

Finding Heavily-Weighted Features in Data Streams.

[...]

Kai Sheng Tai, Vatsal Sharan, Peter Bailis, Gregory Valiant

07 Nov 2017

TL;DR: A new sub-linear space data structure is introduced that captures the most heavily weighted features in linear classifiers trained over data streams that enables memory-limited execution of several statistical analyses over streams, including online feature selection, streaming data explanation, relative deltoid detection, and streaming estimation of pointwise mutual information.

...read moreread less

Abstract: We introduce a new sub-linear space data structure---the Weight-Median Sketch---that captures the most heavily weighted features in linear classifiers trained over data streams. This enables memory-limited execution of several statistical analyses over streams, including online feature selection, streaming data explanation, relative deltoid detection, and streaming estimation of pointwise mutual information. In contrast with related sketches that capture the most commonly occurring features (or items) in a data stream, the Weight-Median Sketch captures the features that are most discriminative of one stream (or class) compared to another. The Weight-Median sketch adopts the core data structure used in the Count-Sketch, but, instead of sketching counts, it captures sketched gradient updates to the model parameters. We provide a theoretical analysis of this approach that establishes recovery guarantees in the online learning setting, and demonstrate substantial empirical improvements in accuracy-memory trade-offs over alternatives, including count-based sketches and feature hashing.

...read moreread less

2 citations

Collapse

Network Information

Performance

Metrics

1,120

Papers

57,460

Citations

No. of papers in the topic in previous years
Year	Papers
2023	33
2022	89
2021	11
2020	16
2019	16
2018	38

Feature hashing

Papers published on a yearly basis

Papers

Trending Questions (2)

Network Information

Related Topics (5)

Performance

Metrics