scispace - formally typeset
Search or ask a question

Showing papers on "Feature hashing published in 2020"


Proceedings ArticleDOI
23 Aug 2020
TL;DR: This work proposes a novel approach for reducing the embedding size in an end-to-end fashion by exploiting complementary partitions of the category set to produce a unique embedding vector for each category without explicit definition.
Abstract: Modern deep learning-based recommendation systems exploit hundreds to thousands of different categorical features, each with millions of different categories ranging from clicks to posts. To respect the natural diversity within the categorical data, embeddings map each category to a unique dense representation within an embedded space. Since each categorical feature could take on as many as tens of millions of different possible categories, the embedding tables form the primary memory bottleneck during both training and inference. We propose a novel approach for reducing the embedding size in an end-to-end fashion by exploiting complementary partitions of the category set to produce a unique embedding vector for each category without explicit definition. By storing multiple smaller embedding tables based on each complementary partition and combining embeddings from each table, we define a unique embedding for each category at smaller cost. This approach may be interpreted as using a specific fixed codebook to ensure uniqueness of each category's representation. Our experimental results demonstrate the effectiveness of our approach over the hashing trick for reducing the size of the embedding tables in terms of model loss and accuracy, while retaining a similar reduction in the number of parameters.

72 citations


Journal ArticleDOI
03 Apr 2020
TL;DR: A novel and low-cost feature extraction approach, and an effective deep neural network architecture for accurate and fast malware detection and valuable insights about feature engineering and architecture design are derived from the ablation study.
Abstract: Dynamic malware analysis executes the program in an isolated environment and monitors its run-time behaviour (e.g. system API calls) for malware detection. This technique has been proven to be effective against various code obfuscation techniques and newly released (“zero-day”) malware. However, existing works typically only consider the API name while ignoring the arguments, or require complex feature engineering operations and expert knowledge to process the arguments. In this paper, we propose a novel and low-cost feature extraction approach, and an effective deep neural network architecture for accurate and fast malware detection. Specifically, the feature representation approach utilizes a feature hashing trick to encode the API call arguments associated with the API name. The deep neural network architecture applies multiple Gated-CNNs (convolutional neural networks) to transform the extracted features of each API call. The outputs are further processed through bidirectional LSTM (long-short term memory networks) to learn the sequential correlation among API calls. Experiments show that our solution outperforms baselines significantly on a large real dataset. Valuable insights about feature engineering and architecture design are derived from the ablation study.

57 citations


Journal ArticleDOI
TL;DR: The proposed DSMHN method is a generic and scalable deep hashing framework for both image-text and video-text retrievals, which can be flexibly integrated with different types of loss functions simultaneously.
Abstract: Hashing has been widely applied to multimodal retrieval on large-scale multimedia data due to its efficiency in computation and storage. In this article, we propose a novel deep semantic multimodal hashing network (DSMHN) for scalable image-text and video-text retrieval. The proposed deep hashing framework leverages 2-D convolutional neural networks (CNN) as the backbone network to capture the spatial information for image-text retrieval, while the 3-D CNN as the backbone network to capture the spatial and temporal information for video-text retrieval. In the DSMHN, two sets of modality-specific hash functions are jointly learned by explicitly preserving both intermodality similarities and intramodality semantic labels. Specifically, with the assumption that the learned hash codes should be optimal for the classification task, two stream networks are jointly trained to learn the hash functions by embedding the semantic labels on the resultant hash codes. Moreover, a unified deep multimodal hashing framework is proposed to learn compact and high-quality hash codes by exploiting the feature representation learning, intermodality similarity-preserving learning, semantic label-preserving learning, and hash function learning with different types of loss functions simultaneously. The proposed DSMHN method is a generic and scalable deep hashing framework for both image-text and video-text retrievals, which can be flexibly integrated with different types of loss functions. We conduct extensive experiments for both single-modal- and cross-modal-retrieval tasks on four widely used multimodal-retrieval data sets. Experimental results on both image-text- and video-text-retrieval tasks demonstrate that the DSMHN significantly outperforms the state-of-the-art methods.

53 citations


Journal ArticleDOI
TL;DR: A novel end-to-end Deep Cross-Modal Hashing with Hashing Functions and Unified Hash Codes Jointly Learning (DCHUC) is proposed, which outperforms the state-of-the-art cross-modal hashing methods.
Abstract: Due to their high retrieval efficiency and low storage cost, cross-modal hashing methods have attracted considerable attention. Generally, compared with shallow cross-modal hashing methods, deep cross-modal hashing methods can achieve a more satisfactory performance by integrating feature learning and hash codes optimizing into a same framework. However, most existing deep cross-modal hashing methods either cannot learn a unified hash code for the two correlated data-points of different modalities in a database instance or cannot guide the learning of unified hash codes by the feedback of hashing function learning procedure, to enhance the retrieval accuracy. To address the issues above, in this paper, we propose a novel end-to-end Deep Cross-Modal Hashing with Hashing Functions and Unified Hash Codes Jointly Learning (DCHUC). Specifically, by an iterative optimization algorithm, DCHUC jointly learns unified hash codes for image-text pairs in a database and a pair of hash functions for unseen query image-text pairs. With the iterative optimization algorithm, the learned unified hash codes can be used to guide the hashing function learning procedure; Meanwhile, the learned hashing functions can feedback to guide the unified hash codes optimizing procedure. Extensive experiments on three public datasets demonstrate that the proposed method outperforms the state-of-the-art cross-modal hashing methods.

26 citations


Dissertation
05 Jun 2020
TL;DR: This thesis aims to improve the performance (in terms of memory and time) of existing data mining algorithms on streams by incorporating an internal preprocessing step that consists in reducing the dimensionality of input data incrementally before feeding them to the learning stage.
Abstract: With the evolution of technology, the use of smart Internet-of-Things (IoT) devices, sensors, and social networks result in an overwhelming volume of IoT data streams, generated daily from several applications, that can be transformed into valuable information through machine learning tasks. In practice, multiple critical issues arise in order to extract useful knowledge from these evolving data streams, mainly that the stream needs to be efficiently handled and processed. In this context, this thesis aims to improve the performance (in terms of memory and time) of existing data mining algorithms on streams. We focus on the classification task in the streaming framework. The task is challenging on streams, principally due to the high -- and increasing -- data dimensionality, in addition to the potentially infinite amount of data. The two aspects make the classification task harder.The first part of the thesis surveys the current state-of-the-art of the classification and dimensionality reduction techniques as applied to the stream setting, by providing an updated view of the most recent works in this vibrant area.In the second part, we detail our contributions to the field of classification in streams, by developing novel approaches based on summarization techniques aiming to reduce the computational resource of existing classifiers with no -- or minor -- loss of classification accuracy. To address high-dimensional data streams and make classifiers efficient, we incorporate an internal preprocessing step that consists in reducing the dimensionality of input data incrementally before feeding them to the learning stage. We present several approaches applied to several classifications tasks: Naive Bayes which is enhanced with sketches and hashing trick, k-NN by using compressed sensing and UMAP, and also integrate them in ensemble methods.

5 citations


Posted Content
21 Oct 2020
TL;DR: This paper proposes an alternative embedding framework Deep Hash Embedding (DHE), with non-one-hot encodings and a deep neural network (embedding network) to compute embeddings on the fly without having to store them.
Abstract: Embedding learning for large-vocabulary categorical features (e.g. user/item IDs, and words) is crucial for deep learning, and especially neural models for recommendation systems and natural language understanding tasks. Typically, the model creates a huge embedding table that each row represents a dedicated embedding vector for every feature value. In practice, to handle new (i.e., out-of-vocab) feature values and reduce the storage cost, the hashing trick is often adopted, that randomly maps feature values to a smaller number of hashing buckets. Essentially, thess embedding methods can be viewed as 1-layer wide neural networks with one-hot encodings. In this paper, we propose an alternative embedding framework Deep Hash Embedding (DHE), with non-one-hot encodings and a deep neural network (embedding network) to compute embeddings on the fly without having to store them. DHE first encodes the feature value to a dense vector with multiple hashing functions and then applies a DNN to generate the embedding. DHE is collision-free as the dense hashing encodings are unique identifiers for both in-vocab and out-of-vocab feature values. The encoding module is deterministic, non-learnable, and free of storage, while the embedding network is updated during the training time to memorize embedding information. Empirical results show that DHE outperforms state-of-the-art hashing-based embedding learning algorithms, and achieves comparable AUC against the standard one-hot encoding, with significantly smaller model sizes. Our work sheds light on design of DNN-based alternative embedding schemes for categorical features without using embedding table lookup.

4 citations


Patent
14 Apr 2020
TL;DR: In this paper, feature hashing is used to detect malware using a set of features in a feature set generated from a sample, which includes at least a portion of a file based on the hashing, one or more hashed features are indexed to generate an index vector each hashed feature corresponds to an index in the index vector.
Abstract: Data to is analyzed using feature hashing to detect malware A plurality of features in a feature set is hashed The feature set is generated from a sample The sample includes at least a portion of a file Based on the hashing, one or more hashed features are indexed to generate an index vector Each hashed feature corresponds to an index in the index vector Using the index vector, a training dataset is generated Using the training dataset, a machine learning model for identifying at least one file having a malicious code is trained

3 citations


Book ChapterDOI
01 Jan 2020
TL;DR: This chapter presents the vector space model and some ways to further process such a representation of semantic data, and points to a few of the methods and datasets used to evaluate the many different algorithms that create a semantic representation.
Abstract: In this chapter, we present the vector space model and some ways to further process such a representation: With feature hashing, random indexing, latent semantic analysis, non-negative matrix factorization, explicit semantic analysis and word embedding, a word or a text may be associated with a distributed semantic representation. Deep learning, explicit semantic networks and auxiliary non-linguistic information provide further means for creating distributed representations from linguistic data. We point to a few of the methods and datasets used to evaluate the many different algorithms that create a semantic representation, and we also point to some of the problems associated with distributed representations.

3 citations


Journal ArticleDOI
TL;DR: This work develops an effective learning-based hashing model, namely local feature hashing with binary auto-encoder (LFH-BAE), to directly learn local binary descriptors in the Hamming space to well reconstruct the face image from binary codes.
Abstract: The learning-based hashing has recently made encouraging progress in face recognition. However, most existing hashing methods disregard the discrete constraint during optimization, inducing the accumulated quantization errors. In this work, we develop an effective learning-based hashing model, namely local feature hashing with binary auto-encoder (LFH-BAE), to directly learn local binary descriptors in the Hamming space. It attempts to exploit structure factors to well reconstruct the face image from binary codes. Specifically, we first introduce a binary auto-encoder to learn a hashing function to project each face region into high-quality binary codes. Since the original problem is a tricky combinational function, we then present a softened version to decompose it into separate tractable sub-problems. Next, we propose an effective alternating algorithm based on the augmented Lagrange method (ALM) to solve these sub-problems, which helps to generate strong discriminative and excellent robust binary codes. Moreover, we utilize the discrete cyclic coordinate descent (DCC) method to optimize binary codes to reduce the loss of useful information. Lastly, we cluster and pool the obtained binary codes, and construct a histogram feature as the final face representation for each image. Extensive experimental results on four public datasets including FERET, CAS-PEAL-R1, LFW and PaSC show that our LFH-BAE is superior to most state-of-the-art face recognition algorithms.

2 citations


Proceedings ArticleDOI
06 Jul 2020
TL;DR: This paper proposes a dual-graph regularized robust hashing (DGRH) based on both manifold smoothness and robust estimators in a more intuitive manner to recover low-rank representation from corrupted data via l1 loss while preserving neighborhood relationships among samples with dual- graph regularization.
Abstract: Unsupervised hashing without supervision easily deteriorates in the case of grossly corrupted data. Motivated by robust optimization, this paper proposes a dual-graph regularized robust hashing (DGRH) based on both manifold smoothness and robust estimators in a more intuitive manner. Orthogonal to existing robust hashing methods, DGRH directly removes the outliers of datasets with M-estimator to exert robustness. In specific, it intends to recover low-rank representation from corrupted data via l 1 loss while preserving neighborhood relationships among samples with dual-graph regularization. Although DGRH seems a simple extension of robust PCA on graphs with hashing trick, it is easy to implement yet effective. Theory analysis is provided to support our claim. Experiments of image retrieval on three popular benchmark datasets show the efficacy of DGRH as compared to several well-behaved representative counterparts.

2 citations


Proceedings ArticleDOI
23 Apr 2020
TL;DR: This paper proposed a method to classify the sentiment polarities and find customer opinions and feeling about everything to propose product selection for each user in online markets.
Abstract: Recently, Sentiment analysis and classification on social networking has been becoming popular in recent years. Industry and companies have realized the value of huge data to create a valuable advantage to get more customer. User generated content in online reviews for online shops or social media makes a lot of brand related information for marketing fields. In this paper we proposed a method to classify the sentiment polarities and find customer opinions and feeling about everything to propose product selection for each user in online markets. Our qualitative and quantitative experiment shown the usefulness of using positive, neutral, and negative customer opinion for product recommendation in online markets. By considering different combinations of techniques such as feature hashing, bag of words, and lexicons, and also consider the extensive results that described in the literature for application purposes, we can present the accuracy and precision of our method for online markets users.

Proceedings ArticleDOI
05 Jan 2020
TL;DR: This paper observes that using feature hashing on the Adult Dataset leads to 5.4x improvement in metric score while losing an accuracy of 6.1% compared to when the data is used as is.
Abstract: Learning new representations of data to reduce correlation with sensitive attributes is one method to tackle algorithmic bias. In this paper, we explore the possibility of using feature hashing as a method for learning new representations of data for fair classification. Using Difference of Equal Odds as our metric to measure fairness, we observe that using feature hashing on the Adult Dataset leads to 5.4x improvement in metric score while losing an accuracy of 6.1% compared to when the data is used as is.

Proceedings Article
25 Sep 2020
TL;DR: These algorithms are efficient, offer a compact sketch of the dataset, and can be efficiently deployed in a distributive setting, and require significantly less number of random bits for sketching – logarithmic in dimension, while other competitive algorithms require linear in dimension.
Abstract: We present sketching algorithms for sparse binary datasets, which maintain binary version of the dataset after sketching, while simultaneously preserving multiple similarity measures such as Jaccard Similarity, Cosine Similarity, Inner Product, and Hamming Distance, on the same sketch. A major advantage of our algorithms is that they are randomness efficient, and require significantly less number of random bits for sketching – logarithmic in dimension, while other competitive algorithms require linear in dimension. Our proposed algorithms are efficient, offer a compact sketch of the dataset, and can be efficiently deployed in a distributive setting. We present a theoretical analysis of our approach and complement them with extensive experimentations on public datasets. For analysis purposes, our algorithms require a natural assumption on the dataset. We empirically verify the assumption and notice that it holds on several real-world datasets.

Book ChapterDOI
12 Oct 2020
TL;DR: In this paper, a neural network model with automatically constructed features was used to select personalized advertisement for Internet users in the targeted advertising system, and the results of testing showed that the model works equally well with logistic regression with manually selected combinations for hashing.
Abstract: We consider the task of selection personalized advertisement for Internet users in the targeted advertising system. In fact, this leads to the regression problem, when for an arbitrary user U, it is necessary to predict the click probability on a set of banners B1…Bn in order to select the most suitable banners. The real values of the predicted probabilities are also important because they may be used in an auction between different advertising systems on many sites. Since the users’ interests and the set of banners are often changed, it is necessary to train the model in online mode. In addition, large advertising systems have to deal with a large amount of data that needs to be processed in real-time. This limits the complexity of the applicable models. Therefore, linear models that are well suited for dynamic learning remain popular for this task. However, data are rarely linearly separable, and therefore, when using such models, it is required to construct derivative features, for example, by hashing combinations of the original features. A serious drawback is that these combinations are needed to select manually. In this paper, it is proposed to use a neural network with specialized architecture to avoid this problem. Special attention is paid to the analysis of the results on the test set, for which a specialized statistical testing technique is used. The results of testing showed that a neural network model with automatically constructed features works equally with logistic regression with manually selected combinations for hashing.

Posted Content
TL;DR: This work designs a highly scalable alternative approach that leverages the low degree of feature co-occurrences present in many practical settings, called Chromatic Learning (CL), which obtains a low-dimensional dense feature representation by performing graph coloring over the co- Occurrence graph of features.
Abstract: Learning over sparse, high-dimensional data frequently necessitates the use of specialized methods such as the hashing trick. In this work, we design a highly scalable alternative approach that leverages the low degree of feature co-occurrences present in many practical settings. This approach, which we call Chromatic Learning (CL), obtains a low-dimensional dense feature representation by performing graph coloring over the co-occurrence graph of features---an approach previously used as a runtime performance optimization for GBDT training. This color-based dense representation can be combined with additional dense categorical encoding approaches, e.g., submodular feature compression, to further reduce dimensionality. CL exhibits linear parallelizability and consumes memory linear in the size of the co-occurrence graph. By leveraging the structural properties of the co-occurrence graph, CL can compress sparse datasets, such as KDD Cup 2012, that contain over 50M features down to 1024, using an order of magnitude fewer features than frequency-based truncation and the hashing trick while maintaining the same test error for linear models. This compression further enables the use of deep networks in this wide, sparse setting, where CL similarly has favorable performance compared to existing baselines for budgeted input dimension.

Patent
03 Dec 2020
TL;DR: In this paper, a method of creating word-level differential privacy with the hashing trick to protect confidentiality of a textual data, preferably a corpus of textual data is proposed, the method comprising: receiving a list of a plurality of hashes with a weight (or weights) associated with each of the plurality, fitting a probability distribution to the list of weights of plurality, updating the list with new hashes that are within the range of allowable hash values but not included in the received list of hashes 303, generating new weights and adjusted weights based on sampling of said probability distribution 304; and updating
Abstract: A method of creating word-level differential privacy with the hashing trick to protect confidentiality of a textual data, preferably a corpus of textual data, the method comprising: receiving a list of a plurality of hashes with a weight (or weights) associated with each of the plurality of hashes 301, fitting a probability distribution to the list of weights of plurality of hashes 302; updating the list with new hashes that are within the range of allowable hash values but not included in the received list of hashes 303; generating new weights and adjusted weights based on sampling of said probability distribution 304; and updating the list with the new weight to each of the plurality of hashes that are missing a weight and with the adjusted weight if the hash is a newer version 305. The probability distribution can be continuous when the weights are continuous and discrete if the weights are discrete and may also be multi-dimensional if the there are multiple types of weight.