scispace - formally typeset
Search or ask a question
Topic

Word embedding

About: Word embedding is a research topic. Over the lifetime, 4683 publications have been published within this topic receiving 153378 citations. The topic is also known as: word embeddings.


Papers
More filters
Journal ArticleDOI
TL;DR: Content Tree Word Embedding is introduced to mitigate the risk of word ambiguity and inject a local context into globally pre-trained word vectors and shows an improvement in F-score and accuracy measures when using two deep learning-based word embedding approaches, namely GloVe and Word2Vec.
Abstract: Only humans can understand and comprehend the actual meaning that underlies natural written language, whereas machines can form semantic relationships only after humans have provided the parameters that are necessary to model the meaning. To enable computer models to access the underlying meaning in written language, accurate and sufficient document representation is crucial. Recently, word embedding approaches have drawn much attention in text mining research. One of the main benefits of such approaches is the use of global corpuses with the generation of pre-trained word vectors. Although very effective, these approaches have their disadvantages. Relying only on pre-trained word vectors may neglect the local context and increase word ambiguity. In this study, a new approach, Content Tree Word Embedding (CTWE), is introduced to mitigate the risk of word ambiguity and inject a local context into globally pre-trained word vectors. CTWE is basically a framework for document representation while using word embedding feature learning. The CTWE structure is locally learned from training data and ultimately represents the local context. While CTWE is constructed, each word vector is updated based on its location in the content tree. For the task of classification, the results show an improvement in F-score and accuracy measures when using two deep learning-based word embedding approaches, namely GloVe and Word2Vec.

35 citations

Proceedings ArticleDOI
30 Jan 2019
TL;DR: The strategy, called CluWords, exploits the nearest words of a given pre-trained word embedding to generate meta-words capable of enhancing the document representation, in terms of both, syntactic and semantic information.
Abstract: In this paper, we advance the state-of-the-art in topic modeling by means of a new document representation based on pre-trained word embeddings for non-probabilistic matrix factorization. Specifically, our strategy, called CluWords, exploits the nearest words of a given pre-trained word embedding to generate meta-words capable of enhancing the document representation, in terms of both, syntactic and semantic information. The novel contributions of our solution include: (i)the introduction of a novel data representation for topic modeling based on syntactic and semantic relationships derived from distances calculated within a pre-trained word embedding space and (ii)the proposal of a new TF-IDF-based strategy, particularly developed to weight the CluWords. In our extensive experimentation evaluation, covering 12 datasets and 8 state-of-the-art baselines, we exceed (with a few ties) in almost cases, with gains of more than 50% against the best baselines (achieving up to 80% against some runner-ups). Finally, we show that our method is able to improve document representation for the task of automatic text classification.

35 citations

Journal ArticleDOI
TL;DR: A novel and efficient text learning framework, named Latent Topic Text Representation Learning, that aims to provide an effective text representation and text measurement with latent topics and is able to effectively measure text distance to perform text categorization tasks by leveraging statistical manifolds.
Abstract: The explosive growth of text data requires effective methods to represent and classify these texts. Many text learning methods have been proposed, like statistics-based methods, semantic similarity methods, and deep learning methods. The statistics-based methods focus on comparing the substructure of text, which ignores the semantic similarity between different words. Semantic similarity methods learn a text representation by training word embedding and representing text as the average vector of all words. However, these methods cannot capture the topic diversity of words and texts clearly. Recently, deep learning methods such as CNNs and RNNs have been studied. However, the vanishing gradient problem and time complexity for parameter selection limit their applications. In this paper, we propose a novel and efficient text learning framework, named Latent Topic Text Representation Learning . Our method aims to provide an effective text representation and text measurement with latent topics. With the assumption that words on the same topic follow a Gaussian distribution, texts are represented as a mixture of topics, i.e., a Gaussian mixture model. Our framework is able to effectively measure text distance to perform text categorization tasks by leveraging statistical manifolds. Experimental results on text representation and classification, and topic coherence demonstrate the effectiveness of the proposed method.

35 citations

Proceedings ArticleDOI
01 Jan 2020
TL;DR: A smart deep learning approach for the automatic detection of cyber hate speech on Twitter on the Arabic region is proposed, which achieved good results in classifying tweets as Hate or Normal regarding accuracy, precision, recall, and F1 measure.
Abstract: Hate speech over online social networks is a worldwide problem that leads for diminishing the cohesion of civil societies. The rapid spread of social media websites is accompanied with an increasing number of social media users which showed a higher rate of hate speech, as well. The objective of this paper is to propose a smart deep learning approach for the automatic detection of cyber hate speech. Particularly, the detection of hate speech on Twitter on the Arabic region. Hence, a dataset is collected from Twitter that captures the hate expressions in different topics at the Arabic region. A set of features extracted from the dataset based on a word embedding mechanism. The word embeddings fed into a deep learning framework. The implemented deep learning approach is a hybrid of convolutional neural network (CNN) and long short-term memory (LSTM) network. The proposed approach achieved good results in classifying tweets as Hate or Normal regarding accuracy, precision, recall, and F1 measure.

35 citations

Proceedings ArticleDOI
20 Jan 2020
TL;DR: This paper presents a formal approach to carrying out privacy preserving text perturbation using the notion of d_χ-privacy designed to achieve geo-indistinguishability in location data.
Abstract: Accurately learning from user data while providing quantifiable privacy guarantees provides an opportunity to build better ML models while maintaining user trust. This paper presents a formal approach to carrying out privacy preserving text perturbation using the notion of d_χ-privacy designed to achieve geo-indistinguishability in location data. Our approach applies carefully calibrated noise to vector representation of words in a high dimension space as defined by word embedding models. We present a privacy proof that satisfies d_χ-privacy where the privacy parameter $\varepsilon$ provides guarantees with respect to a distance metric defined by the word embedding space. We demonstrate how $\varepsilon$ can be selected by analyzing plausible deniability statistics backed up by large scale analysis on GloVe and fastText embeddings. We conduct privacy audit experiments against $2$ baseline models and utility experiments on 3 datasets to demonstrate the tradeoff between privacy and utility for varying values of varepsilon on different task types. Our results demonstrate practical utility (

35 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
87% related
Unsupervised learning
22.7K papers, 1M citations
86% related
Deep learning
79.8K papers, 2.1M citations
85% related
Reinforcement learning
46K papers, 1M citations
84% related
Graph (abstract data type)
69.9K papers, 1.2M citations
84% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023317
2022716
2021736
20201,025
20191,078
2018788