Topic

Word embedding

About: Word embedding is a research topic. Over the lifetime, 4683 publications have been published within this topic receiving 153378 citations. The topic is also known as: word embeddings.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Content Tree Word Embedding for document representation

[...]

Mehran Kamkarhaghighi¹, Masoud Makrehchi¹•Institutions (1)

University of Ontario Institute of Technology¹

30 Dec 2017-Expert Systems With Applications

TL;DR: Content Tree Word Embedding is introduced to mitigate the risk of word ambiguity and inject a local context into globally pre-trained word vectors and shows an improvement in F-score and accuracy measures when using two deep learning-based word embedding approaches, namely GloVe and Word2Vec.

...read moreread less

Abstract: Only humans can understand and comprehend the actual meaning that underlies natural written language, whereas machines can form semantic relationships only after humans have provided the parameters that are necessary to model the meaning. To enable computer models to access the underlying meaning in written language, accurate and sufficient document representation is crucial. Recently, word embedding approaches have drawn much attention in text mining research. One of the main benefits of such approaches is the use of global corpuses with the generation of pre-trained word vectors. Although very effective, these approaches have their disadvantages. Relying only on pre-trained word vectors may neglect the local context and increase word ambiguity. In this study, a new approach, Content Tree Word Embedding (CTWE), is introduced to mitigate the risk of word ambiguity and inject a local context into globally pre-trained word vectors. CTWE is basically a framework for document representation while using word embedding feature learning. The CTWE structure is locally learned from training data and ultimately represents the local context. While CTWE is constructed, each word vector is updated based on its location in the content tree. For the task of classification, the results show an improvement in F-score and accuracy measures when using two deep learning-based word embedding approaches, namely GloVe and Word2Vec.

...read moreread less

35 citations

Proceedings Article•DOI•

CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling

[...]

Felipe Viegas¹, Sergio Canuto, Christian Gomes², Washington Luiz¹, Thierson Couto Rosa³, Sabir Ribas, Leonardo Rocha², Marcos André Gonçalves¹ - Show less +4 more•Institutions (3)

Universidade Federal de Minas Gerais¹, Universidade Federal de São João del-Rei², Universidade Federal de Goiás³

30 Jan 2019

TL;DR: The strategy, called CluWords, exploits the nearest words of a given pre-trained word embedding to generate meta-words capable of enhancing the document representation, in terms of both, syntactic and semantic information.

...read moreread less

Abstract: In this paper, we advance the state-of-the-art in topic modeling by means of a new document representation based on pre-trained word embeddings for non-probabilistic matrix factorization. Specifically, our strategy, called CluWords, exploits the nearest words of a given pre-trained word embedding to generate meta-words capable of enhancing the document representation, in terms of both, syntactic and semantic information. The novel contributions of our solution include: (i)the introduction of a novel data representation for topic modeling based on syntactic and semantic relationships derived from distances calculated within a pre-trained word embedding space and (ii)the proposal of a new TF-IDF-based strategy, particularly developed to weight the CluWords. In our extensive experimentation evaluation, covering 12 datasets and 8 state-of-the-art baselines, we exceed (with a few ties) in almost cases, with gains of more than 50% against the best baselines (achieving up to 80% against some runner-ups). Finally, we show that our method is able to improve document representation for the task of automatic text classification.

...read moreread less

35 citations

Journal Article•DOI•

Latent Topic Text Representation Learning on Statistical Manifolds

[...]

Bingbing Jiang¹, Zhengyu Li, Huanhuan Chen¹, Anthony G. Cohn²•Institutions (2)

University of Science and Technology of China¹, University of Leeds²

16 Mar 2018-IEEE Transactions on Neural Networks

TL;DR: A novel and efficient text learning framework, named Latent Topic Text Representation Learning, that aims to provide an effective text representation and text measurement with latent topics and is able to effectively measure text distance to perform text categorization tasks by leveraging statistical manifolds.

...read moreread less

Abstract: The explosive growth of text data requires effective methods to represent and classify these texts. Many text learning methods have been proposed, like statistics-based methods, semantic similarity methods, and deep learning methods. The statistics-based methods focus on comparing the substructure of text, which ignores the semantic similarity between different words. Semantic similarity methods learn a text representation by training word embedding and representing text as the average vector of all words. However, these methods cannot capture the topic diversity of words and texts clearly. Recently, deep learning methods such as CNNs and RNNs have been studied. However, the vanishing gradient problem and time complexity for parameter selection limit their applications. In this paper, we propose a novel and efficient text learning framework, named Latent Topic Text Representation Learning . Our method aims to provide an effective text representation and text measurement with latent topics. With the assumption that words on the same topic follow a Gaussian distribution, texts are represented as a mixture of topics, i.e., a Gaussian mixture model. Our framework is able to effectively measure text distance to perform text categorization tasks by leveraging statistical manifolds. Experimental results on text representation and classification, and topic coherence demonstrate the effectiveness of the proposed method.

...read moreread less

35 citations

Proceedings Article•DOI•

Hate Speech Detection using Word Embedding and Deep Learning in the Arabic Language Context.

[...]

Hossam Faris¹, Ibrahim Aljarah¹, Maria Habib¹, Pedro A. Castillo²•Institutions (2)

University of Jordan¹, University of Granada²

01 Jan 2020

TL;DR: A smart deep learning approach for the automatic detection of cyber hate speech on Twitter on the Arabic region is proposed, which achieved good results in classifying tweets as Hate or Normal regarding accuracy, precision, recall, and F1 measure.

...read moreread less

Abstract: Hate speech over online social networks is a worldwide problem that leads for diminishing the cohesion of civil societies. The rapid spread of social media websites is accompanied with an increasing number of social media users which showed a higher rate of hate speech, as well. The objective of this paper is to propose a smart deep learning approach for the automatic detection of cyber hate speech. Particularly, the detection of hate speech on Twitter on the Arabic region. Hence, a dataset is collected from Twitter that captures the hate expressions in different topics at the Arabic region. A set of features extracted from the dataset based on a word embedding mechanism. The word embeddings fed into a deep learning framework. The implemented deep learning approach is a hybrid of convolutional neural network (CNN) and long short-term memory (LSTM) network. The proposed approach achieved good results in classifying tweets as Hate or Normal regarding accuracy, precision, recall, and F1 measure.

...read moreread less

35 citations

Proceedings Article•DOI•

Privacy- and Utility-Preserving Textual Analysis via Calibrated Multivariate Perturbations

[...]

Oluwaseyi Feyisetan¹, Borja Balle¹, Thomas Drake¹, Tom Diethe¹•Institutions (1)

Amazon.com¹

20 Jan 2020

TL;DR: This paper presents a formal approach to carrying out privacy preserving text perturbation using the notion of d_χ-privacy designed to achieve geo-indistinguishability in location data.

...read moreread less

Abstract: Accurately learning from user data while providing quantifiable privacy guarantees provides an opportunity to build better ML models while maintaining user trust. This paper presents a formal approach to carrying out privacy preserving text perturbation using the notion of d_χ-privacy designed to achieve geo-indistinguishability in location data. Our approach applies carefully calibrated noise to vector representation of words in a high dimension space as defined by word embedding models. We present a privacy proof that satisfies d_χ-privacy where the privacy parameter $\varepsilon$ provides guarantees with respect to a distance metric defined by the word embedding space. We demonstrate how $\varepsilon$ can be selected by analyzing plausible deniability statistics backed up by large scale analysis on GloVe and fastText embeddings. We conduct privacy audit experiments against $2$ baseline models and utility experiments on 3 datasets to demonstrate the tradeoff between privacy and utility for varying values of varepsilon on different task types. Our results demonstrate practical utility (

...read moreread less

35 citations

Collapse

Network Information

Performance

Metrics

5,718

Papers

201,647

Citations

No. of papers in the topic in previous years
Year	Papers
2023	317
2022	716
2021	736
2020	1,025
2019	1,078
2018	788

Word embedding

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics