scispace - formally typeset
Search or ask a question
Topic

Word embedding

About: Word embedding is a research topic. Over the lifetime, 4683 publications have been published within this topic receiving 153378 citations. The topic is also known as: word embeddings.


Papers
More filters
Proceedings ArticleDOI
07 Jul 2016
TL;DR: Using two Twitter datasets, the results show that the THE AUTHORS-based metrics can capture the coherence of topics in tweets more robustly and efficiently than the PMI/LSA-based ones.
Abstract: Scholars often seek to understand topics discussed on Twitter using topic modelling approaches. Several coherence metrics have been proposed for evaluating the coherence of the topics generated by these approaches, including the pre-calculated Pointwise Mutual Information (PMI) of word pairs and the Latent Semantic Analysis (LSA) word representation vectors. As Twitter data contains abbreviations and a number of peculiarities (e.g. hashtags), it can be challenging to train effective PMI data or LSA word representation. Recently, Word Embedding (WE) has emerged as a particularly effective approach for capturing the similarity among words. Hence, in this paper, we propose new Word Embedding-based topic coherence metrics. To determine the usefulness of these new metrics, we compare them with the previous PMI/LSA-based metrics. We also conduct a large-scale crowdsourced user study to determine whether the new Word Embedding-based metrics better align with human preferences. Using two Twitter datasets, our results show that the WE-based metrics can capture the coherence of topics in tweets more robustly and efficiently than the PMI/LSA-based ones.

47 citations

Proceedings ArticleDOI
24 Jul 2016
TL;DR: In this article, two CNN based methods, cascaded CNN and multitask CNN, are proposed to address aspect extraction and sentiment classification for aspect-based opinion summarization of reviews on particular products.
Abstract: This paper studies Aspect-based Opinion Summarization (AOS) of reviews on particular products. In practice, an AOS system needs to address two core subtasks, aspect extraction and sentiment classification. Most existing approaches to aspect extraction, using linguistic analysis or topic modeling, are general across different products but not precise enough or suitable for particular products. Instead we take a less general but more precise scheme, which directly maps each review sentence into pre-defined aspects. To tackle aspect mapping and sentiment classification, we propose two Convolutional Neural Network (CNN) based methods, cascaded CNN and multitask CNN. Cascaded CNN contains two levels of convolutional networks. Multiple CNNs at level 1 deal with aspect mapping task, and a single CNN at level 2 deals with sentiment classification. Multitask CNN also contains multiple aspect CNNs and a sentiment CNN, but different networks share the same word embeddings. Experimental results show that both cascaded and multitask CNNs with pre-trained word embedding outperform linear classifiers, and multitask CNN generally performs better than cascaded CNN.

47 citations

Proceedings ArticleDOI
03 Apr 2017
TL;DR: This paper introduces new cross-language similarity detection methods based on distributed representation of words and combines the different methods proposed to verify their complementarity, obtaining an overall F1 score of 89.15% for English-French similarity detection at chunk level.
Abstract: This paper proposes to use distributed representation of words (word embeddings) in cross-language textual similarity detection. The main contributions of this paper are the following: (a) we introduce new cross-language similarity detection methods based on distributed representation of words; (b) we combine the different methods proposed to verify their complementarity and finally obtain an overall F 1 score of 89.15% for English-French similarity detection at chunk level (88.5% at sentence level) on a very challenging corpus.

47 citations

Posted Content
TL;DR: In this paper, the authors present a formal approach to carry out privacy preserving text perturbation using the notion of dx-privacy designed to achieve geo-indistinguishability in location data.
Abstract: Accurately learning from user data while providing quantifiable privacy guarantees provides an opportunity to build better ML models while maintaining user trust. This paper presents a formal approach to carrying out privacy preserving text perturbation using the notion of dx-privacy designed to achieve geo-indistinguishability in location data. Our approach applies carefully calibrated noise to vector representation of words in a high dimension space as defined by word embedding models. We present a privacy proof that satisfies dx-privacy where the privacy parameter epsilon provides guarantees with respect to a distance metric defined by the word embedding space. We demonstrate how epsilon can be selected by analyzing plausible deniability statistics backed up by large scale analysis on GloVe and fastText embeddings. We conduct privacy audit experiments against 2 baseline models and utility experiments on 3 datasets to demonstrate the tradeoff between privacy and utility for varying values of epsilon on different task types. Our results demonstrate practical utility (< 2% utility loss for training binary classifiers) while providing better privacy guarantees than baseline models.

47 citations

Journal ArticleDOI
TL;DR: A novel extraction based method for multi-document summarization that covers three important features of a good summary: coverage, non-redundancy, and relevancy is proposed.
Abstract: In this paper, we propose a novel extraction based method for multi-document summarization that covers three important features of a good summary: coverage, non-redundancy, and relevancy. The coverage and non-redundancy features are modeled to generate a single document from the multiple documents. These features are explored by the weighted combination of word embedding and Google based similarity methods. To accommodate the relevancy feature in the system generated summaries, the text summarization task is modeled as an optimization problem, where various text features with their optimized weights are used to score the sentences to find the relevant sentences. For features’ weight optimization, we use the meta-heuristic approach, Shark Smell Optimization (SSO). The experiments are performed on six benchmark datasets (DUC04, DUC06, DUC07, TAC08, TAC11, and MultiLing13) with the co-selection and content based performance parameters. The experimental results show that the proposed approach is viable and effective for multi-document summarization.

47 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
87% related
Unsupervised learning
22.7K papers, 1M citations
86% related
Deep learning
79.8K papers, 2.1M citations
85% related
Reinforcement learning
46K papers, 1M citations
84% related
Graph (abstract data type)
69.9K papers, 1.2M citations
84% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023317
2022716
2021736
20201,025
20191,078
2018788