Topic

Word embedding

About: Word embedding is a research topic. Over the lifetime, 4683 publications have been published within this topic receiving 153378 citations. The topic is also known as: word embeddings.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Using Word Embedding to Evaluate the Coherence of Topics from Twitter Data

[...]

Anjie Fang¹, Craig Macdonald¹, Iadh Ounis¹, Philip Habel¹•Institutions (1)

University of Glasgow¹

07 Jul 2016

TL;DR: Using two Twitter datasets, the results show that the THE AUTHORS-based metrics can capture the coherence of topics in tweets more robustly and efficiently than the PMI/LSA-based ones.

...read moreread less

Abstract: Scholars often seek to understand topics discussed on Twitter using topic modelling approaches. Several coherence metrics have been proposed for evaluating the coherence of the topics generated by these approaches, including the pre-calculated Pointwise Mutual Information (PMI) of word pairs and the Latent Semantic Analysis (LSA) word representation vectors. As Twitter data contains abbreviations and a number of peculiarities (e.g. hashtags), it can be challenging to train effective PMI data or LSA word representation. Recently, Word Embedding (WE) has emerged as a particularly effective approach for capturing the similarity among words. Hence, in this paper, we propose new Word Embedding-based topic coherence metrics. To determine the usefulness of these new metrics, we compare them with the previous PMI/LSA-based metrics. We also conduct a large-scale crowdsourced user study to determine whether the new Word Embedding-based metrics better align with human preferences. Using two Twitter datasets, our results show that the WE-based metrics can capture the coherence of topics in tweets more robustly and efficiently than the PMI/LSA-based ones.

...read moreread less

47 citations

Proceedings Article•DOI•

Aspect-based Opinion Summarization with Convolutional Neural Networks

[...]

Haibing Wu¹, Yiwei Gu¹, Shangdi Sun¹, Xiaodong Gu¹•Institutions (1)

Fudan University¹

24 Jul 2016

TL;DR: In this article, two CNN based methods, cascaded CNN and multitask CNN, are proposed to address aspect extraction and sentiment classification for aspect-based opinion summarization of reviews on particular products.

...read moreread less

Abstract: This paper studies Aspect-based Opinion Summarization (AOS) of reviews on particular products. In practice, an AOS system needs to address two core subtasks, aspect extraction and sentiment classification. Most existing approaches to aspect extraction, using linguistic analysis or topic modeling, are general across different products but not precise enough or suitable for particular products. Instead we take a less general but more precise scheme, which directly maps each review sentence into pre-defined aspects. To tackle aspect mapping and sentiment classification, we propose two Convolutional Neural Network (CNN) based methods, cascaded CNN and multitask CNN. Cascaded CNN contains two levels of convolutional networks. Multiple CNNs at level 1 deal with aspect mapping task, and a single CNN at level 2 deals with sentiment classification. Multitask CNN also contains multiple aspect CNNs and a sentiment CNN, but different networks share the same word embeddings. Experimental results show that both cascaded and multitask CNNs with pre-trained word embedding outperform linear classifiers, and multitask CNN generally performs better than cascaded CNN.

...read moreread less

47 citations

Proceedings Article•DOI•

Using Word Embedding for Cross-Language Plagiarism Detection

[...]

Jérémy Ferrero, Laurent Besacier¹, Didier Schwab², Frédéric Agnès•Institutions (2)

Institut Universitaire de France¹, University of Grenoble²

03 Apr 2017

TL;DR: This paper introduces new cross-language similarity detection methods based on distributed representation of words and combines the different methods proposed to verify their complementarity, obtaining an overall F1 score of 89.15% for English-French similarity detection at chunk level.

...read moreread less

Abstract: This paper proposes to use distributed representation of words (word embeddings) in cross-language textual similarity detection. The main contributions of this paper are the following: (a) we introduce new cross-language similarity detection methods based on distributed representation of words; (b) we combine the different methods proposed to verify their complementarity and finally obtain an overall F 1 score of 89.15% for English-French similarity detection at chunk level (88.5% at sentence level) on a very challenging corpus.

...read moreread less

47 citations

Posted Content•

Privacy- and Utility-Preserving Textual Analysis via Calibrated Multivariate Perturbations

[...]

Oluwaseyi Feyisetan¹, Borja Balle¹, Thomas Drake¹, Tom Diethe¹•Institutions (1)

Amazon.com¹

20 Oct 2019-arXiv: Learning

TL;DR: In this paper, the authors present a formal approach to carry out privacy preserving text perturbation using the notion of dx-privacy designed to achieve geo-indistinguishability in location data.

...read moreread less

Abstract: Accurately learning from user data while providing quantifiable privacy guarantees provides an opportunity to build better ML models while maintaining user trust. This paper presents a formal approach to carrying out privacy preserving text perturbation using the notion of dx-privacy designed to achieve geo-indistinguishability in location data. Our approach applies carefully calibrated noise to vector representation of words in a high dimension space as defined by word embedding models. We present a privacy proof that satisfies dx-privacy where the privacy parameter epsilon provides guarantees with respect to a distance metric defined by the word embedding space. We demonstrate how epsilon can be selected by analyzing plausible deniability statistics backed up by large scale analysis on GloVe and fastText embeddings. We conduct privacy audit experiments against 2 baseline models and utility experiments on 3 datasets to demonstrate the tradeoff between privacy and utility for varying values of epsilon on different task types. Our results demonstrate practical utility (< 2% utility loss for training binary classifiers) while providing better privacy guarantees than baseline models.

...read moreread less

47 citations

Journal Article•DOI•

MCRMR: Maximum coverage and relevancy with minimal redundancy based multi-document summarization

[...]

Pradeepika Verma¹, Hari Om¹•Institutions (1)

Indian Institute of Technology Dhanbad¹

15 Apr 2019-Expert Systems With Applications

TL;DR: A novel extraction based method for multi-document summarization that covers three important features of a good summary: coverage, non-redundancy, and relevancy is proposed.

...read moreread less

Abstract: In this paper, we propose a novel extraction based method for multi-document summarization that covers three important features of a good summary: coverage, non-redundancy, and relevancy. The coverage and non-redundancy features are modeled to generate a single document from the multiple documents. These features are explored by the weighted combination of word embedding and Google based similarity methods. To accommodate the relevancy feature in the system generated summaries, the text summarization task is modeled as an optimization problem, where various text features with their optimized weights are used to score the sentences to find the relevant sentences. For features’ weight optimization, we use the meta-heuristic approach, Shark Smell Optimization (SSO). The experiments are performed on six benchmark datasets (DUC04, DUC06, DUC07, TAC08, TAC11, and MultiLing13) with the co-selection and content based performance parameters. The experimental results show that the proposed approach is viable and effective for multi-document summarization.

...read moreread less

47 citations

Collapse

Network Information

Performance

Metrics

5,718

Papers

201,647

Citations

No. of papers in the topic in previous years
Year	Papers
2023	317
2022	716
2021	736
2020	1,025
2019	1,078
2018	788

Word embedding

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics