Topic

Word embedding

About: Word embedding is a research topic. Over the lifetime, 4683 publications have been published within this topic receiving 153378 citations. The topic is also known as: word embeddings.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Word Embedding based Clustering to Detect Topics in Social Media

[...]

Carmela Comito¹, Agostino Forestiero¹, Clara Pizzuti¹•Institutions (1)

Indian Council of Agricultural Research¹

14 Oct 2019

TL;DR: An online algorithm to discover topics that incrementally groups short text by incorporating the textual content with latent feature vector representations of words appearing in the text, trained on very large corpora to improve the check-in topic mapping learnt on a smaller corpus is proposed.

...read moreread less

Abstract: Social media are playing an increasingly important role in reporting major events happening in the world. However, detecting events and topics of interest from social media is a challenging task due to the huge magnitude of the data and the complex semantics of the language being processed. The paper proposes an online algorithm to discover topics that incrementally groups short text by incorporating the textual content with latent feature vector representations of words appearing in the text, trained on very large corpora to improve the check-in topic mapping learnt on a smaller corpus. Experimental results show that by using information from the external corpora, the approach obtains significant improvements with respect to classical topic detection methods. CCS CONCEPTS• Information systems $\rightarrow$ Clustering; Data stream mining; Data extraction and integration; • Computing methodologies $\rightarrow$ Neural networks.

...read moreread less

29 citations

Posted Content•

Learning Domain-Specific Word Embeddings from Sparse Cybersecurity Texts

[...]

Arpita Roy, Youngja Park, Shimei Pan

21 Sep 2017-arXiv: Computation and Language

TL;DR: A novel method to train domain-specific word embeddings from sparse texts by de-veloping a novel Word Annotation Embedding (WAE) algorithm to incorporate diverse types of text annotations in word em-bedding.

...read moreread less

Abstract: Word embedding is a Natural Language Processing (NLP) technique that automatically maps words from a vocabulary to vectors of real numbers in an embedding space It has been widely used in recent years to boost the performance of a vari-ety of NLP tasks such as Named Entity Recognition, Syntac-tic Parsing and Sentiment Analysis Classic word embedding methods such as Word2Vec and GloVe work well when they are given a large text corpus When the input texts are sparse as in many specialized domains (eg, cybersecurity), these methods often fail to produce high-quality vectors In this pa-per, we describe a novel method to train domain-specificword embeddings from sparse texts In addition to domain texts, our method also leverages diverse types of domain knowledge such as domain vocabulary and semantic relations Specifi-cally, we first propose a general framework to encode diverse types of domain knowledge as text annotations Then we de-velop a novel Word Annotation Embedding (WAE) algorithm to incorporate diverse types of text annotations in word em-bedding We have evaluated our method on two cybersecurity text corpora: a malware description corpus and a Common Vulnerability and Exposure (CVE) corpus Our evaluation re-sults have demonstrated the effectiveness of our method in learning domain-specific word embeddings

...read moreread less

29 citations

Journal Article•DOI•

SEthesaurus: WordNet in Software Engineering

[...]

Xiang Chen¹, Chunyang Chen², Dun Zhang¹, Zhenchang Xing³•Institutions (3)

Nantong University¹, Monash University, Clayton campus², Australian National University³

01 Sep 2021-IEEE Transactions on Software Engineering

TL;DR: An automatic unsupervised approach to build a thesaurus that contains software-specific terms and commonly-used morphological forms and verified the generality of this approach in constructing thesauruses from data sources in other domains is proposed.

...read moreread less

Abstract: Informal discussions on social platforms (e.g., Stack Overflow, CodeProject) have accumulated a large body of programming knowledge in the form of natural language text. Natural language process (NLP) techniques can be utilized to harvest this knowledge base for software engineering tasks. However, consistent vocabulary for a concept is essential to make an effective use of these NLP techniques. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms (such as abbreviations, synonyms and misspellings) in informal discussions. Existing techniques to deal with such morphological forms are either designed for general English or mainly resort to domain-specific lexical rules. A thesaurus, which contains software-specific terms and commonly-used morphological forms, is desirable to perform normalization for software engineering text. However, constructing this thesaurus in a manual way is a challenge task. In this paper, we propose an automatic unsupervised approach to build such a thesaurus. In particular, we first identify software-specific terms by utilizing a software-specific corpus (e.g., Stack Overflow) and a general corpus (e.g., Wikipedia). Then we infer morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations. Finally, we perform graph analysis on morphological relations. We evaluate the coverage and accuracy of our constructed thesaurus against community-cumulated lists of software-specific terms, abbreviations and synonyms. We also manually examine the correctness of the identified abbreviations and synonyms in our thesaurus. We demonstrate the usefulness of our constructed thesaurus by developing three applications and also verify the generality of our approach in constructing thesauruses from data sources in other domains.

...read moreread less

29 citations

Proceedings Article•DOI•

UNIMELB at SemEval-2016 Tasks 4A and 4B: An Ensemble of Neural Networks and a Word2Vec Based Model for Sentiment Classification

[...]

Steven Xu, Huizhi Liang¹, Timothy Baldwin¹•Institutions (1)

University of Melbourne¹

01 Jun 2016

TL;DR: The ensemble performed better than any single classifier, and the method of including topic information achieves a substantial performance gain, when applied to SemEval 2016 Task 4 subtasks A and B.

...read moreread less

Abstract: This paper describes our sentiment classification system for microblog-sized documents, and documents where a topic is present. The system consists of a softvoting ensemble of a word2vec language model adapted to classification, a convolutional neural network (CNN), and a longshort term memory network (LSTM). Our main contribution consists of a way to introduce topic information into this model, by concatenating a topic embedding, consisting of the averaged word embedding for that topic, to each word embedding vector in our neural networks. When we apply our models to SemEval 2016 Task 4 subtasks A and B, we demonstrate that the ensemble performed better than any single classifer, and our method of including topic information achieves a substantial performance gain. According to results on the official test sets, our model ranked 3rd for FPNin the message-only subtask A (among 34 teams) and 1st for accuracy on the topic-dependent subtask B (among 19 teams).

...read moreread less

29 citations

Proceedings Article•DOI•

Querying Word Embeddings for Similarity and Relatedness

[...]

Fatemeh Torabi Asr¹, Robert Zinkov², Michael N. Jones²•Institutions (2)

Simon Fraser University¹, Indiana University²

01 Jun 2018

TL;DR: The usefulness of context embeddings is demonstrated in predicting asymmetric association between words from a recently published dataset of production norms and it is suggested that humans respond with words closer to the cue within the context embedding space (rather than the word embeding space), when asked to generate thematically related words.

...read moreread less

Abstract: Word embeddings obtained from neural network models such as Word2Vec Skipgram have become popular representations of word meaning and have been evaluated on a variety of word similarity and relatedness norming data. Skipgram generates a set of word and context embeddings, the latter typically discarded after training. We demonstrate the usefulness of context embeddings in predicting asymmetric association between words from a recently published dataset of production norms (Jouravlev & McRae, 2016). Our findings suggest that humans respond with words closer to the cue within the context embedding space (rather than the word embedding space), when asked to generate thematically related words.

...read moreread less

29 citations

Collapse

Network Information

Performance

Metrics

5,718

Papers

201,647

Citations

No. of papers in the topic in previous years
Year	Papers
2023	317
2022	716
2021	736
2020	1,025
2019	1,078
2018	788

Word embedding

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics