scispace - formally typeset
Search or ask a question
Topic

Word embedding

About: Word embedding is a research topic. Over the lifetime, 4683 publications have been published within this topic receiving 153378 citations. The topic is also known as: word embeddings.


Papers
More filters
Proceedings ArticleDOI
14 Oct 2019
TL;DR: An online algorithm to discover topics that incrementally groups short text by incorporating the textual content with latent feature vector representations of words appearing in the text, trained on very large corpora to improve the check-in topic mapping learnt on a smaller corpus is proposed.
Abstract: Social media are playing an increasingly important role in reporting major events happening in the world. However, detecting events and topics of interest from social media is a challenging task due to the huge magnitude of the data and the complex semantics of the language being processed. The paper proposes an online algorithm to discover topics that incrementally groups short text by incorporating the textual content with latent feature vector representations of words appearing in the text, trained on very large corpora to improve the check-in topic mapping learnt on a smaller corpus. Experimental results show that by using information from the external corpora, the approach obtains significant improvements with respect to classical topic detection methods. CCS CONCEPTS• Information systems $\rightarrow$ Clustering; Data stream mining; Data extraction and integration; • Computing methodologies $\rightarrow$ Neural networks.

29 citations

Posted Content
TL;DR: A novel method to train domain-specific word embeddings from sparse texts by de-veloping a novel Word Annotation Embedding (WAE) algorithm to incorporate diverse types of text annotations in word em-bedding.
Abstract: Word embedding is a Natural Language Processing (NLP) technique that automatically maps words from a vocabulary to vectors of real numbers in an embedding space It has been widely used in recent years to boost the performance of a vari-ety of NLP tasks such as Named Entity Recognition, Syntac-tic Parsing and Sentiment Analysis Classic word embedding methods such as Word2Vec and GloVe work well when they are given a large text corpus When the input texts are sparse as in many specialized domains (eg, cybersecurity), these methods often fail to produce high-quality vectors In this pa-per, we describe a novel method to train domain-specificword embeddings from sparse texts In addition to domain texts, our method also leverages diverse types of domain knowledge such as domain vocabulary and semantic relations Specifi-cally, we first propose a general framework to encode diverse types of domain knowledge as text annotations Then we de-velop a novel Word Annotation Embedding (WAE) algorithm to incorporate diverse types of text annotations in word em-bedding We have evaluated our method on two cybersecurity text corpora: a malware description corpus and a Common Vulnerability and Exposure (CVE) corpus Our evaluation re-sults have demonstrated the effectiveness of our method in learning domain-specific word embeddings

29 citations

Journal ArticleDOI
TL;DR: An automatic unsupervised approach to build a thesaurus that contains software-specific terms and commonly-used morphological forms and verified the generality of this approach in constructing thesauruses from data sources in other domains is proposed.
Abstract: Informal discussions on social platforms (e.g., Stack Overflow, CodeProject) have accumulated a large body of programming knowledge in the form of natural language text. Natural language process (NLP) techniques can be utilized to harvest this knowledge base for software engineering tasks. However, consistent vocabulary for a concept is essential to make an effective use of these NLP techniques. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms (such as abbreviations, synonyms and misspellings) in informal discussions. Existing techniques to deal with such morphological forms are either designed for general English or mainly resort to domain-specific lexical rules. A thesaurus, which contains software-specific terms and commonly-used morphological forms, is desirable to perform normalization for software engineering text. However, constructing this thesaurus in a manual way is a challenge task. In this paper, we propose an automatic unsupervised approach to build such a thesaurus. In particular, we first identify software-specific terms by utilizing a software-specific corpus (e.g., Stack Overflow) and a general corpus (e.g., Wikipedia). Then we infer morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations. Finally, we perform graph analysis on morphological relations. We evaluate the coverage and accuracy of our constructed thesaurus against community-cumulated lists of software-specific terms, abbreviations and synonyms. We also manually examine the correctness of the identified abbreviations and synonyms in our thesaurus. We demonstrate the usefulness of our constructed thesaurus by developing three applications and also verify the generality of our approach in constructing thesauruses from data sources in other domains.

29 citations

Proceedings ArticleDOI
01 Jun 2016
TL;DR: The ensemble performed better than any single classifier, and the method of including topic information achieves a substantial performance gain, when applied to SemEval 2016 Task 4 subtasks A and B.
Abstract: This paper describes our sentiment classification system for microblog-sized documents, and documents where a topic is present. The system consists of a softvoting ensemble of a word2vec language model adapted to classification, a convolutional neural network (CNN), and a longshort term memory network (LSTM). Our main contribution consists of a way to introduce topic information into this model, by concatenating a topic embedding, consisting of the averaged word embedding for that topic, to each word embedding vector in our neural networks. When we apply our models to SemEval 2016 Task 4 subtasks A and B, we demonstrate that the ensemble performed better than any single classifer, and our method of including topic information achieves a substantial performance gain. According to results on the official test sets, our model ranked 3rd for FPNin the message-only subtask A (among 34 teams) and 1st for accuracy on the topic-dependent subtask B (among 19 teams).

29 citations

Proceedings ArticleDOI
01 Jun 2018
TL;DR: The usefulness of context embeddings is demonstrated in predicting asymmetric association between words from a recently published dataset of production norms and it is suggested that humans respond with words closer to the cue within the context embedding space (rather than the word embeding space), when asked to generate thematically related words.
Abstract: Word embeddings obtained from neural network models such as Word2Vec Skipgram have become popular representations of word meaning and have been evaluated on a variety of word similarity and relatedness norming data. Skipgram generates a set of word and context embeddings, the latter typically discarded after training. We demonstrate the usefulness of context embeddings in predicting asymmetric association between words from a recently published dataset of production norms (Jouravlev & McRae, 2016). Our findings suggest that humans respond with words closer to the cue within the context embedding space (rather than the word embedding space), when asked to generate thematically related words.

29 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
87% related
Unsupervised learning
22.7K papers, 1M citations
86% related
Deep learning
79.8K papers, 2.1M citations
85% related
Reinforcement learning
46K papers, 1M citations
84% related
Graph (abstract data type)
69.9K papers, 1.2M citations
84% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023317
2022716
2021736
20201,025
20191,078
2018788