scispace - formally typeset
Search or ask a question
Proceedings Article

Topical word embeddings

25 Jan 2015-Vol. 29, Iss: 1, pp 2418-2424
TL;DR: The experimental results show that the TWE models outperform typical word embedding models including the multi-prototype version on contextual word similarity, and also exceed latent topic models and other representative document models on text classification.
Abstract: Most word embedding models typically represent each word using a single vector, which makes these models indiscriminative for ubiquitous homonymy and polysemy. In order to enhance discriminativeness, we employ latent topic models to assign topics for each word in the text corpus, and learn topical word embeddings (TWE) based on both words and their topics. In this way, contextual word embeddings can be flexibly obtained to measure contextual word similarity. We can also build document representations, which are more expressive than some widely-used document models such as latent topic models. In the experiments, we evaluate the TWE models on two tasks, contextual word similarity and text classification. The experimental results show that our models outperform typical word embedding models including the multi-prototype version on contextual word similarity, and also exceed latent topic models and other representative document models on text classification. The source code of this paper can be obtained from https://github.com/largelymfs/topical_word_embeddings.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
10 Aug 2015
TL;DR: A semi-supervised representation learning method for text data, which is called the predictive text embedding (PTE), which is comparable or more effective, much more efficient, and has fewer parameters to tune.
Abstract: Unsupervised text embedding methods, such as Skip-gram and Paragraph Vector, have been attracting increasing attention due to their simplicity, scalability, and effectiveness. However, comparing to sophisticated deep learning architectures such as convolutional neural networks, these methods usually yield inferior results when applied to particular machine learning tasks. One possible reason is that these text embedding methods learn the representation of text in a fully unsupervised way, without leveraging the labeled information available for the task. Although the low dimensional representations learned are applicable to many different tasks, they are not particularly tuned for any task. In this paper, we fill this gap by proposing a semi-supervised representation learning method for text data, which we call the predictive text embedding (PTE). Predictive text embedding utilizes both labeled and unlabeled data to learn the embedding of text. The labeled information and different levels of word co-occurrence information are first represented as a large-scale heterogeneous text network, which is then embedded into a low dimensional space through a principled and efficient algorithm. This low dimensional embedding not only preserves the semantic closeness of words and documents, but also has a strong predictive power for the particular task. Compared to recent supervised approaches based on convolutional neural networks, predictive text embedding is comparable or more effective, much more efficient, and has fewer parameters to tune.

703 citations


Cites result from "Topical word embeddings"

  • ...Similar results to ours are also reported in [14]....

    [...]

Journal ArticleDOI
TL;DR: This article extended two Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus.
Abstract: Probabilistic topic models are widely used to discover latent topics in document collections, while latent feature vector representations of words have been used to obtain high performance in many NLP tasks. In this paper, we extend two different Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus. Experimental results show that by using information from the external corpora, our new models produce significant improvements on topic coherence, document clustering and document classification tasks, especially on datasets with few or short documents.

276 citations

Proceedings ArticleDOI
04 Jun 2016
TL;DR: It is demonstrated that word embeddings such as word2vec and GloVe, when trained globally, underperform corpus and query specific embeddlings for retrieval tasks, suggesting that other tasks benefiting from global embeddments may also benefit from local embeddins.
Abstract: Continuous space word embeddings have received a great deal of attention in the natural language processing and machine learning communities for their ability to model term similarity and other relationships. We study the use of term relatedness in the context of query expansion for ad hoc information retrieval. We demonstrate that word embeddings such as word2vec and GloVe, when trained globally, underperform corpus and query specific embeddings for retrieval tasks. These results suggest that other tasks benefiting from global embeddings may also benefit from local embeddings.

265 citations


Cites background from "Topical word embeddings"

  • ...The problem can be addressed by training a global model with multiple vector embeddings per word (Reisinger and Mooney, 2010a; Huang et al., 2012) or topicspecific embeddings (Liu et al., 2015)....

    [...]

  • ..., 2012) or topicspecific embeddings (Liu et al., 2015)....

    [...]

Posted Content
TL;DR: Two different Dirichlet multinomial topic models are extended by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus.
Abstract: Probabilistic topic models are widely used to discover latent topics in document collections, while latent feature vector representations of words have been used to obtain high performance in many NLP tasks. In this paper, we extend two different Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus. Experimental results show that by using information from the external corpora, our new models produce significant improvements on topic coherence, document clustering and document classification tasks, especially on datasets with few or short documents.

251 citations


Cites background from "Topical word embeddings"

  • ...Recent approaches based on deep neural networks learn vectors by predicting words given their window-based context (Collobert and Weston, 2008; Mikolov et al., 2013; Pennington et al., 2014; Liu et al., 2015)....

    [...]

  • ... counts. Recent approaches based on deep neural networks learn vectors by predicting words given their window-based context (Collobert and Weston, 2008; Mikolov et al., 2013; Pennington et al., 2014; Liu et al., 2015). Mikolov et al. (2013)’s method maximizes the log likelihood of each word given its context. Pennington et al. (2014) used back-propagation to minimize the squared error of a prediction of the logfre...

    [...]

Journal ArticleDOI
TL;DR: BERTopic is presented, a topic model that extends the process of topic modeling by extracting coherent topic representation through the development of a class-based variation of TF-IDF.
Abstract: Topic models can be useful tools to discover latent topics in collections of documents. Recent studies have shown the feasibility of approach topic modeling as a clustering task. We present BERTopic, a topic model that extends this process by extracting coherent topic representation through the development of a class-based variation of TF-IDF. More specifically, BERTopic generates document embedding with pre-trained transformer-based language models, clusters these embeddings, and finally, generates topic representations with the class-based TF-IDF procedure. BERTopic generates coherent topics and remains competitive across a variety of benchmarks involving classical models and those that follow the more recent clustering approach of topic modeling.

222 citations

References
More filters
Journal ArticleDOI
TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

30,570 citations


"Topical word embeddings" refers methods in this paper

  • ...We employ the widely used latent Dirichlet allocation (LDA) (Blei, Ng, and Jordan 2003) to obtain word topics, and perform collapsed Gibbs sampling (Griffiths and Steyvers 2004) to iteratively assign latent topics for each word token....

    [...]

Proceedings Article
03 Jan 2001
TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification.

25,546 citations

Proceedings Article
Tomas Mikolov1, Ilya Sutskever1, Kai Chen1, Greg S. Corrado1, Jeffrey Dean1 
05 Dec 2013
TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.

24,012 citations


"Topical word embeddings" refers methods in this paper

  • ...The training objective of CBOW is to combine the embeddings of context words to predict the target word; while Skip-Gram is to use the embedding of each target word to predict its context words (Mikolov et al. 2013)....

    [...]

  • ...We extend Skip-Gram (Mikolov et al. 2013), the stateof-the-art word embedding model, to implement our TWE models....

    [...]

  • ...In previous work, the task of word similarity is always used to evaluate the performance of word embedding methods (Mikolov et al. 2013; Baroni, Dinu, and Kruszewski 2014)....

    [...]

  • ...In order to make the model efficient for learning, the techniques of hierarchical softmax and negative sampling are used when learning Skip-Gram (Mikolov et al. 2013)....

    [...]

  • ...Skip-Gram is a well-known framework for learning word vectors (Mikolov et al. 2013), as shown in Fig....

    [...]

Journal ArticleDOI
01 Jan 1988-Nature
TL;DR: Back-propagation repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector, which helps to represent important features of the task domain.
Abstract: We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector. As a result of the weight adjustments, internal ‘hidden’ units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units. The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure1.

23,814 citations


"Topical word embeddings" refers methods in this paper

  • ...Word embeddings, first proposed in (Rumelhart, Hintont, and Williams 1986), have been successfully used in language models (Bengio et al. 2006; Mnih and Hinton 2008) and many NLP tasks, such as named entity recognition (Turian, Ratinov, and Bengio 2010), disambiguation (Collobert et al. 2011) and…...

    [...]

Journal ArticleDOI
TL;DR: WordNet1 provides a more effective combination of traditional lexicographic information and modern computing, and is an online lexical database designed for use under program control.
Abstract: Because meaningful sentences are composed of meaningful words, any system that hopes to process natural languages as people do must have information about words and their meanings. This information is traditionally provided through dictionaries, and machine-readable dictionaries are now widely available. But dictionary entries evolved for the convenience of human readers, not for machines. WordNet1 provides a more effective combination of traditional lexicographic information and modern computing. WordNet is an online lexical database designed for use under program control. English nouns, verbs, adjectives, and adverbs are organized into sets of synonyms, each representing a lexicalized concept. Semantic relations link the synonym sets [4].

15,068 citations


"Topical word embeddings" refers background in this paper

  • ...• There are many knowledge bases available, such as WordNet (Miller 1995), containing rich linguistic knowledge of homonymy and polysemy....

    [...]