scispace - formally typeset
Search or ask a question
Topic

Latent Dirichlet allocation

About: Latent Dirichlet allocation is a research topic. Over the lifetime, 5351 publications have been published within this topic receiving 212555 citations. The topic is also known as: LDA.


Papers
More filters
Proceedings Article
03 Dec 2007
TL;DR: This paper starts with the PLSA framework and uses an entropic prior in a maximum a posteriori formulation to enforce sparsity and shows that this allows the extraction of overcomplete sets of latent components which better characterize the data.
Abstract: An important problem in many fields is the analysis of counts data to extract meaningful latent components. Methods like Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) have been proposed for this purpose. However, they are limited in the number of components they can extract and lack an explicit provision to control the "expressiveness" of the extracted components. In this paper, we present a learning formulation to address these limitations by employing the notion of sparsity. We start with the PLSA framework and use an entropic prior in a maximum a posteriori formulation to enforce sparsity. We show that this allows the extraction of overcomplete sets of latent components which better characterize the data. We present experimental evidence of the utility of such representations.

84 citations

Proceedings ArticleDOI
Bing Xiang1, Liang Zhou
01 Jun 2014
TL;DR: The proposed sentiment model outperforms the top system in the task of Sentiment Analysis in Twitter in SemEval-2013 in terms of averaged F scores.
Abstract: In this paper, we present multiple approaches to improve sentiment analysis on Twitter data. We first establish a state-of-the-art baseline with a rich feature set. Then we build a topic-based sentiment mixture model with topic-specific data in a semi-supervised training framework. The topic information is generated through topic modeling based on an efficient implementation of Latent Dirichlet Allocation (LDA). The proposed sentiment model outperforms the top system in the task of Sentiment Analysis in Twitter in SemEval-2013 in terms of averaged F scores.

83 citations

Journal ArticleDOI
01 Dec 2004
TL;DR: A new method, parametric embedding (PE), that embeds objects with the class structure into a low-dimensional visualization space, providing insight into the classifier's behavior in supervised, semisupervised, and unsupervised settings is proposed.
Abstract: We propose a new method, parametric embedding (PE), that embeds objects with the class structure into a low-dimensional visualization space. PE takes as input a set of class conditional probabilities for given data points and tries to preserve the structure in an embedding space by minimizing a sum of Kullback-Leibler divergences, under the assumption that samples are generated by a gaussian mixture with equal covariances in the embedding space. PE has many potential uses depending on the source of the input data, providing insight into the classifier's behavior in supervised, semisupervised, and unsupervised settings. The PE algorithm has a computational advantage over conventional embedding methods based on pairwise object relations since its complexity scales with the product of the number of objects and the number of classes. We demonstrate PE by visualizing supervised categorization of Web pages, semisupervised categorization of digits, and the relations of words and latent topics found by an unsupervised algorithm, latent Dirichlet allocation.

83 citations

Journal ArticleDOI
TL;DR: In this paper, a topic detection method based on paragraph vectors is proposed to accelerate citation screening in clinical and public health reviews. But the method is not suitable for the task of biomedical journal articles, since it requires expert reviewers to manually screen thousands of citations to identify all relevant articles to the review.

83 citations

Proceedings ArticleDOI
22 Feb 2012
TL;DR: Three approaches to automating bug report categorization are investigated: an approach similar to previous ones based on an SVM classifier and Term Frequency Inverse Document Frequency(svm-tf-idf), an approach using Latent Dirichlet Allocation (LDA) with SVM (sVM-lda) and an approachUsing LDA and Kullback Leibler divergence (lda-kl).
Abstract: Software developers, particularly in open-source projects, rely on bug repositories to organize their work. On a bug report, the component field is used to indicate to which team of developers a bug should be routed. Researchers have shown that incorrect categorization of newly received bug reports to components can cause potential delays in the resolution of bug reports. Approaches have been developed that consider the use of machine learning approaches, specifically Support Vector Machines (svm), to automatically categorize bug reports into the appropriate component to help streamline the process of solving a bug. One drawback of an SVM-based approach is that the results of categorization can be uneven across various components in the system if some components receive less reports than others. In this paper, we consider broadening the consistency of the recommendations produced by an automatic approach by investigating three approaches to automating bug report categorization: an approach similar to previous ones based on an SVM classifier and Term Frequency Inverse Document Frequency(svm-tf-idf), an approach using Latent Dirichlet Allocation (LDA) with SVM (svm-lda) and an approach using LDA and Kullback Leibler divergence (lda-kl). We found that lda-kl produced recalls similar to those found previously but with better consistency across all components for which bugs must be categorized.

83 citations


Network Information
Related Topics (5)
Cluster analysis
146.5K papers, 2.9M citations
86% related
Support vector machine
73.6K papers, 1.7M citations
86% related
Deep learning
79.8K papers, 2.1M citations
85% related
Feature extraction
111.8K papers, 2.1M citations
84% related
Convolutional neural network
74.7K papers, 2M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023323
2022842
2021418
2020429
2019473
2018446