Topic
Latent Dirichlet allocation
About: Latent Dirichlet allocation is a research topic. Over the lifetime, 5351 publications have been published within this topic receiving 212555 citations. The topic is also known as: LDA.
Papers published on a yearly basis
Papers
More filters
•
03 Dec 2007TL;DR: This paper starts with the PLSA framework and uses an entropic prior in a maximum a posteriori formulation to enforce sparsity and shows that this allows the extraction of overcomplete sets of latent components which better characterize the data.
Abstract: An important problem in many fields is the analysis of counts data to extract meaningful latent components. Methods like Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) have been proposed for this purpose. However, they are limited in the number of components they can extract and lack an explicit provision to control the "expressiveness" of the extracted components. In this paper, we present a learning formulation to address these limitations by employing the notion of sparsity. We start with the PLSA framework and use an entropic prior in a maximum a posteriori formulation to enforce sparsity. We show that this allows the extraction of overcomplete sets of latent components which better characterize the data. We present experimental evidence of the utility of such representations.
84 citations
••
IBM1
TL;DR: The proposed sentiment model outperforms the top system in the task of Sentiment Analysis in Twitter in SemEval-2013 in terms of averaged F scores.
Abstract: In this paper, we present multiple approaches to improve sentiment analysis on Twitter data. We first establish a state-of-the-art baseline with a rich feature set. Then we build a topic-based sentiment mixture model with topic-specific data in a semi-supervised training framework. The topic information is generated through topic modeling based on an efficient implementation of Latent Dirichlet Allocation (LDA). The proposed sentiment model outperforms the top system in the task of Sentiment Analysis in Twitter in SemEval-2013 in terms of averaged F scores.
83 citations
••
01 Dec 2004
TL;DR: A new method, parametric embedding (PE), that embeds objects with the class structure into a low-dimensional visualization space, providing insight into the classifier's behavior in supervised, semisupervised, and unsupervised settings is proposed.
Abstract: We propose a new method, parametric embedding (PE), that embeds objects with the class structure into a low-dimensional visualization space. PE takes as input a set of class conditional probabilities for given data points and tries to preserve the structure in an embedding space by minimizing a sum of Kullback-Leibler divergences, under the assumption that samples are generated by a gaussian mixture with equal covariances in the embedding space. PE has many potential uses depending on the source of the input data, providing insight into the classifier's behavior in supervised, semisupervised, and unsupervised settings. The PE algorithm has a computational advantage over conventional embedding methods based on pairwise object relations since its complexity scales with the product of the number of objects and the number of classes. We demonstrate PE by visualizing supervised categorization of Web pages, semisupervised categorization of digits, and the relations of words and latent topics found by an unsupervised algorithm, latent Dirichlet allocation.
83 citations
••
TL;DR: In this paper, a topic detection method based on paragraph vectors is proposed to accelerate citation screening in clinical and public health reviews. But the method is not suitable for the task of biomedical journal articles, since it requires expert reviewers to manually screen thousands of citations to identify all relevant articles to the review.
83 citations
••
22 Feb 2012TL;DR: Three approaches to automating bug report categorization are investigated: an approach similar to previous ones based on an SVM classifier and Term Frequency Inverse Document Frequency(svm-tf-idf), an approach using Latent Dirichlet Allocation (LDA) with SVM (sVM-lda) and an approachUsing LDA and Kullback Leibler divergence (lda-kl).
Abstract: Software developers, particularly in open-source projects, rely on bug repositories to organize their work. On a bug report, the component field is used to indicate to which team of developers a bug should be routed. Researchers have shown that incorrect categorization of newly received bug reports to components can cause potential delays in the resolution of bug reports. Approaches have been developed that consider the use of machine learning approaches, specifically Support Vector Machines (svm), to automatically categorize bug reports into the appropriate component to help streamline the process of solving a bug. One drawback of an SVM-based approach is that the results of categorization can be uneven across various components in the system if some components receive less reports than others. In this paper, we consider broadening the consistency of the recommendations produced by an automatic approach by investigating three approaches to automating bug report categorization: an approach similar to previous ones based on an SVM classifier and Term Frequency Inverse Document Frequency(svm-tf-idf), an approach using Latent Dirichlet Allocation (LDA) with SVM (svm-lda) and an approach using LDA and Kullback Leibler divergence (lda-kl). We found that lda-kl produced recalls similar to those found previously but with better consistency across all components for which bugs must be categorized.
83 citations