scispace - formally typeset
Search or ask a question
Topic

Latent Dirichlet allocation

About: Latent Dirichlet allocation is a research topic. Over the lifetime, 5351 publications have been published within this topic receiving 212555 citations. The topic is also known as: LDA.


Papers
More filters
Posted Content
TL;DR: Using parallel runs of MCMC, variational, or mode-based inference to hit as many modes or separated regions as possible, and then combining these using importance sampling based Bayesian stacking, a scalable method for constructing a weighted average of distributions so as to maximize cross-validated prediction utility is proposed.
Abstract: When working with multimodal Bayesian posterior distributions, Markov chain Monte Carlo (MCMC) algorithms can have difficulty moving between modes, and default variational or mode-based approximate inferences will understate posterior uncertainty And, even if the most important modes can be found, it is difficult to evaluate their relative weights in the posterior Here we propose an alternative approach, using parallel runs of MCMC, variational, or mode-based inference to hit as many modes or separated regions as possible, and then combining these using importance sampling based Bayesian stacking, a scalable method for constructing a weighted average of distributions so as to maximize cross-validated prediction utility The result from stacking is not necessarily equivalent, even asymptotically, to fully Bayesian inference, but it serves many of the same goals Under misspecified models, stacking can give better predictive performance than full Bayesian inference, hence the multimodality can be considered a blessing rather than a curse We explore with an example where the stacked inference approximates the true data generating process from the misspecified model, an example of inconsistent inference, and non-mixing samplers We elaborate the practical implantation in the context of latent Dirichlet allocation, Gaussian process regression, hierarchical model, variational inference in horseshoe regression, and neural networks

28 citations

Journal ArticleDOI
TL;DR: The proposed DPLSA model provides a novel method to initialize the word distributions for different topics and uses the past assigned bug reports from the same component in the model training step to increase the discriminative power of the topics which is useful for the recommendation task.
Abstract: Context The component field in a bug report provides important location information required by developers during bug fixes. Research has shown that incorrect component assignment for a bug report often causes problems and delays in bug fixes. A topic model technique, Latent Dirichlet Allocation (LDA), has been developed to create a component recommender for bug reports. Objective We seek to investigate a better way to use topic modeling in creating a component recommender. Method This paper presents a component recommender by using the proposed Discriminative Probability Latent Semantic Analysis (DPLSA) model and Jensen–Shannon divergence (DPLSA-JS). The proposed DPLSA model provides a novel method to initialize the word distributions for different topics. It uses the past assigned bug reports from the same component in the model training step. This results in a correlation between the learned topics and the components. Results We evaluate the proposed approach over five open source projects, Mylyn, Gcc, Platform, Bugzilla and Firefox. The results show that the proposed approach on average outperforms the LDA-KL method by 30.08%, 19.60% and 14.13% for recall @1, recall @3 and recall @5, outperforms the LDA-SVM method by 31.56%, 17.80% and 8.78% for recall @1, recall @3 and recall @5, respectively. Conclusion Our method discovers that using comments in the DPLSA-JS recommender does not always make a contribution to the performance. The vocabulary size does matter in DPLSA-JS. Different projects need to adaptively set the vocabulary size according to an experimental method. In addition, the correspondence between the learned topics and components in DPLSA increases the discriminative power of the topics which is useful for the recommendation task.

28 citations

Journal ArticleDOI
01 Jan 2018
TL;DR: Latent Semantic Analysis and Latent Dirichlet Allocation were used to identify themes in a database of text about railroad equipment accidents maintained by the Federal Railroad Administration in the United States, and it was found that the use of the two techniques was complementary, with more accident topics identified than with a single method.
Abstract: Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation(LDA) were used to identify themes in a database of text about railroad equipment accidents maintained by the Federal Railroad Administration in the United States. These text mining techniques use different mechanisms to identify topics. LDA and LSA identified switching accidents, hump yard accidents and grade crossing accidents as major accident type topics. LSA identified accidents with track maintenance equipment as a topic. Both text mining models identified accidents with tractor-trailer highway trucks as a particular problem at grade crossings. It was found that the use of the two techniques was complementary, with more accident topics identified than with the use of a single method.

28 citations

Book ChapterDOI
Liang Yao1, Yin Zhang1, Baogang Wei1, Hongze Qian1, Yibing Wang1 
19 May 2015
TL;DR: This paper combines latent Dirichlet allocation, a widely used topic model with Probase, a large-scale probabilistic knowledge base, to improve the semantic coherence significantly and evaluation results will demonstrate the effectiveness of the method.
Abstract: Probabilistic Topic Models could be used to extract low-dimension aspects from document collections. However, such models without any human knowledge often produce aspects that are not interpretable. In recent years, a number of knowledge-based models have been proposed, which allow the user to input prior knowledge of the domain to produce more coherent and meaningful topics. In this paper, we incorporate human knowledge in the form of probabilistic knowledge base into topic models. By combining latent Dirichlet allocation, a widely used topic model with Probase, a large-scale probabilistic knowledge base, we improve the semantic coherence significantly. Our evaluation results will demonstrate the effectiveness of our method.

28 citations

Journal ArticleDOI
TL;DR: In this article, a machine learning for language toolkit was used to get topic posterior word distribution and word composition, which identifies the distribution of topics and themes; the trend of topics; journal distribution trends; and comparative topic, themes and journal distribution trend.
Abstract: The purpose of this study is to discover the distribution and trends of existing Offsite construction (OSC) literature with an intention to highlight research niches and propose the future outline.,The paper adopted literature reviews methodology involving 1,057 relevant documents published in 2008-2017 from 15 journals. The selected documents were empirically analyzed through a topic-modeling technique. A latent Dirichlet allocation model was applied to each document to infer 50 key topics. A machine learning for language toolkit was used to get topic posterior word distribution and word composition.,This is an exploratory study, which identifies the distribution of topics and themes; the trend of topics and themes; journal distribution trends; and comparative topic, themes and journal distribution trend. The distribution and trends show an increase in researcher’s interest and the journal’s priority on OSC research. Nevertheless, OSC existing literature is faced with; under-researched topics such as building information modeling, smart construction and marketing. The under-researched themes include organizational management, supply chain and context. The authors also found an overload of similar information in prefabrication and concrete topics. Furthermore, the innovative methods and constraints themes were found to be overloaded with similar information.,The naming of the themes was based on our own interpretation; hence, the research results may lack generalizability. Therefore, a comparative study using different data processing is proposed. The study also provides future research outline as follows: studying OSC topics from dynamic evolution perspective and identifying the new emerging topics; searching for effective strategies to enhance OSC research; identifying the contribution of countries, affiliation and funding agency; and studying the impact of these themes to the adoption of OSC.,This study is of values to the scholars, as it could stimulate research to under-researched areas.,This paper justifies a need to have a broad understanding of the nature and structure of existing OSC literature.

28 citations


Network Information
Related Topics (5)
Cluster analysis
146.5K papers, 2.9M citations
86% related
Support vector machine
73.6K papers, 1.7M citations
86% related
Deep learning
79.8K papers, 2.1M citations
85% related
Feature extraction
111.8K papers, 2.1M citations
84% related
Convolutional neural network
74.7K papers, 2M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023323
2022842
2021418
2020429
2019473
2018446