scispace - formally typeset
Search or ask a question
Topic

Latent Dirichlet allocation

About: Latent Dirichlet allocation is a research topic. Over the lifetime, 5351 publications have been published within this topic receiving 212555 citations. The topic is also known as: LDA.


Papers
More filters
Proceedings ArticleDOI
TL;DR: In this paper, the authors propose a method that relies on replicated LDA runs, clustering, and providing a stability metric for the topics, which makes LDA stability transparent and is also complementary rather than alternative to many prior works that focus on LDA parameter tuning.
Abstract: Background: Unstructured and textual data is increasing rapidly and Latent Dirichlet Allocation (LDA) topic modeling is a popular data analysis methods for it. Past work suggests that instability of LDA topics may lead to systematic errors. Aim: We propose a method that relies on replicated LDA runs, clustering, and providing a stability metric for the topics. Method: We generate k LDA topics and replicate this process n times resulting in n*k topics. Then we use K-medioids to cluster the n*k topics to k clusters. The k clusters now represent the original LDA topics and we present them like normal LDA topics showing the ten most probable words. For the clusters, we try multiple stability metrics, out of which we recommend Rank-Biased Overlap, showing the stability of the topics inside the clusters. Results: We provide an initial validation where our method is used for 270,000 Mozilla Firefox commit messages with k=20 and n=20. We show how our topic stability metrics are related to the contents of the topics. Conclusions: Advances in text mining enable us to analyze large masses of text in software engineering but non-deterministic algorithms, such as LDA, may lead to unreplicable conclusions. Our approach makes LDA stability transparent and is also complementary rather than alternative to many prior works that focus on LDA parameter tuning.

29 citations

Journal ArticleDOI
TL;DR: This paper obtains the most general family of prior-posterior distributions which is conjugate to a Dirichlet likelihood and identifies those hyperparameter that are influenced by data values and describes some methods to assess the prior hyperparameters.
Abstract: In this paper we analyze the problem of learning and updating of uncertainty in Dirichlet models, where updating refers to determining the conditional distribution of a single variable when some evidence is known. We first obtain the most general family of prior-posterior distributions which is conjugate to a Dirichlet likelihood and we identify those hyperparameters that are influenced by data values. Next, we describe some methods to assess the prior hyperparameters and we give a numerical method to estimate the Dirichlet parameters in a Bayesian context, based on the posterior mode. We also give formulas for updating uncertainty by determining the conditional probabilities of single variables when the values of other variables are known. A time series approach is presented for dealing with the cases in which samples are not identically distributed, that is, the Dirichlet parameters change from sample to sample. This typically occurs when the population is observed at different times. Finally, two examples are given that illustrate the learning and updating processes and the time series approach.

29 citations

Journal ArticleDOI
TL;DR: In this paper, a new prior process, called a beta-Dirichlet process, is introduced for the cumulative intensity functions and is proved to be conjugate with a Bayesian semiparametric regression model.
Abstract: Bayesian analysis of a finite state Markov process, which is popularly used to model multistate event history data, is considered. A new prior process, called a beta-Dirichlet process, is introduced for the cumulative intensity functions and is proved to be conjugate. In addition, the beta-Dirichlet prior is applied to a Bayesian semiparametric regression model. To illustrate the application of the proposed model, we analyse a dataset of credit histories. Copyright 2012, Oxford University Press.

29 citations

Journal ArticleDOI
TL;DR: A new probabilistic model is presented that exploits healthcare chat logs to find hidden topics and changes in these topics over time and shows that the performance of the proposed model exceeds that of the benchmark models.

29 citations

Journal ArticleDOI
TL;DR: A Bayesian semparametric SEM with covariates, and mixed continuous and unordered categorical variables, in which the explanatory latent variables in the structural equation are modeled via an appropriate truncated Dirichlet process with a stick-breaking procedure.
Abstract: Recently, structural equation models (SEMs) have been applied for analyzing interrelationships among observed and latent variables in biological and medical research. Latent variables in these models are typically assumed to have a normal distribution. This article considers a Bayesian semparametric SEM with covariates, and mixed continuous and unordered categorical variables, in which the explanatory latent variables in the structural equation are modeled via an appropriate truncated Dirichlet process with a stick-breaking procedure. Results obtained from a simulation study and an analysis of a real medical data set are presented to illustrate the methodology.

29 citations


Network Information
Related Topics (5)
Cluster analysis
146.5K papers, 2.9M citations
86% related
Support vector machine
73.6K papers, 1.7M citations
86% related
Deep learning
79.8K papers, 2.1M citations
85% related
Feature extraction
111.8K papers, 2.1M citations
84% related
Convolutional neural network
74.7K papers, 2M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023323
2022842
2021418
2020429
2019473
2018446