Topic

Latent Dirichlet allocation

About: Latent Dirichlet allocation is a research topic. Over the lifetime, 5351 publications have been published within this topic receiving 212555 citations. The topic is also known as: LDA.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Journal Article•DOI•

High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture Model Based on Minimum Message Length

[...]

Nizar Bouguila¹, Djemel Ziou²•Institutions (2)

Concordia University¹, Université de Sherbrooke²

01 Oct 2007-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This work considers the application of the minimum message length (MML) principle to determine the number of clusters in a finite mixture model based on the generalized Dirichlet distribution.

...read moreread less

Abstract: We consider the problem of determining the structure of high-dimensional data without prior knowledge of the number of clusters. Data are represented by a finite mixture model based on the generalized Dirichlet distribution. The generalized Dirichlet distribution has a more general covariance structure than the Dirichlet distribution and offers high flexibility and ease of use for the approximation of both symmetric and asymmetric distributions. This makes the generalized Dirichlet distribution more practical and useful. An important problem in mixture modeling is the determination of the number of clusters. Indeed, a mixture with too many or too few components may not be appropriate to approximate the true model. Here, we consider the application of the minimum message length (MML) principle to determine the number of clusters. The MML is derived so as to choose the number of clusters in the mixture model that best describes the data. A comparison with other selection criteria is performed. The validation involves synthetic data, real data clustering, and two interesting real applications: classification of Web pages, and texture database summarization for efficient retrieval.

...read moreread less

156 citations

Journal Article•DOI•

Large-scale prediction of human protein-protein interactions from amino acid sequence based on latent topic features.

[...]

Xiaoyong Pan¹, Ya-Nan Zhang¹, Hong-Bin Shen¹•Institutions (1)

Shanghai Jiao Tong University¹

31 Aug 2010-Journal of Proteome Research

TL;DR: A novel hierarchical LDA-RF (latent dirichlet allocation-random forest) model to predict human protein-protein interactions from protein primary sequences directly is proposed, which is featured by a high success rate and strong ability for handling large-scale data sets by digging the hidden internal structures buried into the noisy amino acid sequences in low dimensional latent semantic space.

...read moreread less

Abstract: Protein−protein interaction (PPI) is at the core of the entire interactomic system of any living organism. Although there are many human protein−protein interaction links being experimentally determined, the number is still relatively very few compared to the estimation that there are ∼300 000 protein−protein interactions in human beings. Hence, it is still urgent and challenging to develop automated computational methods to accurately and efficiently predict protein−protein interactions. In this paper, we propose a novel hierarchical LDA-RF (latent dirichlet allocation-random forest) model to predict human protein−protein interactions from protein primary sequences directly, which is featured by a high success rate and strong ability for handling large-scale data sets by digging the hidden internal structures buried into the noisy amino acid sequences in low dimensional latent semantic space. First, the local sequential features represented by conjoint triads are constructed from sequences. Then the gene...

...read moreread less

156 citations

Journal Article•DOI•

Smart literature review: a practical topic modelling approach to exploratory literature review

[...]

Claus Boye Asmussen¹, Charles Møller¹•Institutions (1)

Aalborg University¹

19 Oct 2019-Journal of Big Data

TL;DR: The aim of the paper is to enable the use of topic modelling for researchers by presenting a step-by-step framework on a case and sharing a code template, which enables huge amounts of papers to be reviewed in a transparent, reliable, faster, and reproducible way.

...read moreread less

Abstract: Manual exploratory literature reviews should be a thing of the past, as technology and development of machine learning methods have matured. The learning curve for using machine learning methods is rapidly declining, enabling new possibilities for all researchers. A framework is presented on how to use topic modelling on a large collection of papers for an exploratory literature review and how that can be used for a full literature review. The aim of the paper is to enable the use of topic modelling for researchers by presenting a step-by-step framework on a case and sharing a code template. The framework consists of three steps; pre-processing, topic modelling, and post-processing, where the topic model Latent Dirichlet Allocation is used. The framework enables huge amounts of papers to be reviewed in a transparent, reliable, faster, and reproducible way.

...read moreread less

156 citations

Journal Article•DOI•

iVisClustering: An Interactive Visual Document Clustering via Topic Modeling

[...]

Hanseung Lee¹, Jaeyeon Kihm², Jaegul Choo¹, John Stasko¹, Haesun Park¹ - Show less +1 more•Institutions (2)

Georgia Institute of Technology¹, Cornell University²

01 Jun 2012-Computer Graphics Forum

TL;DR: An interactive visual analytics system for document clustering, called iVisClustering, is proposed based on a widely‐used topic modeling method, latent Dirichlet allocation (LDA), which provides a summary of each cluster in terms of its most representative keywords and visualizes soft clustering results in parallel coordinates.

...read moreread less

Abstract: Clustering plays an important role in many large-scale data analyses providing users with an overall understanding of their data. Nonetheless, clustering is not an easy task due to noisy features and outliers existing in the data, and thus the clustering results obtained from automatic algorithms often do not make clear sense. To remedy this problem, automatic clustering should be complemented with interactive visualization strategies. This paper proposes an interactive visual analytics system for document clustering, called iVisClustering, based on a widely-used topic modeling method, latent Dirichlet allocation (LDA). iVisClustering provides a summary of each cluster in terms of its most representative keywords and visualizes soft clustering results in parallel coordinates. The main view of the system provides a 2D plot that visualizes cluster similarities and the relation among data items with a graph-based representation. iVisClustering provides several other views, which contain useful interaction methods. With help of these visualization modules, we can interactively refine the clustering results in various ways. Keywords can be adjusted so that they characterize each cluster better. In addition, our system can filter out noisy data and re-cluster the data accordingly. Cluster hierarchy can be constructed using a tree structure and for this purpose, the system supports cluster-level interactions such as sub-clustering, removing unimportant clusters, merging the clusters that have similar meanings, and moving certain clusters to any other node in the tree structure. Furthermore, the system provides document-level interactions such as moving mis-clustered documents to another cluster and removing useless documents. Finally, we present how interactive clustering is performed via iVisClustering by using real-world document data sets. © 2012 Wiley Periodicals, Inc.

...read moreread less

155 citations

Proceedings Article•

The structural topic model and applied social science

[...]

Margaret E. Roberts¹, Brandon M. Stewart¹, Dustin Tingley¹, Edoardo M. Airoldi¹•Institutions (1)

Harvard University¹

01 Jan 2013

TL;DR: The Structural Topic Model (STM), a general way to incorporate corpus structure or document metadata into the standard topic model, is developed which accommodates corpus structure through document-level covariates affecting topical prevalence and/or topical content.

...read moreread less

Abstract: We develop the Structural Topic Model which provides a general way to incorporate corpus structure or document metadata into the standard topic model. Document-level covariates enter the model through a simple generalized linear model framework in the prior distributions controlling either topical prevalence or topical content. We demonstrate the model’s use in two applied problems: the analysis of open-ended responses in a survey experiment about immigration policy, and understanding differing media coverage of China’s rise. 1 Topic Models and Social Science Over the last decade probabilistic topic models, such as Latent Dirichlet Allocation (LDA), have become a common tool for understanding large text corpora [1].1 Although originally developed for descriptive and exploratory purposes, social scientists are increasingly seeing the value of topic models as a tool for measurement of latent linguistic, political and psychological variables [2]. The defining element of this work is the presence of additional document-level information (e.g. author, partisan affiliation, date) on which variation in either topical prevalence or topical content is of theoretic interest.2 As a practical matter, this generally involves running an off-the-shelf implementation of LDA and then performing a post-hoc evaluation of variation with a covariate of interest. A better alternative to post-hoc comparisons is to build the additional information about the structure of the corpus into the model itself by altering the prior distributions to partially pool information amongst similar documents. Numerous special cases of this framework have been developed for particular types of corpus structure affecting both topic prevalence (e.g. time [3], author [4]) and topical content (e.g. ideology [5], geography [6]). Applied users have been slow to adopt these models because it is often difficult to find a model that exactly fits their specific corpus. We develop the Structural Topic Model (STM) which accommodates corpus structure through document-level covariates affecting topical prevalence and/or topical content. The central idea is to ∗Prepared for the NIPS 2013 Workshop on Topic Models: Computation, Application, and Evaluation. A forthcoming R package implements the methods described here. † These authors contributed equally. We assume a general familiarity with LDA throughout (see [1] for a review) By “topical prevalence” we mean the proportion of document devoted to a given topic. By “topical content” we mean the rate of word use within a given topic.

...read moreread less

155 citations

Collapse

Network Information

Performance

Metrics

6,513

Papers

245,225

Citations

No. of papers in the topic in previous years
Year	Papers
2023	323
2022	842
2021	418
2020	429
2019	473
2018	446

Latent Dirichlet allocation

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics