scispace - formally typeset
Search or ask a question
Topic

Latent Dirichlet allocation

About: Latent Dirichlet allocation is a research topic. Over the lifetime, 5351 publications have been published within this topic receiving 212555 citations. The topic is also known as: LDA.


Papers
More filters
Journal ArticleDOI
TL;DR: This study evaluates several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit, and shows that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures.
Abstract: Methods for document clustering and topic modelling in online social networks (OSNs) offer a means of categorising, annotating and making sense of large volumes of user generated content. Many techniques have been developed over the years, ranging from text mining and clustering methods to latent topic models and neural embedding approaches. However, many of these methods deliver poor results when applied to OSN data as such text is notoriously short and noisy, and often results are not comparable across studies. In this study we evaluate several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison. Several different evaluation measures are used in the literature, so we provide a discussion and recommendation for the most appropriate extrinsic measures for this task. We also demonstrate the performance of the methods over data sets with different document lengths. Our results show that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures. We also demonstrate a method for interpreting the clusters with a top-words based approach using tf-idf weights combined with embedding distance measures.

149 citations

Journal ArticleDOI
TL;DR: In this article, a bipartite network of documents and words is used to detect the number of topics and hierarchically cluster both the words and documents, which leads to better topic models than LDA.
Abstract: One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured texts. Topic models are one popular machine-learning approach that infers the latent topical structure of a collection of documents. Despite their success—particularly of the most widely used variant called latent Dirichlet allocation (LDA)—and numerous applications in sociology, history, and linguistics, topic models are known to suffer from severe conceptual and practical problems, for example, a lack of justification for the Bayesian priors, discrepancies with statistical properties of real texts, and the inability to properly choose the number of topics. We obtain a fresh view of the problem of identifying topical structures by relating it to the problem of finding communities in complex networks. We achieve this by representing text corpora as bipartite networks of documents and words. By adapting existing community-detection methods (using a stochastic block model (SBM) with nonparametric priors), we obtain a more versatile and principled framework for topic modeling (for example, it automatically detects the number of topics and hierarchically clusters both the words and documents). The analysis of artificial and real corpora demonstrates that our SBM approach leads to better topic models than LDA in terms of statistical model selection. Our work shows how to formally relate methods from community detection and topic modeling, opening the possibility of cross-fertilization between these two fields.

148 citations

Proceedings Article
01 Dec 2007
TL;DR: This paper analyzes three batch topic models that have been recently proposed in the machine learning and data mining community – Latent Dirichlet Allocation (LDA),Dirichlet Compound Multinomial (DCM) mixtures and von-Mises Fisher (vMF) mixture models and proposes a practical heuristic for hybrid topic modeling.
Abstract: Automated unsupervised learning of topic-based clusters is used in various text data mining applications, e.g., document organization in content management, information retrieval and filtering in news aggregation services. Typically batch models are used for this purpose, which perform clustering on the document collection in aggregate. In this paper, we first analyze three batch topic models that have been recently proposed in the machine learning and data mining community – Latent Dirichlet Allocation (LDA), Dirichlet Compound Multinomial (DCM) mixtures and von-Mises Fisher (vMF) mixture models. Our discussion uses a common framework based on the particular assumptions made regarding the conditional distributions corresponding to each component and the topic priors. Our experiments on large real-world document collections demonstrate that though LDA is a good model for finding word-level topics, vMF finds better document-level topic clusters more efficiently, which is often important in text mining applications. In cases where offline clustering on complete document collections is infeasible due to resource constraints, online unsupervised clustering methods that process incoming data incrementally are necessary. To this end, we propose online variants of vMF, EDCM and LDA. Experiments on real-world streaming text illustrate the speed and performance benefits of online vMF. Finally, we propose a practical heuristic for hybrid topic modeling, which learns online topic models on streaming text data and intermittently runs batch topic models on aggregated documents offline. Such a hybrid model is useful for applications (e.g., dynamic topic-based aggregation of consumer-generated content in social networking sites) that need a good tradeoff between the performance of batch offline algorithms and efficiency of incremental online algorithms.

148 citations

Book ChapterDOI
01 Jan 2015
TL;DR: In this article, the authors provide an overview of the theory underlying latent Dirichlet allocation (LDA), the most popular topic analysis method today, and illustrate how to employ LDA on a textual data set.
Abstract: Topic analysis is a powerful tool that extracts “topics” from document collections. Unlike manual tagging, which is effort intensive and requires expertise in the documents’ subject matter, topic analysis (in its simplest form) is an automated process. Relying on the assumption that each document in a collection refers to a small number of topics, it extracts bags of words attributable to these topics. These topics can be used to support document retrieval or to relate documents to each other through their associated topics. Given the variety and amount of textual information included in software repositories, in issue reports, in commit and source-code comments, and in other forms of documentation, this method has found many applications in the software-engineering field of mining software repositories. This chapter provides an overview of the theory underlying latent Dirichlet allocation (LDA), the most popular topic-analysis method today. Next, it illustrates, with a brief tutorial introduction, how to employ LDA on a textual data set. Third, it reviews the software-engineering literature for uses of LDA for analyzing textual software-development assets, in order to support developers’ activities. Finally, we discuss the interpretability of the automatically extracted topics, and their correlation with tags provided by subject-matter experts.

148 citations

Journal ArticleDOI
TL;DR: A Bayesian framework for modeling individual differences, in which subjects are assumed to belong to one of a potentially infinite number of groups, is introduced, allowing us to learn flexible parameter distributions without overfitting the data, or requiring the complex computations typically required for determining the dimensionality of a model.

147 citations


Network Information
Related Topics (5)
Cluster analysis
146.5K papers, 2.9M citations
86% related
Support vector machine
73.6K papers, 1.7M citations
86% related
Deep learning
79.8K papers, 2.1M citations
85% related
Feature extraction
111.8K papers, 2.1M citations
84% related
Convolutional neural network
74.7K papers, 2M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023323
2022842
2021418
2020429
2019473
2018446