scispace - formally typeset
Search or ask a question
Topic

Latent Dirichlet allocation

About: Latent Dirichlet allocation is a research topic. Over the lifetime, 5351 publications have been published within this topic receiving 212555 citations. The topic is also known as: LDA.


Papers
More filters
Proceedings ArticleDOI
09 Feb 2009
TL;DR: It is demonstrated how user-generated tags from large-scale social bookmarking websites such as del.icio.us can be used as a complementary data source to page text and anchor text for improving automatic clustering of web pages.
Abstract: Automatically clustering web pages into semantic groups promises improved search and browsing on the web. In this paper, we demonstrate how user-generated tags from large-scale social bookmarking websites such as del.icio.us can be used as a complementary data source to page text and anchor text for improving automatic clustering of web pages. This paper explores the use of tags in 1) K-means clustering in an extended vector space model that includes tags as well as page text and 2) a novel generative clustering algorithm based on latent Dirichlet allocation that jointly models text and tags. We evaluate the models by comparing their output to an established web directory. We find that the naive inclusion of tagging data improves cluster quality versus page text alone, but a more principled inclusion can substantially improve the quality of all models with a statistically significant absolute F-score increase of 4%. The generative model outperforms K-means with another 8% F-score increase.

250 citations

Proceedings ArticleDOI
13 Jun 2010
TL;DR: In this article, an augmented Latent Dirichlet Allocation (aLDA) model was proposed to combine low and mid-level features under a Bayesian generative framework and learn an optimal combination of features.
Abstract: We are interested in identifying the material category, e.g. glass, metal, fabric, plastic or wood, from a single image of a surface. Unlike other visual recognition tasks in computer vision, it is difficult to find good, reliable features that can tell material categories apart. Our strategy is to use a rich set of low and mid-level features that capture various aspects of material appearance. We propose an augmented Latent Dirichlet Allocation (aLDA) model to combine these features under a Bayesian generative framework and learn an optimal combination of features. Experimental results show that our system performs material recognition reasonably well on a challenging material database, outperforming state-of-the-art material/texture recognition systems.

249 citations

Journal ArticleDOI
TL;DR: Two out of three coherence measures find NMF to regularly produce more coherent topics, with higher levels of generality and redundancy observed with the LDA topic descriptors, suggesting that this may be a more suitable topic modeling method when analyzing certain corpora, such as those associated with niche or non-mainstream domains.
Abstract: We evaluate the coherence and generality of topic descriptors found by LDA and NMF.Six new and existing corpora were specifically compiled for this evaluation.A new coherence measure using word2vec-modeled term vector similarity is proposed.NMF regularly produces more coherent topics, where term weighting is influential.NMF may be more suitable for topic modeling of niche or non-mainstream corpora. In recent years, topic modeling has become an established method in the analysis of text corpora, with probabilistic techniques such as latent Dirichlet allocation (LDA) commonly employed for this purpose. However, it might be argued that adequate attention is often not paid to the issue of topic coherence, the semantic interpretability of the top terms usually used to describe discovered topics. Nevertheless, a number of studies have proposed measures for analyzing such coherence, where these have been largely focused on topics found by LDA, with matrix decomposition techniques such as Non-negative Matrix Factorization (NMF) being somewhat overlooked in comparison. This motivates the current work, where we compare and analyze topics found by popular variants of both NMF and LDA in multiple corpora in terms of both their coherence and associated generality, using a combination of existing and new measures, including one based on distributional semantics. Two out of three coherence measures find NMF to regularly produce more coherent topics, with higher levels of generality and redundancy observed with the LDA topic descriptors. In all cases, we observe that the associated term weighting strategy plays a major role. The results observed with NMF suggest that this may be a more suitable topic modeling method when analyzing certain corpora, such as those associated with niche or non-mainstream domains.

247 citations

Journal Article
TL;DR: The maximum entropy discrimination latent Dirichlet allocation (MedLDA) model is proposed, which integrates the mechanismbehind the max-margin prediction models with the mechanism behind the hierarchical Bayesian topic models under a unified constrained optimization framework, and yields latent topical representations that are more discriminative and more suitable for prediction tasks such as document classification or regression.
Abstract: A supervised topic model can use side information such as ratings or labels associated with documents or images to discover more predictive low dimensional topical representations of the data. However, existing supervised topic models predominantly employ likelihood-driven objective functions for learning and inference, leaving the popular and potentially powerful max-margin principle unexploited for seeking predictive representations of data and more discriminative topic bases for the corpus. In this paper, we propose the maximum entropy discrimination latent Dirichlet allocation (MedLDA) model, which integrates the mechanism behind the max-margin prediction models (e.g., SVMs) with the mechanism behind the hierarchical Bayesian topic models (e.g., LDA) under a unified constrained optimization framework, and yields latent topical representations that are more discriminative and more suitable for prediction tasks such as document classification or regression. The principle underlying the MedLDA formalism is quite general and can be applied for jointly max-margin and maximum likelihood learning of directed or undirected topic models when supervising side information is available. Efficient variational methods for posterior inference and parameter estimation are derived and extensive empirical studies on several real data sets are also provided. Our experimental results demonstrate qualitatively and quantitatively that MedLDA could: 1) discover sparse and highly discriminative topical representations; 2) achieve state of the art prediction performance; and 3) be more efficient than existing supervised topic models, especially for classification.

247 citations

Journal ArticleDOI
TL;DR: A Dirichlet-derived multiple topic model (DMTM) is proposed to fuse heterogeneous features at a topic level for HSR imagery scene classification and is able to reduce the dimension of the features representing the HSR images, to fuse the different types of features efficiently, and to improve the performance of the scene classification over that of other scene classification algorithms based on spatial pyramid matching, probabilistic latent semantic analysis, and latentDirichlet allocation.
Abstract: Due to the complex arrangements of the ground objects in high spatial resolution (HSR) imagery scenes, HSR imagery scene classification is a challenging task, which is aimed at bridging the semantic gap between the low-level features and the high-level semantic concepts. A combination of multiple complementary features for HSR imagery scene classification is considered a potential way to improve the performance. However, the different types of features have different characteristics, and how to fuse the different types of features is a classic problem. In this paper, a Dirichlet-derived multiple topic model (DMTM) is proposed to fuse heterogeneous features at a topic level for HSR imagery scene classification. An efficient algorithm based on a variational expectation–maximization framework is developed to infer the DMTM and estimate the parameters of the DMTM. The proposed DMTM scene classification method is able to incorporate different types of features with different characteristics, no matter whether these features are local or global, discrete or continuous. Meanwhile, the proposed DMTM can also reduce the dimension of the features representing the HSR images. In our experiments, three types of heterogeneous features, i.e., the local spectral feature, the local structural feature, and the global textural feature, were employed. The experimental results with three different HSR imagery data sets show that the three types of features are complementary. In addition, the proposed DMTM is able to reduce the dimension of the features representing the HSR images, to fuse the different types of features efficiently, and to improve the performance of the scene classification over that of other scene classification algorithms based on spatial pyramid matching, probabilistic latent semantic analysis, and latent Dirichlet allocation.

245 citations


Network Information
Related Topics (5)
Cluster analysis
146.5K papers, 2.9M citations
86% related
Support vector machine
73.6K papers, 1.7M citations
86% related
Deep learning
79.8K papers, 2.1M citations
85% related
Feature extraction
111.8K papers, 2.1M citations
84% related
Convolutional neural network
74.7K papers, 2M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023323
2022842
2021418
2020429
2019473
2018446