scispace - formally typeset
Search or ask a question
Topic

Latent Dirichlet allocation

About: Latent Dirichlet allocation is a research topic. Over the lifetime, 5351 publications have been published within this topic receiving 212555 citations. The topic is also known as: LDA.


Papers
More filters
Journal ArticleDOI
TL;DR: This paper employs Latent Dirichlet Allocation (LDA) to build user profile signatures and assumes that any significant unexplainable deviations from the normal activity of an individual user is strongly correlated with fraudulent activity.

85 citations

Posted Content
TL;DR: The Structural Topic Model (STM) as mentioned in this paper incorporates document-level information (e.g., author, partisan affiliation, date) into the standard topic model to incorporate corpus structure or document metadata.
Abstract: We develop the Structural Topic Model which provides a general way to incorporate corpus structure or document metadata into the standard topic model. Document-level covariates enter the model through a simple generalized linear model framework in the prior distributions controlling either topical prevalence or topical content. We demonstrate the model’s use in two applied problems: the analysis of open-ended responses in a survey experiment about immigration policy, and understanding differing media coverage of China’s rise. 1 Topic Models and Social Science Over the last decade probabilistic topic models, such as Latent Dirichlet Allocation (LDA), have become a common tool for understanding large text corpora [1].1 Although originally developed for descriptive and exploratory purposes, social scientists are increasingly seeing the value of topic models as a tool for measurement of latent linguistic, political and psychological variables [2]. The defining element of this work is the presence of additional document-level information (e.g. author, partisan affiliation, date) on which variation in either topical prevalence or topical content is of theoretic interest.2 As a practical matter, this generally involves running an off-the-shelf implementation of LDA and then performing a post-hoc evaluation of variation with a covariate of interest. A better alternative to post-hoc comparisons is to build the additional information about the structure of the corpus into the model itself by altering the prior distributions to partially pool information amongst similar documents. Numerous special cases of this framework have been developed for particular types of corpus structure affecting both topic prevalence (e.g. time [3], author [4]) and topical content (e.g. ideology [5], geography [6]). Applied users have been slow to adopt these models because it is often difficult to find a model that exactly fits their specific corpus. We develop the Structural Topic Model (STM) which accommodates corpus structure through document-level covariates affecting topical prevalence and/or topical content. The central idea is to ∗Prepared for the NIPS 2013 Workshop on Topic Models: Computation, Application, and Evaluation. A forthcoming R package implements the methods described here. † These authors contributed equally. We assume a general familiarity with LDA throughout (see [1] for a review) By “topical prevalence” we mean the proportion of document devoted to a given topic. By “topical content” we mean the rate of word use within a given topic.

85 citations

Journal ArticleDOI
TL;DR: Correlation Explanation is introduced, an alternative approach to topic modeling that does not assume an underlying generative model, and instead learns maximally informative topics through an information-theoretic framework that generalizes to hierarchical and semi-supervised extensions with no additional modeling assumptions.
Abstract: While generative models such as Latent Dirichlet Allocation (LDA) have proven fruitful in topic modeling, they often require detailed assumptions and careful specification of hyperparameters. Such model complexity issues only compound when trying to generalize generative models to incorporate human input. We introduce Correlation Explanation (CorEx), an alternative approach to topic modeling that does not assume an underlying generative model, and instead learns maximally informative topics through an information-theoretic framework. This framework naturally generalizes to hierarchical and semi-supervised extensions with no additional modeling assumptions. In particular, word-level domain knowledge can be flexibly incorporated within CorEx through anchor words, allowing topic separability and representation to be promoted with minimal human intervention. Across a variety of datasets, metrics, and experiments, we demonstrate that CorEx produces topics that are comparable in quality to those produced by unsupervised and semi-supervised variants of LDA.

85 citations

Journal ArticleDOI
TL;DR: Through some applications involving real-data classification and image databases categorization using visual words, it is shown that clustering via infinite mixture models offers a more powerful and robust performance than classic finite mixtures.
Abstract: In this paper, we propose a clustering algorithm based on both Dirichlet processes and generalized Dirichlet distribution which has been shown to be very flexible for proportional data modeling. Our approach can be viewed as an extension of the finite generalized Dirichlet mixture model to the infinite case. The extension is based on nonparametric Bayesian analysis. This clustering algorithm does not require the specification of the number of mixture components to be given in advance and estimates it in a principled manner. Our approach is Bayesian and relies on the estimation of the posterior distribution of clusterings using Gibbs sampler. Through some applications involving real-data classification and image databases categorization using visual words, we show that clustering via infinite mixture models offers a more powerful and robust performance than classic finite mixtures.

85 citations

Journal ArticleDOI
TL;DR: The Geometric Latent Dirichlet Allocation model for unsupervised discovery of particular objects in unordered image collections is introduced and how “hub images” (images representative of an object or landmark) can easily be extracted from the authors' matching graph representation is discussed.
Abstract: Given a large-scale collection of images our aim is to efficiently associate images which contain the same entity, for example a building or object, and to discover the significant entities. To achieve this, we introduce the Geometric Latent Dirichlet Allocation (gLDA) model for unsupervised discovery of particular objects in unordered image collections. This explicitly represents images as mixtures of particular objects or facades, and builds rich latent topic models which incorporate the identity and locations of visual words specific to the topic in a geometrically consistent way. Applying standard inference techniques to this model enables images likely to contain the same object to be probabilistically grouped and ranked. Additionally, to reduce the computational cost of applying the gLDA model to large datasets, we propose a scalable method that first computes a matching graph over all the images in a dataset. This matching graph connects images that contain the same object, and rough image groups can be mined from this graph using standard clustering techniques. The gLDA model can then be applied to generate a more nuanced representation of the data. We also discuss how "hub images" (images representative of an object or landmark) can easily be extracted from our matching graph representation. We evaluate our techniques on the publicly available Oxford buildings dataset (5K images) and show examples of automatically mined objects. The methods are evaluated quantitatively on this dataset using a ground truth labeling for a number of Oxford landmarks. To demonstrate the scalability of the matching graph method, we show qualitative results on two larger datasets of images taken of the Statue of Liberty (37K images) and Rome (1M+ images).

85 citations


Network Information
Related Topics (5)
Cluster analysis
146.5K papers, 2.9M citations
86% related
Support vector machine
73.6K papers, 1.7M citations
86% related
Deep learning
79.8K papers, 2.1M citations
85% related
Feature extraction
111.8K papers, 2.1M citations
84% related
Convolutional neural network
74.7K papers, 2M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023323
2022842
2021418
2020429
2019473
2018446