scispace - formally typeset
Search or ask a question

Showing papers on "Latent Dirichlet allocation published in 2003"


Journal ArticleDOI
TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

30,570 citations


Journal ArticleDOI
TL;DR: A new approach for modeling multi-modal data sets, focusing on the specific case of segmented images with associated text, is presented, and a number of models for the joint distribution of image regions and words are developed, including several which explicitly learn the correspondence between regions and Words.
Abstract: We present a new approach for modeling multi-modal data sets, focusing on the specific case of segmented images with associated text. Learning the joint distribution of image regions and words has many applications. We consider in detail predicting words associated with whole images (auto-annotation) and corresponding to particular image regions (region naming). Auto-annotation might help organize and access large collections of images. Region naming is a model of object recognition as a process of translating image regions to words, much as one might translate from one language to another. Learning the relationships between image regions and semantic correlates (words) is an interesting example of multi-modal data mining, particularly because it is typically hard to apply data mining techniques to collections of images. We develop a number of models for the joint distribution of image regions and words, including several which explicitly learn the correspondence between regions and words. We study multi-modal and correspondence extensions to Hofmann's hierarchical clustering/aspect model, a translation model adapted from statistical machine translation (Brown et al.), and a multi-modal extension to mixture of latent Dirichlet allocation (MoM-LDA). All models are assessed using a large collection of annotated images of real scenes. We study in depth the difficult problem of measuring performance. For the annotation task, we look at prediction performance on held out data. We present three alternative measures, oriented toward different types of task. Measuring the performance of correspondence methods is harder, because one must determine whether a word has been placed on the right region of an image. We can use annotation performance as a proxy measure, but accurate measurement requires hand labeled data, and thus must occur on a smaller scale. We show results using both an annotation proxy, and manually labeled data.

1,726 citations


Proceedings ArticleDOI
28 Jul 2003
TL;DR: Three hierarchical probabilistic mixture models which aim to describe annotated data with multiple types, culminating in correspondence latent Dirichlet allocation, a latent variable model that is effective at modeling the joint distribution of both types and the conditional distribution of the annotation given the primary type.
Abstract: We consider the problem of modeling annotated data---data with multiple types where the instance of one type (such as a caption) serves as a description of the other type (such as an image). We describe three hierarchical probabilistic mixture models which aim to describe such data, culminating in correspondence latent Dirichlet allocation, a latent variable model that is effective at modeling the joint distribution of both types and the conditional distribution of the annotation given the primary type. We conduct experiments on the Corel database of images and captions, assessing performance in terms of held-out likelihood, automatic annotation, and text-based image retrieval.

1,199 citations


Proceedings Article
09 Dec 2003
TL;DR: A Bayesian approach is taken to generate an appropriate prior via a distribution on partitions that allows arbitrarily large branching factors and readily accommodates growing data collections.
Abstract: We address the problem of learning topic hierarchies from data. The model selection problem in this domain is daunting—which of the large collection of possible trees to use? We take a Bayesian approach, generating an appropriate prior via a distribution on partitions that we refer to as the nested Chinese restaurant process. This nonparametric prior allows arbitrarily large branching factors and readily accommodates growing data collections. We build a hierarchical topic model by combining this prior with a likelihood that is based on a hierarchical variant of latent Dirichlet allocation. We illustrate our approach on simulated data and with an application to the modeling of NIPS abstracts.

1,055 citations


Proceedings ArticleDOI
28 Jul 2003
TL;DR: PLSI is a maximum a posteriori estimated LDA model under a uniform Dirichlet prior, therefore the perceived shortcomings of PLSI can be resolved and elucidated within the LDA framework.
Abstract: Latent Dirichlet Allocation (LDA) is a fully generative approach to language modelling which overcomes the inconsistent generative semantics of Probabilistic Latent Semantic Indexing (PLSI). This paper shows that PLSI is a maximum a posteriori estimated LDA model under a uniform Dirichlet prior, therefore the perceived shortcomings of PLSI can be resolved and elucidated within the LDA framework.

230 citations


Journal ArticleDOI
TL;DR: The Dirichlet assumption implies that lazy methods can perform as well as eager discretization methods, and the lazy method is extended to classify set-valued and multi-interval data with a naive Bayesian classifier.
Abstract: In a naive Bayesian classifier, discrete variables as well as discretized continuous variables are assumed to have Dirichlet priors. This paper describes the implications and applications of this model selection choice. We start by reviewing key properties of Dirichlet distributions. Among these properties, the most important one is “perfect aggregation,” which allows us to explain why discretization works for a naive Bayesian classifier. Since perfect aggregation holds for Dirichlets, we can explain that in general, discretization can outperform parameter estimation assuming a normal distribution. In addition, we can explain why a wide variety of well-known discretization methods, such as entropy-based, ten-bin, and bin-log l, can perform well with insignificant difference. We designed experiments to verify our explanation using synthesized and real data sets and showed that in addition to well-known methods, a wide variety of discretization methods all perform similarly. Our analysis leads to a lazy discretization method, which discretizes continuous variables according to test data. The Dirichlet assumption implies that lazy methods can perform as well as eager discretization methods. We empirically confirmed this implication and extended the lazy method to classify set-valued and multi-interval data with a naive Bayesian classifier.

42 citations


Proceedings Article
01 Jan 2003
TL;DR: This work considers the statistical problem of analyzing the association between two categorical variables from cross-classified data and proposes measures which enable one to study the dependencies at a local level and to assess whether the data support some more or less strong association model.
Abstract: We consider the statistical problem of analyzing the association between two categorical variables from cross-classified data. The focus is put on measures which enable one to study the dependencies at a local level and to assess whether the data support some more or less strong association model. Statistical inference is envisaged using an imprecise Dirichlet model.

30 citations


Journal ArticleDOI
TL;DR: A latent class logit model with parameter constraints is considered and a method for determining an appropriate number of the latent classes within a Bayesian framework is proposed.
Abstract: Latent class models have recently drawn considerable attention among many researchers and practitioners as a class of useful tools for capturing heterogeneity across different segments in a target market or population. In this paper, we consider a latent class logit model with parameter constraints and deal with two important issues in the latent class models--parameter estimation and selection of an appropriate number of classes--within a Bayesian framework. A simple Gibbs sampling algorithm is proposed for sample generation from the posterior distribution of unknown parameters. Using the Gibbs output, we propose a method for determining an appropriate number of the latent classes. A real-world marketing example as an application for market segmentation is provided to illustrate the proposed method.

12 citations


Journal ArticleDOI
TL;DR: In this article, a prior distribution for multinomial parameters is constructed by modifying the prior that posits independent Dirichlet distributions for the multi-parameter parameters across time.
Abstract: Studies producing longitudinal multinomial data arise in several subject areas. This article suggests a Bayesian approach to the analysis of such data. Rather than infusing a latent model structure, we develop a prior distribution for the multinomial parameters which reflects the longitudinal nature of the observations. This distribution is constructed by modifying the prior that posits independent Dirichlet distributions for the multinomial parameters across time. Posterior analysis, which is implemented using Monte Carlo methods, can then be used to assess the temporal behaviour of the multinomial parameters underlying the observed data. We test this methodology on simulated data, opinion polling data, and data from a study concerning the development of moral reasoning.

6 citations


Proceedings ArticleDOI
17 Nov 2003
TL;DR: The Bayes optimal solutions for estimation of parameters and selection of the dimension of the hidden latent class in these models and analyze it's asymptotic properties are formulated.
Abstract: In this paper, we consider the Bayesian approach for representation of a set of documents. In the field of representation of a set of documents, many previous models, such as the latent semantic analysis (LSA), the probabilistic latent semantic analysis (PLSA), the semantic aggregate model (SAM), the Bayesian latent semantic analysis (BLSA), and so on, were proposed. In this paper, we formulate the Bayes optimal solutions for estimation of parameters and selection of the dimension of the hidden latent class in these models and analyze it's asymptotic properties.

5 citations


Journal ArticleDOI
TL;DR: Large sample properties of the posterior distribution with a mixture of Dirichlet process priors are studied and it is shown that the posterior Distribution of the survival function is consistent with right censored data.
Abstract: Mixtures of Dirichlet process priors offer a reasonable compromise between purely parametric and purely non-parametric models, and are popularly used in survival analysis and for testing problems with non-parametric alternatives. In this paper, we study large sample properties of the posterior distribution with a mixture of Dirichlet process priors. We show that the posterior distribution of the survival function is consistent with right censored data.

01 Jan 2003
TL;DR: An algorithm for learning HLC models is developed and the feasibility of learning H LC models that are large enough to be of practical interest is demonstrated.
Abstract: Hierarchical latent class (HLC) models generalize latent class models. As models for cluster analysis, they suit more applications than the latter because they relax the often untrue conditional independence assumption. They also facilitate the discovery of latent causal structures and the induction of probabilistic models that capture complex dependencies and yet have low inferential complexity. In this paper, we investigate the problem of inducing HLC models from data. Two fundamental issues of general latent structure discovery are identified and methods to address those issues for HLC models are proposed. Based on the proposals, we develop an algorithm for learning HLC models and demonstrate the feasibility of learning HLC models that are large enough to be of practical interest.

Book ChapterDOI
15 Sep 2003
TL;DR: In this paper, the authors present a distance-based discriminant analysis (DDA) method that defines the design of a basic building block classifier for distinguishing among a selected number of semantic categories and demonstrate how a set of DDA classifiers can be grouped into a hierarchical ensemble for prediction of an arbitrary set of semantic classes.
Abstract: Ever-increasing amount of multimedia available online necessitates the development of new techniques and methods that can overcome the semantic gap problem. The said problem, encountered due to major disparities between inherent representational characteristics of multimedia and its semantic content sought by the user, has been a prominent research direction addressed by a great number of semantic augmentation approaches originating from such areas as machine learning, statistics, natural language processing, etc. In this paper, we review several of these recently developed techniques that bring together low-level representation of multimedia and its semantics in order to improve the efficiency of access and retrieval. We also present a distance-based discriminant analysis (DDA) method that defines the design of a basic building block classifier for distinguishing among a selected number of semantic categories. In addition to that, we demonstrate how a set of DDA classifiers can be grouped into a hierarchical ensemble for prediction of an arbitrary set of semantic classes.