scispace - formally typeset
Search or ask a question

Showing papers on "Latent Dirichlet allocation published in 2004"


Journal ArticleDOI
TL;DR: A generative model for documents is described, introduced by Blei, Ng, and Jordan, and a Markov chain Monte Carlo algorithm is presented for inference in this model, which is used to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics.
Abstract: A first step in identifying the content of a document is determining which topics that document addresses. We describe a generative model for documents, introduced by Blei, Ng, and Jordan [Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, 993-1022], in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. We then present a Markov chain Monte Carlo algorithm for inference in this model. We use this algorithm to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics. We show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying “hot topics” by examining temporal dynamics and tagging abstracts to illustrate semantic content.

5,680 citations


Proceedings ArticleDOI
07 Jul 2004
TL;DR: The author-topic model is introduced, a generative model for documents that extends Latent Dirichlet Allocation to include authorship information, and applications to computing similarity between authors and entropy of author output are demonstrated.
Abstract: We introduce the author-topic model, a generative model for documents that extends Latent Dirichlet Allocation (LDA; Blei, Ng, & Jordan, 2003) to include authorship information. Each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. A document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. We apply the model to a collection of 1,700 NIPS conference papers and 160,000 CiteSeer abstracts. Exact inference is intractable for these datasets and we use Gibbs sampling to estimate the topic and author distributions. We compare the performance with two other generative models for documents, which are special cases of the author-topic model: LDA (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. We show topics recovered by the author-topic model, and demonstrate applications to computing similarity between authors and entropy of author output.

1,554 citations


Proceedings Article
01 Dec 2004
TL;DR: The hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data, is proposed and experimental results are reported showing the effective and superior performance of the HDP over previous models.
Abstract: We propose the hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data. Each group of data is modeled with a mixture, with the number of components being open-ended and inferred automatically by the model. Further, components can be shared across groups, allowing dependencies across groups to be modeled effectively as well as conferring generalization to new groups. Such grouped clustering problems occur often in practice, e.g. in the problem of topic discovery in document corpora. We report experimental results on three text corpora showing the effective and superior performance of the HDP over previous models.

474 citations


Proceedings ArticleDOI
10 Oct 2004
TL;DR: A new way of modeling multi-modal co-occurrences is proposed, constraining the definition of the latent space to ensure its consistency in semantic terms (words), while retaining the ability to jointly model visual information.
Abstract: We address the problem of unsupervised image auto-annotation with probabilistic latent space models. Unlike most previous works, which build latent space representations assuming equal relevance for the text and visual modalities, we propose a new way of modeling multi-modal co-occurrences, constraining the definition of the latent space to ensure its consistency in semantic terms (words), while retaining the ability to jointly model visual information. The concept is implemented by a linked pair of Probabilistic Latent Semantic Analysis (PLSA) models. On a 16000-image collection, we show with extensive experiments that our approach significantly outperforms previous joint models.

258 citations


Journal ArticleDOI
TL;DR: An unsupervised algorithm for learning a finite mixture model from multivariate data based on the Dirichlet distribution, which offers high flexibility for modeling data.
Abstract: This paper presents an unsupervised algorithm for learning a finite mixture model from multivariate data. This mixture model is based on the Dirichlet distribution, which offers high flexibility for modeling data. The proposed approach for estimating the parameters of a Dirichlet mixture is based on the maximum likelihood (ML) and Fisher scoring methods. Experimental results are presented for the following applications: estimation of artificial histograms, summarization of image databases for efficient retrieval, and human skin color modeling and its application to skin detection in multimedia databases.

196 citations


Proceedings Article
01 Dec 2004
TL;DR: A probabilistic model for online document clustering using non-parametric Dirichlet process prior to model the growing number of clusters, and using a prior of general English language model as the base distribution to handle the generation of novel clusters.
Abstract: In this paper we propose a probabilistic model for online document clustering. We use non-parametric Dirichlet process prior to model the growing number of clusters, and use a prior of general English language model as the base distribution to handle the generation of novel clusters. Furthermore, cluster uncertainty is modeled with a Bayesian Dirichlet-multinomial distribution. We use empirical Bayes method to estimate hyperparameters based on a historical dataset. Our probabilistic model is applied to the novelty detection task in Topic Detection and Tracking (TDT) and compared with existing approaches in the literature.

144 citations


Book ChapterDOI
21 Jul 2004
TL;DR: A model based on the Inference Network framework from information retrieval that employs a powerful query language that allows structured query operators, term weighting, and the combination of text and images within a query is proposed.
Abstract: Most image retrieval systems only allow a fragment of text or an example image as a query. Most users have more complex information needs that are not easily expressed in either of these forms. This paper proposes a model based on the Inference Network framework from information retrieval that employs a powerful query language that allows structured query operators, term weighting, and the combination of text and images within a query. The model uses non-parametric methods to estimate probabilities within the inference network. Image annotation and retrieval results are reported and compared against other published systems and illustrative structured and weighted query results are given to show the power of the query language. The resulting system both performs well and is robust compared to existing approaches.

112 citations


Journal ArticleDOI
01 Dec 2004
TL;DR: A new method, parametric embedding (PE), that embeds objects with the class structure into a low-dimensional visualization space, providing insight into the classifier's behavior in supervised, semisupervised, and unsupervised settings is proposed.
Abstract: We propose a new method, parametric embedding (PE), that embeds objects with the class structure into a low-dimensional visualization space. PE takes as input a set of class conditional probabilities for given data points and tries to preserve the structure in an embedding space by minimizing a sum of Kullback-Leibler divergences, under the assumption that samples are generated by a gaussian mixture with equal covariances in the embedding space. PE has many potential uses depending on the source of the input data, providing insight into the classifier's behavior in supervised, semisupervised, and unsupervised settings. The PE algorithm has a computational advantage over conventional embedding methods based on pairwise object relations since its complexity scales with the product of the number of objects and the number of classes. We demonstrate PE by visualizing supervised categorization of Web pages, semisupervised categorization of digits, and the relations of words and latent topics found by an unsupervised algorithm, latent Dirichlet allocation.

83 citations


01 Jan 2004
TL;DR: A suite of probabilistic models of information collections for which the above problems can be cast as statistical queries are described, and directed graphical models are used as a flexible, modular framework for describing appropriate modeling assumptions about the data.
Abstract: Managing large and growing collections of information is a central goal of modern computer science. Data repositories of texts, images, sounds, and genetic information have become widely accessible, thus necessitating good methods of retrieval, organization, and exploration. In this thesis, we describe a suite of probabilistic models of information collections for which the above problems can be cast as statistical queries. We use directed graphical models as a flexible, modular framework for describing appropriate modeling assumptions about the data. Fast approximate posterior inference algorithms based on variational methods free us from having to specify tractable models, and further allow us to take the Bayesian perspective, even in the face of large datasets. With this framework in hand, we describe latent Dirichlet allocation (LDA), a graphical model particularly suited to analyzing text collections. LDA posits a finite index of hidden topics which describe the underlying documents. New documents are situated into the collection via approximate posterior inference of their associated index terms. Extensions to LDA can index a set of images, or multimedia collections of interrelated text and images. Finally, we describe nonparametric Bayesian methods for relaxing the assumption of a fixed number of topics, and develop models based on the natural assumption that the size of the index can grow with the collection. This idea is extended to trees, and to models which represent the hidden structure and content of a topic hierarchy that underlies a collection.

74 citations


Proceedings ArticleDOI
29 Sep 2004
TL;DR: A robust probabilistic mixture model based on the multinomial and the Dirichlet distributions is presented and an unsupervised algorithm for learning this mixture is given.
Abstract: The performance of a statistical signal processing system depends in large part on the accuracy of the probabilistic model used. This paper presents a robust probabilistic mixture model based on the multinomial and the Dirichlet distributions. An unsupervised algorithm for learning this mixture is given, too. The proposed approach for estimating the parameters of the multinomial Dirichlet mixture is based on the maximum likelihood (ML) and Newton-Raphson methods. Experimental results involve improving content based image retrieval systems by integrating semantic features and by image database categorization

52 citations


Proceedings ArticleDOI
23 Aug 2004
TL;DR: A new finite mixture model based on a generalization of the Dirichlet distribution is presented, which involves the comparison of the performance of Gaussian and generalizedDirichlet mixtures in the classification of several pattern-recognition data sets.
Abstract: This paper presents a new finite mixture model based on a generalization of the Dirichlet distribution. For the estimation of the parameters of this mixture we use a GEM (generalized expectation maximization) algorithm based on a Newton-Raphson step. The experimental results involve the comparison of the performance of Gaussian and generalized Dirichlet mixtures in the classification of several pattern-recognition data sets.

Journal ArticleDOI
TL;DR: A novel MCMC scheme is introduced for the purpose of making posterior inferences for the AFT regression model and is viewed as a simple extension of existing parametric models.
Abstract: We model the baseline distribution in the accelerated failure-time (AFT) model as a mixture of Dirichlet processes for interval-censored data. This mixture is distinct from Dirichlet process mixtures, and can be viewed as a simple extension of existing parametric models, which we believe is an advantage in the practical modeling of data. We introduce a novel MCMC scheme for the purpose of making posterior inferences for the AFT regression model and illustrate our methods with several real examples.

Journal ArticleDOI
TL;DR: In this paper, a general formulation based on Dirichlet process piror is presented, which yields the number and composition of mixing components a posteriori, obviating the need for post hoc test procedures and is capable of approximating any target heterogeneity distribution.
Abstract: The finite normal mixture model has emerged as a dominant methodology for assessing heterogeneity in choice models. Although it extends the classic mixture models by allowing within component variablility, it requires that a relatively large number of models be separately estimated and fairly difficult test procedures to determine the correct number of mixing components. We present a very general formulation, based on Dirichlet Process Piror, which yields the number and composition of mixing components a posteriori, obviating the need for post hoc test procedures and is capable of approximating any target heterogeneity distribution. Adapting Stephens (2000) algorithm allows the determination of substantively different clusters, as well as a way to sidestep problems arising from label-switching and overlapping mixtures. These methods are illustrated both on simulated data and A.C. Nielsen scanner panel data for liquid detergents. We find that the large number of mixing components required to adequately represent the heterogeneity distribution can be reduced in practice to a far smaller number of segments of managerial relevance.

01 Jan 2004
TL;DR: In this article, a nonparametric Bayesian treatment for analyzing records containing occurrences of items is described, which retains the strength of previous approaches that explore the latent factors of each record (e.g. topics of documents), and further uncovers the clustering structure of records.
Abstract: This paper describes nonparametric Bayesian treatments for analyzing records containing occurrences of items. The introduced model retains the strength of previous approaches that explore the latent factors of each record (e.g. topics of documents), and further uncovers the clustering structure of records, which reflects the statistical dependencies of the latent factors. The nonparametric model induced by a Dirichlet process (DP) flexibly adapts model complexity to reveal the clustering structure of the data. To avoid the problems of dealing with infinite dimensions, we further replace the DP prior by a simpler alternative, namely Dirichlet-multinomial allocation (DMA), which maintains the main modelling properties of the DP. Instead of relying on Markov chain Monte Carlo (MCMC) for inference, this paper applies efficient variational inference based on DMA. The proposed approach yields encouraging empirical results on both a toy problem and text data. The results show that the proposed algorithm uncovers not only the latent factors, but also the clustering structure.

01 Jan 2004
TL;DR: This paper defines the TTMM, compares it to the related Latent Dirichlet Allocation (LDA) model (Blei, 2003) and reports some interesting empirical results.
Abstract: Documents are usually represented in the bag-of-word space. However, this representation does not take into account the possible relations between words. We propose here a graphical model for representing documents: the Theme Topic Mixture Model (TTMM). This model assumes two types of relations among textual data. Topics link words to each other and Themes gather documents with particular distribution over the topics. This paper defines the TTMM, compares it to the related Latent Dirichlet Allocation (LDA) model (Blei, 2003) and reports some interesting empirical results.

Proceedings ArticleDOI
25 Jul 2004
TL;DR: An information retrieval method is proposed using a hierarchical Dirichlet process as a prior on the parameters of a set of multinomial distributions and naturally includes a number of features found in other popular methods.
Abstract: An information retrieval method is proposed using a hierarchical Dirichlet process as a prior on the parameters of a set of multinomial distributions. The resulting method naturally includes a number of features found in other popular methods. Specifically, tf.idf-like term weighting and document length normalisation are recovered. The new method is compared with Okapi BM-25 [3] and the Twenty-One model [1] on TREC data and is shown to give better performance.

Proceedings Article
01 Jan 2004
TL;DR: An unsupervised algorithm based on the Maximum Likelihood (ML) and Fisher scoring methods is proposed for the estimation of the parameters of this mixture and this mixture is used to produce a new texture model.
Abstract: This paper presents a new finite mixture model based on the Multinomial Dirichlet distribution (MDD). For the estimation of the parameters of this mixture we propose an unsupervised algorithm based on the Maximum Likelihood (ML) and Fisher scoring methods. This mixture is used to produce a new texture model. Experimental results concern texture images summarizing and are reported on the Vistex texture image database from the MIT Media Lab.

01 Jan 2004
TL;DR: A model-based approach to identifying clusters of objects based on subsets of attributes, so that the attributes that distinguish a cluster from the rest of the population, called an attribute ensemble, may depend on the cluster being considered.
Abstract: We propose a model-based approach to identifying clusters of objects based on subsets of attributes, so that the attributes that distinguish a cluster from the rest of the population, called an attribute ensemble, may depend on the cluster being considered. The model is based on a P olya urn cluster model, which is equivalent to a Dirichlet process mixture of multivariate normal distributions. This model-based approach allows for the incorporation of applicationspecic data features into the clustering scheme. For example, in an analysis of genetic CGH array data we account for spatial correlation of genetic abnormalities along the genome.

Proceedings ArticleDOI
25 Jul 2004
TL;DR: It turns out that even sophisticated unsupervised methods like multinomial PCA (or latent Dirichlet allocation) cannot help much, and by contrast, feature extraction supervised by relevant auxiliary data may help.
Abstract: This work is part of a proactive information retrieval project that aims at estimating relevance from implicit user feedback. The noisy feedback signal needs to be complemented with all available information, and textual content is one of the natural sources. Here we take the first steps by investigating whether this source is at all useful in the challenging setting of estimating the relevance of a new document based on only few samples with known relevance. It turns out that even sophisticated unsupervised methods like multinomial PCA (or latent Dirichlet allocation) cannot help much. By contrast, feature extraction supervised by relevant auxiliary data may help.

Book ChapterDOI
22 Aug 2004
TL;DR: A topographic version of two LCMs for collaborative filtering is presented and the models are applied to a large collection of user ratings for films.
Abstract: Latent class models (LCM) represent the high dimensional data in a smaller dimensional space in terms of latent variables. They are able to automatically discover the patterns from the data. We present a topographic version of two LCMs for collaborative filtering and apply the models to a large collection of user ratings for films. Latent classes are topologically organized on a “star-like” structure. This makes orientation in rating patterns captured by latent classes easier and more systematic. The variation in film rating patterns is modelled by multinomial and binomial distributions with varying independence assumptions.