Showing papers on "Latent Dirichlet allocation published in 2004"

PDF

Open Access

Journal Article•DOI•

[...]

Thomas L. Griffiths¹, Mark Steyvers²•Institutions (2)

Stanford University¹, University of California, Irvine²

06 Apr 2004-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: A generative model for documents is described, introduced by Blei, Ng, and Jordan, and a Markov chain Monte Carlo algorithm is presented for inference in this model, which is used to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics.

...read moreread less

Abstract: A first step in identifying the content of a document is determining which topics that document addresses. We describe a generative model for documents, introduced by Blei, Ng, and Jordan [Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, 993-1022], in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. We then present a Markov chain Monte Carlo algorithm for inference in this model. We use this algorithm to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics. We show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying “hot topics” by examining temporal dynamics and tagging abstracts to illustrate semantic content.

...read moreread less

5,680 citations

Proceedings Article•DOI•

The author-topic model for authors and documents

[...]

Michal Rosen-Zvi¹, Thomas L. Griffiths², Mark Steyvers¹, Padhraic Smyth¹•Institutions (2)

University of California, Irvine¹, Stanford University²

07 Jul 2004

TL;DR: The author-topic model is introduced, a generative model for documents that extends Latent Dirichlet Allocation to include authorship information, and applications to computing similarity between authors and entropy of author output are demonstrated.

...read moreread less

Abstract: We introduce the author-topic model, a generative model for documents that extends Latent Dirichlet Allocation (LDA; Blei, Ng, & Jordan, 2003) to include authorship information. Each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. A document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. We apply the model to a collection of 1,700 NIPS conference papers and 160,000 CiteSeer abstracts. Exact inference is intractable for these datasets and we use Gibbs sampling to estimate the topic and author distributions. We compare the performance with two other generative models for documents, which are special cases of the author-topic model: LDA (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. We show topics recovered by the author-topic model, and demonstrate applications to computing similarity between authors and entropy of author output.

...read moreread less

1,554 citations

Proceedings Article•

Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes

[...]

Yee Whye Teh¹, Michael I. Jordan¹, Matthew J. Beal², David M. Blei¹•Institutions (2)

University of California, Berkeley¹, University of Toronto²

01 Dec 2004

TL;DR: The hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data, is proposed and experimental results are reported showing the effective and superior performance of the HDP over previous models.

...read moreread less

Abstract: We propose the hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data. Each group of data is modeled with a mixture, with the number of components being open-ended and inferred automatically by the model. Further, components can be shared across groups, allowing dependencies across groups to be modeled effectively as well as conferring generalization to new groups. Such grouped clustering problems occur often in practice, e.g. in the problem of topic discovery in document corpora. We report experimental results on three text corpora showing the effective and superior performance of the HDP over previous models.

...read moreread less

474 citations

Proceedings Article•DOI•

PLSA-based image auto-annotation: constraining the latent space

[...]

Florent Monay¹, Daniel Gatica-Perez¹•Institutions (1)

Idiap Research Institute¹

10 Oct 2004

TL;DR: A new way of modeling multi-modal co-occurrences is proposed, constraining the definition of the latent space to ensure its consistency in semantic terms (words), while retaining the ability to jointly model visual information.

...read moreread less

Abstract: We address the problem of unsupervised image auto-annotation with probabilistic latent space models. Unlike most previous works, which build latent space representations assuming equal relevance for the text and visual modalities, we propose a new way of modeling multi-modal co-occurrences, constraining the definition of the latent space to ensure its consistency in semantic terms (words), while retaining the ability to jointly model visual information. The concept is implemented by a linked pair of Probabilistic Latent Semantic Analysis (PLSA) models. On a 16000-image collection, we show with extensive experiments that our approach significantly outperforms previous joint models.

...read moreread less

258 citations

Journal Article•DOI•

Unsupervised learning of a finite mixture model based on the Dirichlet distribution and its application

[...]

Nizar Bouguila, Djemel Ziou, Jean Vaillancourt

01 Nov 2004-IEEE Transactions on Image Processing

TL;DR: An unsupervised algorithm for learning a finite mixture model from multivariate data based on the Dirichlet distribution, which offers high flexibility for modeling data.

...read moreread less

Abstract: This paper presents an unsupervised algorithm for learning a finite mixture model from multivariate data. This mixture model is based on the Dirichlet distribution, which offers high flexibility for modeling data. The proposed approach for estimating the parameters of a Dirichlet mixture is based on the maximum likelihood (ML) and Fisher scoring methods. Experimental results are presented for the following applications: estimation of artificial histograms, summarization of image databases for efficient retrieval, and human skin color modeling and its application to skin detection in multimedia databases.

...read moreread less

196 citations

Proceedings Article•

A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

[...]

Jian Zhang¹, Zoubin Ghahramani², Yiming Yang¹•Institutions (2)

University of Pittsburgh¹, University College London²

01 Dec 2004

TL;DR: A probabilistic model for online document clustering using non-parametric Dirichlet process prior to model the growing number of clusters, and using a prior of general English language model as the base distribution to handle the generation of novel clusters.

...read moreread less

Abstract: In this paper we propose a probabilistic model for online document clustering. We use non-parametric Dirichlet process prior to model the growing number of clusters, and use a prior of general English language model as the base distribution to handle the generation of novel clusters. Furthermore, cluster uncertainty is modeled with a Bayesian Dirichlet-multinomial distribution. We use empirical Bayes method to estimate hyperparameters based on a historical dataset. Our probabilistic model is applied to the novelty detection task in Topic Detection and Tracking (TDT) and compared with existing approaches in the literature.

...read moreread less

144 citations

Book Chapter•DOI•

An Inference Network Approach to Image Retrieval

[...]

Donald Metzler¹, R. Manmatha¹•Institutions (1)

University of Massachusetts Amherst¹

21 Jul 2004

TL;DR: A model based on the Inference Network framework from information retrieval that employs a powerful query language that allows structured query operators, term weighting, and the combination of text and images within a query is proposed.

...read moreread less

Abstract: Most image retrieval systems only allow a fragment of text or an example image as a query. Most users have more complex information needs that are not easily expressed in either of these forms. This paper proposes a model based on the Inference Network framework from information retrieval that employs a powerful query language that allows structured query operators, term weighting, and the combination of text and images within a query. The model uses non-parametric methods to estimate probabilities within the inference network. Image annotation and retrieval results are reported and compared against other published systems and illustrative structured and weighted query results are given to show the power of the query language. The resulting system both performs well and is robust compared to existing approaches.

...read moreread less

112 citations

Journal Article•DOI•

Parametric Embedding for Class Visualization

[...]

Tomoharu Iwata¹, Kazumi Saito¹, Naonori Ueda¹, Sean Stromsten², Thomas L. Griffiths³, Joshua B. Tenenbaum⁴ - Show less +2 more•Institutions (4)

Nippon Telegraph and Telephone¹, BAE Systems², University of California, Berkeley³, Massachusetts Institute of Technology⁴

01 Dec 2004

TL;DR: A new method, parametric embedding (PE), that embeds objects with the class structure into a low-dimensional visualization space, providing insight into the classifier's behavior in supervised, semisupervised, and unsupervised settings is proposed.

...read moreread less

Abstract: We propose a new method, parametric embedding (PE), that embeds objects with the class structure into a low-dimensional visualization space. PE takes as input a set of class conditional probabilities for given data points and tries to preserve the structure in an embedding space by minimizing a sum of Kullback-Leibler divergences, under the assumption that samples are generated by a gaussian mixture with equal covariances in the embedding space. PE has many potential uses depending on the source of the input data, providing insight into the classifier's behavior in supervised, semisupervised, and unsupervised settings. The PE algorithm has a computational advantage over conventional embedding methods based on pairwise object relations since its complexity scales with the product of the number of objects and the number of classes. We demonstrate PE by visualizing supervised categorization of Web pages, semisupervised categorization of digits, and the relations of words and latent topics found by an unsupervised algorithm, latent Dirichlet allocation.

...read moreread less

83 citations

Probabilistic models of text and images

[...]

David M. Blei¹, Michael I. Jordan¹•Institutions (1)

University of California, Berkeley¹

01 Jan 2004

TL;DR: A suite of probabilistic models of information collections for which the above problems can be cast as statistical queries are described, and directed graphical models are used as a flexible, modular framework for describing appropriate modeling assumptions about the data.

...read moreread less

Abstract: Managing large and growing collections of information is a central goal of modern computer science. Data repositories of texts, images, sounds, and genetic information have become widely accessible, thus necessitating good methods of retrieval, organization, and exploration. In this thesis, we describe a suite of probabilistic models of information collections for which the above problems can be cast as statistical queries. We use directed graphical models as a flexible, modular framework for describing appropriate modeling assumptions about the data. Fast approximate posterior inference algorithms based on variational methods free us from having to specify tractable models, and further allow us to take the Bayesian perspective, even in the face of large datasets. With this framework in hand, we describe latent Dirichlet allocation (LDA), a graphical model particularly suited to analyzing text collections. LDA posits a finite index of hidden topics which describe the underlying documents. New documents are situated into the collection via approximate posterior inference of their associated index terms. Extensions to LDA can index a set of images, or multimedia collections of interrelated text and images. Finally, we describe nonparametric Bayesian methods for relaxing the assumption of a fixed number of topics, and develop models based on the natural assumption that the size of the index can grow with the collection. This idea is extended to trees, and to models which represent the hidden structure and content of a topic hierarchy that underlies a collection.

...read moreread less

74 citations

Proceedings Article•DOI•

Improving content based image retrieval systems using finite multinomial dirichlet mixture

[...]

Nizar Bouguila, Djemel Ziou

29 Sep 2004

TL;DR: A robust probabilistic mixture model based on the multinomial and the Dirichlet distributions is presented and an unsupervised algorithm for learning this mixture is given.

...read moreread less

Abstract: The performance of a statistical signal processing system depends in large part on the accuracy of the probabilistic model used. This paper presents a robust probabilistic mixture model based on the multinomial and the Dirichlet distributions. An unsupervised algorithm for learning this mixture is given, too. The proposed approach for estimating the parameters of the multinomial Dirichlet mixture is based on the maximum likelihood (ML) and Newton-Raphson methods. Experimental results involve improving content based image retrieval systems by integrating semantic features and by image database categorization

...read moreread less

52 citations

Proceedings Article•DOI•

A powerful finite mixture model based on the generalized Dirichlet distribution: unsupervised learning and applications

[...]

Nizar Bouguila¹, Djemel Ziou¹•Institutions (1)

Université de Sherbrooke¹

23 Aug 2004

TL;DR: A new finite mixture model based on a generalization of the Dirichlet distribution is presented, which involves the comparison of the performance of Gaussian and generalizedDirichlet mixtures in the classification of several pattern-recognition data sets.

...read moreread less

Abstract: This paper presents a new finite mixture model based on a generalization of the Dirichlet distribution. For the estimation of the parameters of this mixture we use a GEM (generalized expectation maximization) algorithm based on a Newton-Raphson step. The experimental results involve the comparison of the performance of Gaussian and generalized Dirichlet mixtures in the classification of several pattern-recognition data sets.

...read moreread less

Journal Article•DOI•

A Bayesian Semiparametric AFT Model for Interval-Censored Data

[...]

Timothy Hanson¹, Wesley O. Johnson•Institutions (1)

University of New Mexico¹

01 Jun 2004-Journal of Computational and Graphical Statistics

TL;DR: A novel MCMC scheme is introduced for the purpose of making posterior inferences for the AFT regression model and is viewed as a simple extension of existing parametric models.

...read moreread less

Abstract: We model the baseline distribution in the accelerated failure-time (AFT) model as a mixture of Dirichlet processes for interval-censored data. This mixture is distinct from Dirichlet process mixtures, and can be viewed as a simple extension of existing parametric models, which we believe is an advantage in the practical modeling of data. We introduce a novel MCMC scheme for the purpose of making posterior inferences for the AFT regression model and illustrate our methods with several real examples.

...read moreread less

Journal Article•DOI•

Assessing Heterogeneity in Discrete Choice Models Using a Dirichlet Process Prior

[...]

Jin Gyo Kim, Ulrich Menzefricke, Fred M. Feinberg

27 Jan 2004-Review of Marketing Science

TL;DR: In this paper, a general formulation based on Dirichlet process piror is presented, which yields the number and composition of mixing components a posteriori, obviating the need for post hoc test procedures and is capable of approximating any target heterogeneity distribution.

...read moreread less

Abstract: The finite normal mixture model has emerged as a dominant methodology for assessing heterogeneity in choice models. Although it extends the classic mixture models by allowing within component variablility, it requires that a relatively large number of models be separately estimated and fairly difficult test procedures to determine the correct number of mixing components. We present a very general formulation, based on Dirichlet Process Piror, which yields the number and composition of mixing components a posteriori, obviating the need for post hoc test procedures and is capable of approximating any target heterogeneity distribution. Adapting Stephens (2000) algorithm allows the determination of substantively different clusters, as well as a way to sidestep problems arising from label-switching and overlapping mixtures. These methods are illustrated both on simulated data and A.C. Nielsen scanner panel data for liquid detergents. We find that the large number of mixing components required to adequately represent the heterogeneity distribution can be reduced in practice to a far smaller number of segments of managerial relevance.

...read moreread less

Dirichlet Enhanced Latent Semantic Analysis.

[...]

Kai Yu, Shipeng Yu, Volker Tresp

01 Jan 2004

TL;DR: In this article, a nonparametric Bayesian treatment for analyzing records containing occurrences of items is described, which retains the strength of previous approaches that explore the latent factors of each record (e.g. topics of documents), and further uncovers the clustering structure of records.

...read moreread less

Abstract: This paper describes nonparametric Bayesian treatments for analyzing records containing occurrences of items. The introduced model retains the strength of previous approaches that explore the latent factors of each record (e.g. topics of documents), and further uncovers the clustering structure of records, which reflects the statistical dependencies of the latent factors. The nonparametric model induced by a Dirichlet process (DP) flexibly adapts model complexity to reveal the clustering structure of the data. To avoid the problems of dealing with infinite dimensions, we further replace the DP prior by a simpler alternative, namely Dirichlet-multinomial allocation (DMA), which maintains the main modelling properties of the DP. Instead of relying on Markov chain Monte Carlo (MCMC) for inference, this paper applies efficient variational inference based on DMA. The proposed approach yields encouraging empirical results on both a toy problem and text data. The results show that the proposed algorithm uncovers not only the latent factors, but also the clustering structure.

...read moreread less

Theme Topic Mixture Model: A Graphical Model for Document Representation

[...]

Mikaela Keller, Samy Bengio

01 Jan 2004

TL;DR: This paper defines the TTMM, compares it to the related Latent Dirichlet Allocation (LDA) model (Blei, 2003) and reports some interesting empirical results.

...read moreread less

Abstract: Documents are usually represented in the bag-of-word space. However, this representation does not take into account the possible relations between words. We propose here a graphical model for representing documents: the Theme Topic Mixture Model (TTMM). This model assumes two types of relations among textual data. Topics link words to each other and Themes gather documents with particular distribution over the topics. This paper defines the TTMM, compares it to the related Latent Dirichlet Allocation (LDA) model (Blei, 2003) and reports some interesting empirical results.

...read moreread less

Proceedings Article•DOI•

Information retrieval using hierarchical dirichlet processes

[...]

Philip J. Cowans¹•Institutions (1)

University of Cambridge¹

25 Jul 2004

TL;DR: An information retrieval method is proposed using a hierarchical Dirichlet process as a prior on the parameters of a set of multinomial distributions and naturally includes a number of features found in other popular methods.

...read moreread less

Abstract: An information retrieval method is proposed using a hierarchical Dirichlet process as a prior on the parameters of a set of multinomial distributions. The resulting method naturally includes a number of features found in other popular methods. Specifically, tf.idf-like term weighting and document length normalisation are recovered. The new method is compared with Okapi BM-25 [3] and the Twenty-One model [1] on TREC data and is shown to give better performance.

...read moreread less

Proceedings Article•

Unsupervised Learning of a Finite Discrete Mixture Model Based on the Multinomial Dirichlet Distribution: Application to Texture Modeling

[...]

Nizar Bouguila, Djemel Ziou

01 Jan 2004

TL;DR: An unsupervised algorithm based on the Maximum Likelihood (ML) and Fisher scoring methods is proposed for the estimation of the parameters of this mixture and this mixture is used to produce a new texture model.

...read moreread less

Abstract: This paper presents a new finite mixture model based on the Multinomial Dirichlet distribution (MDD). For the estimation of the parameters of this mixture we propose an unsupervised algorithm based on the Maximum Likelihood (ML) and Fisher scoring methods. This mixture is used to produce a new texture model. Experimental results concern texture images summarizing and are reported on the Vistex texture image database from the MIT Media Lab.

...read moreread less

Clustering based on Dirichlet mixtures of attribute ensembles

[...]

Peter D. Ho

01 Jan 2004

TL;DR: A model-based approach to identifying clusters of objects based on subsets of attributes, so that the attributes that distinguish a cluster from the rest of the population, called an attribute ensemble, may depend on the cluster being considered.

...read moreread less

Abstract: We propose a model-based approach to identifying clusters of objects based on subsets of attributes, so that the attributes that distinguish a cluster from the rest of the population, called an attribute ensemble, may depend on the cluster being considered. The model is based on a P olya urn cluster model, which is equivalent to a Dirichlet process mixture of multivariate normal distributions. This model-based approach allows for the incorporation of applicationspecic data features into the clustering scheme. For example, in an analysis of genetic CGH array data we account for spatial correlation of genetic abnormalities along the genome.

...read moreread less

Proceedings Article•DOI•

On text-based estimation of document relevance

[...]

Eerika Savia¹, Samuel Kaski¹, Ville Tuulos, Petri Myllymäki•Institutions (1)

Helsinki University of Technology¹

25 Jul 2004

TL;DR: It turns out that even sophisticated unsupervised methods like multinomial PCA (or latent Dirichlet allocation) cannot help much, and by contrast, feature extraction supervised by relevant auxiliary data may help.

...read moreread less

Abstract: This work is part of a proactive information retrieval project that aims at estimating relevance from implicit user feedback. The noisy feedback signal needs to be complemented with all available information, and textual content is one of the natural sources. Here we take the first steps by investigating whether this source is at all useful in the challenging setting of estimating the relevance of a new document based on only few samples with known relevance. It turns out that even sophisticated unsupervised methods like multinomial PCA (or latent Dirichlet allocation) cannot help much. By contrast, feature extraction supervised by relevant auxiliary data may help.

...read moreread less

Book Chapter•DOI•

Introducing a Star Topology into Latent Class Models for Collaborative Filtering

[...]

Gabriela Polcicova, Petetr Tiňo¹•Institutions (1)

University of Birmingham¹

22 Aug 2004

TL;DR: A topographic version of two LCMs for collaborative filtering is presented and the models are applied to a large collection of user ratings for films.

...read moreread less

Abstract: Latent class models (LCM) represent the high dimensional data in a smaller dimensional space in terms of latent variables. They are able to automatically discover the patterns from the data. We present a topographic version of two LCMs for collaborative filtering and apply the models to a large collection of user ratings for films. Latent classes are topologically organized on a “star-like” structure. This makes orientation in rating patterns captured by latent classes easier and more systematic. The variation in film rating patterns is modelled by multinomial and binomial distributions with varying independence assumptions.

...read moreread less