Showing papers on "Latent Dirichlet allocation published in 2007"

PDF

Open Access

Proceedings Article•

[...]

David M. Blei¹, Jon McAuliffe²•Institutions (2)

Princeton University¹, University of Pennsylvania²

03 Dec 2007

TL;DR: The supervised latent Dirichlet allocation (sLDA) model, a statistical model of labelled documents, is introduced, which derives a maximum-likelihood procedure for parameter estimation, which relies on variational approximations to handle intractable posterior expectations.

...read moreread less

Abstract: We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The model accommodates a variety of response types. We derive a maximum-likelihood procedure for parameter estimation, which relies on variational approximations to handle intractable posterior expectations. Prediction problems motivate this research: we use the fitted model to predict response values for new documents. We test sLDA on two real-world problems: movie ratings predicted from reviews, and web page popularity predicted from text descriptions. We illustrate the benefits of sLDA versus modern regularized regression, as well as versus an unsupervised LDA analysis followed by a separate regression.

...read moreread less

1,383 citations

Journal Article•DOI•

A correlated topic model of Science

[...]

David M. Blei, John Lafferty

27 Aug 2007-arXiv: Applications

TL;DR: The correlated topic model (CTM) is developed, where the topic proportions exhibit correlation via the logistic normal distribution, and it is demonstrated its use as an exploratory tool of large document collections.

...read moreread less

Abstract: Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than X-ray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution [J. Roy. Statist. Soc. Ser. B 44 (1982) 139--177]. We derive a fast variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. We apply the CTM to the articles from Science published from 1990--1999, a data set that comprises 57M words. The CTM gives a better fit of the data than LDA, and we demonstrate its use as an exploratory tool of large document collections.

...read moreread less

1,100 citations

Journal Article•DOI•

A correlated topic model of Science

[...]

David M. Blei, John Lafferty

01 Jun 2007-The Annals of Applied Statistics

TL;DR: The correlated topic model (CTM) as mentioned in this paper uses the logistic normal distribution to model the topic proportions, which is a variant of the Dirichlet distribution used in LDA.

...read moreread less

Abstract: Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than X-ray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution [J. Roy. Statist. Soc. Ser. B 44 (1982) 139–177]. We derive a fast variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. We apply the CTM to the articles from Science published from 1990–1999, a data set that comprises 57M words. The CTM gives a better fit of the data than LDA, and we demonstrate its use as an exploratory tool of large document collections.

...read moreread less

1,053 citations

Proceedings Article•DOI•

Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval

[...]

Xuerui Wang¹, Andrew McCallum¹, Xing Wei¹•Institutions (1)

University of Massachusetts Amherst¹

28 Oct 2007

TL;DR: Topical n-grams as discussed by the authors is a probabilistic model that generates words in their textual order by, for each word, first sampling a topic, then sampling its status as a unigram or bigram, and then sampling the word from a topic-specific unigrams or bigrams distribution.

...read moreread less

Abstract: Most topic models, such as latent Dirichlet allocation, rely on the bag-of-words assumption. However, word order and phrases are often critical to capturing the meaning of text in many text mining tasks. This paper presents topical n-grams, a topic model that discovers topics as well as topical phrases. The probabilistic model generates words in their textual order by, for each word, first sampling a topic, then sampling its status as a unigram or bigram, and then sampling the word from a topic-specific unigram or bigram distribution. Thus our model can model "white house" as a special meaning phrase in the 'politics' topic, but not in the 'real estate' topic. Successive bigrams form longer phrases. We present experiments showing meaningful phrases and more interpretable topics from the NIPS data and improved information retrieval performance on a TREC collection.

...read moreread less

510 citations

Journal Article•DOI•

Topic and role discovery in social networks with experiments on enron and academic email

[...]

Andrew McCallum¹, Xuerui Wang¹, Andres Corrada-Emmanuel¹•Institutions (1)

University of Massachusetts Amherst¹

01 Sep 2007-Journal of Artificial Intelligence Research

TL;DR: The Author-Recipient-Topic model for social network analysis, which learns topic distributions based on the direction-sensitive messages sent between entities, is presented and results are given, providing evidence not only that clearly relevant topics are discovered, but that the ART model better predicts people's roles and gives lower perplexity on previously unseen messages.

...read moreread less

Abstract: Previous work in social network analysis (SNA) has modeled the existence of links from one entity to another, but not the attributes such as language content or topics on those links. We present the Author-Recipient-Topic (ART) model for social network analysis, which learns topic distributions based on the direction-sensitive messages sent between entities. The model builds on Latent Dirichlet Allocation (LDA) and the Author-Topic (AT) model, adding the key attribute that distribution over topics is conditioned distinctly on both the sender and recipient--steering the discovery of topics according to the relationships between people. We give results on both the Enron email corpus and a researcher's email archive, providing evidence not only that clearly relevant topics are discovered, but that the ART model better predicts people's roles and gives lower perplexity on previously unseen messages. We also present the Role-Author-Recipient-Topic (RART) model, an extension to ART that explicitly represents people's roles.

...read moreread less

484 citations

Journal Article•DOI•

Sampling the Dirichlet Mixture Model with Slices

[...]

Stephen G. Walker¹•Institutions (1)

University of Kent¹

29 May 2007-Communications in Statistics - Simulation and Computation

TL;DR: The key to the algorithm detailed in this article, which also keeps the random distribution functions, is the introduction of a latent variable which allows a finite number of objects to be sampled within each iteration of a Gibbs sampler.

...read moreread less

Abstract: We provide a new approach to the sampling of the well known mixture of Dirichlet process model. Recent attention has focused on retention of the random distribution function in the model, but sampling algorithms have then suffered from the countably infinite representation these distributions have. The key to the algorithm detailed in this article, which also keeps the random distribution functions, is the introduction of a latent variable which allows a finite number, which is known, of objects to be sampled within each iteration of a Gibbs sampler.

...read moreread less

482 citations

Proceedings Article•DOI•

Spatially Coherent Latent Topic Model for Concurrent Segmentation and Classification of Objects and Scenes

[...]

Liangliang Cao¹, Li Fei-Fei²•Institutions (2)

University of Illinois at Urbana–Champaign¹, Princeton University²

26 Dec 2007

TL;DR: Spatial-LTM represents an image containing objects in a hierarchical way by over-segmented image regions of homogeneous appearances and the salient image patches within the regions, enforcing the spatial coherency of the model.

...read moreread less

Abstract: We present a novel generative model for simultaneously recognizing and segmenting object and scene classes. Our model is inspired by the traditional bag of words representation of texts and images as well as a number of related generative models, including probabilistic latent semantic analysis (pLSA) and latent Dirichlet allocation (LDA). A major drawback of the pLSA and LDA models is the assumption that each patch in the image is independently generated given its corresponding latent topic. While such representation provides an efficient computational method, it lacks the power to describe the visually coherent images and scenes. Instead, we propose a spatially coherent latent topic model (spatial-LTM). Spatial-LTM represents an image containing objects in a hierarchical way by over-segmented image regions of homogeneous appearances and the salient image patches within the regions. Only one single latent topic is assigned to the image patches within each region, enforcing the spatial coherency of the model. This idea gives rise to the following merits of spatial-LTM: (1) spatial-LTM provides a unified representation for spatially coherent bag of words topic models; (2) spatial-LTM can simultaneously segment and classify objects, even in the case of occlusion and multiple instances; and (3) spatial-LTM can be trained either unsupervised or supervised, as well as when partial object labels are provided. We verify the success of our model in a number of segmentation and classification experiments.

...read moreread less

392 citations

Proceedings Article•DOI•

Unsupervised Activity Perception by Hierarchical Bayesian Models

[...]

Xiaogang Wang¹, Xiaoxu Ma¹, Eric Grimson¹•Institutions (1)

Massachusetts Institute of Technology¹

17 Jun 2007

TL;DR: A novel unsupervised learning framework for activity perception to understand activities in complicated scenes from visual data using a hierarchical Bayesian model to connect three elements: low-level visual features, simple "atomic" activities, and multi-agent interactions.

...read moreread less

Abstract: We propose a novel unsupervised learning framework for activity perception. To understand activities in complicated scenes from visual data, we propose a hierarchical Bayesian model to connect three elements: low-level visual features, simple "atomic" activities, and multi-agent interactions. Atomic activities are modeled as distributions over low-level visual features, and interactions are modeled as distributions over atomic activities. Our models improve existing language models such as latent Dirichlet allocation (LDA) and hierarchical Dirichlet process (HDP) by modeling interactions without supervision. Our data sets are challenging video sequences from crowded traffic scenes with many kinds of activities co-occurring. Our approach provides a summary of typical atomic activities and interactions in the scene. Unusual activities and interactions are found, with natural probabilistic explanations. Our method supports flexible high-level queries on activities and interactions using atomic activities as components.

...read moreread less

350 citations

Book Chapter•DOI•

Data mining for web personalization

[...]

Bamshad Mobasher¹•Institutions (1)

DePaul University¹

01 Jan 2007

TL;DR: An overview of Web personalization process viewed as an application of data mining requiring support for all the phases of a typical data mining cycle, including data collection and pre-processing, pattern discovery and evaluation, and finally applying the discovered knowledge in real-time to mediate between the user and the Web.

...read moreread less

Abstract: In this chapter we present an overview of Web personalization process viewed as an application of data mining requiring support for all the phases of a typical data mining cycle. These phases include data collection and pre-processing, pattern discovery and evaluation, and finally applying the discovered knowledge in real-time to mediate between the user and the Web. This view of the personalization process provides added flexibility in leveraging multiple data sources and in effectively using the discovered models in an automatic personalization system. The chapter provides a detailed discussion of a host of activities and techniques used at different stages of this cycle, including the preprocessing and integration of data from multiple sources, as well as pattern discovery techniques that are typically applied to this data. We consider a number of classes of data mining algorithms used particularly forWeb personalization, including techniques based on clustering, association rule discovery, sequential pattern mining, Markov models, and probabilistic mixture and hidden (latent) variable models. Finally, we discuss hybrid data mining frameworks that leverage data from a variety of channels to provide more effective personalization solutions.

...read moreread less

344 citations

Proceedings Article•

Hidden Topic Markov Models

[...]

Amit Gruber¹, Yair Weiss, Michal Rosen-Zvi•Institutions (1)

Hebrew University of Jerusalem¹

11 Mar 2007

TL;DR: This paper proposes modeling the topics of words in the document as a Markov chain, and shows that incorporating this dependency allows us to learn better topics and to disambiguate words that can belong to different topics.

...read moreread less

Abstract: Algorithms such as Latent Dirichlet Allocation (LDA) have achieved significant progress in modeling word document relationships. These algorithms assume each word in the document was generated by a hidden topic and explicitly model the word distribution of each topic as well as the prior distribution over topics in the document. Given these parameters, the topics of all words in the same document are assumed to be independent. In this paper, we propose modeling the topics of words in the document as a Markov chain. Specifically, we assume that all words in the same sentence have the same topic, and successive sentences are more likely to have the same topics. Since the topics are hidden, this leads to using the well-known tools of Hidden Markov Models for learning and inference. We show that incorporating this dependency allows us to learn better topics and to disambiguate words that can belong to different topics. Quantitatively, we show that we obtain better perplexity in modeling documents with only a modest increase in learning and inference complexity.

...read moreread less

290 citations

Proceedings Article•

Spatial Latent Dirichlet Allocation

[...]

Xiaogang Wang¹, Eric Grimson¹•Institutions (1)

Massachusetts Institute of Technology¹

03 Dec 2007

TL;DR: A topic model Spatial Latent Dirichlet Allocation (SLDA), which better encodes spatial structures among visual words that are essential for solving many vision problems, is proposed and used to discover objects from a collection of images.

...read moreread less

Abstract: In recent years, the language model Latent Dirichlet Allocation (LDA), which clusters co-occurring words into topics, has been widely applied in the computer vision field. However, many of these applications have difficulty with modeling the spatial and temporal structure among visual words, since LDA assumes that a document is a "bag-of-words". It is also critical to properly design "words" and "documents" when using a language model to solve vision problems. In this paper, we propose a topic model Spatial Latent Dirichlet Allocation (SLDA), which better encodes spatial structures among visual words that are essential for solving many vision problems. The spatial information is not encoded in the values of visual words but in the design of documents. Instead of knowing the partition of words into documents a priori, the word-document assignment becomes a random hidden variable in SLDA. There is a generative procedure, where knowledge of spatial structure can be flexibly added as a prior, grouping visual words which are close in space into the same document. We use SLDA to discover objects from a collection of images, and show it achieves better performance than LDA.

...read moreread less

Proceedings Article•

Distributed Inference for Latent Dirichlet Allocation

[...]

David Newman¹, Padhraic Smyth¹, Max Welling¹, Arthur U. Asuncion¹•Institutions (1)

University of California, Irvine¹

03 Dec 2007

TL;DR: Using five real-world text corpora, it is shown that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning.

...read moreread less

Abstract: We investigate the problem of learning a widely-used latent-variable model - the Latent Dirichlet Allocation (LDA) or "topic" model - using distributed computation, where each of P processors only sees 1/P of the total data set. We propose two distributed inference schemes that are motivated from different perspectives. The first scheme uses local Gibbs sampling on each processor with periodic updates—it is simple to implement and can be viewed as an approximation to a single processor implementation of Gibbs sampling. The second scheme relies on a hierarchical Bayesian extension of the standard LDA model to directly account for the fact that data are distributed across P processors—it has a theoretical guarantee of convergence but is more complex to implement than the approximate method. Using five real-world text corpora we show that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning. Our extensive experimental results include large-scale distributed computation on 1000 virtual processors; and speedup experiments of learning topics in a 100-million word corpus using 16 processors.

...read moreread less

Proceedings Article•

A Topic Model for Word Sense Disambiguation

[...]

Jordan Boyd-Graber¹, David M. Blei¹, Xiaojin Zhu²•Institutions (2)

Princeton University¹, University of Wisconsin-Madison²

01 Jun 2007

TL;DR: A probabilistic posterior inference algorithm for simultaneously disambiguating a corpus and learning the domains in which to consider each word is developed.

...read moreread less

Abstract: We develop latent Dirichlet allocation with WORDNET (LDAWN), an unsupervised probabilistic topic model that includes word sense as a hidden variable. We develop a probabilistic posterior inference algorithm for simultaneously disambiguating a corpus and learning the domains in which to consider each word. Using the WORDNET hierarchy, we embed the construction of Abney and Light (1999) in the topic model and show that automatically learned domains improve WSD accuracy compared to alternative contexts.

...read moreread less

Journal Article•DOI•

Controlling the reinforcement in Bayesian non-parametric mixture models

[...]

Antonio Lijoi¹, Ramsés H. Mena², Igor Prünster³•Institutions (3)

University of Pavia¹, National Autonomous University of Mexico², University of Turin³

01 Sep 2007-Journal of The Royal Statistical Society Series B-statistical Methodology

TL;DR: A Bayesian non‐parametric approach is taken and adopt a hierarchical model with a suitable non-parametric prior obtained from a generalized gamma process to solve the problem of determining the number of components in a mixture model.

...read moreread less

Abstract: Summary. The paper deals with the problem of determining the number of components in a mixture model. We take a Bayesian non-parametric approach and adopt a hierarchical model with a suitable non-parametric prior for the latent structure. A commonly used model for such a problem is the mixture of Dirichlet process model. Here, we replace the Dirichlet process with a more general non-parametric prior obtained from a generalized gamma process. The basic feature of this model is that it yields a partition structure for the latent variables which is of Gibbs type. This relates to the well-known (exchangeable) product partition models. If compared with the usual mixture of Dirichlet process model the advantage of the generalization that we are examining relies on the availability of an additional parameter σ belonging to the interval (0,1): it is shown that such a parameter greatly influences the clustering behaviour of the model. A value of σ that is close to 1 generates a large number of clusters, most of which are of small size. Then, a reinforcement mechanism which is driven by σ acts on the mass allocation by penalizing clusters of small size and favouring those few groups containing a large number of elements. These features turn out to be very useful in the context of mixture modelling. Since it is difficult to specify a priori the reinforcement rate, it is reasonable to specify a prior for σ. Hence, the strength of the reinforcement mechanism is controlled by the data.

...read moreread less

Proceedings Article•

The Infinite PCFG Using Hierarchical Dirichlet Processes

[...]

Percy Liang, Slav Petrov, Michael I. Jordan, Dan Klein¹•Institutions (1)

University of California, Berkeley¹

01 Jun 2007

TL;DR: This work presents a nonparametric Bayesian model of tree structures based on the hierarchical Dirichlet process (HDP) and develops an efficient variational inference procedure that can be applied to full-scale parsing applications.

...read moreread less

Abstract: We present a nonparametric Bayesian model of tree structures based on the hierarchical Dirichlet process (HDP). Our HDP-PCFG model allows the complexity of the grammar to grow as more training data is available. In addition to presenting a fully Bayesian model for the PCFG, we also develop an efficient variational inference procedure. On synthetic data, we recover the correct grammar without having to specify its complexity in advance. We also show that our techniques can be applied to full-scale parsing applications by demonstrating its effectiveness in learning state-split grammars.

...read moreread less

Journal Article•DOI•

Generalized spatial dirichlet process models

[...]

Jason A. Duan¹, Michele Guindani², Alan E. Gelfand³•Institutions (3)

Yale University¹, University of New Mexico², Duke University³

01 Dec 2007-Biometrika

TL;DR: In this article, a generalized spatial Dirichlet process is proposed for point-referenced data, which allows different surface selection at different sites, and the marginal distribution of the effect at each site still comes from a Gaussian process.

...read moreread less

Abstract: SUMMARY Many models for the study of point-referenced data explicitly introduce spatial random effects to capture residual spatial association. These spatial effects are customarily modelled as a zeromean stationary Gaussian process. The spatial Dirichlet process introduced by Gelfand et al. (2005) produces a random spatial process which is neither Gaussian nor stationary. Rather, it varies about a process that is assumed to be stationary and Gaussian. The spatial Dirichlet process arises as a probability-weighted collection of random surfaces. This can be limiting for modelling and inferential purposes since it insists that a process realization must be one of these surfaces. We introduce a random distribution for the spatial effects that allows different surface selection at different sites. Moreover, we can specify the model so that the marginal distribution of the effect at each site still comes from a Dirichlet process. The development is offered constructively, providing a multivariate extension of the stick-breaking representation of the weights. We then introduce mixing using this generalized spatial Dirichlet process. We illustrate with a simulated dataset of independent replications and note that we can embed the generalized process within a dynamic model specification to eliminate the independence assumption.

...read moreread less

Proceedings Article•

Collapsed Variational Inference for HDP

[...]

Yee Whye Teh¹, Kenichi Kurihara², Max Welling³•Institutions (3)

University College London¹, Tokyo Institute of Technology², University of California, Irvine³

03 Dec 2007

TL;DR: This work obtains the first variational algorithm to deal with the hierarchical Dirichlet process and with hyperparameters ofDirichlet variables, and shows a significant improvement in accuracy.

...read moreread less

Abstract: A wide variety of Dirichlet-multinomial 'topic' models have found interesting applications in recent years. While Gibbs sampling remains an important method of inference in such models, variational techniques have certain advantages such as easy assessment of convergence, easy optimization without the need to maintain detailed balance, a bound on the marginal likelihood, and side-stepping of issues with topic-identifiability. The most accurate variational technique thus far, namely collapsed variational latent Dirichlet allocation, did not deal with model selection nor did it include inference for hyperparameters. We address both issues by generalizing the technique, obtaining the first variational algorithm to deal with the hierarchical Dirichlet process and to deal with hyperparameters of Dirichlet variables. Experiments show a significant improvement in accuracy.

...read moreread less

Proceedings Article•DOI•

Efficient topic-based unsupervised name disambiguation

[...]

Yang Song¹, Jian Huang¹, Isaac G. Councill¹, Jia Li¹, C. Lee Giles¹ - Show less +1 more•Institutions (1)

Pennsylvania State University¹

18 Jun 2007

TL;DR: This paper presents an efficient and effective two-stage approach to disambiguate person names within web pages and scientific documents and empirically addressed the issue of scalability bydisambiguating authors in over 750,000 papers from the entire CiteSeer dataset.

...read moreread less

Abstract: Name ambiguity is a special case of identity uncertainty where one person can be referenced by multiple name variations in different situations or even share the same name with other people. In this paper, we focus on the problem of disambiguating person names within web pages and scientific documents. We present an efficient and effective two-stage approach to disambiguate names. In the first stage, two novel topic-based models are proposed by extending two hierarchical Bayesian text models, namely Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). Our models explicitly introduce a new variable for persons and learn the distribution of topics with regard to persons and words. After learning an initial model, the topic distributions are treated as feature sets and names are disambiguated by leveraging a hierarchical agglomerative clustering method. Experiments on web data and scientific documents from CiteSeer indicate that our approach consistently outperforms other unsupervised learning methods such as spectral clustering and DBSCAN clustering and could be extended to other research fields. We empirically addressed the issue of scalability by disambiguating authors in over 750,000 papers from the entire CiteSeer dataset.

...read moreread less

Journal Article•DOI•

High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture Model Based on Minimum Message Length

[...]

Nizar Bouguila¹, Djemel Ziou²•Institutions (2)

Concordia University¹, Université de Sherbrooke²

01 Oct 2007-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This work considers the application of the minimum message length (MML) principle to determine the number of clusters in a finite mixture model based on the generalized Dirichlet distribution.

...read moreread less

Abstract: We consider the problem of determining the structure of high-dimensional data without prior knowledge of the number of clusters. Data are represented by a finite mixture model based on the generalized Dirichlet distribution. The generalized Dirichlet distribution has a more general covariance structure than the Dirichlet distribution and offers high flexibility and ease of use for the approximation of both symmetric and asymmetric distributions. This makes the generalized Dirichlet distribution more practical and useful. An important problem in mixture modeling is the determination of the number of clusters. Indeed, a mixture with too many or too few components may not be appropriate to approximate the true model. Here, we consider the application of the minimum message length (MML) principle to determine the number of clusters. The MML is derived so as to choose the number of clusters in the mixture model that best describes the data. A comparison with other selection criteria is performed. The validation involves synthetic data, real data clustering, and two interesting real applications: classification of Web pages, and texture database summarization for efficient retrieval.

...read moreread less

Proceedings Article•

Topic models over text streams: A study of batch arid online unsupervised learning

[...]

Arindam Banerjee¹, Sugato Basu²•Institutions (2)

University of Minnesota¹, SRI International²

01 Dec 2007

TL;DR: This paper analyzes three batch topic models that have been recently proposed in the machine learning and data mining community – Latent Dirichlet Allocation (LDA),Dirichlet Compound Multinomial (DCM) mixtures and von-Mises Fisher (vMF) mixture models and proposes a practical heuristic for hybrid topic modeling.

...read moreread less

Abstract: Automated unsupervised learning of topic-based clusters is used in various text data mining applications, e.g., document organization in content management, information retrieval and filtering in news aggregation services. Typically batch models are used for this purpose, which perform clustering on the document collection in aggregate. In this paper, we first analyze three batch topic models that have been recently proposed in the machine learning and data mining community – Latent Dirichlet Allocation (LDA), Dirichlet Compound Multinomial (DCM) mixtures and von-Mises Fisher (vMF) mixture models. Our discussion uses a common framework based on the particular assumptions made regarding the conditional distributions corresponding to each component and the topic priors. Our experiments on large real-world document collections demonstrate that though LDA is a good model for finding word-level topics, vMF finds better document-level topic clusters more efficiently, which is often important in text mining applications. In cases where offline clustering on complete document collections is infeasible due to resource constraints, online unsupervised clustering methods that process incoming data incrementally are necessary. To this end, we propose online variants of vMF, EDCM and LDA. Experiments on real-world streaming text illustrate the speed and performance benefits of online vMF. Finally, we propose a practical heuristic for hybrid topic modeling, which learns online topic models on streaming text data and intermittently runs batch topic models on aggregated documents offline. Such a hybrid model is useful for applications (e.g., dynamic topic-based aggregation of consumer-generated content in social networking sites) that need a good tradeoff between the performance of batch offline algorithms and efficiency of incremental online algorithms.

...read moreread less

Proceedings Article•

A Bayesian LDA-based model for semi-supervised part-of-speech tagging

[...]

Kristina Toutanova¹, Mark Johnson²•Institutions (2)

Microsoft¹, Brown University²

03 Dec 2007

TL;DR: A novel Bayesian model for semi-supervised part-of-speech tagging that outperforms the best previously proposed model for this task on a standard dataset and introduces a model for determining the set of possible tags of a word which captures important dependencies in the ambiguity classes of words.

...read moreread less

Abstract: We present a novel Bayesian model for semi-supervised part-of-speech tagging. Our model extends the Latent Dirichlet Allocation model and incorporates the intuition that words' distributions over tags, p(t|w), are sparse. In addition we introduce a model for determining the set of possible tags of a word which captures important dependencies in the ambiguity classes of words. Our model outperforms the best previously proposed model for this task on a standard dataset.

...read moreread less

Proceedings Article•DOI•

An LDA-based Community Structure Discovery Approach for Large-Scale Social Networks

[...]

Haizheng Zhang¹, Baojun Qiu¹, C.L. Giles¹, Henry C. Foley¹, John Yen¹ - Show less +1 more•Institutions (1)

Pennsylvania State University¹

23 May 2007

TL;DR: An LDA(latent Dirichlet Allocation)-based hierarchical Bayesian algorithm, namely SSN-LDA (simple social network LDA), which is promising for discovering community structures in large-scale networks.

...read moreread less

Abstract: Community discovery has drawn significant research interests among researchers from many disciplines for its increasing application in multiple, disparate areas, including computer science, biology, social science and so on. This paper describes an LDA(latent Dirichlet Allocation)-based hierarchical Bayesian algorithm, namely SSN-LDA (simple social network LDA). In SSN-LDA, communities are modeled as latent variables in the graphical model and defined as distributions over the social actor space. The advantage of SSN-LDA is that it only requires topological information as input. This model is evaluated on two research collaborative networkst: CtteSeer and NanoSCI. The experimental results demonstrate that this approach is promising for discovering community structures in large-scale networks.

...read moreread less

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

[...]

Bernhard Schölkopf, John Platt, Thomas Hofmann

01 Jan 2007

TL;DR: In this article, collapsed variational Bayes and Gibbs sampling have been used for LDA, and showed that it is computationally efficient, easy to implement and significantly more accurate than standard variational bayesian inference.

...read moreread less

Abstract: Latent Dirichlet allocation (LDA) is a Bayesian network that has recently gained much popularity in applications ranging from document modeling to computer vision Due to the large scale nature of these applications, current inference procedures like variational Bayes and Gibbs sampling have been found lacking In this paper we propose the collapsed variational Bayesian inference algorithm for LDA, and show that it is computationally efficient, easy to implement and significantly more accurate than standard variational Bayesian inference for LDA

...read moreread less

Proceedings Article•

Dynamic mixture models for multiple time series

[...]

Xing Wei, Jimeng Sun¹, Xuerui Wang•Institutions (1)

Carnegie Mellon University¹

06 Jan 2007

TL;DR: This paper applies Dynamic Mixture Models (DMMs) for online pattern discovery in multiple time series to two real-world datasets, and achieves significantly better results with intuitive interpretation.

...read moreread less

Abstract: Traditional probabilistic mixture models such as Latent Dirichlet Allocation imply that data records (such as documents) are fully exchangeable. However, data are naturally collected along time, thus obey some order in time. In this paper, we present Dynamic Mixture Models (DMMs) for online pattern discovery in multiple time series. DMMs do not have the noticeable drawback of the SVD-based methods for data streams: negative values in hidden variables are often produced even with all non-negative inputs. We apply DMM models to two real-world datasets, and achieve significantly better results with intuitive interpretation.

...read moreread less

Proceedings Article•DOI•

Image retrieval on large-scale image databases

[...]

Eva Hörster¹, Rainer Lienhart¹, Malcolm Slaney²•Institutions (2)

University of Augsburg¹, Yahoo!²

09 Jul 2007

TL;DR: This work studies the representation of images by Latent Dirichlet Allocation (LDA) models for content-based image retrieval, and shows the suitability of the approach for large-scale databases.

...read moreread less

Abstract: Online image repositories such as Flickr contain hundreds of millions of images and are growing quickly. Along with that the needs for supporting indexing, searching and browsing is becoming more and more pressing. In this work we will employ the image content as a source of information to retrieve images. We study the representation of images by Latent Dirichlet Allocation (LDA) models for content-based image retrieval. Image representations are learned in an unsupervised fashion, and each image is modeled as the mixture of topics/object parts depicted in the image. This allows us to put images into subspaces for higher-level reasoning which in turn can be used to find similar images. Different similarity measures based on the described image representation are studied. The presented approach is evaluated on a real world image database consisting of more than 246,000 images and compared to image models based on probabilistic Latent Semantic Analysis (pLSA). Results show the suitability of the approach for large-scale databases. Finally we incorporate active learning with user relevance feedback in our framework, which further boosts the retrieval performance.

...read moreread less

Book Chapter•DOI•

Semi-latent Dirichlet allocation: a hierarchical model for human action recognition

[...]

Yang Wang¹, Payam Sabzmeydani¹, Greg Mori¹•Institutions (1)

Simon Fraser University¹

20 Oct 2007

TL;DR: This work proposes a new method for human action recognition from video sequences using latent topic models, which achieves much better performance by utilizing the information provided by the class labels in the training set.

...read moreread less

Abstract: We propose a new method for human action recognition from video sequences using latent topic models. Video sequences are represented by a novel "bag-of-words" representation, where each frame corresponds to a "word". The major difference between our model and previous latent topic models for recognition problems in computer vision is that, our model is trained in a "semi-supervised" way. Our model has several advantages over other similar models. First of all, the training is much easier due to the decoupling of the model parameters. Secondly, it naturally solves the problem of how to choose the appropriate number of latent topics. Thirdly, it achieves much better performance by utilizing the information provided by the class labels in the training set. We present action classification and irregularity detection results, and show improvement over previous methods.

...read moreread less

Book Chapter•DOI•

Statistical Debugging Using Latent Topic Models

[...]

David Andrzejewski¹, Anne Mulhern¹, Ben Liblit¹, Xiaojin Zhu¹•Institutions (1)

University of Wisconsin-Madison¹

17 Sep 2007

TL;DR: Qualitative evaluation by domain experts suggests that the novel Delta-Latent-Dirichlet-Allocation model outperforms existing statistical methods for bug cause identification, and may help support other software tasks not addressed by earlier models.

...read moreread less

Abstract: Statistical debugging uses machine learning to model program failures and help identify root causes of bugs. We approach this task using a novel Delta-Latent-Dirichlet-Allocation model. We model execution traces attributed to failed runs of a program as being generated by two types of latent topics: normal usage topics and bug topics. Execution traces attributed to successful runs of the same program, however, are modeled by usage topics only. Joint modeling of both kinds of traces allows us to identify weak bug topics that would otherwise remain undetected. We perform model inference with collapsed Gibbs sampling. In quantitative evaluations on four real programs, our model produces bug topics highly correlated to the true bugs, as measured by the Rand index. Qualitative evaluation by domain experts suggests that our model outperforms existing statistical methods for bug cause identification, and may help support other software tasks not addressed by earlier models.

...read moreread less

Journal Article•DOI•

Employing Latent Dirichlet Allocation for fraud detection in telecommunications

[...]

Dongshan Xing¹, Mark Girolami¹•Institutions (1)

University of Glasgow¹

01 Oct 2007-Pattern Recognition Letters

TL;DR: This paper employs Latent Dirichlet Allocation (LDA) to build user profile signatures and assumes that any significant unexplainable deviations from the normal activity of an individual user is strongly correlated with fraudulent activity.

...read moreread less

Proceedings Article•

Sparse Overcomplete Latent Variable Decomposition of Counts Data

[...]

Madhusudana Shashanka, Bhiksha Raj¹, Paris Smaragdis²•Institutions (2)

Mitsubishi Electric Research Laboratories¹, Adobe Systems²

03 Dec 2007

TL;DR: This paper starts with the PLSA framework and uses an entropic prior in a maximum a posteriori formulation to enforce sparsity and shows that this allows the extraction of overcomplete sets of latent components which better characterize the data.

...read moreread less

Abstract: An important problem in many fields is the analysis of counts data to extract meaningful latent components. Methods like Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) have been proposed for this purpose. However, they are limited in the number of components they can extract and lack an explicit provision to control the "expressiveness" of the extracted components. In this paper, we present a learning formulation to address these limitations by employing the notion of sparsity. We start with the PLSA framework and use an entropic prior in a maximum a posteriori formulation to enforce sparsity. We show that this allows the extraction of overcomplete sets of latent components which better characterize the data. We present experimental evidence of the utility of such representations.

...read moreread less

Proceedings Article•

Nonparametric Bayes pachinko allocation

[...]

Wei Li¹, David M. Blei², Andrew McCallum¹•Institutions (2)

University of Massachusetts Amherst¹, Princeton University²

19 Jul 2007

TL;DR: This paper proposed a nonparametric Bayesian prior for PAM based on a variant of the hierarchical Dirichlet process (HDP), which can capture topic correlations defined by nested data structure, but it does not automatically discover such correlations from unstructured data.

...read moreread less

Abstract: Recent advances in topic models have explored complicated structured distributions to represent topic correlation. For example, the pachinko allocation model (PAM) captures arbitrary, nested, and possibly sparse correlations between topics using a directed acyclic graph (DAG). While PAM provides more flexibility and greater expressive power than previous models like latent Dirichlet allocation (LDA), it is also more difficult to determine the appropriate topic structure for a specific dataset. In this paper, we propose a nonparametric Bayesian prior for PAM based on a variant of the hierarchical Dirichlet process (HDP). Although the HDP can capture topic correlations defined by nested data structure, it does not automatically discover such correlations from unstructured data. By assuming an HDP-based prior for PAM, we are able to learn both the number of topics and how the topics are correlated. We evaluate our model on synthetic and real-world text datasets, and show that nonparametric PAM achieves performance matching the best of PAM without manually tuning the number of topics.

...read moreread less