scispace - formally typeset
Search or ask a question

Showing papers on "Latent Dirichlet allocation published in 2005"


Proceedings Article
05 Dec 2005
TL;DR: The correlated topic model (CTM) is developed, where the topic proportions exhibit correlation via the logistic normal distribution and a mean-field variational inference algorithm is derived for approximate posterior inference in this model, which is complicated by the fact that the Logistic normal is not conjugate to the multinomial.
Abstract: Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution [1]. We derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. The CTM gives a better fit than LDA on a collection of OCRed articles from the journal Science. Furthermore, the CTM provides a natural way of visualizing and exploring this and other unstructured data sets.

1,046 citations


25 Feb 2005
TL;DR: Given a set of images containing multiple object categories, this work seeks to discover those categories and their image locations without supervision using generative models from the statistical text literature: probabilistic Latent Semantic Analysis (pLSA), and Latent Dirichlet Allocation (LDA).
Abstract: Given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. We achieve this using generative models from the statistical text literature: probabilistic Latent Semantic Analysis (pLSA), and Latent Dirichlet Allocation (LDA). In text analysis these are used to discover topics in a corpus using the bag-of-words document representation. Here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. The models are applied to images by using a visual analogue of a word, formed by vector quantizing SIFT like region descriptors. We investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. The object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. We also demonstrate classification of unseen images and images containing multiple objects. Performance of the proposed unsupervised method is compared to the semi-supervised approach of [7].1 1This work was sponsored in part by the EU Project CogViSys, the University of Oxford, Shell Oil, and the National Geospatial-Intelligence Agency.

524 citations


Journal ArticleDOI
TL;DR: This work develops a spatial Dirichlet process model for spatial data and discusses its properties, and introduces mixing by convolving this process with a pure error process to produce a random spatial process that is neither Gaussian nor stationary.
Abstract: Customary modeling for continuous point-referenced data assumes a Gaussian process that is often taken to be stationary. When such models are fitted within a Bayesian framework, the unknown parameters of the process are assumed to be random, so a random Gaussian process results. Here we propose a novel spatial Dirichlet process mixture model to produce a random spatial process that is neither Gaussian nor stationary. We first develop a spatial Dirichlet process model for spatial data and discuss its properties. Because of familiar limitations associated with direct use of Dirichlet process models, we introduce mixing by convolving this process with a pure error process. We then examine properties of models created through such Dirichlet process mixing. In the Bayesian framework, we implement posterior inference using Gibbs sampling. Spatial prediction raises interesting questions, but these can be handled. Finally, we illustrate the approach using simulated data, as well as a dataset involving precipitati...

384 citations


Proceedings Article
30 Jul 2005
TL;DR: The Author-Recipient-Topic (ART) model for social network analysis is presented, which learns topic distributions based on the direction-sensitive messages sent between entities, adding the key attribute that distribution over topics is conditioned distinctly on both the sender and recipient.
Abstract: Previous work in social network analysis (SNA) has modeled the existence of links from one entity to another, but not the language content or topics on those links. We present the Author-Recipient-Topic (ART) model for social network analysis, which learns topic distributions based on the direction-sensitive messages sent between entities. The model builds on Latent Dirichlet Allocation (LDA) and the Author-Topic (AT) model, adding the key attribute that distribution over topics is conditioned distinctly on both the sender and recipient--steering the discovery of topics according to the relationships between people. We give results on both the Enron email corpus and a researcher's email archive, providing evidence not only that clearly relevant topics are discovered, but that the ART model better predicts people's roles.

342 citations


Proceedings Article
19 Aug 2005
TL;DR: This work proposes a novel sampling algorithm for collective entity resolution which is unsupervised and also takes entity relations into account, and demonstrates the utility and practicality of the relational entity resolution approach for author resolution in two real-world bibliographic datasets.
Abstract: Entity resolution has received considerable attention in recent years. Given many references to underlying entities, the goal is to predict which references correspond to the same entity. We show how to extend the Latent Dirichlet Allocation model for this task and propose a probabilistic model for collective entity resolution for relational domains where references are connected to each other. Our approach differs from other recently proposed entity resolution approaches in that it is a) generative, b) does not make pair-wise decisions and c) captures relations between entities through a hidden group variable. We propose a novel sampling algorithm for collective entity resolution which is unsupervised and also takes entity relations into account. Additionally, we do not assume the domain of entities to be known and show how to infer the number of entities from the data. We demonstrate the utility and practicality of our relational entity resolution approach for author resolution in two real-world bibliographic datasets. In addition, we present preliminary results on characterizing conditions under which relational information is useful.

293 citations


Journal ArticleDOI
TL;DR: In this article, the normalized inverse-Gaussian (N-IG) prior is proposed as an alternative to the Dirichlet process to be used in Bayesian hierarchical models.
Abstract: In recent years the Dirichlet process prior has experienced a great success in the context of Bayesian mixture modeling. The idea of overcoming discreteness of its realizations by exploiting it in hierarchical models, combined with the development of suitable sampling techniques, represent one of the reasons of its popularity. In this article we propose the normalized inverse-Gaussian (N–IG) process as an alternative to the Dirichlet process to be used in Bayesian hierarchical models. The N–IG prior is constructed via its finite-dimensional distributions. This prior, although sharing the discreteness property of the Dirichlet prior, is characterized by a more elaborate and sensible clustering which makes use of all the information contained in the data. Whereas in the Dirichlet case the mass assigned to each observation depends solely on the number of times that it occurred, for the N–IG prior the weight of a single observation depends heavily on the whole number of ties in the sample. Moreover, expressio...

211 citations


01 Jan 2005
TL;DR: The authors proposed the Author-Recipient-Topic (ART) model for social network analysis, which learns topic distributions based on the directionsensitive messages sent between entities, adding the key attribute that distribution over topics is conditioned distinctly on both the sender and recipient, steering the discovery of topics according to the relationships between people.
Abstract: Previous work in social network analysis (SNA) has modeled the existence of links from one entity to another, but not the language content or topics on those links. We present the Author-Recipient-Topic (ART) model for social network analysis, which learns topic distributions based on the the directionsensitive messages sent between entities. The model builds on Latent Dirichlet Allocation and the Author-Topic (AT) model, adding the key attribute that distribution over topics is conditioned distinctly on both the sender and recipient—steering the discovery of topics according to the relationships between people. We give results on both the Enron email corpus and a researcher’s email archive, providing evidence not only that clearly relevant topics are discovered, but that the ART model better predicts people’s roles.

169 citations


Proceedings ArticleDOI
21 Aug 2005
TL;DR: This work proposes a novel Web recommendation system in which collaborative features such as navigation or rating data as well as the content features accessed by the users are seamlessly integrated under the maximum entropy principle.
Abstract: Web users display their preferences implicitly by navigating through a sequence of pages or by providing numeric ratings to some items. Web usage mining techniques are used to extract useful knowledge about user interests from such data. The discovered user models are then used for a variety of applications such as personalized recommendations. Web site content or semantic features of objects provide another source of knowledge for deciphering users' needs or interests. We propose a novel Web recommendation system in which collaborative features such as navigation or rating data as well as the content features accessed by the users are seamlessly integrated under the maximum entropy principle. Both the discovered user patterns and the semantic relationships among Web objects are represented as sets of constraints that are integrated to fit the model. In the case of content features, we use a new approach based on Latent Dirichlet Allocation (LDA) to discover the hidden semantic relationships among items and derive constraints used in the model. Experiments on real Web site usage data sets show that this approach can achieve better recommendation accuracy, when compared to systems using only usage information. The integration of semantic information also allows for better interpretation of the generated recommendations.

116 citations


Book ChapterDOI
TL;DR: In this article, a unified theory for analysis of components in discrete data is presented, and the main families of algorithms discussed are a variational approximation, Gibbs sampling, and Rao-Blackwellised Gibbs sampling.
Abstract: This article presents a unified theory for analysis of components in discrete data, and compares the methods with techniques such as independent component analysis, non-negative matrix factorisation and latent Dirichlet allocation. The main families of algorithms discussed are a variational approximation, Gibbs sampling, and Rao-Blackwellised Gibbs sampling. Applications are presented for voting records from the United States Senate for 2003, and for the Reuters-21578 newswire collection.

111 citations


Proceedings ArticleDOI
21 Aug 2005
TL;DR: A new way to automatically model and predict human behavior of receiving and disseminating information by analyzing the contact and content of personal communications is proposed, based on the combination of dynamic algorithms in the social network field and semantic content classification methods in the natural language processing and machine learning literatures.
Abstract: In this paper, we propose a new way to automatically model and predict human behavior of receiving and disseminating information by analyzing the contact and content of personal communications. A personal profile, called CommunityNet, is established for each individual based on a novel algorithm incorporating contact, content, and time information simultaneously. It can be used for personal social capital management. Clusters of CommunityNets provide a view of informal networks for organization management. Our new algorithm is developed based on the combination of dynamic algorithms in the social network field and the semantic content classification methods in the natural language processing and machine learning literatures. We tested CommunityNets on the Enron Email corpus and report experimental results including filtering, prediction, and recommendation capabilities. We show that the personal behavior and intention are somewhat predictable based on these models. For instance, "to whom a person is going to send a specific email" can be predicted by one's personal social network and content analysis. Experimental results show the prediction accuracy of the proposed adaptive algorithm is 58% better than the social network-based predictions, and is 75% better than an aggregated model based on Latent Dirichlet Allocation with social network enhancement. Two online demo systems we developed that allow interactive exploration of CommunityNet are also discussed.

95 citations


Proceedings ArticleDOI
04 Sep 2005
TL;DR: An unsupervised dynamic language model (LM) adaptation framework using long-distance latent topic mixtures and the LDA model is combined with the trigram language model using linear interpolation to reduce the perplexity and character error rate.
Abstract: We propose an unsupervised dynamic language model (LM) adaptation framework using long-distance latent topic mixtures. The framework employs the Latent Dirichlet Allocation model (LDA) which models the latent topics of a document collection in an unsupervised and Bayesian fashion. In the LDA model, each word is modeled as a mixture of latent topics. Varying topics within a context can be modeled by re-sampling the mixture weights of the latent topics from a prior Dirichlet distribution. The model can be trained using the variational Bayes Expectation Maximization algorithm. During decoding, mixture weights of the latent topics are adapted dynamically using the hypotheses of previously decoded utterances. In our work, the LDA model is combined with the trigram language model using linear interpolation. We evaluated the approach on the CCTV episode of the RT04 Mandarin Broadcast News test set. Results show that the proposed approach reduces the perplexity by up to 15.4% relative and the character error rate by 4.9% relative depending on the size and setup of the training set.

16 Sep 2005
TL;DR: Time-Sensitive Dirichlet Process Mixture models for clustering are introduced that allow infinite mixture components and have the ability to model time correlations between instances.
Abstract: : We introduce Time-Sensitive Dirichlet Process Mixture models for clustering. The models allow infinite mixture components just like standard Dirichlet process mixture models. However, they also have the ability to model time correlations between instances.

01 Jan 2005
TL;DR: Dirichlet prior smoothing’s performance advantage appears to come more from an implicit prior favoring longer documents than from better estimation of the document model.
Abstract: In the language modeling approach to information retrieval, Dirichlet prior smoothing frequently outperforms fixed linear interpolated (aka Jelinek-Mercer) smoothing. The only difference between Dirichlet prior and fixed linear interpolated smoothing is that Dirichlet prior determines the amount of smoothing based on a document’s length. Our hypothesis was that Dirichlet prior smoothing has an implicit document prior that favors longer documents. We tested our hypothesis by first calculating a prior for a given document length from the known relevant documents. We then determined the performance of each smoothing method with and without the document prior. We discovered that when given the document prior, fixed linear interpolated smoothing matches or exceeds the performance of Dirichlet prior smoothing. Dirichlet prior smoothing’s performance advantage appears to come more from an implicit prior favoring longer documents than from better estimation of the document model.

24 Dec 2005
TL;DR: A new probabilistic generative model is proposed that automatically determines unigram words and phrases based on context and simultaneously associates them with a mixture of topics and presents very interesting results on large text corpora.
Abstract: : Most of the popular topic models (such as Latent Dirichlet Allocation) have an underlying assumption: bag of words. However, text is indeed a sequence of discrete word tokens, and without considering the order of words (in another word, the nearby context where a word is located), the accurate meaning of language cannot be exactly captured by word co-occurrences only. In this sense, collocations of words (phrases) have to be considered. However, like individual words, phrases sometimes show polysemy as well depending on the context. More noticeably, a composition of two (or more) words is a phrase in some contexts, but not in other contexts. In this paper, the authors propose a new probabilistic generative model that automatically determines unigram words and phrases based on context and simultaneously associates them with a mixture of topics. They present very interesting results on large text corpora.

Proceedings Article
01 Jan 2005
TL;DR: This paper applies efficient variational inference based on DMA, which replaces the DP prior by a simpler alternative, namely Dirichlet-multinomial allocation (DMA), which maintains the main modelling properties of the DP.
Abstract: This paper describes nonparametric Bayesian treatments for analyzing records containing occurrences of items. The introduced model retains the strength of previous approaches that explore the latent factors of each record (e.g. topics of documents), and further uncovers the clustering structure of records, which reflects the statistical dependencies of the latent factors. The nonparametric model induced by a Dirichlet process (DP) flexibly adapts model complexity to reveal the clustering structure of the data. To avoid the problems of dealing with infinite dimensions, we further replace the DP prior by a simpler alternative, namely Dirichlet-multinomial allocation (DMA), which maintains the main modelling properties of the DP. Instead of relying on Markov chain Monte Carlo (MCMC) for inference, this paper applies efficient variational inference based on DMA. The proposed approach yields encouraging empirical results on both a toy problem and text data. The results show that the proposed algorithm uncovers not only the latent factors, but also the clustering structure.

Proceedings ArticleDOI
15 Aug 2005
TL;DR: The Dirichlet Priors is applied to the term frequency normalisation of the classical BM25 probabilistic model and the Divergence from Randomness PL2 model and a novel theoretically-driven approach to the automatic parameter tuning is proposed.
Abstract: In Information Retrieval (IR), the Dirichlet Priors have been applied to the smoothing technique of the language modeling approach. In this paper, we apply the Dirichlet Priors to the term frequency normalisation of the classical BM25 probabilistic model and the Divergence from Randomness PL2 model. The contributions of this paper are twofold. First, through extensive experiments on four TREC collections, we show that the newly generated models, to which the Dirichlet Priors normalisation is applied, provide robust and effective performance. Second, we propose a novel theoretically-driven approach to the automatic parameter tuning of the Dirichlet Priors normalisation. Experiments show that this tuning approach optimises the retrieval performance of the newly generated Dirichlet Priors-based weighting models.

Journal ArticleDOI
TL;DR: A linear-time algorithm is proposed that defines a distributed predictive model for finite state symbolic sequences which represent the traces of the activity of a number of individuals within a group.
Abstract: To provide a parsimonious generative representation of the sequential activity of a number of individuals within a population there is a necessary tradeoff between the definition of individual specific and global representations. A linear-time algorithm is proposed that defines a distributed predictive model for finite state symbolic sequences which represent the traces of the activity of a number of individuals within a group. The algorithm is based on a straightforward generalization of latent Dirichlet allocation to time-invariant Markov chains of arbitrary order. The modelling assumption made is that the possibly heterogeneous behavior of individuals may be represented by a relatively small number of simple and common behavioral traits which may interleave randomly according to an individual-specific distribution. The results of an empirical study on three different application domains indicate that this modelling approach provides an efficient low-complexity and intuitively interpretable representation scheme which is reflected by improved prediction performance over comparable models.

Book ChapterDOI
18 May 2005
TL;DR: A regularized probabilistic latent semantic analysis model (RPLSA), which can properly adjust the amount of model flexibility so that not only the training data can be fit well but also the model is robust to avoid the overfitting problem.
Abstract: Mixture models, such as Gaussian Mixture Model, have been widely used in many applications for modeling data. Gaussian mixture model (GMM) assumes that data points are generated from a set of Gaussian models with the same set of mixture weights. A natural extension of GMM is the probabilistic latent semantic analysis (PLSA) model, which assigns different mixture weights for each data point. Thus, PLSA is more flexible than the GMM method. However, as a tradeoff, PLSA usually suffers from the overfitting problem. In this paper, we propose a regularized probabilistic latent semantic analysis model (RPLSA), which can properly adjust the amount of model flexibility so that not only the training data can be fit well but also the model is robust to avoid the overfitting problem. We conduct empirical study for the application of speaker identification to show the effectiveness of the new model. The experiment results on the NIST speaker recognition dataset indicate that the RPLSA model outperforms both the GMM and PLSA models substantially. The principle of RPLSA of appropriately adjusting model flexibility can be naturally extended to other applications and other types of mixture models.

01 Jan 2005
TL;DR: This article presents a unied theory for analysis of components in discrete data, and compares the methods with techniques such as independent component analysis (ICA), non-negative matrix factorisation (NMF) and latent Dirichlet allocation (LDA).
Abstract: This article presents a unied theory for analysis of com- ponents in discrete data, and compares the methods with techniques such as independent component analysis (ICA), non-negative matrix factorisation (NMF) and latent Dirichlet allocation (LDA). The main families of algorithms discussed are mean eld, Gibbs sampling, and Rao-Blackwellised Gibbs sampling. Applications are presented for voting records from the United States Senate for 2003, and the use of compo- nents in subsequent classication.


Book ChapterDOI
11 Sep 2005
TL;DR: This paper proposes the use of a neural network based, non-probabilistic, solution, which captures jointly a rich representation of words and documents, as compared to the classical TFIDF representations.
Abstract: Text categorization and retrieval tasks are often based on a good representation of textual data. Departing from the classical vector space model, several probabilistic models have been proposed recently, such as PLSA. In this paper, we propose the use of a neural network based, non-probabilistic, solution, which captures jointly a rich representation of words and documents. Experiments performed on two information retrieval tasks using the TDT2 database and the TREC-8 and 9 sets of queries yielded a better performance for the proposed neural network model, as compared to PLSA and the classical TFIDF representations.

30 May 2005
TL;DR: The DM model achieves the lowest perplexity level despite its unitopic nature, with parameter estimation methods derived from Minka’s fixed-point methods and the EM algorithm.
Abstract: Word rates in text vary according to global factors such as genre, topic, author, and expected readership (Church and Gale 1995). Models that summarize such global factors in text or at the document level, are called ‘text models.’ A finite mixture of Dirichlet distribution (Dirichlet Mixture or DM for short) was investigated as a new text model. When parameters of a multinomial are drawn from a DM, the compound for discrete outcomes is a finite mixture of the Dirichlet-multinomial. A Dirichlet multinomial can be regarded as a multivariate version of the Poisson mixture, a reliable univariate model for global factors (Church and Gale 1995). In the present paper, the DM and its compounds are introduced, with parameter estimation methods derived from Minka’s fixed-point methods (Minka 2003) and the EM algorithm. The method can estimate a considerable number of parameters of a large DM, i.e., a few hundred thousand parameters. After discussion of the relationships within the DM — probabilistic latent semantic analysis (PLSA) (Hofmann 1999), the mixture of unigrams (Nigam et al. 2000), and latent Dirichlet allocation (LDA) (Blei et al. 2001, 2003) — the products of statistical language modeling applications are discussed and their performance in perplexity compared. The DM model achieves the lowest perplexity level despite its unitopic nature.

Proceedings ArticleDOI
20 Jun 2005
TL;DR: The generalized Dirichlet distribution offers high flexibility and ease of use for the approximation of both symmetric and asymmetric distributions and allow highdimensional modeling without requiring dimensionality reduction and thus without a loss of information.
Abstract: We consider the problem of determining the structure of high-dimensional data, without prior knowledge of the number of clusters. Data are represented by a finite mixture model based on the generalized Dirichlet distribution. The generalized Dirichlet distribution has a more general covariance structure than the Dirichlet distribution and offers high flexibility and ease of use for the approximation of both symmetric and asymmetric distributions. In addition, the mathematical properties of this distribution allow highdimensional modeling without requiring dimensionality reduction and thus without a loss of information. The number of clusters is determined using the Minimum Message length (MML) principle. Parameters estimation is done by a hybrid stochastic expectation-maximization (HSEM) algorithm. The model is compared with results obtained by other selection criteria (AIC, MDL and MMDL). The performance of our method is tested by real data clustering and by applying it to an image object recognition problem.

Book ChapterDOI
03 Oct 2005
TL;DR: A probabilistic clustering-projection (PCP) model for discrete data, where they are both represented in a unified framework and Iterating the two operations turns out to be exactly the variational EM algorithm under Bayesian model inference, and thus is guaranteed to improve the data likelihood.
Abstract: For discrete co-occurrence data like documents and words, calculating optimal projections and clustering are two different but related tasks. The goal of projection is to find a low-dimensional latent space for words, and clustering aims at grouping documents based on their feature representations. In general projection and clustering are studied independently, but they both represent the intrinsic structure of data and should reinforce each other. In this paper we introduce a probabilistic clustering-projection (PCP) model for discrete data, where they are both represented in a unified framework. Clustering is seen to be performed in the projected space, and projection explicitly considers clustering structure. Iterating the two operations turns out to be exactly the variational EM algorithm under Bayesian model inference, and thus is guaranteed to improve the data likelihood. The model is evaluated on two text data sets, both showing very encouraging results.

Journal ArticleDOI
TL;DR: In this article, a Bayesian method by assuming generalized Dirichlet priors is presented to calculate the probabilities of future yields in the production model studied by Jewell and Chou, since some of the sorting probabilities for different categories of microelectronic chips tend to be positively correlated.
Abstract: In the production model studied by Jewell and Chou, since some of the sorting probabilities for different categories of microelectronic chips tend to be positively correlated, a Dirichlet distribution is an inappropriate prior for that model. Jewell and Chou therefore propose an approximation approach to predict coproduct yields. Since a generalized Dirichlet distribution allows variables to be positively correlated, a Bayesian method by assuming generalized Dirichlet priors is presented to calculate the probabilities of future yields in this paper. We consider not only the mean values, but also either the variances or the covariances of the sorting probabilities to construct generalized Dirichlet priors. The numerical results indicate that the generalized Dirichlet distribution should be a reasonable prior, and the computation in forecasting coproduct output is relatively straightforward with respect to the approximation approach.

Proceedings ArticleDOI
20 Jun 2005
TL;DR: This paper shows that those based on latent Dirichlet allocation (LDA) out perform traditional mixture models in likelihood comparison and can be used for both prediction and classification of new unseen data.
Abstract: In this paper we compare a variety of unsupervised probabilistic models used to represent a data set consisting of textual and image information. We show that those based on latent Dirichlet allocation (LDA) out perform traditional mixture models in likelihood comparison. The data set is taken from radiology; a combination of medical images and consultants reports. The task of learning to classify individual tissue, or disease types, requires expert hand labeled data. This is both: expensive to produce and prone to inconsistencies in labeling. Here we present methods that require no hand labeling and also automatically discover sub-types of disease. The learnt models can be used for both prediction and classification of new unseen data.

Proceedings Article
01 Dec 2005
TL;DR: Experimental results show that a structured theme-based language modeling approach is effective in improving retrieval effectiveness for the ad hoc taks and the Latent Dirichlet Allocation method iseffective in dimension reduction for the categorization task.
Abstract: We report experiment results from the collaborative participation of UIUC and MUSC in the TREC 2005 Genomics Track. We participated in both the adhoc task and the categorization task, and studied the use of some mixture language models in these tasks. Experiment results show that a structured theme-based language modeling approach is effective in improving retrieval effectiveness for the ad hoc taks and the Latent Dirichlet Allocation method is effective in dimension reduction for the categorization task.

Proceedings ArticleDOI
01 Dec 2005
TL;DR: A novel Bayesian PLSA framework is presented that is capable of performing dynamic document indexing and classification and the maximum a posteriori PLSA for corrective training is presented.
Abstract: Probabilistic latent semantic analysis (PLSA) is a popular approach to text modeling where the semantics and statistics in documents can be effectively captured. In this paper, a novel Bayesian PLSA framework is presented. We focus on exploiting the incremental learning algorithm for solving the updating problem of new domain articles. This algorithm is developed to improve text modeling by incrementally extracting the up-to-date latent semantic information to match the changing domains at run time. The expectationmaximization (EM) algorithm is applied to resolve the quasiBayes (QB) estimate of PLSA parameters. The online PLSA is constructed to accomplish parameter estimation as well as hyperparameter updating. Compared to standard PLSA using maximum likelihood estimate, the proposed QB approach is capable of performing dynamic document indexing and classification. Also, we present the maximum a posteriori PLSA for corrective training. Experiments on evaluating model perplexities and classification accuracies demonstrate the superiority of using Bayesian PLSA.


Book ChapterDOI
20 Jul 2005
TL;DR: This paper presents a novel method of automatic image semantic annotation based on the Image-Keyword Document Model with image features discretization that demonstrates great improvement of annotation performance compared with the known discretized-based image annotation model such as CMRM.
Abstract: This paper presents a novel method of automatic image semantic annotation. Our approach is based on the Image-Keyword Document Model (IKDM) with image features discretization. According to IKDM, the image keyword annotation is conducted using image similarity measurement based on language model from text information retrieval domain. Through the experiments on a testing set of 5000 annotated images, our approach demonstrates great improvement of annotation performance compared with the known discretization-based image annotation model such as CMRM. Our approach also performs better in annotation time compared with the continuous model such as CRM.