Showing papers on "Dynamic topic model published in 2010"

PDF

Open Access

Journal Article•DOI•

[...]

David M. Blei¹, Lawrence Carin², David B. Dunson²•Institutions (2)

18 Oct 2010-IEEE Signal Processing Magazine

TL;DR: In this paper, a review of probabilistic topic models can be found, which can be used to summarize a large collection of documents with a smaller number of distributions over words.

...read moreread less

Abstract: In this article, we review probabilistic topic models: graphical models that can be used to summarize a large collection of documents with a smaller number of distributions over words. Those distributions are called "topics" because, when fit to data, they capture the salient themes that run through the collection. We describe both finite-dimensional parametric topic models and their Bayesian nonparametric counterparts, which are based on the hierarchical Dirichlet process (HDP). We discuss two extensions of topic models to time-series data-one that lets the topics slowly change over time and one that lets the assumed prevalence of the topics change. Finally, we illustrate the application of topic models to nontext data, summarizing some recent research results in image analysis.

...read moreread less

1,429 citations

Proceedings Article•

Timeline: a dynamic hierarchical dirichlet process model for recovering birth/death and evolution of topics in text stream

[...]

Amr Ahmed¹, Eric P. Xing¹•Institutions (1)

Carnegie Mellon University¹

08 Jul 2010

TL;DR: In this paper, infinite dynamic topic models, iDTM, are introduced that can accommodate the evolution of all the aforementioned aspects of the latent structure such as the number of topics, the topics' distribution and popularity are time-evolving.

...read moreread less

Abstract: Topic models have proven to be a useful tool for discovering latent structures in document collections. However, most document collections often come as temporal streams and thus several aspects of the latent structure such as the number of topics, the topics' distribution and popularity are time-evolving. Several models exist that model the evolution of some but not all of the above aspects. In this paper we introduce infinite dynamic topic models, iDTM, that can accommodate the evolution of all the aforementioned aspects. Our model assumes that documents are organized into epochs, where the documents within each epoch are exchangeable but the order between the documents is maintained across epochs. iDTM allows for unbounded number of topics: topics can die or be born at any epoch, and the representation of each topic can evolve according to a Markovian dynamics. We use iDTM to analyze the birth and evolution of topics in the NIPS community and evaluated the efficacy of our model on both simulated and real datasets with favorable outcome.

...read moreread less

175 citations

Proceedings Article•

A Language-based Approach to Measuring Scholarly Impact

[...]

Sean Gerrish¹, David M. Blei¹•Institutions (1)

Princeton University¹

21 Jun 2010

TL;DR: This work proposes using changes in the thematic content of documents over time to measure the importance of individual documents within the collection, and describes a dynamic topic model for both quantifying and qualifying the impact of these documents.

...read moreread less

Abstract: Identifying the most influential documents in a corpus is an important problem in many fields, from information science and historiography to text summarization and news aggregation. Unfortunately, traditional bibliometrics such as citations are often not available. We propose using changes in the thematic content of documents over time to measure the importance of individual documents within the collection. We describe a dynamic topic model for both quantifying and qualifying the impact of these documents. We validate the model by analyzing three large corpora of scientific articles. Our measurement of a document's impact correlates significantly with its number of citations.

...read moreread less

145 citations

Proceedings Article•

Word Features for Latent Dirichlet Allocation

[...]

James Petterson¹, Alexander J. Smola², Tibério S. Caetano³, Wray Buntine¹, Shravan Narayanamurthy² - Show less +1 more•Institutions (3)

NICTA¹, Yahoo!², Australian National University³

01 Jan 2010

TL;DR: This work extends Latent Dirichlet Allocation by explicitly allowing for the encoding of side information in the distribution over words, which results in a variety of new capabilities, such as improved estimates for infrequently occurring words, as well as the ability to leverage thesauri and dictionaries in order to boost topic cohesion within and across languages.

...read moreread less

Abstract: We extend Latent Dirichlet Allocation (LDA) by explicitly allowing for the encoding of side information in the distribution over words. This results in a variety of new capabilities, such as improved estimates for infrequently occurring words, as well as the ability to leverage thesauri and dictionaries in order to boost topic cohesion within and across languages. We present experiments on multi-language topic synchronisation where dictionary information is used to bias corresponding words towards similar topics. Results indicate that our model substantially improves topic cohesion when compared to the standard LDA model.

...read moreread less

115 citations

Proceedings Article•DOI•

Online multiscale dynamic topic models

[...]

Tomoharu Iwata¹, Takeshi Yamada¹, Yasushi Sakurai¹, Naonori Ueda¹•Institutions (1)

Nippon Telegraph and Telephone¹

25 Jul 2010

TL;DR: An online topic model for sequentially analyzing the time evolution of topics in document collections is proposed based on a stochastic EM algorithm, in which the model is sequentially updated using newly obtained data; this means that past data are not required to make the inference.

...read moreread less

Abstract: We propose an online topic model for sequentially analyzing the time evolution of topics in document collections. Topics naturally evolve with multiple timescales. For example, some words may be used consistently over one hundred years, while other words emerge and disappear over periods of a few days. Thus, in the proposed model, current topic-specific distributions over words are assumed to be generated based on the multiscale word distributions of the previous epoch. Considering both the long-timescale dependency as well as the short-timescale dependency yields a more robust model. We derive efficient online inference procedures based on a stochastic EM algorithm, in which the model is sequentially updated using newly obtained data; this means that past data are not required to make the inference. We demonstrate the effectiveness of the proposed method in terms of predictive performance and computational efficiency by examining collections of real documents with timestamps.

...read moreread less

114 citations

Proceedings Article•

Word Features for Latent Dirichlet Allocation

[...]

James Petterson¹, Wray Buntine¹, Shravan Narayanamurthy², Tibério S. Caetano¹, Alexander J. Smola² - Show less +1 more•Institutions (2)

NICTA¹, Yahoo!²

06 Dec 2010

TL;DR: The authors extend Latent Dirichlet Allocation (LDA) by explicitly allowing for the encoding of side information in the distribution over words, which results in improved estimates for infrequently occurring words, as well as the ability to leverage thesauri and dictionaries in order to boost topic cohesion within and across languages.

...read moreread less

91 citations

Proceedings Article•

Cross-Lingual Latent Topic Extraction

[...]

Duo Zhang¹, Qiaozhu Mei², ChengXiang Zhai¹•Institutions (2)

University of Illinois at Urbana–Champaign¹, University of Michigan²

11 Jul 2010

TL;DR: A new topic model called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) is proposed which extends the probabilistic LatentSemantic Analysis model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary.

...read moreread less

Abstract: Probabilistic latent topic models have recently enjoyed much success in extracting and analyzing latent topics in text in an unsupervised way. One common deficiency of existing topic models, though, is that they would not work well for extracting cross-lingual latent topics simply because words in different languages generally do not co-occur with each other. In this paper, we propose a way to incorporate a bilingual dictionary into a probabilistic topic model so that we can apply topic models to extract shared latent topics in text data of different languages. Specifically, we propose a new topic model called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) which extends the Probabilistic Latent Semantic Analysis (PLSA) model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary. Both qualitative and quantitative experimental results show that the PCLSA model can effectively extract cross-lingual latent topics from multilingual text data.

...read moreread less

90 citations

Proceedings Article•

Term Weighting Schemes for Latent Dirichlet Allocation

[...]

Andrew T. Wilson¹, Peter A. Chew•Institutions (1)

Sandia National Laboratories¹

02 Jun 2010

TL;DR: It is shown that the 'problem' of high-frequency words can be dealt with more elegantly, and in a way that to the authors' knowledge has not been considered in LDA, through the use of appropriate weighting schemes comparable to those sometimes used in Latent Semantic Indexing (LSI).

...read moreread less

Abstract: Many implementations of Latent Dirichlet Allocation (LDA), including those described in Blei et al. (2003), rely at some point on the removal of stopwords, words which are assumed to contribute little to the meaning of the text. This step is considered necessary because otherwise high-frequency words tend to end up scattered across many of the latent topics without much rhyme or reason. We show, however, that the 'problem' of high-frequency words can be dealt with more elegantly, and in a way that to our knowledge has not been considered in LDA, through the use of appropriate weighting schemes comparable to those sometimes used in Latent Semantic Indexing (LSI). Our proposed weighting methods not only make theoretical sense, but can also be shown to improve precision significantly on a non-trivial cross-language retrieval task.

...read moreread less

83 citations

Proceedings Article•DOI•

Topic models with power-law using Pitman-Yor process

[...]

Issei Sato¹, Hiroshi Nakagawa¹•Institutions (1)

University of Tokyo¹

25 Jul 2010

TL;DR: A novel topic model using the Pitman-Yor(PY) process is proposed, called the PY topic model, which captures two properties of a document; a power-law word distribution and the presence of multiple topics.

...read moreread less

Abstract: One important approach for knowledge discovery and data mining is to estimate unobserved variables because latent variables can indicate hidden specific properties of observed data. The latent factor model assumes that each item in a record has a latent factor; the co-occurrence of items can then be modeled by latent factors. In document modeling, a record indicates a document represented as a "bag of words," meaning that the order of words is ignored, an item indicates a word and a latent factor indicates a topic. Latent Dirichlet allocation (LDA) is a widely used Bayesian topic model applying the Dirichlet distribution over the latent topic distribution of a document having multiple topics. LDA assumes that latent topics, i.e., discrete latent variables, are distributed according to a multinomial distribution whose parameters are generated from the Dirichlet distribution. LDA also models a word distribution by using a multinomial distribution whose parameters follows the Dirichlet distribution. This Dirichlet-multinomial setting, however, cannot capture the power-law phenomenon of a word distribution, which is known as Zipf's law in linguistics. We therefore propose a novel topic model using the Pitman-Yor(PY) process, called the PY topic model. The PY topic model captures two properties of a document; a power-law word distribution and the presence of multiple topics. In an experiment using real data, this model outperformed LDA in document modeling in terms of perplexity.

...read moreread less

77 citations

Proceedings Article•

Conditional Topic Random Fields

[...]

Jun Zhu¹, Eric P. Xing¹•Institutions (1)

Carnegie Mellon University¹

21 Jun 2010

TL;DR: An efficient variational inference algorithm that scales linearly in terms of topic numbers, and a maximum likelihood estimation (MLE) procedure for parameter estimation, for the supervised version of CTRF and an arguably more discriminative max-margin learning method.

...read moreread less

Abstract: Generative topic models such as LDA are limited by their inability to utilize nontrivial input features to enhance their performance, and many topic models assume that topic assignments of different words are conditionally independent. Some work exists to address the second limitation but no work exists to address both. This paper presents a conditional topic random field (CTRF) model, which can use arbitrary nonlocal features about words and documents and incorporate the Markov dependency between topic assignments of neighboring words. We develop an efficient variational inference algorithm that scales linearly in terms of topic numbers, and a maximum likelihood estimation (MLE) procedure for parameter estimation. For the supervised version of CTRF, we also develop an arguably more discriminative max-margin learning method. We evaluate CTRF on real review rating data and demonstrate the advantages of CTRF over generative competitors, and we show the advantages of max-margin learning over MLE.

...read moreread less

60 citations

Journal Article•DOI•

Hierarchical Bayesian Modeling of Topics in Time-Stamped Documents

[...]

Iulian Pruteanu-Malinici¹, Lu Ren¹, John Paisley¹, Eric Wang¹, Lawrence Carin¹ - Show less +1 more•Institutions (1)

Duke University¹

01 Jun 2010-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This work considers the problem of inferring and modeling topics in a sequence of documents with known publication dates as well as the US Presidential State of the Union addresses from 1790 to 2008, and proposes a hierarchical model that infers the change in the topic mixture weights as a function of time.

...read moreread less

Abstract: We consider the problem of inferring and modeling topics in a sequence of documents with known publication dates. The documents at a given time are each characterized by a topic and the topics are drawn from a mixture model. The proposed model infers the change in the topic mixture weights as a function of time. The details of this general framework may take different forms, depending on the specifics of the model. For the examples considered here, we examine base measures based on independent multinomial-Dirichlet measures for representation of topic-dependent word counts. The form of the hierarchical model allows efficient variational Bayesian inference, of interest for large-scale problems. We demonstrate results and make comparisons to the model when the dynamic character is removed, and also compare to latent Dirichlet allocation (LDA) and Topics over Time (TOT). We consider a database of Neural Information Processing Systems papers as well as the US Presidential State of the Union addresses from 1790 to 2008.

...read moreread less

Patent•

Collapsed gibbs sampler for sparse topic models and discrete matrix factorization

[...]

Cédric Archambeau¹, Guillaume Bouchard¹•Institutions (1)

Xerox¹

19 Oct 2010

TL;DR: In this article, a topic model defining a set of topics is inferred by performing latent Dirichlet allocation (LDA) with an Indian Buffet Process (IBP) compound Dichlet prior probability distribution.

...read moreread less

Abstract: In an inference system for organizing a corpus of objects, feature representations are generated comprising distributions over a set of features corresponding to the objects. A topic model defining a set of topics is inferred by performing latent Dirichlet allocation (LDA) with an Indian Buffet Process (IBP) compound Dirichlet prior probability distribution. The inference is performed using a collapsed Gibbs sampling algorithm by iteratively sampling (1) topic allocation variables of the LDA and (2) binary activation variables of the IBP compound Dirichlet prior. In some embodiments the inference is configured such that each inferred topic model is a clean topic model with topics defined as distributions over sub-sets of the set of features selected by the prior. In some embodiments the inference is configured such that the inferred topic model associates a focused sub-set of the set of topics to each object of the training corpus.

...read moreread less

Proceedings Article•DOI•

Sequential Latent Dirichlet Allocation: Discover Underlying Topic Structures within a Document

[...]

Lan Du¹, Wray Buntine, Huidong Jin²•Institutions (2)

Australian National University¹, Commonwealth Scientific and Industrial Research Organisation²

13 Dec 2010

TL;DR: By taking into account the sequential structure within a document, the SeqLDA model has a higher fidelity over LDA in terms of perplexity (a standard measure of dictionary-based compressibility) and yields a nicer sequential topic structure than LDA.

...read moreread less

Abstract: Understanding how topics within a document evolve over its structure is an interesting and important problem. In this paper, we address this problem by presenting a novel variant of Latent Dirichlet Allocation (LDA): Sequential LDA (SeqLDA). This variant directly considers the underlying sequential structure, {\it i.e.}, a document consists of multiple segments ({\it e.g.}, chapters, paragraphs), each of which is correlated to its previous and subsequent segments. In our model, a document and its segments are modelled as random mixtures of the same set of latent topics, each of which is a distribution over words, and the topic distribution of each segment depends on that of its previous segment, the one for first segment will depend on the document topic distribution. The progressive dependency is captured by using the nested two-parameter Poisson Dirichlet process (PDP). We develop an efficient collapsed Gibbs sampling algorithm to sample from the posterior of the PDP. Our experimental results on patent documents show that by taking into account the sequential structure within a document, our SeqLDA model has a higher fidelity over LDA in terms of perplexity (a standard measure of dictionary-based compressibility). The SeqLDA model also yields a nicer sequential topic structure than LDA, as we show in experiments on books such as Melville's "The Whale''.

...read moreread less

Proceedings Article•DOI•

Expression microarray classification using topic models

[...]

Manuele Bicego¹, Pietro Lovato¹, Barbara Oliboni¹, Alessandro Perina¹•Institutions (1)

University of Verona¹

22 Mar 2010

TL;DR: The expression microarray classification task is cast into this probabilistic context, providing a parallelism with the text mining domain and an interpretation, and an experimental evaluation of the proposed methodologies on three standard datasets confirms their effectiveness.

...read moreread less

Abstract: Classification of samples in expression microarray experiments represents a crucial task in bioinformatics and biomedicine. In this paper this scenario is addressed by employing a particular class of statistical approaches, called Topic Models. These models, firstly introduced in the text mining community, permit to extract from a set of objects (typically documents) an interpretable and rich description, based on an intermediate representation called topics (or processes). In this paper the expression microarray classification task is cast into this probabilistic context, providing a parallelism with the text mining domain and an interpretation. Two different topic models are investigated, namely the Probabilistic Latent Semantic Analysis (PLSA) and the Latent Dirichlet Allocation (LDA). An experimental evaluation of the proposed methodologies on three standard datasets confirms their effectiveness, also in comparison with other classification methodologies.

...read moreread less

Book Chapter•DOI•

Biologically-aware latent dirichlet allocation (BaLDA) for the classification of expression microarray

[...]

Alessandro Perina¹, Pietro Lovato¹, Vittorio Murino², Manuele Bicego²•Institutions (2)

University of Verona¹, Istituto Italiano di Tecnologia²

22 Sep 2010

TL;DR: A novel topic model is proposed, which enriches and extends the Latent Dirichlet Allocation (LDA) model by integrating such dependencies, encoded in a categorization of genes, in a highly informative and discriminant representation for microarray experiments.

...read moreread less

Abstract: Topic models have recently shown to be really useful tools for the analysis of microarray experiments. In particular they have been successfully applied to gene clustering and, very recently, also to samples classification. In this latter case, nevertheless, the basic assumption of functional independence between genes is limiting, since many other a priori information about genes' interactions may be available (co-regulation, spatial proximity or other a priori knowledge). In this paper a novel topic model is proposed, which enriches and extends the Latent Dirichlet Allocation (LDA) model by integrating such dependencies, encoded in a categorization of genes. The proposed topic model is used to derive a highly informative and discriminant representation for microarray experiments. Its usefulness, in comparison with standard topic models, has been demonstrated in two different classification tests.

...read moreread less

Proceedings Article•DOI•

Discriminative topic modeling based on manifold learning

[...]

Seungil Huh¹, Stephen E. Fienberg¹•Institutions (1)

Carnegie Mellon University¹

25 Jul 2010

TL;DR: A Discriminative Topic Model (DTM) is proposed that separates non-neighboring pairs from each other in addition to bringing neighboring pairs closer together, thereby preserving the global manifold structure as well as improving the local consistency.

...read moreread less

Abstract: Topic modeling has been popularly used for data analysis in various domains including text documents. Previous topic models, such as probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA), have shown impressive success in discovering low-rank hidden structures for modeling text documents. These models, however, do not take into account the manifold structure of data, which is generally informative for the non-linear dimensionality reduction mapping. More recent models, namely Laplacian PLSI (LapPLSI) and Locally-consistent Topic Model (LTM), have incorporated the local manifold structure into topic models and have shown the resulting benefits. But these approaches fall short of the full discriminating power of manifold learning as they only enhance the proximity between the low-rank representations of neighboring pairs without any consideration for non-neighboring pairs. In this paper, we propose Discriminative Topic Model (DTM) that separates non-neighboring pairs from each other in addition to bringing neighboring pairs closer together, thereby preserving the global manifold structure as well as improving the local consistency. We also present a novel model fitting algorithm based on the generalized EM and the concept of Pareto improvement. As a result, DTM achieves higher classification performance in a semi-supervised setting by effectively exposing the manifold structure of data. We provide empirical evidence on text corpora to demonstrate the success of DTM in terms of classification accuracy and robustness to parameters compared to state-of-the-art techniques.

...read moreread less

Proceedings Article•DOI•

Author interest topic model

[...]

Noriaki Kawamae¹•Institutions (1)

Nippon Telegraph and Telephone¹

19 Jul 2010

TL;DR: A hierarchical topic model that simultaneously captures topics and author's interests is presented, which introduces a latent variable with a separate probability distribution over topics into each document.

...read moreread less

Abstract: This paper presents a hierarchical topic model that simultaneously captures topics and author's interests. Our proposal, the Author Interest Topic model (AIT), introduces a latent variable with a separate probability distribution over topics into each document. Experiments on a research paper corpus show that the AIT is useful as a generative model.

...read moreread less

Journal Article•DOI•

Topic Tracking with Dynamic Topic Model and Topic-based Weighting Method

[...]

Xiaoyan Zhang¹, Ting Wang¹•Institutions (1)

National University of Defense Technology¹

05 Jan 2010-Journal of Software

TL;DR: An improved topic-based weighting method and a dynamic model based on the static model to overcome the topic drift problem and filter the noise existed in the tracked topic description are proposed.

...read moreread less

Abstract: In t opic t racking, a topic is usually described by several stories. How to represent a topic is always an issue and a difficult problem in the research on topic tracking. To emphasis the topic in stories, we provide an improved topic-based tf * idf weighting method to measure the topical importance of the features in the representation model. To overcome the topic drift problem and filter the noise existed in the tracked topic description , a dynamic topic model is proposed based on the static model. It extends the initial topic model with the information from the incoming related stories and filters the noise using the latest unrelated story. The topic tracking systems are implemented on the TDT4 Chinese corpus. The experimental results indicate that both the new weighting method and the dynamic model can improve the tracking performance.

...read moreread less

Book Chapter•DOI•

Embedding Semantics in LDA Topic Models

[...]

Loulwah AlSumait¹, Pu Wang², Carlotta Domeniconi², Daniel Barbará²•Institutions (2)

Kuwait University¹, George Mason University²

04 Mar 2010

Proceedings Article•

Not-So-Latent Dirichlet Allocation: Collapsed Gibbs Sampling Using Human Judgments

[...]

Jonathan Chang¹•Institutions (1)

Facebook¹

06 Jun 2010

TL;DR: A novel task, tag-and-cluster, which asks subjects to simultaneously annotate documents and cluster those annotations is presented, and it is demonstrated that these topic models have features which distinguish them from traditional topic models.

...read moreread less

Abstract: Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Recent studies have found that while there are suggestive connections between topic models and the way humans interpret data, these two often disagree. In this paper, we explore this disagreement from the perspective of the learning process rather than the output. We present a novel task, tag-and-cluster, which asks subjects to simultaneously annotate documents and cluster those annotations. We use these annotations as a novel approach for constructing a topic model, grounded in human interpretations of documents. We demonstrate that these topic models have features which distinguish them from traditional topic models.

...read moreread less

Proceedings Article•DOI•

Topic-weak-correlated Latent Dirichlet allocation

[...]

Yimin Tan¹, Zhijian Ou¹•Institutions (1)

Tsinghua University¹

01 Nov 2010

TL;DR: Experimental results on both synthetic and real-world corpus show the superiority of the TWC-LDA over the basic LDA for semantically meaningful topic discovery and document classification.

...read moreread less

Abstract: Latent Dirichlet allocation (LDA) has been widely used for analyzing large text corpora. In this paper we propose the topic-weak-correlated LDA (TWC-LDA) for topic modeling, which constrains different topics to be weak-correlated. This is technically achieved by placing a special prior over the topic-word distributions. Reducing the overlapping between the topic-word distributions makes the learned topics more interpretable in the sense that each topic word-distribution can be clearly associated to a distinctive semantic meaning. Experimental results on both synthetic and real-world corpus show the superiority of the TWC-LDA over the basic LDA for semantically meaningful topic discovery and document classification.

...read moreread less

Book Chapter•DOI•

Topic models conditioned on relations

[...]

Mirwaes Wahabzada, Zhao Xu, Kristian Kersting

20 Sep 2010

TL;DR: A Dirichlet-multinomial nonparametric regression topic model that includes a Gaussian process prior on joint document and topic distributions that is a function of document relations is presented.

...read moreread less

Abstract: Latent Dirichlet allocation is a fully generative statistical language model that has been proven to be successful in capturing both the content and the topics of a corpus of documents. Recently, it was even shown that relations among documents such as hyper-links or citations allow one to share information between documents and in turn to improve topic generation. Although fully generative, in many situations we are actually not interested in predicting relations among documents. In this paper, we therefore present a Dirichlet-multinomial nonparametric regression topic model that includes a Gaussian process prior on joint document and topic distributions that is a function of document relations. On networks of scientific abstracts and of Wikipedia documents we show that this approach meets or exceeds the performance of several baseline topic models.

...read moreread less

Proceedings Article•DOI•

An efficient block model for clustering sparse graphs

[...]

Ádám Gyenge¹, Janne Sinkkonen, András A. Benczúr¹•Institutions (1)

Hungarian Academy of Sciences¹

24 Jul 2010

TL;DR: A new generative model that combines rich block structure and simple, efficient estimation by collapsed Gibbs sampling is developed, which outperforms earlier Latent Dirichlet Allocation based models as well as spectral heuristics.

...read moreread less

Abstract: Models for large, sparse graphs are found in many applications and are an active topic in machine learning research. We develop a new generative model that combines rich block structure and simple, efficient estimation by collapsed Gibbs sampling. Novel in our method is that we may learn the strength of assortative and disassortative mixing schemes of communities. Most earlier approaches, both based on low-dimensional projections and Latent Dirichlet Allocation implicitely rely on one of the two assumptions: some algorithms define similarity based solely on connectedness while others solely on the similarity of the neighborhood, leading to undesired results for example in near-bipartite subgraphs. In our experiments we cluster both small and large graphs, involving real and generated graphs that are known to be hard to partition. Our method outperforms earlier Latent Dirichlet Allocation based models as well as spectral heuristics.

...read moreread less

Proceedings Article•DOI•

Trend detection model

[...]

Noriaki Kawamae, Ryuichiro Higashinaka

26 Apr 2010

TL;DR: A topic model that detects topic distributions over time by introducing a latent trend class variable into each document that has a probability distribution over topics and a continuous distribution over time is presented.

...read moreread less

Abstract: This paper presents a topic model that detects topic distributions over time. Our proposed model, Trend Detection Model (TDM) introduces a latent trend class variable into each document. The trend class has a probability distribution over topics and a continuous distribution over time. Experiments using our data set show that TDM is useful as a generative model in the analysis of the evolution of trends.

...read moreread less

Proceedings Article•

Continuous Time Group Discovery in Dynamic Graphs

[...]

Kurt T. Miller¹, Tina Eliassi-Rad²•Institutions (2)

University of California, Berkeley¹, Lawrence Livermore National Laboratory²

04 Nov 2010

TL;DR: A scalable Bayesian approach for community discovery in dynamic graphs is proposed based on extensions of Latent Dirichlet Allocation, which was extended to deal with topic changes in discrete time and later in continuous time.

...read moreread less

Abstract: With the rise in availability and importance of graphs and networks, it has become increasingly important to have good models to describe their behavior. While much work has focused on modeling static graphs, we focus on group discovery in dynamic graphs. We adapt a dynamic extension of Latent Dirichlet Allocation to this task and demonstrate good performance on two datasets. Modeling relational data has become increasingly important in recent years. Much work has focused on static graphs - that is fixed graphs at a single point in time. Here we focus on the problem of modeling dynamic (i.e. time-evolving) graphs. We propose a scalable Bayesian approach for community discovery in dynamic graphs. Our approach is based on extensions of Latent Dirichlet Allocation (LDA). LDA is a latent variable model for topic modeling in text corpora. It was extended to deal with topic changes in discrete time and later in continuous time. These models were referred to as the discrete Dynamic Topic Model (dDTM) and the continuous Dynamic Topic Model (cDTM), respectively. When adapting these models to graphs, we take our inspiration from LDA-G and SSN-LDA, applications of LDA to static graphs that have been shown to effectively factor out communitymore » structure to explain link patterns in graphs. In this paper, we demonstrate how to adapt and apply the cDTM to the task of finding communities in dynamic networks. We use link prediction to measure the quality of the discovered community structure and apply it to two different relational datasets - DBLP author-keyword and CAIDA autonomous systems relationships. We also discuss a parallel implementation of this approach using Hadoop. In Section 2, we review LDA and LDA-G. In Section 3, we review the cDTM and introduce cDTMG, its adaptation to modeling dynamic graphs. We discuss inference for the cDTM-G and details of our parallel implementation in Section 4 and present its performance on two datasets in Section 5 before concluding in Section 6.« less

...read moreread less

Journal Article•

LDA-based Model for Online Topic Evolution Mining

[...]

Liang Zheng¹•Institutions (1)

National University of Defense Technology¹

01 Jan 2010-Computer Science

TL;DR: Latent Dirichlet Allocation was extended to the context of online text streams, and an online LDA model was proposed and implemented as well, which can discover meaningful topical evolution trends both on English and Chinese corpus.

...read moreread less

Abstract: A computational model for online topic evolution mining was established through a latent semantic analysis process on textual data.Topical evolutionary analysis was achieved by tracking the topic trends in different time-slices.In this paper,Latent Dirichlet Allocation(LDA)was extended to the context of online text streams,and an online LDA model was proposed and implemented as well.The main idea is to use the posterior of topic-word distribution of each time-slice to influence the inference of the next time-slice,which also maintains the relevance between the topics.The topic-word and document-topic distributions are inferenced by incremental Gibbs algorithm.Kullback Leibler(KL)relative entropy is uesd to measure the similarity between topics in order to identify topic genetic and topic mutation.Experiments show that the proposed model can discover meaningful topical evolution trends both on English and Chinese corpus.

...read moreread less

Proceedings Article•

Novel weighting scheme for unsupervised language model adaptation using latent dirichlet allocation.

[...]

Md. Akmal Haidar¹, Douglas D. O'Shaughnessy²•Institutions (2)

Institut national de la recherche scientifique¹, Université du Québec²

01 Jan 2010

TL;DR: This work formed topic clusters by a hard-clustering method assigning one topic to one document based on the maximum number of words chosen from a topic for that document in Latent Dirichlet Allocation (LDA) analysis.

...read moreread less

Abstract: A new approach for computing weights of topic models in language model (LM) adaptation is introduced. We formed topic clusters by a hard-clustering method assigning one topic to one document based on the maximum number of words chosen from a topic for that document in Latent Dirichlet Allocation (LDA) analysis. The new weighting idea is that the unigram count of the topic generated by hard-clustering is used to compute the mixture weights instead of using an LDA latent topic word count used in the literature. Our approach shows significant perplexity and word error rate (WER) reduction against the existing approach.

...read moreread less

Journal Article•

Topic Evolution Based on LDA and Topic Association

[...]

Chu Ke

01 Jan 2010-Journal of Shanghai Jiaotong University

TL;DR: Experiments show that the method can detect new topics and describe topic's evolution over time effectively, and shows that the topics evolve with time and that the content of topics change with time.

...read moreread less

Abstract: Topic evolution will help people to learn information quickly.In this paper,a method was proposed to discover topic's evolution over time by topic detection and relating topics in different time periods.The method applies LDA model on temporal documents to extract topics.The number of topics in different time periods is different.Relating topics in consecutive time periods is based on Jensen-Shannon divergence and features similarity.Experiments show that the method can detect new topics and describe topic's evolution over time effectively.It not only shows that the topics evolve with time,but also that the content of topics change with time.

...read moreread less

Book Chapter•DOI•

A temporal latent topic model for facial expression recognition

[...]

Lifeng Shang¹, Kwok-Ping Chan¹•Institutions (1)

University of Hong Kong¹

08 Nov 2010

TL;DR: The latent Dirichlet allocation (LDA) topic model is extended to model facial expression dynamics and the resulting temporal latent topic model (TLTM) is described and shown how it can be applied to facial expression recognition.

...read moreread less

Abstract: In this paper we extend the latent Dirichlet allocation (LDA) topic model to model facial expression dynamics. Our topic model integrates the temporal information of image sequences through redefining topic generation probability without involving new latent variables or increasing inference difficulties. A collapsed Gibbs sampler is derived for batch learning with labeled training dataset and an efficient learning method for testing data is also discussed. We describe the resulting temporal latent topic model (TLTM) in detail and show how it can be applied to facial expression recognition. Experiments on CMU expression database illustrate that the proposed TLTM is very efficient in facial expression recognition.

...read moreread less

Journal Article•

Topic Words Extraction Method Based on LDA Model

[...]

LI Wan-long¹•Institutions (1)

Jilin University¹

01 Jan 2010-Computer Engineering

TL;DR: Latent Dirichlet Allocation (LDA) is presented to express the distributed probability of words with the help of word clustering of background and topic words association.

...read moreread less

Abstract: Latent Dirichlet Allocation(LDA) is presented to express the distributed probability of words.The topic keywords are extracted according to Shannon information.Words which are not distinctly in the analyzed text can be included to express the topics with the help of word clustering of background and topic words association.The topic meaning is attempted to dig out.Fast Gibbs is used to estimate the parameters.Experiments show that Fast Gibbs is 5 times faster than Gibbs and the precision is satisfactory,which shows the approach is efficient.

...read moreread less