scispace - formally typeset
Search or ask a question

Showing papers by "Padhraic Smyth published in 2009"


Proceedings Article
18 Jun 2009
TL;DR: In this article, the authors compare the performance of topic models with collapsed Gibbs sampling, variational inference, and maximum a posteriori estimation, and find that the main differences are attributable to the amount of smoothing applied to the counts.
Abstract: Latent Dirichlet analysis, or topic modeling, is a flexible latent variable framework for modeling high-dimensional sparse count data. Various learning algorithms have been developed in recent years, including collapsed Gibbs sampling, variational inference, and maximum a posteriori estimation, and this variety motivates the need for careful empirical comparisons. In this paper, we highlight the close connections between these approaches. We find that the main differences are attributable to the amount of smoothing applied to the counts. When the hyperparameters are optimized, the differences in performance among the algorithms diminish significantly. The ability of these algorithms to achieve solutions of comparable accuracy gives us the freedom to select computationally efficient approaches. Using the insights gained from this comparative study, we show how accurate topic models can be learned in several seconds on text corpora with thousands of documents.

495 citations


Journal Article
TL;DR: This work describes distributed algorithms for two widely-used topic models, namely the Latent Dirichlet Allocation (LDA) model and the Hierarchical Dirichet Process (HDP) model, and proposes a model that uses a hierarchical Bayesian extension of LDA to directly account for distributed data.
Abstract: We describe distributed algorithms for two widely-used topic models, namely the Latent Dirichlet Allocation (LDA) model, and the Hierarchical Dirichet Process (HDP) model. In our distributed algorithms the data is partitioned across separate processors and inference is done in a parallel, distributed fashion. We propose two distributed algorithms for LDA. The first algorithm is a straightforward mapping of LDA to a distributed processor setting. In this algorithm processors concurrently perform Gibbs sampling over local data followed by a global update of topic counts. The algorithm is simple to implement and can be viewed as an approximation to Gibbs-sampled LDA. The second version is a model that uses a hierarchical Bayesian extension of LDA to directly account for distributed data. This model has a theoretical guarantee of convergence but is more complex to implement than the first algorithm. Our distributed algorithm for HDP takes the straightforward mapping approach, and merges newly-created topics either by matching or by topic-id. Using five real-world text corpora we show that distributed learning works well in practice. For both LDA and HDP, we show that the converged test-data log probability for distributed learning is indistinguishable from that obtained with single-processor learning. Our extensive experimental results include learning topic models for two multi-million document collections using a 1024-processor parallel computer.

438 citations


Journal ArticleDOI
TL;DR: Findings indicate that circadian clock genes may be utilized to modulate the progression of non-diurnal cyclic processes through their effect on the cell cycle.
Abstract: Hair follicles undergo recurrent cycling of controlled growth (anagen), regression (catagen), and relative quiescence (telogen) with a defined periodicity. Taking a genomics approach to study gene expression during synchronized mouse hair follicle cycling, we discovered that, in addition to circadian fluctuation, CLOCK-regulated genes are also modulated in phase with the hair growth cycle. During telogen and early anagen, circadian clock genes are prominently expressed in the secondary hair germ, which contains precursor cells for the growing follicle. Analysis of Clock and Bmal1 mutant mice reveals a delay in anagen progression, and the secondary hair germ cells show decreased levels of phosphorylated Rb and lack mitotic cells, suggesting that circadian clock genes regulate anagen progression via their effect on the cell cycle. Consistent with a block at the G1 phase of the cell cycle, we show a significant upregulation of p21 in Bmal1 mutant skin. While circadian clock mechanisms have been implicated in a variety of diurnal biological processes, our findings indicate that circadian clock genes may be utilized to modulate the progression of non-diurnal cyclic processes.

160 citations


Journal ArticleDOI
TL;DR: An analysis of variance (ANOVA) periodicity detector and its Bayesian extension that can be used to discover periodic transcripts of arbitrary shapes from replicated gene expression profiles and applies quantitative real-time PCR to several highly ranked non-sinusoidal transcripts in liver tissue found by the model.
Abstract: Motivation: Cyclical biological processes such as cell division and circadian regulation produce coordinated periodic expression of thousands of genes. Identification of such genes and their expression patterns is a crucial step in discovering underlying regulatory mechanisms. Existing computational methods are biased toward discovering genes that follow sine-wave patterns. Results: We present an analysis of variance (ANOVA) periodicity detector and its Bayesian extension that can be used to discover periodic transcripts of arbitrary shapes from replicated gene expression profiles. The models are applicable when the profiles are collected at comparable time points for at least two cycles. We provide an empirical Bayes procedure for estimating parameters of the prior distributions and derive closed-form expressions for the posterior probability of periodicity, enabling efficient computation. The model is applied to two datasets profiling circadian regulation in murine liver and skeletal muscle, revealing a substantial number of previously undetected non-sinusoidal periodic transcripts in each. We also apply quantitative real-time PCR to several highly ranked non-sinusoidal transcripts in liver tissue found by the model, providing independent evidence of circadian regulation of these genes. Availability: Matlab software for estimating prior distributions and performing inference is available for download from http://www.datalab.uci.edu/resources/periodicity/. Contact: dchudova@gmail.com Supplementary information:Supplementary data are available at Bioinformatics online.

20 citations


Proceedings Article
07 Dec 2009
TL;DR: A recently proposed particle-based belief propagation algorithm is extended to provide a general framework for adapting discrete message-passing algorithms to inference in continuous systems, and the resulting algorithms behave similarly to their purely discrete counterparts.
Abstract: Since the development of loopy belief propagation, there has been considerable work on advancing the state of the art for approximate inference over distributions defined on discrete random variables. Improvements include guarantees of convergence, approximations that are provably more accurate, and bounds on the results of exact inference. However, extending these methods to continuous-valued systems has lagged behind. While several methods have been developed to use belief propagation on systems with continuous values, recent advances for discrete variables have not as yet been incorporated. In this context we extend a recently proposed particle-based belief propagation algorithm to provide a general framework for adapting discrete message-passing algorithms to inference in continuous systems. The resulting algorithms behave similarly to their purely discrete counterparts, extending the benefits of these more advanced inference techniques to the continuous domain.

14 citations


01 Jan 2009
TL;DR: A new strategy for computing relative log-likelihood (or perplexity) scores of topic models, based on annealed importance sampling is proposed, which has smaller Monte Carlo error than previous approaches, leading to marked improvements in both accuracy and computation time.
Abstract: ◮ Despite recent advances in learning and inference algorithms, evaluating the predictive performance of topic models is still painfully slow and unreliable. ◮ We propose a new strategy for computing relative log-likelihood (or perplexity) scores of topic models, based on annealed importance sampling. ◮ The proposed method has smaller Monte Carlo error than previous approaches, leading to marked improvements in both accuracy and computation time.

3 citations