scispace - formally typeset
Search or ask a question

Showing papers by "Padhraic Smyth published in 2012"


Posted Content
TL;DR: Using the insights gained from this comparative study, it is shown how accurate topic models can be learned in several seconds on text corpora with thousands of documents.
Abstract: Latent Dirichlet analysis, or topic modeling, is a flexible latent variable framework for modeling high-dimensional sparse count data. Various learning algorithms have been developed in recent years, including collapsed Gibbs sampling, variational inference, and maximum a posteriori estimation, and this variety motivates the need for careful empirical comparisons. In this paper, we highlight the close connections between these approaches. We find that the main differences are attributable to the amount of smoothing applied to the counts. When the hyperparameters are optimized, the differences in performance among the algorithms diminish significantly. The ability of these algorithms to achieve solutions of comparable accuracy gives us the freedom to select computationally efficient approaches. Using the insights gained from this comparative study, we show how accurate topic models can be learned in several seconds on text corpora with thousands of documents.

496 citations


Journal ArticleDOI
TL;DR: The experimental results indicate that probabilistic generative models can achieve competitive multi-label classification performance compared to discriminative methods, and have advantages for datasets with many labels and skewed label frequencies.
Abstract: Machine learning approaches to multi-label document classification have to date largely relied on discriminative modeling techniques such as support vector machines. A drawback of these approaches is that performance rapidly drops off as the total number of labels and the number of labels per document increase. This problem is amplified when the label frequencies exhibit the type of highly skewed distributions that are often observed in real-world datasets. In this paper we investigate a class of generative statistical topic models for multi-label documents that associate individual word tokens with different labels. We investigate the advantages of this approach relative to discriminative models, particularly with respect to classification problems involving large numbers of relatively rare labels. We compare the performance of generative and discriminative approaches on document labeling tasks ranging from datasets with several thousand labels to datasets with tens of labels. The experimental results indicate that probabilistic generative models can achieve competitive multi-label classification performance compared to discriminative methods, and have advantages for datasets with many labels and skewed label frequencies.

259 citations


Journal ArticleDOI
TL;DR: It is speculated that in humans the circadian clock imposes regulation of epidermal cell proliferation so that skin is at a particularly vulnerable stage during times of maximum UV exposure, thus contributing to the high incidence of human skin cancers.
Abstract: The role of the circadian clock in skin and the identity of genes participating in its chronobiology remain largely unknown, leading us to define the circadian transcriptome of mouse skin at two different stages of the hair cycle, telogen and anagen. The circadian transcriptomes of telogen and anagen skin are largely distinct, with the former dominated by genes involved in cell proliferation and metabolism. The expression of many metabolic genes is antiphasic to cell cycle-related genes, the former peaking during the day and the latter at night. Consistently, accumulation of reactive oxygen species, a byproduct of oxidative phosphorylation, and S-phase are antiphasic to each other in telogen skin. Furthermore, the circadian variation in S-phase is controlled by BMAL1 intrinsic to keratinocytes, because keratinocyte-specific deletion of Bmal1 obliterates time-of-day–dependent synchronicity of cell division in the epidermis leading to a constitutively elevated cell proliferation. In agreement with higher cellular susceptibility to UV-induced DNA damage during S-phase, we found that mice are most sensitive to UVB-induced DNA damage in the epidermis at night. Because in the human epidermis maximum numbers of keratinocytes go through S-phase in the late afternoon, we speculate that in humans the circadian clock imposes regulation of epidermal cell proliferation so that skin is at a particularly vulnerable stage during times of maximum UV exposure, thus contributing to the high incidence of human skin cancers.

204 citations


Journal ArticleDOI
TL;DR: A discussion of the design and implementation choices for each visual analysis technique is presented, followed by a discussion of three diverse use cases in which TopicNets enables fast discovery of information that is otherwise hard to find.
Abstract: We present TopicNets, a Web-based system for visual and interactive analysis of large sets of documents using statistical topic models A range of visualization types and control mechanisms to support knowledge discovery are presented These include corpus- and document-specific views, iterative topic modeling, search, and visual filtering Drill-down functionality is provided to allow analysts to visualize individual document sections and their relations within the global topic space Analysts can search across a dataset through a set of expansion techniques on selected document and topic nodes Furthermore, analysts can select relevant subsets of documents and perform real-time topic modeling on these subsets to interactively visualize topics at various levels of granularity, allowing for a better understanding of the documents A discussion of the design and implementation choices for each visual analysis technique is presented This is followed by a discussion of three diverse use cases in which TopicNets enables fast discovery of information that is otherwise hard to find These include a corpus of 50,000 successful NSF grant proposals, 10,000 publications from a large research center, and single documents including a grant proposal and a PhD thesis

163 citations


Posted Content
TL;DR: The Author-Topic model as mentioned in this paper is a generative model for documents that extends Latent Dirichlet Allocation (LDA) to include authorship information, where each document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors.
Abstract: We introduce the author-topic model, a generative model for documents that extends Latent Dirichlet Allocation (LDA; Blei, Ng, & Jordan, 2003) to include authorship information. Each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. A document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. We apply the model to a collection of 1,700 NIPS conference papers and 160,000 CiteSeer abstracts. Exact inference is intractable for these datasets and we use Gibbs sampling to estimate the topic and author distributions. We compare the performance with two other generative models for documents, which are special cases of the author-topic model: LDA (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. We show topics recovered by the author-topic model, and demonstrate applications to computing similarity between authors and entropy of author output.

94 citations


Posted Content
TL;DR: This work considers the problem of modeling discrete-valued vector time series data using extensions of Chow-Liu tree models to capture both dependencies across time and dependencies across variables, and describes learning algorithms for such models and how they can be used to learn parsimonious representations for the output distributions in hidden Markov models.
Abstract: We consider the problem of modeling discrete-valued vector time series data using extensions of Chow-Liu tree models to capture both dependencies across time and dependencies across variables. Conditional Chow-Liu tree models are introduced, as an extension to standard Chow-Liu trees, for modeling conditional rather than joint densities. We describe learning algorithms for such models and show how they can be used to learn parsimonious representations for the output distributions in hidden Markov models. These models are applied to the important problem of simulating and forecasting daily precipitation occurrence for networks of rain stations. To demonstrate the effectiveness of the models, we compare their performance versus a number of alternatives using historical precipitation data from Southwestern Australia and the Western United States. We illustrate how the structure and parameters of the models can be used to provide an improved meteorological interpretation of such data.

57 citations


Posted Content
TL;DR: In this paper, a general probabilistic framework for modeling waveforms such as heartbeats from ECG data is described, which is based on segmental hidden Markov models with the addition of random effects to the generative model.
Abstract: In this paper we describe a general probabilistic framework for modeling waveforms such as heartbeats from ECG data. The model is based on segmental hidden Markov models (as used in speech recognition) with the addition of random effects to the generative model. The random effects component of the model handles shape variability across different waveforms within a general class of waveforms of similar shape. We show that this probabilistic model provides a unified framework for learning these models from sets of waveform data as well as parsing, classification, and prediction of new waveforms. We derive a computationally efficient EM algorithm to fit the model on multiple waveforms, and introduce a scoring method that evaluates a test waveform based on its shape. Results on two real-world data sets demonstrate that the random effects methodology leads to improved accuracy (compared to alternative approaches) on classification and segmentation of real-world waveforms.

23 citations


Journal ArticleDOI
TL;DR: In this article, the authors presented new methods for an automated analysis of the double InterTropical Convergence Zone (dITCZ) phenomena on a daily time scale over the east Pacific.

19 citations


Posted Content
TL;DR: A family of models and learning algorithms that can simultaneously align and cluster sets of multidimensional curves measured on a discrete time grid are presented and it is shown that the Bayesian network models provide systematic improvements in predictive power over more conventional clustering approaches.
Abstract: In this paper we present a family of algorithms that can simultaneously align and cluster sets of multidimensional curves measured on a discrete time grid. Our approach is based on a generative mixture model that allows non-linear time warping of the observed curves relative to the mean curves within the clusters. We also allow for arbitrary discrete-valued translation of the time axis, random real-valued offsets of the measured curves, and additive measurement noise. The resulting model can be viewed as a dynamic Bayesian network with a special transition structure that allows effective inference and learning. The Expectation-Maximization (EM) algorithm can be used to simultaneously recover both the curve models for each cluster, and the most likely time warping, translation, offset, and cluster membership for each curve. We demonstrate how Bayesian estimation methods improve the results for smaller sample sizes by enforcing smoothness in the cluster mean curves. We evaluate the methodology on two real-world data sets, and show that the DBN models provide systematic improvements in predictive power over competing approaches.

15 citations


Posted Content
TL;DR: In this paper, a hierarchical extension for modeling multiple such sequences, facilitating inferences about event-level dynamics and their variation across sequences, is presented, and the efficacy of such sharing is illustrated with an analysis of high school classroom dynamics.
Abstract: Interaction within small groups can often be represented as a sequence of events, where each event involves a sender and a recipient. Recent methods for modeling network data in continuous time model the rate at which individuals interact conditioned on the previous history of events as well as actor covariates. We present a hierarchical extension for modeling multiple such sequences, facilitating inferences about event-level dynamics and their variation across sequences. The hierarchical approach allows one to share information across sequences in a principled manner---we illustrate the efficacy of such sharing through a set of prediction experiments. After discussing methods for adequacy checking and model selection for this class of models, the method is illustrated with an analysis of high school classroom dynamics.

13 citations


Proceedings ArticleDOI
04 Oct 2012
TL;DR: The track-oriented multiple hypothesis tracker is formulated as a graphical model and it is shown that belief propagation can be used to approximate the track marginals and enable an online parameter estimation scheme that improves tracker performance in the presence of parameter misspecification.
Abstract: The track-oriented multiple hypothesis tracker is currently the preferred method for tracking multiple targets in clutter with medium to high computational resources. This method maintains a structured representation of the track posterior distribution, which it repeatedly extends and optimizes over. This representation of the posterior admits probabilistic inference tasks beyond MAP estimation that have yet to be explored. To this end we formulate the posterior as a graphical model and show that belief propagation can be used to approximate the track marginals. These approximate marginals enable an online parameter estimation scheme that improves tracker performance in the presence of parameter misspecification.

Book ChapterDOI
01 Jan 2012
TL;DR: In this article, the authors present a comprehensive overview of distributed inference algorithms for topic models and extend the general ideas to a broader class of Bayesian networks, and discuss practical guidelines for running their algorithms within various parallel computing frameworks.
Abstract: In this chapter, we address distributed learning algorithms for statistical latent variable models, with a focus on topic models. Many high-dimensional datasets, such as text corpora and image databases, are too large to allow one to learn topic models on a single computer. Moreover, a growing number of applications require that inference be fast or in real time, motivating the exploration of parallel and distributed learning algorithms. We begin by reviewing topic models such as Latent Dirichlet Allocation and Hierarchical Dirichlet Processes. We discuss parallel and distributed algorithms for learning these models and show that these algorithms can achieve substantial speedups without sacrificing model quality. Next we discuss practical guidelines for running our algorithms within various parallel computing frameworks and highlight complementary speedup techniques. Finally, we generalize our distributed approach to handle Bayesian networks. Several of the results in this chapter have appeared in previous papers in the specific context of topic modeling. The goal of this chapter is to present a comprehensive overview of distributed inference algorithms and to extend the general ideas to a broader class of Bayesian networks. Latent Variable Models Latent variable models are a class of statistical models that explain observed data with latent (or hidden) variables. Topic models and hidden Markov models are two examples of such models, where the latent variables are the topic assignment variables and the hidden states, respectively. Given observed data, the goal is to perform Bayesian inference over the latent variables and use the learned model to make inferences or predictions.

Proceedings Article
17 Nov 2012
TL;DR: A latent variable modeling approach for extracting information from individual email histories, focusing in particular on understanding how an individual communicates over time with recipients in their social network is investigated.
Abstract: As digital communication devices play an increasingly prominent role in our daily lives, the ability to analyze and understand our communication patterns becomes more important. In this paper, we investigate a latent variable modeling approach for extracting information from individual email histories, focusing in particular on understanding how an individual communicates over time with recipients in their social network. The proposed model consists of latent groups of recipients, each of which is associated with a piecewise-constant Poisson rate over time. Inference of group memberships, temporal changepoints, and rate parameters is carried out via Markov Chain Monte Carlo (MCMC) methods. We illustrate the utility of the model by applying it to both simulated and real-world email data sets.


Posted Content
TL;DR: In this paper, Gibbs samplers for infinite complexity mixture models in the stick breaking representation are explored to improve mixing over cluster labels and to bring clusters into correspondence for modeling of storm trajectories.
Abstract: Nonparametric Bayesian approaches to clustering, information retrieval, language modeling and object recognition have recently shown great promise as a new paradigm for unsupervised data analysis Most contributions have focused on the Dirichlet process mixture models or extensions thereof for which efficient Gibbs samplers exist In this paper we explore Gibbs samplers for infinite complexity mixture models in the stick breaking representation The advantage of this representation is improved modeling flexibility For instance, one can design the prior distribution over cluster sizes or couple multiple infinite mixture models (eg over time) at the level of their parameters (ie the dependent Dirichlet process model) However, Gibbs samplers for infinite mixture models (as recently introduced in the statistics literature) seem to mix poorly over cluster labels Among others issues, this can have the adverse effect that labels for the same cluster in coupled mixture models are mixed up We introduce additional moves in these samplers to improve mixing over cluster labels and to bring clusters into correspondence An application to modeling of storm trajectories is used to illustrate these ideas

Book ChapterDOI
24 Sep 2012
TL;DR: An overview of recent work using probabilistic latent variable models to analyze large text and social network data sets, specifically data in the form of time-stamped events between nodes (such as emails exchanged among individuals over time).
Abstract: Exploring and understanding large text and social network data sets is of increasing interest across multiple fields, in computer science, social science, history, medicine, and more. This talk will present an overview of recent work using probabilistic latent variable models to analyze such data. Latent variable models have a long tradition in data analysis and typically hypothesize the existence of simple unobserved phenomena to explain relatively complex observed data. In the past decade there has been substantial work on extending the scope of these approaches from relatively small simple data sets to much more complex text and network data. We will discuss the basic concepts behind these developments, reviewing key ideas, recent advances, and open issues. In addition we will highlight common ideas that lie beneath the surface of different approaches including links (for example) to work in matrix factorization. The concluding part of the talk will focus more specifically on recent work with temporal social networks, specifically data in the form of time-stamped events between nodes (such as emails exchanged among individuals over time).

Posted Content
TL;DR: In this paper, the problem of analyzing social network data sets in which the edges of the network have timestamps is considered, and the subgraphs formed from edges in contiguous subintervals of these time-stamps are analyzed.
Abstract: We consider the problem of analyzing social network data sets in which the edges of the network have timestamps, and we wish to analyze the subgraphs formed from edges in contiguous subintervals of these timestamps. We provide data structures for these problems that use near-linear preprocessing time, linear space, and sublogarithmic query time to handle queries that ask for the number of connected components, number of components that contain cycles, number of vertices whose degree equals or is at most some predetermined value, number of vertices that can be reached from a starting set of vertices by time-increasing paths, and related queries.

01 Jan 2012
TL;DR: This work introduces a continuous-time regression modeling framework for network event data that can incorporate both time-varying network statistics and time- varying regression coefficients, and develops an efficient inference scheme that allows the approach to scale to large networks.
Abstract: The analysis of the formation and evolution of networks over time is of fundamental importance to social science, biology, and many other fields. While longitudinal network data sets are increasingly being recorded at the granularity of individual time-stamped events, most studies only focus on collapsed crosssectional snapshots of the network. Leveraging ideas from survival and event history analysis, we introduce a continuous-time regression modeling framework for network event data that can incorporate both time-varying network statistics and time-varying regression coefficients. This framework can apply to both egocentric processes defined for individual nodes and relational processes defined for pairs of nodes. We also develop an efficient inference scheme that allows our approach to scale to large networks. We apply our techniques to various synthetic and real-world datasets, such as citation networks and social networks, and show that the proposed inference approach can accurately estimate the model coefficients, which is useful for interpreting the evolution of the network; furthermore, the learned model has systematically better predictive performance compared to standard baseline methods. Guest Parking Available in the Liacouras Garage (Located on 15th Street between Montgomery and Cecil B. Moore Avenues)