scispace - formally typeset
Search or ask a question

Showing papers by "Padhraic Smyth published in 2006"


Proceedings ArticleDOI
20 Aug 2006
TL;DR: The experimental results indicate that the proposed time-varying Poisson model provides a robust and accurate framework for adaptively and autonomously learning how to separate unusual bursty events from traces of normal human activity.
Abstract: Time-series of count data are generated in many different contexts, such as web access logging, freeway traffic monitoring, and security logs associated with buildings. Since this data measures the aggregated behavior of individual human beings, it typically exhibits a periodicity in time on a number of scales (daily, weekly,etc.) that reflects the rhythms of the underlying human activity and makes the data appear non-homogeneous. At the same time, the data is often corrupted by a number of bursty periods of unusual behavior such as building events, traffic accidents, and so forth. The data mining problem of finding and extracting these anomalous events is made difficult by both of these elements. In this paper we describe a framework for unsupervised learning in this context, based on a time-varying Poisson process model that can also account for anomalous events. We show how the parameters of this model can be learned from count time series using statistical estimation techniques. We demonstrate the utility of this model on two datasets for which we have partial ground truth in the form of known events, one from freeway traffic data and another from building access data, and show that the model performs significantly better than a non-probabilistic, threshold-based technique. We also describe how the model can be used to investigate different degrees of periodicity in the data, including systematic day-of-week and time-of-day effects, and make inferences about the detected events (e.g., popularity or level of attendance). Our experimental results indicate that the proposed time-varying Poisson model provides a robust and accurate framework for adaptively and autonomously learning how to separate unusual bursty events from traces of normal human activity.

262 citations


Proceedings Article
04 Dec 2006
TL;DR: A new probabilistic model is proposed that tempers this approach by representing each document as a combination of a background distribution over common words, a mixture distribution over general topics, and a distribution over words that are treated as being specific to that document.
Abstract: Techniques such as probabilistic topic models and latent-semantic indexing have been shown to be broadly useful at automatically extracting the topical or semantic content of documents, or more generally for dimension-reduction of sparse count data. These types of models and algorithms can be viewed as generating an abstraction from the words in a document to a lower-dimensional latent variable representation that captures what the document is generally about beyond the specific words it contains. In this paper we propose a new probabilistic model that tempers this approach by representing each document as a combination of (a) a background distribution over common words, (b) a mixture distribution over general topics, and (c) a distribution over words that are treated as being specific to that document. We illustrate how this model can be used for information retrieval by matching documents both at a general topic level and at a specific word level, providing an advantage over techniques that only match documents at a general level (such as topic models or latent-sematic indexing) or that only match documents at the specific word level (such as TF-IDF).

203 citations


Proceedings ArticleDOI
20 Aug 2006
TL;DR: New graphical models are presented that directly learn the relationship between topics discussed in news articles and entities mentioned in each article and it is shown how these entity-topic models, through a better understanding of the entity- topic relationships, are better at making predictions about entities.
Abstract: The primary purpose of news articles is to convey information about who, what, when and where. But learning and summarizing these relationships for collections of thousands to millions of articles is difficult. While statistical topic models have been highly successful at topically summarizing huge collections of text documents, they do not explicitly address the textual interactions between who/where, i.e. named entities (persons, organizations, locations) and what, i.e. the topics. We present new graphical models that directly learn the relationship between topics discussed in news articles and entities mentioned in each article. We show how these entity-topic models, through a better understanding of the entity-topic relationships, are better at making predictions about entities.

169 citations


Book ChapterDOI
23 May 2006
TL;DR: A novel combination of statistical topic models and named-entity recognizers are presented to jointly analyze entities mentioned and topics discussed in a collection of 330,000 New York Times news articles.
Abstract: Statistical language models can learn relationships between topics discussed in a document collection and persons, organizations and places mentioned in each document. We present a novel combination of statistical topic models and named-entity recognizers to jointly analyze entities mentioned (persons, organizations and places) and topics discussed in a collection of 330,000 New York Times news articles. We demonstrate an analytic framework which automatically extracts from a large collection: topics; topic trends; and topics that relate entities.

114 citations


Journal Article
TL;DR: This paper presented a combination of statistical topic models and named entity recognizers to jointly analyze entities mentioned (persons, organizations and places) and topics discussed in a collection of 330,000 New York Times news articles.
Abstract: Statistical language models can learn relationships between topics discussed in a document collection and persons, organizations and places mentioned in each document. We present a novel combination of statistical topic models and named-entity recognizers to jointly analyze entities mentioned (persons, organizations and places) and topics discussed in a collection of 330,000 New York Times news articles. We demonstrate an analytic framework which automatically extracts from a large collection: topics; topic trends; and topics that relate entities.

106 citations


Journal ArticleDOI
TL;DR: In this paper, daily rainfall occurrence and amount at 11 stations over North Queensland were examined for summers 1958-1998, using a Hidden Markov Model (HMM), and daily rainfall variability was described in terms of the occurrence of five discrete weather states, identified by the HMM.
Abstract: Daily rainfall occurrence and amount at 11 stations over North Queensland are examined for summers 1958–1998, using a Hidden Markov Model (HMM). Daily rainfall variability is described in terms of the occurrence of five discrete ‘weather states’, identified by the HMM. Three states are characterized respectively by very wet, moderately wet, and dry conditions at most stations; two states have enhanced rainfall along the coast and dry conditions inland. Each HMM rainfall state is associated with a distinct atmospheric circulation regime. The two wet states are accompanied by monsoonal circulation patterns with large-scale ascent, low-level inflow from the north-west, and a phase reversal with height; the dry state is characterized by circulation anomalies of the opposite sense. Two of the states show significant associations with midlatitude synoptic waves. Variability of the monsoon on time-scales from subseasonal to interdecadal is interpreted in terms of changes in the frequency of occurrence of the five HMM rainfall states. Large subseasonal variability is identified in terms of active and break phases, and a highly variable monsoon onset date. The occurrence of the very wet and dry states is somewhat modulated by the Madden–Julian oscillation. On interannual time-scales, there are clear relationships with the El Nino–Southern Oscillation and Indian Ocean sea surface temperatures (SSTs). Interdecadal monsoonal variability is characterized by stronger monsoons during the 1970s, and weaker monsoons plus an increased prevalence of drier states in the later part of the record. Stochastic simulations of daily rainfall occurrence and amount at the 11 stations are generated by introducing predictors based on large-scale precipitation from (a) reanalysis data, (b) an atmospheric general circulation model (GCM) run with observed SST forcing and (c) antecedent June–August Pacific SST anomalies. The reanalysis large-scale precipitation yields relatively accurate station-level simulations of the interannual variability of daily rainfall amount and occurrence, with rainfall intensity less well simulated. At some stations, interannual variations in 10-day dry-spell frequency are also simulated reasonably well. The interannual quality of the simulations is markedly degraded when the GCM simulations are used as inputs, while antecedent Pacific SST inputs yield an anomaly correlation skill comparable to that of the GCM. Copyright © 2006 Royal Meteorological Society

70 citations


Journal ArticleDOI
TL;DR: Some of the key findings in imaging phenotyping and genotyping of schizophrenia are reviewed and the initial endeavors at their combination into more meaningful and predictive patterns, or endophenotypes identifying the relationships among clinical symptoms, course, genes, and the underlying pathophysiology are reviewed.
Abstract: Schizophrenia is associated with subtle structural and functional brain abnormalities. Both recent and classical data suggest that it is a heterogeneous disorder that is clearly heritable. The cause and course of schizophrenia are poorly understood, and classical categories of clinical symptoms have not been particularly useful in identifying its pathophysiology or predicting its treatment. The possible genetic risk factors for schizophrenia are numerous; however, the connection between the genotype and the time-course, or the multifaceted symptoms of the disease, has yet to be established. Brain imaging methods that study the structure or function of the cortical and subcortical regions have also identified distinct patterns that distinguish schizophrenics from controls, and that may identify meaningful subtypes of schizophrenia. The predictive relationship between these imaging phenotypes and disease characteristics such as treatment response is only beginning to be revealed. The emergence of the field of imaging genetics, combining genetic, and neuroimaging data, holds much promise for the deeper understanding and improved treatment of diseases such as schizophrenia. In this article we review some of the key findings in imaging phenotyping and genotyping of schizophrenia, and the initial endeavors at their combination into more meaningful and predictive patterns, or endophenotypes identifying the relationships among clinical symptoms, course, genes, and the underlying pathophysiology.

40 citations


Journal Article
TL;DR: The proposed general probabilistic framework for shape-based modeling and classification of waveform data leads to improved accuracy in classification and segmentation when compared to alternatives such as Euclidean distance matching, dynamic time warping, and segmental HMMs without random effects.
Abstract: This paper proposes a general probabilistic framework for shape-based modeling and classification of waveform data. A segmental hidden Markov model (HMM) is used to characterize waveform shape and shape variation is captured by adding random effects to the segmental model. The resulting probabilistic framework provides a basis for learning of waveform models from data as well as parsing and recognition of new waveforms. Expectation-maximization (EM) algorithms are derived and investigated for fitting such models to data. In particular, the "expectation conditional maximization either" (ECME) algorithm is shown to provide significantly faster convergence than a standard EM procedure. Experimental results on two real-world data sets demonstrate that the proposed approach leads to improved accuracy in classification and segmentation when compared to alternatives such as Euclidean distance matching, dynamic time warping, and segmental HMMs without random effects.

37 citations


Proceedings Article
04 Dec 2006
TL;DR: A Markov Chain Monte Carlo (MCMC) sampling algorithm is presented and demonstrated by applying it to the problem of modeling spatial brain activation patterns across multiple images collected via functional magnetic resonance imaging (fMRI).
Abstract: Data sets involving multiple groups with shared characteristics frequently arise in practice In this paper we extend hierarchical Dirichlet processes to model such data Each group is assumed to be generated from a template mixture model with group level variability in both the mixing proportions and the component parameters Variabilities in mixing proportions across groups are handled using hierarchical Dirichlet processes, also allowing for automatic determination of the number of components In addition, each group is allowed to have its own component parameters coming from a prior described by a template mixture model This group-level variability in the component parameters is handled using a random effects model We present a Markov Chain Monte Carlo (MCMC) sampling algorithm to estimate model parameters and demonstrate the method by applying it to the problem of modeling spatial brain activation patterns across multiple images collected via functional magnetic resonance imaging (fMRI)

32 citations


01 Jan 2006
TL;DR: This work presents a parallel algorithm for the topic model that has linear speedup and high parallel efficiency for shared-memory symmetric multiprocessors (SMPs) and uses this parallel algorithm, topic model computations on an 8-processor system took 1/7 the time of the same computation on a single processor.
Abstract: (U) The topic model is a popular probabilistic model for text and document modeling. It can be used for topic indexing, document classification, corpus summarization and information retrieval. In the past, topic models have been applied to corpora containing thousands to hundreds of thousands of documents. Now there is an increasing need to model collections with millions to billions of documents. We present a parallel algorithm for the topic model that has linear speedup and high parallel efficiency for shared-memory symmetric multiprocessors (SMPs). Using this parallel algorithm, topic model computations on an 8-processor system took 1/7 the time of the same computation on a single processor.

22 citations


Book ChapterDOI
01 Oct 2006
TL;DR: This paper describes an MCMC sampling method to estimate both parameters for shape features and the number of local activations at the same time, and illustrates the application of the algorithm to a number of different fMRI brain images.
Abstract: Traditional techniques for statistical fMRI analysis are often based on thresholding of individual voxel values or averaging voxel values over a region of interest. In this paper we present a mixture-based response-surface technique for extracting and characterizing spatial clusters of activation patterns from fMRI data. Each mixture component models a local cluster of activated voxels with a parametric surface function. A novel aspect of our approach is the use of Bayesian nonparametric methods to automatically select the number of activation clusters in an image. We describe an MCMC sampling method to estimate both parameters for shape features and the number of local activations at the same time, and illustrate the application of the algorithm to a number of different fMRI brain images.

Proceedings Article
04 Dec 2006
TL;DR: A non-parametric Bayesian framework for modeling collections of time-stamped events or counts using a Dirichlet process framework for learning a set of intensity functions corresponding to different categories, which form a basis set for representing individual time-periods.
Abstract: Data sets that characterize human activity over time through collections of time-stamped events or counts are of increasing interest in application areas as human-computer interaction, video surveillance, and Web data analysis. We propose a non-parametric Bayesian framework for modeling collections of such data. In particular, we use a Dirichlet process framework for learning a set of intensity functions corresponding to different categories, which form a basis set for representing individual time-periods (e.g., several days) depending on which categories the time-periods are assigned to. This allows the model to learn in a data-driven fashion what "factors" are generating the observations on a particular day, including (for example) weekday versus weekend effects or day-specific effects corresponding to unique (single-day) occurrences of unusual behavior, sharing information where appropriate to obtain improved estimates of the behavior associated with each category. Applications to real-world data sets of count data involving both vehicles and people are used to illustrate the technique.

Proceedings Article
13 Jul 2006
TL;DR: In this paper, Gibbs samplers for infinite complexity mixture models in the stick breaking representation are explored to improve mixing over cluster labels and to bring clusters into correspondence, and an application to modeling of storm trajectories is used to illustrate these ideas.
Abstract: Nonparametric Bayesian approaches to clustering, information retrieval, language modeling and object recognition have recently shown great promise as a new paradigm for unsupervised data analysis. Most contributions have focused on the Dirichlet process mixture models or extensions thereof for which efficient Gibbs samplers exist. In this paper we explore Gibbs samplers for infinite complexity mixture models in the stick breaking representation. The advantage of this representation is improved modeling flexibility. For instance, one can design the prior distribution over cluster sizes or couple multiple infinite mixture models (e.g., overtime) at the level of their parameters (i.e., the dependent Dirichlet process model). However, Gibbs samplers for infinite mixture models (as recently introduced in the statistics literature) seem to mix poorly over cluster labels. Among others issues, this can have the adverse effect that labels for the same cluster in coupled mixture models are mixed up. We introduce additional moves in these samplers to improve mixing over cluster labels and to bring clusters into correspondence. An application to modeling of storm trajectories is used to illustrate these ideas.