scispace - formally typeset
Search or ask a question

Showing papers by "Padhraic Smyth published in 2007"


Journal ArticleDOI
TL;DR: In this article, a probabilistic clustering method based on a regression mixture model was used to describe tropical cyclone propagation in the western North Pacific (WNP) and seven clusters were obtained and described in Part I of this two-part study.
Abstract: A new probabilistic clustering method, based on a regression mixture model, is used to describe tropical cyclone (TC) propagation in the western North Pacific (WNP). Seven clusters were obtained and described in Part I of this two-part study. In Part II, the present paper, the large-scale patterns of atmospheric circulation and sea surface temperature associated with each of the clusters are investigated, as well as associations with the phase of the El Nino–Southern Oscillation (ENSO). Composite wind field maps over the WNP provide a physically consistent picture of each TC type, and of its seasonality. Anomalous vorticity and outgoing longwave radiation indicate changes in the monsoon trough associated with different types of TC genesis and trajectory. The steering winds at 500 hPa are more zonal in the straight-moving clusters, with larger meridional components in the recurving ones. Higher values of vertical wind shear in the midlatitudes also accompany the straight-moving tracks, compared to...

277 citations


Proceedings Article
03 Dec 2007
TL;DR: Using five real-world text corpora, it is shown that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning.
Abstract: We investigate the problem of learning a widely-used latent-variable model - the Latent Dirichlet Allocation (LDA) or "topic" model - using distributed computation, where each of P processors only sees 1/P of the total data set. We propose two distributed inference schemes that are motivated from different perspectives. The first scheme uses local Gibbs sampling on each processor with periodic updates—it is simple to implement and can be viewed as an approximation to a single processor implementation of Gibbs sampling. The second scheme relies on a hierarchical Bayesian extension of the standard LDA model to directly account for the fact that data are distributed across P processors—it has a theoretical guarantee of convergence but is more complex to implement than the approximate method. Using five real-world text corpora we show that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning. Our extensive experimental results include large-scale distributed computation on 1000 virtual processors; and speedup experiments of learning topics in a 100-million word corpus using 16 processors.

264 citations


Journal ArticleDOI
TL;DR: In this article, a probabilistic clustering technique based on a regression mixture model was used to describe tropical cyclone trajectories in the western North Pacific, where each component of the mixture model consists of a quadratic regression curve of cyclone position against time.
Abstract: A new probabilistic clustering technique, based on a regression mixture model, is used to describe tropical cyclone trajectories in the western North Pacific. Each component of the mixture model consists of a quadratic regression curve of cyclone position against time. The best-track 1950–2002 dataset is described by seven distinct clusters. These clusters are then analyzed in terms of genesis location, trajectory, landfall, intensity, and seasonality. Both genesis location and trajectory play important roles in defining the clusters. Several distinct types of straight-moving, as well as recurving, trajectories are identified, thus enriching this main distinction found in previous studies. Intensity and seasonality of cyclones, though not used by the clustering algorithm, are both highly stratified from cluster to cluster. Three straight-moving trajectory types have very small withincluster spread, while the recurving types are more diffuse. Tropical cyclone landfalls over East and Southeast Asia are found to be strongly cluster dependent, both in terms of frequency and region of impact. The relationships of each cluster type with the large-scale circulation, sea surface temperatures, and the phase of the El Nino–Southern Oscillation are studied in a companion paper.

263 citations


Journal ArticleDOI
TL;DR: The KDD Cup itself in 2007 consisted of a prediction competition using Netflix movie rating data, with tasks that were different and separate from those being used in the Netflix Prize itself.
Abstract: The KDD Cup is the oldest of the many data mining competitions that are now popular [1]. It is an integral part of the annual ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). In 2007, the traditional KDD Cup competition was augmented with a workshop with a focus on the concurrently active Netflix Prize competition [2]. The KDD Cup itself in 2007 consisted of a prediction competition using Netflix movie rating data, with tasks that were different and separate from those being used in the Netflix Prize itself. At the workshop, participants in both the KDD Cup and the Netflix Prize competition presented their results and analyses, and exchanged ideas.

176 citations


Journal ArticleDOI
TL;DR: In this paper, a probabilistic clustering technique is developed for classification of wintertime extratropical cyclone (ETC) tracks over the North Atlantic, using a regression mixture model to describe the longitude-time and latitude-time propagation of the ETCs.
Abstract: A probabilistic clustering technique is developed for classification of wintertime extratropical cyclone (ETC) tracks over the North Atlantic. We use a regression mixture model to describe the longitude-time and latitude-time propagation of the ETCs. A simple tracking algorithm is applied to 6-hourly mean sea-level pressure fields to obtain the tracks from either a general circulation model (GCM) or a reanalysis data set. Quadratic curves are found to provide the best description of the data. We select a three-cluster classification for both data sets, based on a mix of objective and subjective criteria. The track orientations in each of the clusters are broadly similar for the GCM and reanalyzed data; they are characterized by predominantly south-to-north (S–N), west-to-east (W–E), and southwest-to-northeast (SW–NE) tracking cyclones, respectively. The reanalysis cyclone tracks, however, are found to be much more tightly clustered geographically than those of the GCM. For the reanalysis data, a link is found between the occurrence of cyclones belonging to different clusters of trajectory-shape, and the phase of the North Atlantic Oscillation (NAO). The positive phase of the NAO is associated with the SW–NE oriented cluster, whose tracks are relatively straight and smooth (with cyclones that are typically faster, more intense, and of longer duration). The negative NAO phase is associated with more-erratic W–E tracks, with typically weaker and slower-moving cyclones. The S–N cluster is accompanied by a more transient geopotential trough over the western North Atlantic. No clear associations are found in the case of the GCM composites. The GCM is able to capture cyclone tracks of quite realistic orientation, as well as subtle associated features of cyclone intensity, speed and lifetimes. The clustering clearly highlights, though, the presence of serious systematic errors in the GCM’s simulation of ETC behavior.

173 citations


Proceedings ArticleDOI
01 Dec 2007
TL;DR: A probabilistic model for predicting the occupancy of a building using networks of people-counting sensors that provides robust predictions given typical sensor noise as well as missing and corrupted data from malfunctioning sensors is described.
Abstract: Knowledge of the number of people in a building at a given time is crucial for applications such as emergency response. Sensors can be used to gather noisy measurements which when combined, can be used to make inferences about the location, movement and density of people. In this paper we describe a probabilistic model for predicting the occupancy of a building using networks of people-counting sensors. This model provides robust predictions given typical sensor noise as well as missing and corrupted data from malfunctioning sensors. We experimentally validate the model by comparing it to a baseline method using real data from a network of optical counting sensors in a campus building.

61 citations


Journal ArticleDOI
TL;DR: Several additional examples of how graphical models can be applied to climate dynamics, speciflcally estimation using multi-resolution models of large{scale data sets such as satellite imagery, and learning hidden Markov models to capture rainfall patterns in space and time are given.

55 citations


Journal ArticleDOI
TL;DR: The results indicate that the Markov-modulated Poisson framework provides a robust and accurate framework for adaptively and autonomously learning how to separate unusual bursty events from traces of normal human activity.
Abstract: Time-series of count data occur in many different contexts, including Internet navigation logs, freeway traffic monitoring, and security logs associated with buildings. In this article we describe a framework for detecting anomalous events in such data using an unsupervised learning approach. Normal periodic behavior is modeled via a time-varying Poisson process model, which in turn is modulated by a hidden Markov process that accounts for bursty events. We outline a Bayesian framework for learning the parameters of this model from count time-series. Two large real-world datasets of time-series counts are used as testbeds to validate the approach, consisting of freeway traffic data and logs of people entering and exiting a building. We show that the proposed model is significantly more accurate at detecting known events than a more traditional threshold-based technique. We also describe how the model can be used to investigate different degrees of periodicity in the data, including systematic day-of-week and time-of-day effects, and to make inferences about different aspects of events such as number of vehicles or people involved. The results indicate that the Markov-modulated Poisson framework provides a robust and accurate framework for adaptively and autonomously learning how to separate unusual bursty events from traces of normal human activity.

55 citations


Proceedings ArticleDOI
18 Jun 2007
TL;DR: This work describes some of the challenges of metadata enrichment on a huge scale when the metadata is highly heterogeneous, and shows how to improve the quality of the enriched metadata, using both manual and statistical modeling techniques.
Abstract: Creating a collection of metadata records from disparate and diverse sources often results in uneven, unreliable and variable quality subject metadata. Having uniform, consistent and enriched subject metadata allows users to more easily discover material, browse the collection, and limit keyword search results by subject. We demonstrate how statistical topic models are useful for subject metadata enrichment. We describe some of the challenges of metadata enrichment on a huge scale (10 million metadata records from 700 repositories in the OAIster Digital Library) when the metadata is highly heterogeneous (metadata about images and text, and both cultural heritage material and scientific literature). We show how to improve the quality of the enriched metadata, using both manual and statistical modeling techniques. Finally, we discuss some of the challenges of the production environment, and demonstrate the value of the enriched metadata in a prototype portal.

43 citations


Proceedings ArticleDOI
12 Aug 2007
TL;DR: The KDD Cup itself in 2007 consisted of a prediction competition using Netflix movie rating data, with tasks that were different and separate from those being used in the Netflix Prize itself.
Abstract: INTRODUCTION The KDD Cup is the oldest of the many data mining competitions that are now popular [1]. It is an integral part of the annual ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). In 2007, the traditional KDD Cup competition was augmented with a workshop with a focus on the concurrently active Netflix Prize competition [2]. The KDD Cup itself in 2007 consisted of a prediction competition using Netflix movie rating data, with tasks that were different and separate from those being used in the Netflix Prize itself. At the workshop, participants in both the KDD Cup and the Netflix Prize competition presented their results and analyses, and exchanged ideas.

38 citations


Proceedings ArticleDOI
20 Jun 2007
TL;DR: A Bayesian framework is applied to identify the number and the properties of predominant precipitation patterns in historical archives of climate data to deal with the problem of selecting the number of mixture components.
Abstract: Finite mixtures of tree-structured distributions have been shown to be efficient and effective in modeling multivariate distributions. Using Dirichlet processes, we extend this approach to allow countably many tree-structured mixture components. The resulting Bayesian framework allows us to deal with the problem of selecting the number of mixture components by computing the posterior distribution over the number of components and integrating out the components by Bayesian model averaging. We apply the proposed framework to identify the number and the properties of predominant precipitation patterns in historical archives of climate data.

01 Jan 2007
TL;DR: This dissertation considers the problem of modeling grouped data where individual group members share common characteristic patterns but with some random variation, and incorporates random effects models to handle variability in shape parameters across different waveforms or images.
Abstract: In this dissertation, we consider the problem of modeling grouped data where individual group members share common characteristic patterns but with some random variation. The first problem of this type we investigate is waveform modeling and classification, where waveforms consist of sets of measurements of a phenomena over time (such as an ECG trace of a heartbeat). The second problem we investigate involves analyzing spatial patterns in groups of images, such as brain activation images obtained from functional magnetic resonance imaging (fMRI). In both cases we model the shape of waveforms and activation patterns explicitly using probabilistic models that parameterize the shape information directly. We incorporate random effects models to handle variability in shape parameters across different waveforms or images. The random effects model allows each group to have its own shape parameters that are modeled as coming from a common prior distribution that can be viewed as a shape template. We develop learning algorithms for these models that can learn both the template and group-specific shapes. We use real-world waveform and imaging data to demonstrate the advantages of the proposed models.