scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Applications in 2015"


Journal ArticleDOI
TL;DR: In this paper, a generative model called Bayesian rule lists (BRL) is proposed to predict the risk of stroke in patients with atrial fibrillation, which can be used to produce highly accurate and interpretable medical scoring systems.
Abstract: We aim to produce predictive models that are not only accurate, but are also interpretable to human experts. Our models are decision lists, which consist of a series of if...then... statements (e.g., if high blood pressure, then stroke) that discretize a high-dimensional, multivariate feature space into a series of simple, readily interpretable decision statements. We introduce a generative model called Bayesian Rule Lists that yields a posterior distribution over possible decision lists. It employs a novel prior structure to encourage sparsity. Our experiments show that Bayesian Rule Lists has predictive accuracy on par with the current top algorithms for prediction in machine learning. Our method is motivated by recent developments in personalized medicine, and can be used to produce highly accurate and interpretable medical scoring systems. We demonstrate this by producing an alternative to the CHADS$_2$ score, actively used in clinical practice for estimating the risk of stroke in patients that have atrial fibrillation. Our model is as interpretable as CHADS$_2$, but more accurate.

532 citations


Journal ArticleDOI
TL;DR: The authors showed that for typical psychological and psycholinguistic data, higher power is achieved without inflating Type I error rate if a model selection criterion is used to select a random effect structure that is supported by the data.
Abstract: Linear mixed-effects models have increasingly replaced mixed-model analyses of variance for statistical inference in factorial psycholinguistic experiments. Although LMMs have many advantages over ANOVA, like ANOVAs, setting them up for data analysis also requires some care. One simple option, when numerically possible, is to fit the full variance-covariance structure of random effects (the maximal model; Barr et al. 2013), presumably to keep Type I error down to the nominal alpha in the presence of random effects. Although it is true that fitting a model with only random intercepts may lead to higher Type I error, fitting a maximal model also has a cost: it can lead to a significant loss of power. We demonstrate this with simulations and suggest that for typical psychological and psycholinguistic data, higher power is achieved without inflating Type I error rate if a model selection criterion is used to select a random effect structure that is supported by the data.

330 citations


Posted Content
TL;DR: It is shown that likelihood ratios are invariant under a specific class of dimensionality reduction maps, and that discriminative classifiers can be used to approximate the generalized likelihood ratio statistic when only a generative model for the data is available.
Abstract: In many fields of science, generalized likelihood ratio tests are established tools for statistical inference. At the same time, it has become increasingly common that a simulator (or generative model) is used to describe complex processes that tie parameters $\theta$ of an underlying theory and measurement apparatus to high-dimensional observations $\mathbf{x}\in \mathbb{R}^p$. However, simulator often do not provide a way to evaluate the likelihood function for a given observation $\mathbf{x}$, which motivates a new class of likelihood-free inference algorithms. In this paper, we show that likelihood ratios are invariant under a specific class of dimensionality reduction maps $\mathbb{R}^p \mapsto \mathbb{R}$. As a direct consequence, we show that discriminative classifiers can be used to approximate the generalized likelihood ratio statistic when only a generative model for the data is available. This leads to a new machine learning-based approach to likelihood-free inference that is complementary to Approximate Bayesian Computation, and which does not require a prior on the model parameters. Experimental results on artificial problems with known exact likelihoods illustrate the potential of the proposed method.

153 citations


Posted Content
TL;DR: The results show that either an interpolation with seasonal kalman filter from the zoo package or a linear interpolation on seasonal loess decomposed data from the forecast package were the most effective methods for dealing with missing data in most of the scenarios assessed in this paper.
Abstract: Missing values in datasets are a well-known problem and there are quite a lot of R packages offering imputation functions. But while imputation in general is well covered within R, it is hard to find functions for imputation of univariate time series. The problem is, most standard imputation techniques can not be applied directly. Most algorithms rely on inter-attribute correlations, while univariate time series imputation needs to employ time dependencies. This paper provides an overview of univariate time series imputation in general and an in-detail insight into the respective implementations within R packages. Furthermore, we experimentally compare the R functions on different time series using four different ratios of missing data. Our results show that either an interpolation with seasonal kalman filter from the zoo package or a linear interpolation on seasonal loess decomposed data from the forecast package were the most effective methods for dealing with missing data in most of the scenarios assessed in this paper.

120 citations


Posted Content
TL;DR: In this article, generalized additive mixed models are introduced as an extension of the generalized linear mixed model which makes it possible to deal with temporal autocorrelational structure in experimental data.
Abstract: Generalized additive mixed models are introduced as an extension of the generalized linear mixed model which makes it possible to deal with temporal autocorrelational structure in experimental data. This autocorrelational structure is likely to be a consequence of learning, fatigue, or the ebb and flow of attention within an experiment (the `human factor'). Unlike molecules or plots of barley, subjects in psycholinguistic experiments are intelligent beings that depend for their survival on constant adaptation to their environment, including the environment of an experiment. Three data sets illustrate that the human factor may interact with predictors of interest, both factorial and metric. We also show that, especially within the framework of the generalized additive model, in the nonlinear world, fitting maximally complex models that take every possible contingency into account is ill-advised as a modeling strategy. Alternative modeling strategies are discussed for both confirmatory and exploratory data analysis.

103 citations


Journal ArticleDOI
TL;DR: In this paper, a generic Bayesian framework for inference in distributional regression models is proposed, in which each parameter of a potentially complex response distribution and not only the mean is related to a structured additive predictor.
Abstract: We propose a generic Bayesian framework for inference in distributional regression models in which each parameter of a potentially complex response distribution and not only the mean is related to a structured additive predictor. The latter is composed additively of a variety of different functional effect types such as nonlinear effects, spatial effects, random coefficients, interaction surfaces or other (possibly nonstandard) basis function representations. To enforce specific properties of the functional effects such as smoothness, informative multivariate Gaussian priors are assigned to the basis function coefficients. Inference can then be based on computationally efficient Markov chain Monte Carlo simulation techniques where a generic procedure makes use of distribution-specific iteratively weighted least squares approximations to the full conditionals. The framework of distributional regression encompasses many special cases relevant for treating nonstandard response structures such as highly skewed nonnegative responses, overdispersed and zero-inflated counts or shares including the possibility for zero- and one-inflation. We discuss distributional regression along a study on determinants of labour incomes for full-time working males in Germany with a particular focus on regional differences after the German reunification. Controlling for age, education, work experience and local disparities, we estimate full conditional income distributions allowing us to study various distributional quantities such as moments, quantiles or inequality measures in a consistent manner in one joint model. Detailed guidance on practical aspects of model choice including the selection of several competing distributions for labour incomes and the consideration of different covariate effects on the income distribution complete the distributional regression analysis. We find that next to a lower expected income, full-time working men in East Germany also face a more unequal income distribution than men in the West, ceteris paribus.

99 citations


Journal ArticleDOI
TL;DR: A new Bayesian model and algorithm used for depth and reflectivity profiling using full waveforms from the time-correlated single-photon counting measurement in the limit of very low photon counts is presented.
Abstract: This paper presents a new Bayesian model and algorithm used for depth and intensity profiling using full waveforms from the time-correlated single photon counting (TCSPC) measurement in the limit of very low photon counts. The model proposed represents each Lidar waveform as a combination of a known impulse response, weighted by the target intensity, and an unknown constant background, corrupted by Poisson noise. Prior knowledge about the problem is embedded in a hierarchical model that describes the dependence structure between the model parameters and their constraints. In particular, a gamma Markov random field (MRF) is used to model the joint distribution of the target intensity, and a second MRF is used to model the distribution of the target depth, which are both expected to exhibit significant spatial correlations. An adaptive Markov chain Monte Carlo algorithm is then proposed to compute the Bayesian estimates of interest and perform Bayesian inference. This algorithm is equipped with a stochastic optimization adaptation mechanism that automatically adjusts the parameters of the MRFs by maximum marginal likelihood estimation. Finally, the benefits of the proposed methodology are demonstrated through a serie of experiments using real data.

98 citations


Posted Content
TL;DR: In this article, an approach to estimating causal/structural parameters in the presence of many instruments and controls based on methods for estimating sparse high-dimensional models is presented. But this approach is limited to the case where the variable of interest is exogenous conditional on observables.
Abstract: In this note, we offer an approach to estimating causal/structural parameters in the presence of many instruments and controls based on methods for estimating sparse high-dimensional models. We use these high-dimensional methods to select both which instruments and which control variables to use. The approach we take extends BCCH2012, which covers selection of instruments for IV models with a small number of controls, and extends BCH2014, which covers selection of controls in models where the variable of interest is exogenous conditional on observables, to accommodate both a large number of controls and a large number of instruments. We illustrate the approach with a simulation and an empirical example. Technical supporting material is available in a supplementary online appendix.

83 citations


Posted Content
TL;DR: In this article, three sampling methods are compared for efficiency on a number of test problems of various complexity for which analytic quadratures are available: Monte Carlo with pseudo-random numbers, Latin Hypercube Sampling, and Quasi Monte Carlo based on Sobol sequences.
Abstract: Three sampling methods are compared for efficiency on a number of test problems of various complexity for which analytic quadratures are available. The methods compared are Monte Carlo with pseudo-random numbers, Latin Hypercube Sampling, and Quasi Monte Carlo with sampling based on Sobol sequences. Generally results show superior performance of the Quasi Monte Carlo approach based on Sobol sequences in line with theoretical predictions. Latin Hypercube Sampling can be more efficient than both Monte Carlo method and Quasi Monte Carlo method but the latter inequality holds for a reduced set of function typology and at small number of sampled points. In conclusion Quasi Monte Carlo method would appear the safest bet when integrating functions of unknown typology.

77 citations


Posted Content
TL;DR: In this paper, a Dynamic Linear Model (DLM) is proposed to forecast the number of hotel nonresident registrations in Puerto Rico using search query volume (SQV) data downloaded in 11 different occasions.
Abstract: Recently, studies have used search query volume (SQV) data to forecast a given process of interest. However, Google Trends SQV data comes from a periodic sample of queries. As a result, Google Trends data is different every week. We propose a Dynamic Linear Model that treats SQV data as a representation of an unobservable process. We apply our model to forecast the number of hotel nonresident registrations in Puerto Rico using SQV data downloaded in 11 different occasions. The model provides better inference on the association between the number of hotel nonresident registrations and SQV than using Google Trends data retrieved only on one occasion. Furthermore, our model results in more realistic prediction intervals of forecasts. However, compared to simpler models we only find evidence of better performance for our model when making forecasts on a horizon of over 6 months.

74 citations


Posted Content
TL;DR: The VARX-L framework as mentioned in this paper adapts several prominent scalar regression regularization techniques to a vector time series context in order to reduce the parameter space of VAR and VAR-X models.
Abstract: The vector autoregression (VAR) has long proven to be an effective method for modeling the joint dynamics of macroeconomic time series as well as forecasting. A major shortcoming of the VAR that has hindered its applicability is its heavy parameterization: the parameter space grows quadratically with the number of series included, quickly exhausting the available degrees of freedom. Consequently, forecasting using VARs is intractable for low-frequency, high-dimensional macroeconomic data. However, empirical evidence suggests that VARs that incorporate more component series tend to result in more accurate forecasts. Conventional methods that allow for the estimation of large VARs either tend to require ad hoc subjective specifications or are computationally infeasible. Moreover, as global economies become more intricately intertwined, there has been substantial interest in incorporating the impact of stochastic, unmodeled exogenous variables. Vector autoregression with exogenous variables (VARX) extends the VAR to allow for the inclusion of unmodeled variables, but it similarly faces dimensionality challenges. We introduce the VARX-L framework, a structured family of VARX models, and provide methodology that allows for both efficient estimation and accurate forecasting in high-dimensional analysis. VARX-L adapts several prominent scalar regression regularization techniques to a vector time series context in order to greatly reduce the parameter space of VAR and VARX models. We also highlight a compelling extension that allows for shrinking toward reference models, such as a vector random walk. We demonstrate the efficacy of VARX-L in both low- and high-dimensional macroeconomic forecasting applications and simulated data examples. Our methodology is easily reproducible in a publicly available R package.

Posted Content
TL;DR: Superheat as discussed by the authors enhances the traditional heatmap by providing a platform to visualize a wide range of data types simultaneously, adding to the heatmap a response variable as a scatterplot, model results as boxplots, correlation information as barplots and text information.
Abstract: The technological advancements of the modern era have enabled the collection of huge amounts of data in science and beyond. Extracting useful information from such massive datasets is an ongoing challenge as traditional data visualization tools typically do not scale well in high-dimensional settings. An existing visualization technique that is particularly well suited to visualizing large datasets is the heatmap. Although heatmaps are extremely popular in fields such as bioinformatics for visualizing large gene expression datasets, they remain a severely underutilized visualization tool in modern data analysis. In this paper we introduce superheat, a new R package that provides an extremely flexible and customizable platform for visualizing large datasets using extendable heatmaps. Superheat enhances the traditional heatmap by providing a platform to visualize a wide range of data types simultaneously, adding to the heatmap a response variable as a scatterplot, model results as boxplots, correlation information as barplots, text information, and more. Superheat allows the user to explore their data to greater depths and to take advantage of the heterogeneity present in the data to inform analysis decisions. The goal of this paper is two-fold: (1) to demonstrate the potential of the heatmap as a default visualization method for a wide range of data types using reproducible examples, and (2) to highlight the customizability and ease of implementation of the superheat package in R for creating beautiful and extendable heatmaps. The capabilities and fundamental applicability of the superheat package will be explored via three case studies, each based on publicly available data sources and accompanied by a file outlining the step-by-step analytic pipeline (with code).

Journal ArticleDOI
TL;DR: A nonparametric approach and a model‐based approach are introduced and their epicenter estimation capabilities are studied by means of data from simulated earthquakes and data from 49 earthquakes detected by the earthquake early warning system in Chile.
Abstract: The Earthquake Network research project implements a crowdsourced earthquake early warning system based on smartphones. Smartphones, which are made available by the global population, exploit the Internet connection to report a signal to a central server every time a vibration is detected by the on-board accelerometer sensor. This paper introduces a statistical approach for the detection of earthquakes from the data coming from the network of smartphones. The approach allows to handle a dynamic network in which the number of active nodes constantly changes and where nodes are heterogeneous in terms of sensor sensibility and transmission delay. Additionally, the approach allows to keep the probability of false alarm under control. The statistical approach is applied to the data collected by three subnetworks related to the cities of Santiago de Chile, Iquique (Chile) and Kathmandu (Nepal). The detection capabilities of the approach are discussed in terms of earthquake magnitude and detection delay. A simulation study is carried out in order to link the probability of detection and the detection delay to the behaviour of the network under an earthquake event.

Posted Content
TL;DR: In this paper, the authors show that the inverse-variance weighted method as originally proposed (equivalent to a two-stage least squares or allele score analysis using individual-level data) can lead to over-rejection of the null, particularly when there is heterogeneity between the causal estimates from different genetic variants.
Abstract: Mendelian randomization is the use of genetic variants as instrumental variables to assess whether a risk factor is a cause of a disease outcome. Increasingly, Mendelian randomization investigations are conducted on the basis of summarized data, rather than individual-level data. These summarized data comprise the coefficients and standard errors from univariate regression models of the risk factor on each genetic variant, and of the outcome on each genetic variant. A causal estimate can be derived from these associations for each individual genetic variant, and a combined estimate can be obtained by inverse-variance weighted meta-analysis of these causal estimates. Various proposals have been made for how to calculate this inverse-variance weighted estimate. In this paper, we show that the inverse-variance weighted method as originally proposed (equivalent to a two-stage least squares or allele score analysis using individual-level data) can lead to over-rejection of the null, particularly when there is heterogeneity between the causal estimates from different genetic variants. Random-effects models should be routinely employed to allow for this possible heterogeneity. Additionally, over-rejection of the null is observed when associations with the risk factor and the outcome are obtained in overlapping participants. The use of weights including second-order terms from the delta method is recommended in this case.

Journal ArticleDOI
TL;DR: In this paper, the authors present a model for cortical activity based on nearest-neighbor autoregression that incorporates local spatio-temporal interactions between distributed sources in a manner consistent with neurophysiology and neuroanatomy.
Abstract: MEG/EEG are non-invasive imaging techniques that record brain activity with high temporal resolution. However, estimation of brain source currents from surface recordings requires solving an ill-posed inverse problem. Converging lines of evidence in neuroscience, from neuronal network models to resting-state imaging and neurophysiology, suggest that cortical activation is a distributed spatiotemporal dynamic process, supported by both local and long-distance neuroanatomic connections. Because spatiotemporal dynamics of this kind are central to brain physiology, inverse solutions could be improved by incorporating models of these dynamics. In this article, we present a model for cortical activity based on nearest-neighbor autoregression that incorporates local spatiotemporal interactions between distributed sources in a manner consistent with neurophysiology and neuroanatomy. We develop a dynamic Maximum a Posteriori Expectation-Maximization (dMAP-EM) source localization algorithm for estimation of cortical sources and model parameters based on the Kalman Filter, the Fixed Interval Smoother, and the EM algorithms. We apply the dMAP-EM algorithm to simulated experiments as well as to human experimental data. Furthermore, we derive expressions to relate our dynamic estimation formulas to those of standard static models, and show how dynamic methods optimally assimilate past and future data. Our results establish the feasibility of spatiotemporal dynamic estimation in large-scale distributed source spaces with several thousand source locations and hundreds of sensors, with resulting inverse solutions that provide substantial performance improvements over static methods.

Posted Content
TL;DR: In this article, a scalable Dynamic Nearest Neighbor Gaussian Process (DNNGP) model is proposed to provide a sparse approximation to any spatio-temporal Gaussian process (e.g., with non-separable covariance structures).
Abstract: Particulate matter (PM) is a class of malicious environmental pollutants known to be detrimental to human health. Regulatory efforts aimed at curbing PM levels in different countries often require high resolution space-time maps that can identify red-flag regions exceeding statutory concentration limits. Continuous spatio-temporal Gaussian Process (GP) models can deliver maps depicting predicted PM levels and quantify predictive uncertainty. However, GP based approaches are usually thwarted by computational challenges posed by large datasets. We construct a novel class of scalable Dynamic Nearest Neighbor Gaussian Process (DNNGP) models that can provide a sparse approximation to any spatio-temporal GP (e.g., with non-separable covariance structures). The DNNGP we develop here can be used as a sparsity-inducing prior for spatio-temporal random effects in any Bayesian hierarchical model to deliver full posterior inference. Storage and memory requirements for a DNNGP model are linear in the size of the dataset thereby delivering massive scalability without sacrificing inferential richness. Extensive numerical studies reveal that the DNNGP provides substantially superior approximations to the underlying process than low rank approximations. Finally, we use the DNNGP to analyze a massive air quality dataset to substantially improve predictions of PM levels across Europe in conjunction with the LOTOS-EUROS chemistry transport models (CTMs).

Journal ArticleDOI
TL;DR: The fGWAS model, equipped with Bayesian group lassso, will provide a useful tool for genetic and developmental analysis of complex traits or diseases and is proposed for incorporating functional aspects of phenotypic traits into GWAS.
Abstract: Although genome-wide association studies (GWAS) have proven powerful for comprehending the genetic architecture of complex traits, they are challenged by a high dimension of single-nucleotide polymorphisms (SNPs) as predictors, the presence of complex environmental factors, and longitudinal or functional natures of many complex traits or diseases. To address these challenges, we propose a high-dimensional varying-coefficient model for incorporating functional aspects of phenotypic traits into GWAS to formulate a so-called functional GWAS or fGWAS. The Bayesian group lasso and the associated MCMC algorithms are developed to identify significant SNPs and estimate how they affect longitudinal traits through time-varying genetic actions. The model is generalized to analyze the genetic control of complex traits using subject-specific sparse longitudinal data. The statistical properties of the new model are investigated through simulation studies. We use the new model to analyze a real GWAS data set from the Framingham Heart Study, leading to the identification of several significant SNPs associated with age-specific changes of body mass index. The fGWAS model, equipped with the Bayesian group lasso, will provide a useful tool for genetic and developmental analysis of complex traits or diseases.

Posted Content
TL;DR: In this paper, a new probabilistic forecasting technique is proposed based on a multiparametric programming formulation that partitions the uncertainty parameter space into critical regions from which the conditional probability distribution of the real-time LMP/congestion is obtained.
Abstract: The short-term forecasting of real-time locational marginal price (LMP) and network congestion is considered from a system operator perspective. A new probabilistic forecasting technique is proposed based on a multiparametric programming formulation that partitions the uncertainty parameter space into critical regions from which the conditional probability distribution of the real-time LMP/congestion is obtained. The proposed method incorporates load/generation forecast, time varying operation constraints, and contingency models. By shifting the computation cost associated with multiparametric programs offline, the online computation cost is significantly reduced. An online simulation technique by generating critical regions dynamically is also proposed, which results in several orders of magnitude improvement in the computational cost over standard Monte Carlo methods.

Journal ArticleDOI
TL;DR: A statistical method is presented that provides locally calibrated, probabilistic wind speed forecasts at any desired place within the forecast domain based on the output of a numerical weather prediction (NWP) model.
Abstract: Probabilistic forecasts of wind speed are important for a wide range of applications, ranging from operational decision making in connection with wind power generation to storm warnings, ship routing and aviation. We present a statistical method that provides locally calibrated, probabilistic wind speed forecasts at any desired place within the forecast domain based on the output of a numerical weather prediction (NWP) model. Three approaches for wind speed post-processing are proposed, which use either truncated normal, gamma or truncated logistic distributions to make probabilistic predictions about future observations conditional on the forecasts of an ensemble prediction system (EPS). In order to provide probabilistic forecasts on a grid, predictive distributions that were calibrated with local wind speed observations need to be interpolated. We study several interpolation schemes that combine geostatistical methods with local information on annual mean wind speeds, and evaluate the proposed methodology with surface wind speed forecasts over Germany from the COSMO-DE (Consortium for Small-scale Modelling) ensemble prediction system.

Journal ArticleDOI
TL;DR: A novel calibration method for computer models whose output is in the form of binary spatial data that helps rigorously characterize the parameter uncertainty even in the presence of systematic data-model discrepancies and dependence in the errors is presented.
Abstract: Rapid retreat of ice in the Amundsen Sea sector of West Antarctica may cause drastic sea level rise, posing significant risks to populations in low-lying coastal regions. Calibration of computer models representing the behavior of the West Antarctic Ice Sheet is key for informative projections of future sea level rise. However, both the relevant observations and the model output are high-dimensional binary spatial data; existing computer model calibration methods are unable to handle such data. Here we present a novel calibration method for computer models whose output is in the form of binary spatial data. To mitigate the computational and inferential challenges posed by our approach, we apply a generalized principal component based dimension reduction method. To demonstrate the utility of our method, we calibrate the PSU3D-ICE model by comparing the output from a 499-member perturbed-parameter ensemble with observations from the Amundsen Sea sector of the ice sheet. Our methods help rigorously characterize the parameter uncertainty even in the presence of systematic data-model discrepancies and dependence in the errors. Our method also helps inform environmental risk analyses by contributing to improved projections of sea level rise from the ice sheets.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed an ensemble model output statistics (EMOS) model for calibrating wind speed forecasts based on weighted mixtures of truncated normal (TN) and log-normal (LN) distributions where model parameters and component weights are estimated by optimizing the values of proper scoring rules over a rolling training period.
Abstract: Ensemble model output statistics (EMOS) is a statistical tool for post-processing forecast ensembles of weather variables obtained from multiple runs of numerical weather prediction models in order to produce calibrated predictive probability density functions (PDFs). The EMOS predictive PDF is given by a parametric distribution with parameters depending on the ensemble forecasts. We propose an EMOS model for calibrating wind speed forecasts based on weighted mixtures of truncated normal (TN) and log-normal (LN) distributions where model parameters and component weights are estimated by optimizing the values of proper scoring rules over a rolling training period. The new model is tested on wind speed forecasts of the 50 member European Centre for Medium-Range Weather Forecasts ensemble, the 11 member Aire Limitee Adaptation dynamique Developpement International-Hungary Ensemble Prediction System ensemble of the Hungarian Meteorological Service and the eight-member University of Washington mesoscale ensemble, and its predictive performance is compared to that of various benchmark EMOS models based on single parametric families and combinations thereof. The results indicate improved calibration of probabilistic and accuracy of point forecasts in comparison with the raw ensemble and climatological forecasts. The mixture EMOS model significantly outperforms the TN and LN EMOS methods, moreover, it provides better calibrated forecasts than the TN-LN combination model and offers an increased flexibility while avoiding covariate selection problems.

Journal ArticleDOI
TL;DR: This article proposes a joint Ising and Dirichlet Process (Ising-DP) prior within the framework of Bayesian stochastic search variable selection for selecting brain voxels in high-dimensional SI regressions and proposes a new analytic approach to derive bounds for the hyperparameters.
Abstract: Multi-subject functional magnetic resonance imaging (fMRI) data has been increasingly used to study the population-wide relationship between human brain activity and individual biological or behavioral traits. A common method is to regress the scalar individual response on imaging predictors, known as a scalar-on-image (SI) regression. Analysis and computation of such massive and noisy data with complex spatio-temporal correlation structure is challenging. In this article, motivated by a psychological study on human affective feelings using fMRI, we propose a joint Ising and Dirichlet Process (Ising-DP) prior within the framework of Bayesian stochastic search variable selection for selecting brain voxels in high-dimensional SI regressions. The Ising component of the prior makes use of the spatial information between voxels, and the DP component groups the coefficients of the large number of voxels to a small set of values and thus greatly reduces the posterior computational burden. To address the phase transition phenomenon of the Ising prior, we propose a new analytic approach to derive bounds for the hyperparameters, illustrated on 2- and 3-dimensional lattices. The proposed method is compared with several alternative methods via simulations, and is applied to the fMRI data collected from the KLIFF hand-holding experiment.

Posted Content
TL;DR: This work combines a new class of mixed graphical models with a structure estimation approach based on generalized covariance matrices to close the methodological gap in undirected graphical model estimation.
Abstract: Undirected graphical models are a key component in the analysis of complex observational data in a large variety of disciplines. In many of these applications one is interested in estimating the undirected graphical model underlying a distribution over variables with different domains. Despite the pervasive need for such an estimation method, to date there is no such method that models all variables on their proper domain. We close this methodological gap by combining a new class of mixed graphical models with a structure estimation approach based on generalized covariance matrices. We report the performance of our methods using simulations, illustrate the method with a dataset on Autism Spectrum Disorder (ASD) and provide an implementation as an R-package. \

Posted Content
TL;DR: In this paper, a probit stick-breaking process (PSBP) mixture model is proposed for flexible estimation of the conditional density function of transport risk, which provides a tool for the forwarder to offer customized price and service quotes.
Abstract: In cargo logistics, a key performance measure is transport risk, defined as the deviation of the actual arrival time from the planned arrival time. Neither earliness nor tardiness is desirable for customer and freight forwarders. In this paper, we investigate ways to assess and forecast transport risks using a half-year of air cargo data, provided by a leading forwarder on 1336 routes served by 20 airlines. Interestingly, our preliminary data analysis shows a strong multimodal feature in the transport risks, driven by unobserved events, such as cargo missing flights. To accommodate this feature, we introduce a Bayesian nonparametric model -- the probit stick-breaking process (PSBP) mixture model -- for flexible estimation of the conditional (i.e., state-dependent) density function of transport risk. We demonstrate that using simpler methods, such as OLS linear regression, can lead to misleading inferences. Our model provides a tool for the forwarder to offer customized price and service quotes. It can also generate baseline airline performance to enable fair supplier evaluation. Furthermore, the method allows us to separate recurrent risks from disruption risks. This is important, because hedging strategies for these two kinds of risks are often drastically different.

Journal ArticleDOI
TL;DR: A flexible framework for modeling high-dimensional imaging data observed longitudinally is developed that is very fast, scalable to studies including ultra-high dimensional data, and can easily be adapted to and executed on modest computing infrastructures.
Abstract: We develop a flexible framework for modeling high-dimensional imaging data observed longitudinally. The approach decomposes the observed variability of repeatedly measured high-dimensional observations into three additive components: a subject-specific imaging random intercept that quantifies the cross-sectional variability, a subject-specific imaging slope that quantifies the dynamic irreversible deformation over multiple realizations, and a subject-visit-specific imaging deviation that quantifies exchangeable effects between visits. The proposed method is very fast, scalable to studies including ultrahigh-dimensional data, and can easily be adapted to and executed on modest computing infrastructures. The method is applied to the longitudinal analysis of diffusion tensor imaging (DTI) data of the corpus callosum of multiple sclerosis (MS) subjects. The study includes $176$ subjects observed at $466$ visits. For each subject and visit the study contains a registered DTI scan of the corpus callosum at roughly 30,000 voxels.

Posted Content
TL;DR: This work systematically analyze the TDS strategy in a rigorous statistical sense and proves that with a slight modification to the commonly used formula for FDR estimation, the peptide-level FDR can be rigorously controlled based on the concatenated TDS.
Abstract: Motivation: Target-decoy search (TDS) is currently the most popular strategy for estimating and controlling the false discovery rate (FDR) of peptide identifications in mass spectrometry-based shotgun proteomics. While this strategy is very useful in practice and has been intensively studied empirically, its theoretical foundation has not yet been well established. Result: In this work, we systematically analyze the TDS strategy in a rigorous statistical sense. We prove that the commonly used concatenated TDS provides a conservative estimate of the FDR for any given score threshold, but it cannot rigorously control the FDR. We prove that with a slight modification to the commonly used formula for FDR estimation, the peptide-level FDR can be rigorously controlled based on the concatenated TDS. We show that the spectrum-level FDR control is difficult. We verify the theoretical conclusions with real mass spectrometry data.

Posted Content
TL;DR: This work constructs spatio-temporal weight functions to incorporate various temporal and spatial patterns in ambulance demand, including location-specific seasonalities and short-term serial dependence, and provides spatial density predictions for ambulance demand in Toronto, Canada as it varies over hourly intervals.
Abstract: Predicting ambulance demand accurately at fine time and location scales is critical for ambulance fleet management and dynamic deployment. Large-scale datasets in this setting typically exhibit complex spatio-temporal dynamics and sparsity at high resolutions. We propose a predictive method using spatio-temporal kernel density estimation (stKDE) to address these challenges, and provide spatial density predictions for ambulance demand in Toronto, Canada as it varies over hourly intervals. Specifically, we weight the spatial kernel of each historical observation by its informativeness to the current predictive task. We construct spatio-temporal weight functions to incorporate various temporal and spatial patterns in ambulance demand, including location-specific seasonalities and short-term serial dependence. This allows us to draw out the most helpful historical data, and exploit spatio-temporal patterns in the data for accurate and fast predictions. We further provide efficient estimation and customizable prediction procedures. stKDE is easy to use and interpret by non-specialized personnel from the emergency medical service industry. It also has significantly higher statistical accuracy than the current industry practice, with a comparable amount of computational expense.

Journal ArticleDOI
TL;DR: In this article, a semi-local method for estimating the EMOS coefficients where the training data for a specific observation station are augmented with corresponding forecast cases from stations with similar characteristics is proposed.
Abstract: Weather forecasts are typically given in the form of forecast ensembles obtained from multiple runs of numerical weather prediction models with varying initial conditions and physics parameterizations. Such ensemble predictions tend to be biased and underdispersive and thus require statistical postprocessing. In the ensemble model output statistics (EMOS) approach, a probabilistic forecast is given by a single parametric distribution with parameters depending on the ensemble members. This article proposes two semi-local methods for estimating the EMOS coefficients where the training data for a specific observation station are augmented with corresponding forecast cases from stations with similar characteristics. Similarities between stations are determined using either distance functions or clustering based on various features of the climatology, forecast errors, ensemble predictions and locations of the observation stations. In a case study on wind speed over Europe with forecasts from the Grand Limited Area Model Ensemble Prediction System, the proposed similarity-based semi-local models show significant improvement in predictive performance compared to standard regional and local estimation methods. They further allow for estimating complex models without numerical stability issues and are computationally more efficient than local parameter estimation.

Posted Content
TL;DR: In this article, the authors derive ensembles of decision trees through a nonparametric Bayesian model, allowing them to view random forests as samples from a posterior distribution, which provides large gains in interpretability, and motivates a class of Bayesian forest (BF) algorithms that yield small but reliable performance gains.
Abstract: We derive ensembles of decision trees through a nonparametric Bayesian model, allowing us to view random forests as samples from a posterior distribution. This insight provides large gains in interpretability, and motivates a class of Bayesian forest (BF) algorithms that yield small but reliable performance gains. Based on the BF framework, we are able to show that high-level tree hierarchy is stable in large samples. This leads to an empirical Bayesian forest (EBF) algorithm for building approximate BFs on massive distributed datasets and we show that EBFs outperform sub-sampling based alternatives by a large margin.

Journal ArticleDOI
TL;DR: An efficient MCMC algorithm for posterior inference along with tractable procedures for online updating and forecasting of future networks is developed for Locally Adaptive DYnamic (LADY) network inference.
Abstract: Our focus is on realistically modeling and forecasting dynamic networks of face-to-face contacts among individuals. Important aspects of such data that lead to problems with current methods include the tendency of the contacts to move between periods of slow and rapid changes, and the dynamic heterogeneity in the actors' connectivity behaviors. Motivated by this application, we develop a novel method for Locally Adaptive DYnamic (LADY) network inference. The proposed model relies on a dynamic latent space representation in which each actor's position evolves in time via stochastic differential equations. Using a state space representation for these stochastic processes and Polya-gamma data augmentation, we develop an efficient MCMC algorithm for posterior inference along with tractable procedures for online updating and forecasting of future networks. We evaluate performance in simulation studies, and consider an application to face-to-face contacts among individuals in a primary school.