scispace - formally typeset
Search or ask a question

Showing papers in "The Annals of Applied Statistics in 2022"


Journal ArticleDOI
TL;DR: In this article , the authors developed a model for estimating the true mortality burden of COVID-19 for every country in the world and developed a relatively simple overdispersed Poisson count framework within which the various data types can be modeled.
Abstract: Estimating the true mortality burden of COVID-19 for every country in the world is a difficult, but crucial, public health endeavor. Attributing deaths, direct or indirect, to COVID-19 is problematic. A more attainable target is the "excess deaths," the number of deaths in a particular period, relative to that expected during "normal times," and we develop a model for this endeavor. The excess mortality requires two numbers, the total deaths and the expected deaths, but the former is unavailable for many countries, and so modeling is required for such countries. The expected deaths are based on historic data, and we develop a model for producing estimates of these deaths for all countries. We allow for uncertainty in the modeled expected numbers when calculating the excess. The methods we describe were used to produce the World Health Organization (WHO) excess death estimates. To achieve both interpretability and transparency we developed a relatively simple overdispersed Poisson count framework within which the various data types can be modeled. We use data from countries with national monthly data to build a predictive log-linear regression model with time-varying coefficients for countries without data. For a number of countries, subnational data only are available, and we construct a multinomial model for such data, based on the assumption that the fractions of deaths in subregions remain approximately constant over time. Our inferential approach is Bayesian, with the covariate predictive model being implemented in the fast and accurate INLA software. The subnational modeling was carried out using MCMC in Stan. Based on our modeling, the point estimate for global excess mortality during 2020-2021 is 14.8 million, with a 95% credible interval of (13.2, 16.6) million.

20 citations


Journal ArticleDOI
TL;DR:
Abstract: BRAVO, the most widely fielded method for risk-limiting election audits (RLAs), is based on Wald’s sequential probability ratio test for the Bernoulli parameter. It cannot accommodate sampling without replacement or stratified sampling. It applies only to ballot-polling, an inefficient auditing approach. It does not apply to many social choice functions for which there are RLAs, including approval voting, STAR-voting, Borda count, and general scoring rules. ALPHA, a supermartingale test that generalizes BRAVO, (i) works for ballot polling, Bernoulli ballot polling, ballot-level comparison, batch-polling, and batch-level comparison audits, sampling with or without replacement, uniformly or with probability proportional to size; (ii) requires smaller samples than BRAVO when the reported vote shares are wrong but the outcome is correct; (iii) works for all social choice functions for which an RLA method is known; (iv) in stratified audits, obviates the need to use a P -value combining function, the need to maximize P -values over nuisance parameters within strata, and allows adaptive sampling across strata. ALPHA includes the betting martingale tests in RiLACS, but parametrizes the betting strategy as an estimator of the population mean and explicitly accommodates sampling weights and population bounds that vary by draw.

12 citations


Journal ArticleDOI
TL;DR: In this article , a Bayesian framework is proposed for predicting extreme floods using the generalized extreme-value (GEV) distribution and a multivariate link function designed to separate the interpretation of the parameters at the latent level and to avoid unreasonable estimates of shape and time trend parameters.
Abstract: Extreme floods cause casualties and widespread damage to property and vital civil infrastructure. Predictions of extreme floods, within gauged and ungauged catchments, is crucial to mitigate these disasters. In this paper a Bayesian framework is proposed for predicting extreme floods, using the generalized extreme-value (GEV) distribution. A major methodological chal-lenge is to find a suitable parametrization for the GEV distribution when multiple covariates and/or latent spatial effects are involved and a time trend is present. Other challenges involve balancing model complexity and parsi-mony, using an appropriate model selection procedure and making inference based on a reliable and computationally efficient approach. We here propose a latent Gaussian modeling framework with a novel multivariate link function designed to separate the interpretation of the parameters at the latent level and to avoid unreasonable estimates of the shape and time trend parameters. Structured additive regression models, which include catchment descriptors as covariates and spatially correlated model components, are proposed for the four parameters at the latent level. To achieve computational efficiency with large datasets and richly parametrized models, we exploit a highly accurate and fast approximate Bayesian inference approach which can also be used to efficiently select models separately for each of the four regression models at the latent level. We applied our proposed methodology to annual peak river flow data from 554 catchments across the United Kingdom. The framework performed well in terms of flood predictions for both ungauged catchments and future observations at gauged catchments. The results show that the spatial model components for the transformed location and scale parameters as well as the time trend are all important, and none of these should be ignored. Posterior estimates of the time trend parameters correspond to an average increase of about 1 . 5% per decade with range 0 . 1% to 2 . 8% and reveal a spatial structure across the United Kingdom. When the interest lies in estimating return levels for spatial aggregates, we further develop a novel copula-based postprocessing approach of posterior predictive samples in order to mitigate the effect of the conditional independence assumption at the data level, and we demonstrate that our approach indeed provides accurate results. The effects of the covariates and the spatial model components in the final model on the 100 -year event (0 . 99 quantile ) of the GEV density . The effect of a given covariate ( or a given spatial model component ) on the 100 -year event is measured by comparing the 100 -year event computed with the median value of the covariate ( or the spatial model component ) to the 100 -year event computed with the first quartile and third quartile of the covariate ( or the spatial model component ) while keeping other covariates and random effects fixed . The table reports the multiplicative on

8 citations


Journal ArticleDOI
TL;DR: In this article , a Bayesian kernel machine regression distributed lag model (BKMR-DLM) was proposed to capture non-linear and interaction effects of the multivariate exposure on the outcome.
Abstract: Exposures to environmental chemicals during gestation can alter health status later in life. Most studies of maternal exposure to chemicals during pregnancy have focused on a single chemical exposure observed at high temporal resolution. Recent research has turned to focus on exposure to mixtures of multiple chemicals, generally observed at a single time point. We consider statistical methods for analyzing data on chemical mixtures that are observed at a high temporal resolution. As motivation, we analyze the association between exposure to four ambient air pollutants observed weekly throughout gestation and birth weight in a Boston-area prospective birth cohort. To explore patterns in the data, we first apply methods for analyzing data on (1) a single chemical observed at high temporal resolution, and (2) a mixture measured at a single point in time. We highlight the shortcomings of these approaches for temporally-resolved data on exposure to chemical mixtures. Second, we propose a novel method, a Bayesian kernel machine regression distributed lag model (BKMR-DLM), that simultaneously accounts for nonlinear associations and interactions among time-varying measures of exposure to mixtures. BKMR-DLM uses a functional weight for each exposure that parameterizes the window of susceptibility corresponding to that exposure within a kernel machine framework that captures non-linear and interaction effects of the multivariate exposure on the outcome. In a simulation study, we show that the proposed method can better estimate the exposure-response function and, in high signal settings, can identify critical windows in time during which exposure has an increased association with the outcome. Applying the proposed method to the Boston birth cohort data, we find evidence of a negative association between organic carbon and birth weight and that nitrate modifies the organic carbon, elemental carbon, and sulfate exposure-response functions.

8 citations


Journal ArticleDOI
TL;DR: In this article , the authors propose a flexible approach to the simultaneous factorization and decomposition of variation across such bidimensionally linked matrices, BIDIFAC+ decomposes variation into a series of low-rank components that may be shared across any number of row sets or column sets.
Abstract: Several modern applications require the integration of multiple large data matrices that have shared rows and/or columns. For example, cancer studies that integrate multiple omics platforms across multiple types of cancer, pan-omics pan-cancer analysis, have extended our knowledge of molecular heterogeneity beyond what was observed in single tumor and single platform studies. However, these studies have been limited by available statistical methodology. We propose a flexible approach to the simultaneous factorization and decomposition of variation across such bidimensionally linked matrices, BIDIFAC+. BIDIFAC+ decomposes variation into a series of low-rank components that may be shared across any number of row sets (e.g., omics platforms) or column sets (e.g., cancer types). This builds on a growing literature for the factorization and decomposition of linked matrices which has primarily focused on multiple matrices that are linked in one dimension (rows or columns) only. Our objective function extends nuclear norm penalization, is motivated by random matrix theory, gives a unique decomposition under relatively mild conditions, and can be shown to give the mode of a Bayesian posterior distribution. We apply BIDIFAC+ to pan-omics pan-cancer data from TCGA, identifying shared and specific modes of variability across four different omics platforms and 29 different cancer types.

7 citations


Journal ArticleDOI
TL;DR: In this paper , the authors explicitly model uncertainties in a Bayesian manner and jointly infer unknown locations together with all parameters of a reasonably flexible spatiotemporal Hawkes model, obtaining results that are practically and statistically distinct from those obtained while ignoring spatial coarsening.
Abstract: Self-exciting spatiotemporal Hawkes processes have found increasing use in the study of large-scale public health threats, ranging from gun violence and earthquakes to wildfires and viral contagion. Whereas many such applications feature locational uncertainty, that is, the exact spatial positions of individual events are unknown, most Hawkes model analyses to date have ignored spatial coarsening present in the data. Three particular 21st century public health crises-urban gun violence, rural wildfires and global viral spread-present qualitatively and quantitatively varying uncertainty regimes that exhibit: (a) different collective magnitudes of spatial coarsening, (b) uniform and mixed magnitude coarsening, (c) differently shaped uncertainty regions and-less orthodox-(d) locational data distributed within the "wrong" effective space. We explicitly model such uncertainties in a Bayesian manner and jointly infer unknown locations together with all parameters of a reasonably flexible Hawkes model, obtaining results that are practically and statistically distinct from those obtained while ignoring spatial coarsening. This work also features two different secondary contributions: first, to facilitate Bayesian inference of locations and background rate parameters, we make a subtle yet crucial change to an established kernel-based rate model, and second, to facilitate the same Bayesian inference at scale, we develop a massively parallel implementation of the model's log-likelihood gradient with respect to locations and thus avoid its quadratic computational cost in the context of Hamiltonian Monte Carlo. Our examples involve thousands of observations and allow us to demonstrate practicality at moderate scales.

6 citations


Journal ArticleDOI
TL;DR: In this article , the authors developed new methods for providing instantaneous in-game win probabilities for the National Rugby League (NRL) using a conditional probability formulation, the components of which are evaluated from the perspective of functional data analysis.
Abstract: This paper develops new methods for providing instantaneous in-game win probabilities for the National Rugby League. Besides the score differential, betting odds, and real-time features extracted from the match event data are also used as inputs to inform the in-game win probabilities. Rugby matches evolve continuously in time, and the circumstances change over the duration of the match. Therefore, the match data are considered as functional data, and the in-game win probability is a function of the time of the match. We express the in-game win probability using a conditional probability formulation, the components of which are evaluated from the perspective of functional data analysis. Specifically, we model the score differential process and functional feature extracted from the match event data as sums of mean functions and noises. The mean functions are approximated by B-spline basis expansions with functional parameters. Since each match is conditional on a unique kickoff win probability of the home team obtained from the betting odds (i.e., the functional data are not independent and identically distributed), we propose a weighted least squares method to estimate the functional parameters by borrowing the information from matches with similar kickoff win probabilities. The variance and covariance elements are obtained by the maximum likelihood estimation method. The proposed method is applicable to other sports when suitable match event data are available.

6 citations


Journal ArticleDOI
TL;DR: In this article , a hierarchical Bayesian method that combines smoothed variable selection and temporally correlated weight parameters is proposed to identify critical windows of exposure to mixtures of time-varying pollutants, estimate the relative importance of each individual pollutant and their first order interactions within the mixture, and quantify the impact of the mixtures on health.
Abstract: Understanding the role of time-varying pollution mixtures on human health is critical as people are simultaneously exposed to multiple pollutants during their lives. For vulnerable subpopulations who have well-defined exposure periods (e.g., pregnant women), questions regarding critical windows of exposure to these mixtures are important for mitigating harm. We extend critical window variable selection (CWVS) to the multipollutant setting by introducing CWVS for mixtures (CWVSmix), a hierarchical Bayesian method that combines smoothed variable selection and temporally correlated weight parameters to: (i) identify critical windows of exposure to mixtures of time-varying pollutants, (ii) estimate the time-varying relative importance of each individual pollutant and their first order interactions within the mixture, and (iii) quantify the impact of the mixtures on health. Through simulation we show that CWVSmix offers the best balance of performance in each of these categories in comparison to competing methods. Using these approaches, we investigate the impact of exposure to multiple ambient air pollutants on the risk of stillbirth in New Jersey, 2005-2014. We find consistent elevated risk in gestational weeks 2, 16-17, and 20 for non-Hispanic Black mothers, with pollution mixtures dominated by ammonium (weeks 2, 17, 20), nitrate (weeks 2, 17), nitrogen oxides (weeks 2, 16), PM2.5 (week 2), and sulfate (week 20). The method is available in the R package CWVSmix.

6 citations


Journal ArticleDOI
TL;DR: In this article , the authors propose a flexible Monte Carlo sensitivity analysis approach for causal inference in such settings, which embeds nested multiple imputation within the Bayesian framework, which allow for seamless integration of the uncertainty about the values of the sensitivity parameters and the sampling variability as well as use of Bayesian Additive Regression Trees for modeling flexibility.
Abstract: In the absence of a randomized experiment, a key assumption for drawing causal inference about treatment effects is the ignorable treatment assignment. Violations of the ignorability assumption may lead to biased treatment effect estimates. Sensitivity analysis helps gauge how causal conclusions will be altered in response to the potential magnitude of departure from the ignorability assumption. However, sensitivity analysis approaches for unmeasured confounding in the context of multiple treatments and binary outcomes are scarce. We propose a flexible Monte Carlo sensitivity analysis approach for causal inference in such settings. We first derive the general form of the bias introduced by unmeasured confounding, with emphasis on theoretical properties uniquely relevant to multiple treatments. We then propose methods to encode the impact of unmeasured confounding on potential outcomes and adjust the estimates of causal effects in which the presumed unmeasured confounding is removed. Our proposed methods embed nested multiple imputation within the Bayesian framework, which allow for seamless integration of the uncertainty about the values of the sensitivity parameters and the sampling variability as well as use of the Bayesian Additive Regression Trees for modeling flexibility. Expansive simulations validate our methods and gain insight into sensitivity analysis with multiple treatments. We use the SEER-Medicare data to demonstrate sensitivity analysis using three treatments for early stage nonsmall cell lung cancer. The methods developed in this work are readily available in the R package SAMTx.

5 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a novel clustering technique to pursue homogeneity among multiple functional time series based on functional panel data modeling, which can lead to improvements in long-term forecasting.
Abstract: Modeling and forecasting homogeneous age-specific mortality rates of multiple countries could lead to improvements in long-term forecasting. Data fed into joint models are often grouped according to nominal attributes, such as geographic regions, ethnic groups, and socioeconomic status, which may still contain heterogeneity and deteriorate the forecast results. Our paper proposes a novel clustering technique to pursue homogeneity among multiple functional time series based on functional panel data modeling to address this issue. Using a functional panel data model with fixed effects, we can extract common functional time series features. These common features could be decomposed into two components: the functional time trend and the mode of variations of functions (functional pattern). The functional time trend reflects the dynamics across time, while the functional pattern captures the fluctuations within curves. The proposed clustering method searches for homogeneous age-specific mortality rates of multiple countries by accounting for both the modes of variations and the temporal dynamics among curves. We demonstrate that the proposed clustering technique outperforms other existing methods through a Monte Carlo simulation and could handle complicated cases with slow decaying eigenvalues. In empirical data analysis, we find that the clustering results of age-specific mortality rates can be explained by the combination of geographic region, ethnic groups, and socioeconomic status. We further show that our model produces more accurate forecasts than several benchmark methods in forecasting age-specific mortality rates.

5 citations


Journal ArticleDOI
TL;DR: Ziggy as discussed by the authors is a scalable approach to GP inference with integrated observations based on stochastic variational inference, and it can reliably infers the spatial dust map with well-calibrated posterior uncertainties.
Abstract: Interstellar dust corrupts nearly every stellar observation, and accounting for it is crucial to measuring physical properties of stars. We model the dust distribution as a spatially varying latent field with a Gaussian process (GP) and develop a likelihood model and inference method that scales to millions of astronomical observations. Modeling interstellar dust is complicated by two factors. The first is integrated observations. The data come from a vantage point on Earth and each observation is an integral of the unobserved function along our line of sight, resulting in a complex likelihood and a more difficult inference problem than in classical GP inference. The second complication is scale; stellar catalogs have millions of observations. To address these challenges we develop ziggy, a scalable approach to GP inference with integrated observations based on stochastic variational inference. We study ziggy on synthetic data and the Ananke dataset, a high-fidelity mechanistic model of the Milky Way with millions of stars. ziggy reliably infers the spatial dust map with well-calibrated posterior uncertainties.

Journal ArticleDOI
TL;DR: In this paper , a new class of extended stochastic block models (esbm) is proposed to infer groups of nodes having common connectivity patterns via Gibbs-type priors on the partition process.
Abstract: Reliably learning group structures among nodes in network data is challenging in several applications. We are particularly motivated by studying covert networks that encode relationships among criminals. These data are subject to measurement errors, and exhibit a complex combination of an unknown number of core-periphery, assortative and disassortative structures that may unveil key architectures of the criminal organization. The coexistence of these noisy block patterns limits the reliability of routinely-used community detection algorithms, and requires extensions of model-based solutions to realistically characterize the node partition process, incorporate information from node attributes, and provide improved strategies for estimation and uncertainty quantification. To cover these gaps, we develop a new class of extended stochastic block models (esbm) that infer groups of nodes having common connectivity patterns via Gibbs-type priors on the partition process. This choice encompasses many realistic priors for criminal networks, covering solutions with fixed, random and infinite number of possible groups, and facilitates the inclusion of node attributes in a principled manner. Among the new alternatives in our class, we focus on the Gnedin process as a realistic prior that allows the number of groups to be finite, random and subject to a reinforcement process coherent with criminal networks. A collapsed Gibbs sampler is proposed for the whole esbm class, and refined strategies for estimation, prediction, uncertainty quantification and model selection are outlined. The esbm performance is illustrated in realistic simulations and in an application to an Italian mafia network, where we unveil key complex block structures, mostly hidden from state-of-the-art alternatives.

Journal ArticleDOI
TL;DR: In this article , the authors employ a state-of-the-art Bayesian mixture model that allows the estimation of heterogeneous intrinsic dimension (ID) within a dataset, and propose some theoretical enhancements.
Abstract: Following the introduction of high-resolution player tracking technology, a new range of statistical analysis has emerged in sports, specifically in basketball. However, such high-dimensional data are often challenging for statistical inference and decision making. In this article we employ a state-of-the-art Bayesian mixture model that allows the estimation of heterogeneous intrinsic dimension (ID) within a dataset, and we propose some theoretical enhancements. Informally, the ID can be seen as an indicator of complexity and dependence of the data at hand, and it is usually assumed unique. Our method provides the capacity to reveal valuable insights about the hidden dynamics of sports interactions in space and time which helps to translate complex patterns into more coherent statistics. The application of this technique is illustrated using NBA basketball players’ tracking data, allowing effective classification and clustering. In movement data the analysis identified key stages of offensive actions, such as creating space for passing, preparation/shooting, and following through which are relevant for invasion sports. We found that the ID value spikes, reaching a peak between four and eight seconds in the offensive part of the court, after which it declines. In shot charts we obtained groups of shots that produce substantially higher and lower successes. Overall, game-winners tend to have a larger intrinsic dimension, indicative of greater unpredictability and unique shot placements. Similarly, we found higher ID values in plays when the score margin is smaller rather than larger. The exploitation of these results can bring clear strategic advantages in sports games.

Journal ArticleDOI
TL;DR: In this paper , a nonparametric test of association between a time series of images and a series of binary event labels is proposed to quantify whether, and if so how, spatio-temporal patterns in tropical cyclone satellite imagery signal an upcoming rapid intensity change event.
Abstract: Our goal is to quantify whether, and if so how, spatio-temporal patterns in tropical cyclone (TC) satellite imagery signal an upcoming rapid intensity change event. To address this question, we propose a new nonparametric test of association between a time series of images and a series of binary event labels. We ask whether there is a difference in distribution between (dependent but identically distributed) 24-h sequences of images preceding an event versus a non-event. By rewriting the statistical test as a regression problem, we leverage neural networks to infer modes of structural evolution of TC convection that are representative of the lead-up to rapid intensity change events. Dependencies between nearby sequences are handled by a bootstrap procedure that estimates the marginal distribution of the label series. We prove that type I error control is guaranteed as long as the distribution of the label series is well-estimated, which is made easier by the extensive historical data for binary TC event labels. We show empirical evidence that our proposed method identifies archetypes of infrared imagery associated with elevated rapid intensification risk, typically marked by deep or deepening core convection over time. Such results provide a foundation for improved forecasts of rapid intensification.

Journal ArticleDOI
TL;DR: This paper developed Bayesian spatial hierarchical models for point patterns of landslide occurrences using different types of log-Gaussian Cox processes, starting from a competitive baseline model that captures the unobserved precipitation trigger through a spatial random effect at slope unit resolution, explore novel complex model structures that take clusters of events arising at small spatial scales into account as well as nonlinear or spatially-varying covariate effects.
Abstract: Statistical models for landslide hazard enable mapping of risk factors and landslide occurrence intensity by using geomorphological covariates available at high spatial resolution. However, the spatial distribution of the triggering event (e.g., precipitation or earthquakes) is often not directly observed. In this paper we develop Bayesian spatial hierarchical models for point patterns of landslide occurrences using different types of log-Gaussian Cox processes. Starting from a competitive baseline model that captures the unobserved precipitation trigger through a spatial random effect at slope unit resolution, we explore novel complex model structures that take clusters of events arising at small spatial scales into account as well as nonlinear or spatially-varying covariate effects. For a 2009 event of around 5000 precipitation-triggered landslides in Sicily, Italy, we show how to fit our proposed models efficiently, using the integrated nested Laplace approximation (INLA), and rigorously compare the performance of our models both from a statistical and applied perspective. In this context we argue that model comparison should not be based on a single criterion and that different models of various complexity may provide insights into complementary aspects of the same applied problem. In our application our models are found to have mostly the same spatial predictive performance, implying that key to successful prediction is the inclusion of a slope-unit resolved random effect capturing the precipitation trigger. Interestingly, a parsimonious formulation of space-varying slope effects reflects a physical interpretation of the precipitation trigger: in subareas with weak trigger, the slope steepness is shown to be mostly irrelevant.

Journal ArticleDOI
TL;DR: In this article , the authors derive a scalable approach to detect anomalous mean structure in a subset of correlated multivariate time series and develop a new dynamic programming algorithm for solving the resulting binary quadratic program when the precision matrix of the time series at any given time point is banded.
Abstract: Motivated by a condition monitoring application arising from subsea engineering, we derive a novel, scalable approach to detecting anomalous mean structure in a subset of correlated multivariate time series. Given the need to analyse such series efficiently, we explore a computationally efficient approximation of the maximum likelihood solution to the resulting modelling framework and develop a new dynamic programming algorithm for solving the resulting binary quadratic programme when the precision matrix of the time series at any given time point is banded. Through a comprehensive simulation study we show that the resulting methods perform favorably compared to competing methods, both in the anomaly and change detection settings, even when the sparsity structure of the precision matrix estimate is misspecified. We also demonstrate its ability to correctly detect faulty time periods of a pump within the motivating application.

Journal ArticleDOI
TL;DR: In this paper , a model for high-resolution precipitation data from which they can simulate realistic fields and explore the behaviour of spatial aggregates is proposed. But it does not address edge effects and subregions without rain.
Abstract: Inference on the extremal behaviour of spatial aggregates of precipitation is important for quantifying river flood risk. There are two classes of previous approach, with one failing to ensure self-consistency in inference across different regions of aggregation and the other imposing highly restrictive assumptions. To overcome these issues, we propose a model for high-resolution precipitation data from which we can simulate realistic fields and explore the behaviour of spatial aggregates. Recent developments have seen spatial extensions of the Heffernan and Tawn (J. R. Stat. Soc. Ser. B. Stat. Methodol. 66 (2004) 497–546) model for conditional multivariate extremes which can handle a wide range of dependence structures. Our contribution is twofold: extensions and improvements of this approach and its model inference for high-dimensional data and a novel framework for deriving aggregates addressing edge effects and subregions without rain. We apply our modelling approach to gridded East Anglia, UK precipitation data. Return-level curves for spatial aggregates over different regions of various sizes are estimated and shown to fit very well to the data.

Journal ArticleDOI
TL;DR: In this article , a reverse-Bayes-based approach is proposed for the analysis of replication studies, which is directly related to the relative effect size, the ratio of the replication to the original effect estimate.
Abstract: Replication studies are increasingly conducted in order to confirm original findings. However, there is no established standard how to assess replication success and in practice many different approaches are used. The purpose of this paper is to refine and extend a recently proposed reverse-Bayes approach for the analysis of replication studies. We show how this method is directly related to the relative effect size, the ratio of the replication to the original effect estimate. This perspective leads to a new proposal to recalibrate the assessment of replication success, the golden level. The recalibration ensures that for borderline significant original studies replication success can only be achieved if the replication effect estimate is larger than the original one. Conditional power for replication success can then take any desired value if the original study is significant and the replication sample size is large enough. Compared to the standard approach to require statistical significance of both the original and replication study, replication success at the golden level offers uniform gains in project power and controls the Type-I error rate if the replication sample size is not smaller than the original one. An application to data from four large replication projects shows that the new approach leads to more appropriate inferences, as it penalizes shrinkage of the replication estimate compared to the original one, while ensuring that both effect estimates are sufficiently convincing on their own.

Journal ArticleDOI
TL;DR: The Ordinal Probit Functional Outcome Regression model (OPFOR) as discussed by the authors was proposed to fit one of several basis functions including penalized B-splines, wavelets, and O'Sullivan splines.
Abstract: Research in functional regression has made great strides in expanding to non-Gaussian functional outcomes, but exploration of ordinal functional outcomes remains limited. Motivated by a study of computer-use behavior in rhesus macaques (Macaca mulatta), we introduce the Ordinal Probit Functional Outcome Regression model (OPFOR). OPFOR models can be fit using one of several basis functions including penalized B-splines, wavelets, and O'Sullivan splines-the last of which typically performs best. Simulation using a variety of underlying covariance patterns shows that the model performs reasonably well in estimation under multiple basis functions with near nominal coverage for joint credible intervals. Finally, in application, we use Bayesian model selection criteria adapted to functional outcome regression to best characterize the relation between several demographic factors of interest and the monkeys' computer use over the course of a year. In comparison with a standard ordinal longitudinal analysis, OPFOR outperforms a cumulative-link mixed-effects model in simulation and provides additional and more nuanced information on the nature of the monkeys' computer-use behavior.

Journal ArticleDOI
TL;DR: In this article , a flexible copula model is proposed to capture asymptotic dependence or independence in its lower and upper tails simultaneously, which is parsimonious and smoothly bridges (in each tail) both extremal dependence classes in the interior of the parameter space.
Abstract: Since the inception of Bitcoin in 2008, cryptocurrencies have played an increasing role in the world of e-commerce, but the recent turbulence in the cryptocurrency market in 2018 has raised some concerns about their stability and associated risks. For investors it is crucial to uncover the dependence relationships between cryptocurrencies for a more resilient portfolio diversification. Moreover, the stochastic behavior in both tails is important, as long positions are sensitive to a decrease in prices (lower tail), while short positions are sensitive to an increase in prices (upper tail). In order to assess both risk types, we develop in this paper a flexible copula model which is able to distinctively capture asymptotic dependence or independence in its lower and upper tails simultaneously. Our proposed model is parsimonious and smoothly bridges (in each tail) both extremal dependence classes in the interior of the parameter space. Inference is performed using a full or censored likelihood approach, and we investigate by simulation the estimators’ efficiency under three different censoring schemes which reduce the impact of nonextreme observations. We also develop a local likelihood approach to capture the temporal dynamics of extremal dependence among pairs of leading cryptocurrencies. We here apply our model to historical closing prices of five leading cryotocurrencies which share large cryptocurrency market capitalizations. The results show that our proposed copula model outperforms alternative copula models and that the lower-tail dependence level between most pairs of leading cryptocurrencies and, in particular, Bitcoin and Ethereum has become stronger over time, smoothly transitioning from an asymptotic independence regime to an asymptotic dependence regime in recent years, whilst the upper tail has been relatively more stable overall at a weaker dependence level.

Journal ArticleDOI
TL;DR: In this paper , the authors proposed a new family of gene-level association tests that integrate quantile rank score process to better accommodate complex associations, which are almost as efficient as the best existing tests when the associations are homogeneous across quantile levels and have improved efficiency for complex and heterogeneous associations.
Abstract: Gene-based testing is a commonly employed strategy in many genetic association studies. Gene-trait associations can be complex due to underlying population heterogeneity, gene-environment interactions, and various other reasons. Existing gene-based tests, such as burden and sequence kernel association tests (SKAT), are mean-based tests and may miss or underestimate higher-order associations that could be scientifically interesting. In this paper we propose a new family of gene-level association tests that integrate quantile rank score process to better accommodate complex associations. The resulting test statistics have multiple advantages: (1) they are almost as efficient as the best existing tests when the associations are homogeneous across quantile levels and have improved efficiency for complex and heterogeneous associations; (2) they provide useful insights into risk stratification; (3) the test statistics are distribution free and could hence accommodate a wide range of underlying distributions, and (4) they are computationally efficient. We established the asymptotic properties of the proposed tests under the null and alternative hypotheses and conducted large-scale simulation studies to investigate their finite sample performance. The performance of the proposed approach is compared with that of conventional mean-based tests, that is, the burden and SKAT tests, through simulation studies and applications to a metabochip dataset on lipid traits and to the genotype-tissue expression data in GTEx to identify eGenes, that is, genes whose expression levels are associated with cis-eQTLs.

Journal ArticleDOI
TL;DR: In this article , a Bayesian graphical model was developed to investigate longitudinal effects of ART drugs on a range of depressive symptoms while adjusting for participants' demographic, behavior, and clinical characteristics, and taking into account the heterogeneous population through a bayesian nonparametric prior.
Abstract: Access and adherence to antiretroviral therapy (ART) has transformed the face of HIV infection from a fatal to a chronic disease. However, ART is also known for its side effects. Studies have reported that ART is associated with depressive symptomatology. Large-scale HIV clinical databases with individuals’ longitudinal depression records, ART medications, and clinical characteristics offer researchers unprecedented opportunities to study the effects of ART drugs on depression over time. We develop BAGEL, a Bayesian graphical model, to investigate longitudinal effects of ART drugs on a range of depressive symptoms while adjusting for participants’ demographic, behavior, and clinical characteristics, and taking into account the heterogeneous population through a Bayesian nonparametric prior. We evaluate BAGEL through simulation studies. Application to a dataset from the Women’s Interagency HIV Study yields interpretable and clinically useful results. BAGEL not only can improve our understanding of ART drugs’ effects on disparate depression symptoms but also has clinical utility in guiding informed and effective treatment selection to facilitate precision medicine in HIV.

Journal ArticleDOI
TL;DR: In this paper , the authors present a Bayesian hierarchical model for estimating the size of key populations that combines multiple estimates from different sources of information, and use the model to estimate the number of people who inject drugs in Ukraine.
Abstract: To combat the HIV/AIDS pandemic effectively, targeted interventions among certain key populations play a critical role. Examples of such key populations include sex workers, people who inject drugs, and men who have sex with men. While having accurate estimates for the size of these key populations is important, any attempt to directly contact or count members of these populations is difficult. As a result, indirect methods are used to produce size estimates. Multiple approaches for estimating the size of such populations have been suggested but often give conflicting results. It is, therefore, necessary to have a principled way to combine and reconcile these estimates. To this end, we present a Bayesian hierarchical model for estimating the size of key populations that combines multiple estimates from different sources of information. The proposed model makes use of multiple years of data and explicitly models the systematic error in the data sources used. We use the model to estimate the size of people who inject drugs in Ukraine. We evaluate the appropriateness of the model and compare the contribution of each data source to the final estimates.

Journal ArticleDOI
TL;DR: In this paper , a co-clustering model for multivariate functional data is defined based on a functional latent block model, which assumes for each cocluster a probabilistic distribution for multiivariate functional principal component scores, and a stochastic EM algorithm, embedding a Gibbs sampler, is proposed for model inference as well as a model selection criteria for choosing the number of co clusters.
Abstract: Nowadays, air pollution is a major threat for public health with clear relationships with many diseases, especially cardiovascular ones. The spatiotemporal study of pollution is of great interest for governments and local authorities when deciding for public alerts or new city policies against pollution increase. The aim of this work is to study spatiotemporal profiles of environmental data collected in the south of France (Région Sud) by the public agency AtmoSud. The idea is to better understand the exposition to pollutants of inhabitants on a large territory with important differences in term of geography and urbanism. The data gather the recording of daily measurements of five environmental variables, namely, three pollutants (PM10, NO2, O3) and two meteorological factors (pressure and temperature) over six years. Those data can be seen as multivariate functional data: quantitative entities evolving along time for which there is a growing need of methods to summarize and understand them. For this purpose a novel co-clustering model for multivariate functional data is defined. The model is based on a functional latent block model which assumes for each co-cluster a probabilistic distribution for multivariate functional principal component scores. A stochastic EM algorithm, embedding a Gibbs sampler, is proposed for model inference as well as a model selection criteria for choosing the number of co-clusters. The application of the proposed co-clustering algorithm on environmental data of the Région Sud allowed to divide the region, composed by 357 zones, into six macroareas with common exposure to pollution. We showed that pollution profiles vary accordingly to the seasons, and the patterns are similar during the six years studied. These results can be used by local authorities to develop specific programs to reduce pollution at the macroarea level and to identify specific periods of the year with high pollution peaks in order to set up specific health prevention programs. Overall, the proposed co-clustering approach is a powerful resource to analyse multivariate functional data in order to identify intrinsic data structure and to summarize variables profiles over long periods of time.

Journal ArticleDOI
TL;DR: In this article , the authors developed a multi-state model based on a state process operating in continuous time, which can be regarded as an analogue of the discrete-time Arnason-Schwarz model for irregularly sampled data.
Abstract: Multistate capture-recapture data comprise individual-specific sighting histories, together with information on individuals’ states related, for example, to breeding status, infection level, or geographical location. Such data are often analysed using the Arnason–Schwarz model, where transitions between states are modelled using a discrete-time Markov chain, making the model most easily applicable to regular time series. When time intervals between capture occasions are not of equal length, more complex time-dependent constructions may be required, increasing the number of parameters to estimate, decreasing interpretability, and potentially leading to reduced precision. Here we develop a multi-state model based on a state process operating in continuous time, which can be regarded as an analogue of the discrete-time Arnason–Schwarz model for irregularly sampled data. Statistical inference is carried out by regarding the capture-recapture data as realisations from a continuous-time hidden Markov model, which allows the associated efficient algorithms to be used for maximum likelihood estimation and state decoding. To illustrate the feasibility of the modelling framework, we use a long-term survey of bottlenose dolphins where capture occasions are not regularly spaced through time. Here, we are particularly interested in seasonal effects on the movement rates of the dolphins along the Scottish east coast. The results reveal seasonal movement patterns between two core areas of their range, providing information that will inform conservation management.

Journal ArticleDOI
TL;DR: Novel findings characterizing the presence and volatility of risk states in Dravet syndrome are uncovered, which may directly inform counseling to reduce the unpredictability of seizures for patients with this devastating cause of epilepsy.
Abstract: A major issue in the clinical management of epilepsy is the unpredictability of seizures. Yet, traditional approaches to seizure forecasting and risk assessment in epilepsy rely heavily on raw seizure frequencies, which are a stochastic measurement of seizure risk. We consider a Bayesian non-homogeneous hidden Markov model for unsupervised clustering of zero-inflated seizure count data. The proposed model allows for a probabilistic estimate of the sequence of seizure risk states at the individual level. It also offers significant improvement over prior approaches by incorporating a variable selection prior for the identification of clinical covariates that drive seizure risk changes and accommodating highly granular data. For inference, we implement an efficient sampler that employs stochastic search and data augmentation techniques. We evaluate model performance on simulated seizure count data. We then demonstrate the clinical utility of the proposed model by analyzing daily seizure count data from 133 patients with Dravet syndrome collected through the Seizure Tracker system, a patient-reported electronic seizure diary. We report on the dynamics of seizure risk cycling, including validation of several known pharmacologic relationships. We also uncover novel findings characterizing the presence and volatility of risk states in Dravet syndrome, which may directly inform counseling to reduce the unpredictability of seizures for patients with this devastating cause of epilepsy.

Journal ArticleDOI
TL;DR: In this paper , the authors proposed a new algorithm that jointly assigns the individuals to the latent groups and estimates the parameters of the regression model inside each group, which can help answer important domain-speci c questions.
Abstract: Extreme value applications commonly employ regression techniques to capture cross-sectional heterogeneity or time-variation in the data. Estimation of the parameters of an extreme value regression model is notoriously challenging due to the small number of observations that are usually available in applications. When repeated extreme measurements are collected on the same individuals, i.e., a panel of extremes is available, pooling the observations in groups can improve the statistical inference. We study three data sets related to risk assessment in finance, climate science, and hydrology. In all three cases, the problem can be formulated as an extreme value panel regression model with a latent group structure and group-specific parameters. We propose a new algorithm that jointly assigns the individuals to the latent groups and estimates the parameters of the regression model inside each group. Our method efficiently recovers the underlying group structure without prior information, and for the three data sets it provides improved return level estimates and helps answer important domain-specific questions.

Journal ArticleDOI
TL;DR: In this paper , a causal inference framework was proposed to estimate fire-contributed PM2.5 and PM 2.5 from all other sources using a bias-adjusted chemical model representation.
Abstract: Wildland fire smoke contains hazardous levels of fine particulate matter (PM2.5), a pollutant shown to adversely effect health. Estimating fire attributable PM2.5 concentrations is key to quantifying the impact on air quality and subsequent health burden. This is a challenging problem since only total PM2.5 is measured at monitoring stations and both fire-attributable PM2.5 and PM2.5 from all other sources are correlated in space and time. We propose a framework for estimating fire-contributed PM2.5 and PM2.5 from all other sources using a novel causal inference framework and bias-adjusted chemical model representations of PM2.5 under counterfactual scenarios. The chemical model representation of PM2.5 for this analysis is simulated using Community Multiscale Air Quality Modeling System (CMAQ), run with and without fire emissions across the contiguous U.S. for the 2008–2012 wildfire seasons. The CMAQ output is calibrated with observations from monitoring sites for the same spatial domain and time period. We use a Bayesian model that accounts for spatial variation to estimate the effect of wildland fires on PM2.5 and state assumptions under which the estimate has a valid causal interpretation. Our results include estimates of the contributions of wildfire smoke to PM2.5 for the contiguous U.S. Additionally, we compute the health burden associated with the PM2.5 attributable to wildfire smoke.

Journal ArticleDOI
TL;DR: In this paper , a Bayesian additive vector autoregressive tree (BAVART) model is proposed to capture arbitrary nonlinear relations between the endogenous variables and the covariates without much input from the researcher.
Abstract: Vector autoregressive (VAR) models assume linearity between the endogenous variables and their lags. This assumption might be overly restrictive and could have a deleterious impact on forecasting accuracy. As a solution we propose combining VAR with Bayesian additive regression tree (BART) models. The resulting Bayesian additive vector autoregressive tree (BAVART) model is capable of capturing arbitrary nonlinear relations between the endogenous variables and the covariates without much input from the researcher. Since controlling for heteroscedasticity is key for producing precise density forecasts, our model allows for stochastic volatility in the errors. We apply our model to two datasets. The first application shows that the BAVART model yields highly competitive forecasts of the U.S. term structure of interest rates. In a second application we estimate our model using a moderately sized Eurozone dataset to investigate the dynamic effects of uncertainty on the economy.

Journal ArticleDOI
TL;DR: In this article , a Bayesian pseudolikelihood-based approach for non-Gaussian data collected under informative sampling designs is proposed for estimating health insurance estimates from the American Community Survey.
Abstract: Statistical estimates from survey samples have traditionally been obtained via design-based estimators. In many cases these estimators tend to work well for quantities, such as population totals or means, but can fall short as sample sizes become small. In today’s “information age,” there is a strong demand for more granular estimates. To meet this demand, using a Bayesian pseudolikelihood, we propose a computationally efficient unit-level modeling approach for non-Gaussian data collected under informative sampling designs. Specifically, we focus on binary and multinomial data. Our approach is both multivariate and multiscale, incorporating spatial dependence at the area level. We illustrate our approach through an empirical simulation study and through a motivating application to health insurance estimates, using the American Community Survey.