scispace - formally typeset
Search or ask a question

Showing papers in "The Annals of Applied Statistics in 2018"


Journal ArticleDOI
TL;DR: The covariate balancing generalized propensity score (CBGPS) methodology is proposed, which minimizes the association between covariates and the treatment, and both parametric and nonparametric approaches show their superior performance over the standard maximum likelihood estimation in a simulation study.
Abstract: Propensity score matching and weighting are popular methods when estimating causal effects in observational studies. Beyond the assumption of unconfoundedness, however, these methods also require the model for the propensity score to be correctly specified. The recently proposed covariate balancing propensity score (CBPS) methodology increases the robustness to model misspecification by directly optimizing sample covariate balance between the treatment and control groups. In this paper, we extend the CBPS to a continuous treatment. We propose the covariate balancing generalized propensity score (CBGPS) methodology, which minimizes the association between covariates and the treatment. We develop both parametric and nonparametric approaches and show their superior performance over the standard maximum likelihood estimation in a simulation study. The CBGPS methodology is applied to an observational study, whose goal is to estimate the causal effects of political advertisements on campaign contributions. We also provide open-source software that implements the proposed methods.

206 citations


Journal ArticleDOI
TL;DR: This article suggests a framework to address a question: “Which one should I trust more: a 1% survey with 60% response rate or a self-reported administrative dataset covering 80% of the population?”
Abstract: Statisticians are increasingly posed with thought-provoking and even paradoxical questions, challenging our qualifications for entering the statistical paradises created by Big Data. By developing measures for data quality, this article suggests a framework to address such a question: “Which one should I trust more: a 1% survey with 60% response rate or a self-reported administrative dataset covering 80% of the population?” A 5-element Euler-formula-like identity shows that for any dataset of size $n$, probabilistic or not, the difference between the sample average $\overline{X}_{n}$ and the population average $\overline{X}_{N}$ is the product of three terms: (1) a data quality measure, $\rho_{{R,X}}$, the correlation between $X_{j}$ and the response/recording indicator $R_{j}$; (2) a data quantity measure, $\sqrt{(N-n)/n}$, where $N$ is the population size; and (3) a problem difficulty measure, $\sigma_{X}$, the standard deviation of $X$. This decomposition provides multiple insights: (I) Probabilistic sampling ensures high data quality by controlling $\rho_{{R,X}}$ at the level of $N^{-1/2}$; (II) When we lose this control, the impact of $N$ is no longer canceled by $\rho_{{R,X}}$, leading to a Law of Large Populations (LLP), that is, our estimation error, relative to the benchmarking rate $1/\sqrt{n}$, increases with $\sqrt{N}$; and (III) the “bigness” of such Big Data (for population inferences) should be measured by the relative size $f=n/N$, not the absolute size $n$; (IV) When combining data sources for population inferences, those relatively tiny but higher quality ones should be given far more weights than suggested by their sizes. Estimates obtained from the Cooperative Congressional Election Study (CCES) of the 2016 US presidential election suggest a $\rho_{{R,X}}\approx-0.005$ for self-reporting to vote for Donald Trump. Because of LLP, this seemingly minuscule data defect correlation implies that the simple sample proportion of the self-reported voting preference for Trump from $1\%$ of the US eligible voters, that is, $n\approx2\mbox{,}300\mbox{,}000$, has the same mean squared error as the corresponding sample proportion from a genuine simple random sample of size $n\approx400$, a $99.98\%$ reduction of sample size (and hence our confidence). The CCES data demonstrate LLP vividly: on average, the larger the state’s voter populations, the further away the actual Trump vote shares from the usual $95\%$ confidence intervals based on the sample proportions. This should remind us that, without taking data quality into account, population inferences with Big Data are subject to a Big Data Paradox: the more the data, the surer we fool ourselves.

170 citations


Journal ArticleDOI
TL;DR: An overview of commonly-used graph distances and an explicit characterization of the structural changes that they are best able to capture, and some guidance on choosing one distance over another in different contexts are provided.
Abstract: From longitudinal biomedical studies to social networks, graphs have emerged as essential objects for describing evolving interactions between agents in complex systems. In such studies, after pre-processing, the data are encoded by a set of graphs, each representing a system’s state at a different point in time or space. The analysis of the system’s dynamics depends on the selection of the appropriate analytical tools. In particular, after specifying properties characterizing similarities between states, a critical step lies in the choice of a distance between graphs capable of reflecting such similarities. While the literature offers a number of distances to choose from, their properties have been little investigated and no guidelines regarding the choice of such a distance have yet been provided. In particular, most graph distances consider that the nodes are exchangeable—ignoring node “identities.” Alignment of the graphs according to identified nodes enables us to enhance these distances’ sensitivity to perturbations in the network and detect important changes in graph dynamics. Thus the selection of an adequate metric is a decisive—yet delicate—practical matter. In the spirit of Goldenberg et al.’s seminal 2009 review [Found. Trends Mach. Learn. 2 (2010) 129–233], this article provides an overview of commonly-used graph distances and an explicit characterization of the structural changes that they are best able to capture. We show how these choices affect real-life situations, and we use these distances to analyze both a longitudinal microbiome dataset and a brain fMRI study. One contribution of the present study is a coordinated suite of data analytic techniques, displays and statistical tests using “metagraphs”: a graph of graphs based on a chosen metric. Permutation tests can uncover the effects of covariates on the graphs’ variability. Furthermore, synthetic examples provide intuition as to the qualities and drawbacks of the different distances. Above all, we provide some guidance on choosing one distance over another in different contexts. Finally, we extend the scope of our analyses from temporal to spatial dynamics and apply these different distances to a network created from worldwide recipes.

90 citations


Journal ArticleDOI
TL;DR: In this paper, the spatial partial identity model (SPIM) is proposed to solve the problem of partial identity capture histories in camera trapping surveys, which uses the spatial location where partial identity samples are captured to probabilistically resolve their complete identities.
Abstract: Camera trapping surveys frequently capture individuals whose identity is only known from a single flank. The most widely used methods for incorporating these partial identity individuals into density analyses discard some of the partial identity capture histories, reducing precision, and, while not previously recognized, introducing bias. Here, we present the spatial partial identity model (SPIM), which uses the spatial location where partial identity samples are captured to probabilistically resolve their complete identities, allowing all partial identity samples to be used in the analysis. We show that the SPIM outperforms other analytical alternatives. We then apply the SPIM to an ocelot data set collected on a trapping array with double-camera stations and a bobcat data set collected on a trapping array with single-camera stations. The SPIM improves inference in both cases and, in the ocelot example, individual sex is determined from photographs used to further resolve partial identities—one of which is resolved to near certainty. The SPIM opens the door for the investigation of trapping designs that deviate from the standard two camera design, the combination of other data types between which identities cannot be deterministically linked, and can be extended to the problem of partial genotypes.

72 citations


Journal ArticleDOI
TL;DR: In an application to a single-trial multichannel seizure EEG dataset, the proposed persistent homology procedure was able to identify the left temporal region to consistently show topological invariance, suggesting that the PH features of the Fourier decomposition during seizure is similar to the process before seizure.
Abstract: Epilepsy is a neurological disorder that can negatively affect the visual, audial and motor functions of the human brain. Statistical analysis of neurophysiological recordings, such as electroencephalogram (EEG), facilitates the understanding and diagnosis of epileptic seizures. Standard statistical methods, however, do not account for topological features embedded in EEG signals. In the current study, we propose a persistent homology (PH) procedure to analyze single-trial EEG signals. The procedure denoises signals with a weighted Fourier series (WFS), and tests for topological difference between the denoised signals with a permutation test based on their PH features persistence landscapes (PL). Simulation studies show that the test effectively identifies topological difference and invariance between two signals. In an application to a single-trial multichannel seizure EEG dataset, our proposed PH procedure was able to identify the left temporal region to consistently show topological invariance, suggesting that the PH features of the Fourier decomposition during seizure is similar to the process before seizure. This finding is important because it could not be identified from a mere visual inspection of the EEG data and was in fact missed by earlier analyses of the same dataset.

67 citations


Journal ArticleDOI
TL;DR: A Unified RNA-Sequencing Model is proposed for both single cell and bulk RNA-seq data, formulated as a hierarchical model that borrows the strength from both data sources and carefully models the dropouts in single cell data, leading to a more accurate estimation of cell type specific gene expression profile.
Abstract: Recent advances in technology have enabled the measurement of RNA levels for individual cells. Compared to traditional tissue-level bulk RNA-seq data, single cell sequencing yields valuable insights about gene expression profiles for different cell types, which is potentially critical for understanding many complex human diseases. However, developing quantitative tools for such data remains challenging because of high levels of technical noise, especially the "dropout" events. A "dropout" happens when the RNA for a gene fails to be amplified prior to sequencing, producing a "false" zero in the observed data. In this paper, we propose a Unified RNA-Sequencing Model (URSM) for both single cell and bulk RNA-seq data, formulated as a hierarchical model. URSM borrows the strength from both data sources and carefully models the dropouts in single cell data, leading to a more accurate estimation of cell type specific gene expression profile. In addition, URSM naturally provides inference on the dropout entries in single cell data that need to be imputed for downstream analyses, as well as the mixing proportions of different cell types in bulk samples. We adopt an empirical Bayes' approach, where parameters are estimated using the EM algorithm and approximate inference is obtained by Gibbs sampling. Simulation results illustrate that URSM outperforms existing approaches both in correcting for dropouts in single cell data, as well as in deconvolving bulk samples. We also demonstrate an application to gene expression data on fetal brains, where our model successfully imputes the dropout genes and reveals cell type specific expression patterns.

65 citations


Journal ArticleDOI
TL;DR: Recently, a convex optimization problem involving an $\ell_{1}$ penalty was proposed for this task as mentioned in this paper, and the resulting optimization problem can be efficiently solved for the global optimum using an extremely simple and efficient dynamic programming algorithm.
Abstract: In recent years new technologies in neuroscience have made it possible to measure the activities of large numbers of neurons simultaneously in behaving animals. For each neuron a fluorescence trace is measured; this can be seen as a first-order approximation of the neuron’s activity over time. Determining the exact time at which a neuron spikes on the basis of its fluorescence trace is an important open problem in the field of computational neuroscience. Recently, a convex optimization problem involving an $\ell_{1}$ penalty was proposed for this task. In this paper we slightly modify that recent proposal by replacing the $\ell_{1}$ penalty with an $\ell_{0}$ penalty. In stark contrast to the conventional wisdom that $\ell_{0}$ optimization problems are computationally intractable, we show that the resulting optimization problem can be efficiently solved for the global optimum using an extremely simple and efficient dynamic programming algorithm. Our R-language implementation of the proposed algorithm runs in a few minutes on fluorescence traces of 100,000 timesteps. Furthermore, our proposal leads to substantial improvements over the previous $\ell_{1}$ proposal, in simulations as well as on two calcium imaging datasets. R-language software for our proposal is available on CRAN in the package LZeroSpikeInference. Instructions for running this software in python can be found at https://github.com/jewellsean/LZeroSpikeInference.

58 citations


Journal ArticleDOI
TL;DR: This study introduces new algorithms that use geographic information to estimate ancestry proportions and ancestral genotype frequencies from population genetic data and combines matrix factorization methods and spatial statistics to provide estimates of ancestry matrices based on least-squares approximation.
Abstract: Accurately evaluating the distribution of genetic ancestry across geographic space is one of the main questions addressed by evolutionary biologists. This question has been commonly addressed through the application of Bayesian estimation programs allowing their users to estimate individual admixture proportions and allele frequencies among putative ancestral populations. Following the explosion of high-throughput sequencing technologies, several algorithms have been proposed to cope with computational burden generated by the massive data in those studies. In this context, incorporating geographic proximity in ancestry estimation algorithms is an open statistical and computational challenge. In this study, we introduce new algorithms that use geographic information to estimate ancestry proportions and ancestral genotype frequencies from population genetic data. Our algorithms combine matrix factorization methods and spatial statistics to provide estimates of ancestry matrices based on least-squares approximation. We demonstrate the benefit of using spatial algorithms through extensive computer simulations, and we provide an example of application of our new algorithms to a set of spatially referenced samples for the plant species Arabidopsis thaliana. Without loss of statistical accuracy, the new algorithms exhibit runtimes that are much shorter than those observed for previously developed spatial methods. Our algorithms are implemented in the R package, tess3r.

55 citations


Journal ArticleDOI
TL;DR: In this paper, the authors consider the multivariate exponential family framework with multivariate Gaussian latent variables and show that approximate maximum likelihood inference can be achieved via a variational algorithm for which gradient descent easily applies.
Abstract: Many application domains such as ecology or genomics have to deal with multivariate non Gaussian observations. A typical example is the joint observation of the respective abundances of a set of species in a series of sites, aiming to understand the co-variations between these species. The Gaussian setting provides a canonical way to model such dependencies, but does not apply in general. We consider here the multivariate exponential family framework for which we introduce a generic model with multivariate Gaussian latent variables. We show that approximate maximum likelihood inference can be achieved via a variational algorithm for which gradient descent easily applies. We show that this setting enables us to account for covariates and offsets. We then focus on the case of the Poisson-lognormal model in the context of community ecology.

52 citations


Journal ArticleDOI
TL;DR: In this paper, the Anderson-Darling test is applied to the sample of exceedances above a fixed threshold in order to automate threshold selection, in conjunction with a recently developed stopping rule that controls the false discovery rate in ordered hypothesis testing.
Abstract: Threshold selection is a critical issue for extreme value analysis with threshold-based approaches. Under suitable conditions, exceedances over a high threshold have been shown to follow the generalized Pareto distribution (GPD) asymptotically. In practice, however, the threshold must be chosen. If the chosen threshold is too low, the GPD approximation may not hold and bias can occur. If the threshold is chosen too high, reduced sample size increases the variance of parameter estimates. To process batch analyses, commonly used selection methods such as graphical diagnostics are subjective and cannot be automated. We develop an efficient technique to evaluate and apply the Anderson–Darling test to the sample of exceedances above a fixed threshold. In order to automate threshold selection, this test is used in conjunction with a recently developed stopping rule that controls the false discovery rate in ordered hypothesis testing. Previous attempts in this setting do not account for the issue of ordered multiple testing. The performance of the method is assessed in a large scale simulation study that mimics practical return level estimation. This procedure was repeated at hundreds of sites in the western US to generate return level maps of extreme precipitation.

50 citations


Journal ArticleDOI
TL;DR: e-PCA (exponential family PCA) is developed, a new methodology for PCA on exponential family distributions constructed in a simple and deterministic way using moment calculations, shrinkage, and random matrix theory for dimensionality reduction and denoising of large data matrices.
Abstract: Many applications involve large datasets with entries from exponential family distributions. Our main motivating application is photon-limited imaging, where we observe images with Poisson distributed pixels. We focus on X-ray Free Electron Lasers (XFEL), a quickly developing technology whose goal is to reconstruct molecular structure. In XFEL, estimating the principal components of the noiseless distribution is needed for denoising and for structure determination. However, the standard method, Principal Component Analysis (PCA), can be inefficient in non-Gaussian noise. Motivated by this application, we develop $e$PCA (exponential family PCA), a new methodology for PCA on exponential families. $e$PCA is a fast method that can be used very generally for dimension reduction and denoising of large data matrices with exponential family entries. We conduct a substantive XFEL data analysis using $e$PCA. We show that $e$PCA estimates the PCs of the distribution of images more accurately than PCA and alternatives. Importantly, it also leads to better denoising. We also provide theoretical justification for our estimator, including the convergence rate and the Marchenko–Pastur law in high dimensions. An open-source implementation is available.

Journal ArticleDOI
TL;DR: The analysis suggests that the proposed Bayesian selection model, compared with various imputation strategies and complete-case analyses, can increase accuracy and provide substantial improvements to interval coverage.
Abstract: An idealized version of a label-free discovery mass spectrometry proteomics experiment would provide absolute abundance measurements for a whole proteome, across varying conditions. Unfortunately, this ideal is not realized. Measurements are made on peptides requiring an inferential step to obtain protein level estimates. The inference is complicated by experimental factors that necessitate relative abundance estimation and result in widespread non-ignorable missing data. Relative abundance on the log scale takes the form of parameter contrasts. In a complete-case analysis, contrast estimates may be biased by missing data and a substantial amount of useful information will often go unused. To avoid problems with missing data, many analysts have turned to single imputation solutions. Unfortunately, these methods often create further difficulties by hiding inestimable contrasts, preventing the recovery of interblock information and failing to account for imputation uncertainty. To mitigate many of the problems caused by missing values, we propose the use of a Bayesian selection model. Our model is tested on simulated data, real data with simulated missing values, and on a ground truth dilution experiment where all of the true relative changes are known. The analysis suggests that our model, compared with various imputation strategies and complete-case analyses, can increase accuracy and provide substantial improvements to interval coverage.

Journal ArticleDOI
TL;DR: This paper shows that the GP approach generally outperforms a more classical generalized linear (autoregressive) model (GLM) that was developed to utilize abundant covariate information and illustrates variations of the method(s) on the two benchmark locales.
Abstract: In 2015 the US federal government sponsored a dengue forecasting competition using historical case data from Iquitos, Peru and San Juan, Puerto Rico. Competitors were evaluated on several aspects of out-of-sample forecasts including the targets of peak week, peak incidence during that week, and total season incidence across each of several seasons. Our team was one of the winners of that competition, outperforming other teams in multiple targets/locales. In this paper we report on our methodology, a large component of which, surprisingly, ignores the known biology of epidemics at large—for example, relationships between dengue transmission and environmental factors—and instead relies on flexible nonparametric nonlinear Gaussian process (GP) regression fits that “memorize” the trajectories of past seasons, and then “match” the dynamics of the unfolding season to past ones in real-time. Our phenomenological approach has advantages in situations where disease dynamics are less well understood, or where measurements and forecasts of ancillary covariates like precipitation are unavailable, and/or where the strength of association with cases are as yet unknown. In particular, we show that the GP approach generally outperforms a more classical generalized linear (autoregressive) model (GLM) that we developed to utilize abundant covariate information. We illustrate variations of our method(s) on the two benchmark locales alongside a full summary of results submitted by other contest competitors.

Journal ArticleDOI
TL;DR: The results suggest that a fast and useful strategy for achieving a combination of power against independence and equitability is to filter relationships by TICe and then to rank the remaining ones using MICe, which is confirmed on a set of data collected by the World Health Organization.
Abstract: In exploratory data analysis, we are often interested in identifying promising pairwise associations for further analysis while filtering out weaker ones. This can be accomplished by computing a measure of dependence on all variable pairs and examining the highest-scoring pairs, provided the measure of dependence used assigns similar scores to equally noisy relationships of different types. This property, called equitability and previously formalized, can be used to assess measures of dependence along with the power of their corresponding independence tests and their runtime. Here we present an empirical evaluation of the equitability, power against independence, and runtime of several leading measures of dependence. These include the two recently introduced and simultaneously computable statistics ${\mbox{MIC}_{e}}$, whose goal is equitability, and ${\mbox{TIC}_{e}}$, whose goal is power against independence. Regarding equitability, our analysis finds that ${\mbox{MIC}_{e}}$ is the most equitable method on functional relationships in most of the settings we considered. Regarding power against independence, we find that ${\mbox{TIC}_{e}}$ and Heller and Gorfine’s ${S^{\mathrm{DDP}}}$ share state-of-the-art performance, with several other methods achieving excellent power as well. Our analyses also show evidence for a trade-off between power against independence and equitability consistent with recent theoretical work. Our results suggest that a fast and useful strategy for achieving a combination of power against independence and equitability is to filter relationships by ${\mbox{TIC}_{e}}$ and then to rank the remaining ones using ${\mbox{MIC}_{e}}$. We confirm our findings on a set of data collected by the World Health Organization.

Journal ArticleDOI
TL;DR: In this paper, a new effect equivalence framework and a Bayesian method are developed to enable deviation model transfer across processes in an additive manufacturing (AM) system with limited experimental runs, which is performed via inference on the equivalent effects of lurking variables.
Abstract: Shape deviation models constitute an important component in quality control for additive manufacturing (AM) systems. However, specified models have a limited scope of application across the vast spectrum of processes in a system that are characterized by different settings of process variables, including lurking variables. We develop a new effect equivalence framework and Bayesian method that enables deviation model transfer across processes in an AM system with limited experimental runs. Model transfer is performed via inference on the equivalent effects of lurking variables in terms of an observed factor whose effect has been modeled under a previously learned process. Studies on stereolithography illustrate the ability of our framework to broaden both the scope of deviation models and the comprehensive understanding of AM systems.

Journal ArticleDOI
TL;DR: By combining semiparametric regression with flexible tree-based learning, T-RL is robust, efficient and easy to interpret for the identification of optimal DTRs, as shown in the simulation studies.
Abstract: Dynamic treatment regimes (DTRs) are sequences of treatment decision rules, in which treatment may be adapted over time in response to the changing course of an individual. Motivated by the substance use disorder (SUD) study, we propose a tree-based reinforcement learning (T-RL) method to directly estimate optimal DTRs in a multi-stage multi-treatment setting. At each stage, T-RL builds an unsupervised decision tree that directly handles the problem of optimization with multiple treatment comparisons, through a purity measure constructed with augmented inverse probability weighted estimators. For the multiple stages, the algorithm is implemented recursively using backward induction. By combining semiparametric regression with flexible tree-based learning, T-RL is robust, efficient and easy to interpret for the identification of optimal DTRs, as shown in the simulation studies. With the proposed method, we identify dynamic SUD treatment regimes for adolescents.

Journal ArticleDOI
TL;DR: In this article, the authors present a survey of recent methods for testing hypotheses about high-dimensional multinomials and show that, despite having non-normal null distributions, carefully designed tests can have high power.
Abstract: The statistical analysis of discrete data has been the subject of extensive statistical research dating back to the work of Pearson. In this survey we review some recently developed methods for testing hypotheses about high-dimensional multinomials. Traditional tests like the $\chi^{2}$-test and the likelihood ratio test can have poor power in the high-dimensional setting. Much of the research in this area has focused on finding tests with asymptotically normal limits and developing (stringent) conditions under which tests have normal limits. We argue that this perspective suffers from a significant deficiency: it can exclude many high-dimensional cases when—despite having non-normal null distributions—carefully designed tests can have high power. Finally, we illustrate that taking a minimax perspective and considering refinements of this perspective can lead naturally to powerful and practical tests.

Journal ArticleDOI
TL;DR: In this article, the authors developed a new framework for multivariate association analysis of two sets of high-dimensional and heterogeneous (continuous/binary/count) data using exponential family distributions and exploited a structured decomposition of the underlying natural parameter matrices to identify shared and individual patterns for two data sets.
Abstract: Multivariate association analysis is of primary interest in many applications. Despite the prevalence of high-dimensional and non-Gaussian data (such as count-valued or binary), most existing methods only apply to low-dimensional data with continuous measurements. Motivated by the Computer Audition Lab 500-song (CAL500) music annotation study, we develop a new framework for the association analysis of two sets of high-dimensional and heterogeneous (continuous/binary/count) data. We model heterogeneous random variables using exponential family distributions, and exploit a structured decomposition of the underlying natural parameter matrices to identify shared and individual patterns for two data sets. We also introduce a new measure of the strength of association, and a permutation-based procedure to test its significance. An alternating iteratively reweighted least squares algorithm is devised for model fitting, and several variants are developed to expedite computation and achieve variable selection. The application to the CAL500 data sheds light on the relationship between acoustic features and semantic annotations, and provides effective means for automatic music annotation and retrieval.

Journal ArticleDOI
TL;DR: Torus-PCA as discussed by the authors uses principal nested spheres analysis to deform tori into spheres with self-gluing and then uses a variant of principal nested sphere analysis to avoid overfitting.
Abstract: There are several cutting edge applications needing PCA methods for data on tori, and we propose a novel torus-PCA method that adaptively favors low-dimensional representations while preventing overfitting by a new test—both of which can be generally applied and address shortcomings in two previously proposed PCA methods. Unlike tangent space PCA, our torus-PCA features structure fidelity by honoring the cyclic topology of the data space and, unlike geodesic PCA, produces nonwinding, nondense descriptors. These features are achieved by deforming tori into spheres with self-gluing and then using a variant of the recently developed principal nested spheres analysis. This PCA analysis involves a step of subsphere fitting, and we provide a new test to avoid overfitting. We validate our torus-PCA by application to an RNA benchmark data set. Further, using a larger RNA data set, torus-PCA recovers previously found structure, now globally at the one-dimensional representation, which is not accessible via tangent space PCA.

Journal ArticleDOI
TL;DR: In this article, a regression model for the angular density of a bivariate extreme value distribution is introduced to assess how extremal dependence evolves over a covariate, and the authors apply the proposed model to assess the dynamics governing extreme value dependence of some leading European stock markets over the last three decades.
Abstract: Extremal dependence between international stock markets is of particular interest in today’s global financial landscape. However, previous studies have shown this dependence is not necessarily stationary over time. We concern ourselves with modeling extreme value dependence when that dependence is changing over time, or other suitable covariate. Working within a framework of asymptotic dependence, we introduce a regression model for the angular density of a bivariate extreme value distribution that allows us to assess how extremal dependence evolves over a covariate. We apply the proposed model to assess the dynamics governing extremal dependence of some leading European stock markets over the last three decades, and find evidence of an increase in extremal dependence over recent years.

Journal ArticleDOI
TL;DR: In this paper, a framework of high-dimensional regression models that extends the dimension-reduced ordination methods was proposed to address the compositional nature of multivariate predictors comprised of relative abundances; that is, vectors whose entries sum to a constant.
Abstract: The analysis of human microbiome data is often based on dimension-reduced graphical displays and clusterings derived from vectors of microbial abundances in each sample. Common to these ordination methods is the use of biologically motivated definitions of similarity. Principal coordinate analysis, in particular, is often performed using ecologically defined distances, allowing analyses to incorporate context-dependent, non-Euclidean structure. In this paper, we go beyond dimension-reduced ordination methods and describe a framework of high-dimensional regression models that extends these distance-based methods. In particular, we use kernel-based methods to show how to incorporate a variety of extrinsic information, such as phylogeny, into penalized regression models that estimate taxon-specific associations with a phenotype or clinical outcome. Further, we show how this regression framework can be used to address the compositional nature of multivariate predictors comprised of relative abundances; that is, vectors whose entries sum to a constant. We illustrate this approach with several simulations using data from two recent studies on gut and vaginal microbiomes. We conclude with an application to our own data, where we also incorporate a significance test for the estimated coefficients that represent associations between microbial abundance and a percent fat.

Journal ArticleDOI
TL;DR: In this article, the discrepancy between the simulated and observed data using a Gaussian process (GP) can be used to reduce the number of model evaluations required by approximate Bayesian computation.
Abstract: Approximate Bayesian computation (ABC) can be used for model fitting when the likelihood function is intractable but simulating from the model is feasible. However, even a single evaluation of a complex model may take several hours, limiting the number of model evaluations available. Modelling the discrepancy between the simulated and observed data using a Gaussian process (GP) can be used to reduce the number of model evaluations required by ABC, but the sensitivity of this approach to a specific GP formulation has not yet been thoroughly investigated. We begin with a comprehensive empirical evaluation of using GPs in ABC, including various transformations of the discrepancies and two novel GP formulations. Our results indicate the choice of GP may significantly affect the accuracy of the estimated posterior distribution. Selection of an appropriate GP model is thus important. We formulate expected utility to measure the accuracy of classifying discrepancies below or above the ABC threshold, and show that it can be used to automate the GP model selection step. Finally, based on the understanding gained with toy examples, we fit a population genetic model for bacteria, providing insight into horizontal gene transfer events within the population and from external origins.

Journal ArticleDOI
TL;DR: In this article, a dictionary learning approach is proposed to identify the neurons in a calcium imaging video using a sparse group lasso optimization problem, which is implemented in the R package scalpel.
Abstract: In the past few years, new technologies in the field of neuroscience have made it possible to simultaneously image activity in large populations of neurons at cellular resolution in behaving animals. In mid-2016, a huge repository of this so-called "calcium imaging" data was made publicly available. The availability of this large-scale data resource opens the door to a host of scientific questions for which new statistical methods must be developed. In this paper we consider the first step in the analysis of calcium imaging data-namely, identifying the neurons in a calcium imaging video. We propose a dictionary learning approach for this task. First, we perform image segmentation to develop a dictionary containing a huge number of candidate neurons. Next, we refine the dictionary using clustering. Finally, we apply the dictionary to select neurons and estimate their corresponding activity over time, using a sparse group lasso optimization problem. We assess performance on simulated calcium imaging data and apply our proposal to three calcium imaging data sets. Our proposed approach is implemented in the R package scalpel, which is available on CRAN.

Journal ArticleDOI
TL;DR: A novel statistical emulation of the input-output dependence of these computer models is introduced, and the coseismic representation in this analysis is novel, and more realistic than in previous studies.
Abstract: The rarity of tsunamis impels the scientific community to rely on numerical simulation for planning and risk assessment purposes because of the low availability of actual data from historic events. Numerical models, also called simulators, typically produce time series of outputs. Due to the large computational cost of such simulators, statistical emulation is required to carry out uncertainty quantification tasks, as emulators efficiently approximate simulators. There is thus a need to create emulators that respect the nature of time series outputs. We introduce here a novel statistical emulation of the input-output dependence of these computer models. We employ the Outer Product Emulator with two enhancements. Functional registration and Functional Principal Components techniques improve the predictions of the emulator. Our phase registration method captures fine variations in amplitude. Smoothness in the time series of outputs is modelled, and we are thus able to select more representative, and more parsimonious, regression functions than a fixed basis method such as a Fourier basis. We apply this approach to the high resolution tsunami wave propagation and coastal inundation for the Cascadia region in the Pacific Northwest. The coseismic representation in this analysis is novel, and more realistic than in previous studies. With the help of the emulator, we can carry out sensitivity analysis of the maximum wave elevation with respect to the source characteristics, and we are able to propagate uncertainties from the source characteristics to wave heights in order to issue probabilistic statements about tsunami hazard for Cascadia.

Journal ArticleDOI
TL;DR: In this article, the authors examined a bivariate count time series with some curious statistical features: Saffir-Simpson Category 3 and stronger annual hurricane counts in the North Atlantic and eastern Pacific Ocean Basins.
Abstract: This paper examines a bivariate count time series with some curious statistical features: Saffir–Simpson Category 3 and stronger annual hurricane counts in the North Atlantic and eastern Pacific Ocean Basins. As land and ocean temperatures on our planet warm, an intense climatological debate has arisen over whether hurricanes are becoming more numerous, or whether the strengths of the individual storms are increasing. Recent literature concludes that an increase in hurricane counts occurred in the Atlantic Basin circa 1994. This increase persisted through 2012; moreover, the 1994–2012 period was one of relative inactivity in the eastern Pacific Basin. When Atlantic activity eased in 2013, heavy activity in the eastern Pacific Basin commenced. When examined statistically, a Poisson white noise model for the annual severe hurricane counts is difficult to resoundingly reject. Yet, decadal cycles (longer term dependence) in the hurricane counts is plausible. This paper takes a statistical look at the issue, developing a stationary multivariate count time series model with Poisson marginal distributions and a flexible autocovariance structure. Our auto- and cross-correlations can be negative and have long-range dependence; features that most previous count models cannot achieve in tandem. Our model is new in the literature and is based on categorizing and super-positioning multivariate Gaussian time series. We derive the autocovariance function of the model and propose a method to estimate model parameters. In the end, we conclude that severe hurricane counts are indeed negatively correlated across the two ocean basins. Some evidence for long-range dependence is also presented; however, with only a 49-year record, this issue cannot be definitively judged without additional data.

Journal ArticleDOI
TL;DR: In this article, the authors developed Bayesian state-space models using betting market data that can be uniformly applied across sporting organizations to better understand the role of randomness in game outcomes.
Abstract: Statistical applications in sports have long centered on how to best separate signal (e.g., team talent) from random noise. However, most of this work has concentrated on a single sport, and the development of meaningful cross-sport comparisons has been impeded by the difficulty of translating luck from one sport to another. In this manuscript we develop Bayesian state-space models using betting market data that can be uniformly applied across sporting organizations to better understand the role of randomness in game outcomes. These models can be used to extract estimates of team strength, the between-season, within-season and game-to-game variability of team strengths, as well each team’s home advantage. We implement our approach across a decade of play in each of the National Football League (NFL), National Hockey League (NHL), National Basketball Association (NBA) and Major League Baseball (MLB), finding that the NBA demonstrates both the largest dispersion in talent and the largest home advantage, while the NHL and MLB stand out for their relative randomness in game outcomes. We conclude by proposing new metrics for judging competitiveness across sports leagues, both within the regular season and using traditional postseason tournament formats. Although we focus on sports, we discuss a number of other situations in which our generalizable models might be usefully applied.

Journal ArticleDOI
TL;DR: A statistical model that aims at reproducing the data-generating mechanism of an ensemble of runs via a Stochastic Generator of global annual wind data, and introduces an evolutionary spectrum approach with spatially varying parameters based on large-scale geographical descriptors such as altitude to better account for different regimes across the Earth's orography.
Abstract: Wind has the potential to make a significant contribution to future energy resources. Locating the sources of this renewable energy on a global scale is however extremely challenging, given the difficulty to store very large data sets generated by modern computer models. We propose a statistical model that aims at reproducing the data-generating mechanism of an ensemble of runs via a Stochastic Generator (SG) of global annual wind data. We introduce an evolutionary spectrum approach with spatially varying parameters based on large-scale geographical descriptors such as altitude to better account for different regimes across the Earth’s orography. We consider a multi-step conditional likelihood approach to estimate the parameters that explicitly accounts for nonstationary features while also balancing memory storage and distributed computation. We apply the proposed model to more than 18 million points of yearly global wind speed. The proposed SG requires orders of magnitude less storage for generating surrogate ensemble members from wind than does creating additional wind fields from the climate model, even if an effective lossy data compression algorithm is applied to the simulation output.

Journal ArticleDOI
TL;DR: The phylogenetic scan test (PhyloScan) is introduced for investigating cross-group differences in microbiome compositions using the Dirichlet-tree multinomial (DTM) model and is applied to the American Gut dataset to identify taxa associated with diet habits.
Abstract: In this paper, we introduce the phylogenetic scan test (PhyloScan) for investigating cross-group differences in microbiome compositions using the Dirichlet-tree multinomial (DTM) model. DTM models the microbiome data through a cascade of independent local DMs on the internal nodes of the phylogenetic tree. Each of the local DMs captures the count distributions of a certain number of operational taxonomic units at a given resolution. Since distributional differences tend to occur in clusters along evolutionary lineages, we design a scan statistic over the phylogenetic tree to allow nodes to borrow signal strength from their parents and children. We also derive a formula to bound the tail probability of the scan statistic, and verify its accuracy through simulations. The PhyloScan procedure is applied to the American Gut dataset to identify taxa associated with diet habits. Empirical studies performed on this dataset show that PhyloScan achieves higher testing power in most cases.

Journal ArticleDOI
TL;DR: In this article, the authors proposed a Marked Self-Exciting Process with Time-Dependent Excitation Function (MaSEPTiDE) model to predict the tweet popularity.
Abstract: Information diffusion occurs on microblogging platforms like Twitter as retweet cascades. When a tweet is posted, it may be retweeted and henceforth further retweeted, and the retweeting process continues iteratively and indefinitely. A natural measure of the popularity of a tweet is the number of retweets it generates. Accurate predictions of tweet popularity can assist Twitter to rank contents more effectively and facilitate the assessment of potential for marketing and campaigning strategies. In this paper, we propose a model called the Marked Self-Exciting Process with Time-Dependent Excitation Function, or MaSEPTiDE for short, to model the retweeting dynamics and to predict the tweet popularity. Our model does not require expensive feature engineering but is capable of leveraging the observed dynamics to accurately predict the future evolution of retweet cascades. We apply our proposed methodology on a large amount of Twitter data and report substantial improvement in prediction performance over existing approaches in the literature.

Journal ArticleDOI
TL;DR: A matching algorithm for multilevel data based on a network flow algorithm is developed and applied to assess a school-based intervention through which students in treated schools were exposed to a new reading program during summer school, finding that the summer intervention does not appear to increase reading test scores.
Abstract: Many observational studies of causal effects occur in settings with clustered treatment assignment In studies of this type, treatment is applied to entire clusters of units For example, an educational intervention might be administered to all the students in a school We develop a matching algorithm for multilevel data based on a network flow algorithm Earlier work on multilevel matching relied on integer programming, which allows for balance targeting on specific covariates but can be slow with larger data sets Although we cannot directly specify minimal levels of balance for individual covariates, our algorithm is fast and scales easily to larger data sets We apply this algorithm to assess a school-based intervention through which students in treated schools were exposed to a new reading program during summer school In one variant of the algorithm, where we match both schools and students, we change the causal estimand through optimal subset matching to better maintain common support In a second variant, we relax the common support assumption to preserve the causal estimand by only matching on schools We find that the summer intervention does not appear to increase reading test scores In a sensitivity analysis, however, we determine that an unobserved confounder could easily mask a larger treatment effect