scispace - formally typeset
Search or ask a question

Showing papers on "Nonparametric statistics published in 2014"


Journal ArticleDOI
01 Oct 2014-Genetics
TL;DR: The BGLR R-package implements a large collection of Bayesian regression models, including parametric variable selection and shrinkage methods and semiparametric procedures, which allows integrating various parametric and nonparametric shrinkage and variable selection procedures in a unified and consistent manner.
Abstract: Many modern genomic data analyses require implementing regressions where the number of parameters (p, e.g., the number of marker effects) exceeds sample size (n). Implementing these large-p-with-small-n regressions poses several statistical and computational challenges, some of which can be confronted using Bayesian methods. This approach allows integrating various parametric and nonparametric shrinkage and variable selection procedures in a unified and consistent manner. The BGLR R-package implements a large collection of Bayesian regression models, including parametric variable selection and shrinkage methods and semiparametric procedures (Bayesian reproducing kernel Hilbert spaces regressions, RKHS). The software was originally developed for genomic applications; however, the methods implemented are useful for many nongenomic applications as well. The response can be continuous (censored or not) or categorical (either binary or ordinal). The algorithm is based on a Gibbs sampler with scalar updates and the implementation takes advantage of efficient compiled C and Fortran routines. In this article we describe the methods implemented in BGLR, present examples of the use of the package, and discuss practical issues emerging in real-data analysis.

987 citations



Journal ArticleDOI
TL;DR: This paper proposes an efficient algorithm, called vector field consensus, for establishing robust point correspondences between two sets of points, and suggests a two-stage strategy, where the nonparametric model is used to reduce the size of the putative set and a parametric variant of the approach to estimate the geometric parameters.
Abstract: In this paper, we propose an efficient algorithm, called vector field consensus, for establishing robust point correspondences between two sets of points. Our algorithm starts by creating a set of putative correspondences which can contain a very large number of false correspondences, or outliers, in addition to a limited number of true correspondences (inliers). Next, we solve for correspondence by interpolating a vector field between the two point sets, which involves estimating a consensus of inlier points whose matching follows a nonparametric geometrical constraint. We formulate this a maximum a posteriori (MAP) estimation of a Bayesian model with hidden/latent variables indicating whether matches in the putative set are outliers or inliers. We impose nonparametric geometrical constraints on the correspondence, as a prior distribution, using Tikhonov regularizers in a reproducing kernel Hilbert space. MAP estimation is performed by the EM algorithm which by also estimating the variance of the prior model (initialized to a large value) is able to obtain good estimates very quickly (e.g., avoiding many of the local minima inherent in this formulation). We illustrate this method on data sets in 2D and 3D and demonstrate that it is robust to a very large number of outliers (even up to 90%). We also show that in the special case where there is an underlying parametric geometrical model (e.g., the epipolar line constraint) that we obtain better results than standard alternatives like RANSAC if a large number of outliers are present. This suggests a two-stage strategy, where we use our nonparametric model to reduce the size of the putative set and then apply a parametric variant of our approach to estimate the geometric parameters. Our algorithm is computationally efficient and we provide code for others to use it. In addition, our approach is general and can be applied to other problems, such as learning with a badly corrupted training data set.

489 citations


Journal ArticleDOI
TL;DR: The divisive method is shown to provide consistent estimates of both the number and the location of change points under standard regularity assumptions, and methods from cluster analysis are applied to assess performance and to allow simple comparisons of location estimates, even when the estimated number differs.
Abstract: Change point analysis has applications in a wide variety of fields. The general problem concerns the inference of a change in distribution for a set of time-ordered observations. Sequential detection is an online version in which new data are continually arriving and are analyzed adaptively. We are concerned with the related, but distinct, offline version, in which retrospective analysis of an entire sequence is performed. For a set of multivariate observations of arbitrary dimension, we consider nonparametric estimation of both the number of change points and the positions at which they occur. We do not make any assumptions regarding the nature of the change in distribution or any distribution assumptions beyond the existence of the αth absolute moment, for some α ∈ (0, 2). Estimation is based on hierarchical clustering and we propose both divisive and agglomerative algorithms. The divisive method is shown to provide consistent estimates of both the number and the location of change points under standard...

454 citations


Journal ArticleDOI
TL;DR: This work considers bootstrap methods for computing standard errors and confidence intervals that take model selection into account, also known as bootstrap smoothing, to tame the erratic discontinuities of selection-based estimators.
Abstract: Classical statistical theory ignores model selection in assessing estimation accuracy. Here we consider bootstrap methods for computing standard errors and confidence intervals that take model selection into account. The methodology involves bagging, also known as bootstrap smoothing, to tame the erratic discontinuities of selection-based estimators. A useful new formula for the accuracy of bagging then provides standard errors for the smoothed estimators. Two examples, nonparametric and parametric, are carried through in detail: a regression model where the choice of degree (linear, quadratic, cubic, …) is determined by the Cp criterion and a Lasso-based estimation problem.

329 citations


Journal ArticleDOI
TL;DR: A penalizedspline regression model is developed to address the issues of choosing the number and location of knots in the spline regression in the polynomial regression.
Abstract: Wind turbine power curve modeling is an important tool in turbine performance monitoring and power forecasting. There are several statistical techniques to fit the empirical power curve of a wind turbine, which can be classified into parametric and nonparametric methods. In this paper, we study four of these methods to estimate the wind turbine power curve. Polynomial regression is studied as the benchmark parametric model, and issues associated with this technique are discussed. We then introduce the locally weighted polynomial regression method, and show its advantages over the polynomial regression. Also, the spline regression method is examined to achieve more flexibility for fitting the power curve. Finally, we develop a penalized spline regression model to address the issues of choosing the number and location of knots in the spline regression. The performance of the presented methods is evaluated using two simulated data sets as well as an actual operational power data of a wind farm in North America.

193 citations


Journal ArticleDOI
TL;DR: A review of functional principal component analysis, and its use in explanatory analysis, modeling and forecasting, and classification of functional data is provided in this article from both methodological and practical viewpoints.
Abstract: Advances in data collection and storage have tremendously increased the presence of functional data, whose graphical representations are curves, images or shapes. As a new area of statistics, functional data analysis extends existing methodologies and theories from the realms of functional analysis, generalized linear model, multivariate data analysis, nonparametric statistics, regression models and many others. From both methodological and practical viewpoints, this paper provides a review of functional principal component analysis, and its use in explanatory analysis, modeling and forecasting, and classification of functional data.

177 citations


Journal ArticleDOI
TL;DR: In this paper, a general class of weighting strategies for balancing covariates is proposed, which unifies existing weighting methods, including commonly used weights such as inverse probability weights as special cases.
Abstract: Covariate balance is crucial for unconfounded descriptive or causal comparisons. However, lack of balance is common in observational studies. This article considers weighting strategies for balancing covariates. We define a general class of weights---the balancing weights---that balance the weighted distributions of the covariates between treatment groups. These weights incorporate the propensity score to weight each group to an analyst-selected target population. This class unifies existing weighting methods, including commonly used weights such as inverse-probability weights as special cases. General large-sample results on nonparametric estimation based on these weights are derived. We further propose a new weighting scheme, the overlap weights, in which each unit's weight is proportional to the probability of that unit being assigned to the opposite group. The overlap weights are bounded, and minimize the asymptotic variance of the weighted average treatment effect among the class of balancing weights. The overlap weights also possess a desirable small-sample exact balance property, based on which we propose a new method that achieves exact balance for means of any selected set of covariates. Two applications illustrate these methods and compare them with other approaches.

174 citations



Journal ArticleDOI
TL;DR: In this article, the authors proposed a nonparametric independence screening (NIS) method to select variables by ranking a measure of the non-parametric marginal contributions of each covariate given the exposure variable.
Abstract: The varying coefficient model is an important class of nonparametric statistical model, which allows us to examine how the effects of covariates vary with exposure variables. When the number of covariates is large, the issue of variable selection arises. In this article, we propose and investigate marginal nonparametric screening methods to screen variables in sparse ultra-high-dimensional varying coefficient models. The proposed nonparametric independence screening (NIS) selects variables by ranking a measure of the nonparametric marginal contributions of each covariate given the exposure variable. The sure independent screening property is established under some mild technical conditions when the dimensionality is of nonpolynomial order, and the dimensionality reduction of NIS is quantified. To enhance the practical utility and finite sample performance, two data-driven iterative NIS (INIS) methods are proposed for selecting thresholding parameters and variables: conditional permutation and greedy metho...

164 citations


Journal ArticleDOI
TL;DR: A novel, nonparametric method for summarizing ensembles of 2D and 3D curves is presented and an extension of a method from descriptive statistics, data depth, to curves is proposed, which is a generalization of traditional whisker plots or boxplots to multidimensional curves.
Abstract: In simulation science, computational scientists often study the behavior of their simulations by repeated solutions with variations in parameters and/or boundary values or initial conditions. Through such simulation ensembles, one can try to understand or quantify the variability or uncertainty in a solution as a function of the various inputs or model assumptions. In response to a growing interest in simulation ensembles, the visualization community has developed a suite of methods for allowing users to observe and understand the properties of these ensembles in an efficient and effective manner. An important aspect of visualizing simulations is the analysis of derived features, often represented as points, surfaces, or curves. In this paper, we present a novel, nonparametric method for summarizing ensembles of 2D and 3D curves. We propose an extension of a method from descriptive statistics, data depth, to curves. We also demonstrate a set of rendering and visualization strategies for showing rank statistics of an ensemble of curves, which is a generalization of traditional whisker plots or boxplots to multidimensional curves. Results are presented for applications in neuroimaging, hurricane forecasting and fluid dynamics.

Journal ArticleDOI
TL;DR: Parametric methods were unable to predict phenotypic values when the underlying genetic architecture was based entirely on epistasis, and were slightly better than nonparametric methods for additive genetic architectures.
Abstract: Parametric and nonparametric methods have been developed for purposes of predicting phenotypes. These methods are based on retrospective analyses of empirical data consisting of genotypic and phenotypic scores. Recent reports have indicated that parametric methods are unable to predict phenotypes of traits with known epistatic genetic architectures. Herein, we review parametric methods including least squares regression, ridge regression, Bayesian ridge regression, least absolute shrinkage and selection operator (LASSO), Bayesian LASSO, best linear unbiased prediction (BLUP), Bayes A, Bayes B, Bayes C, and Bayes Cπ. We also review nonparametric methods including Nadaraya-Watson estimator, reproducing kernel Hilbert space, support vector machine regression, and neural networks. We assess the relative merits of these 14 methods in terms of accuracy and mean squared error (MSE) using simulated genetic architectures consisting of completely additive or two-way epistatic interactions in an F2 population derived from crosses of inbred lines. Each simulated genetic architecture explained either 30% or 70% of the phenotypic variability. The greatest impact on estimates of accuracy and MSE was due to genetic architecture. Parametric methods were unable to predict phenotypic values when the underlying genetic architecture was based entirely on epistasis. Parametric methods were slightly better than nonparametric methods for additive genetic architectures. Distinctions among parametric methods for additive genetic architectures were incremental. Heritability, i.e., proportion of phenotypic variability, had the second greatest impact on estimates of accuracy and MSE.

Journal ArticleDOI
TL;DR: In this article, the conditional average treatment effect (CATE) was proposed to capture the heterogeneity of a treatment effect across sub-populations when the unconfoundedness assumption applies.
Abstract: We consider a functional parameter called the conditional average treatment effect (CATE), designed to capture the heterogeneity of a treatment effect across subpopulations when the unconfoundedness assumption applies. In contrast to quantile regressions, the subpopulations of interest are defined in terms of the possible values of a set of continuous covariates rather than the quantiles of the potential outcome distributions. We show that the CATE parameter is nonparametrically identified under unconfoundedness and propose inverse probability weighted estimators for it. Under regularity conditions, some of which are standard and some are new in the literature, we show (pointwise) consistency and asymptotic normality of a fully nonparametric and a semiparametric estimator. We apply our methods to estimate the average effect of a first-time mother’s smoking during pregnancy on the baby’s birth weight as a function of the mother’s age. A robust qualitative finding is that the expected effect becomes stronge...

Journal ArticleDOI
TL;DR: In this article, the generalized method of moments (GMM) is applied to obtain es- timators of the parameters in the nonresponse probability and the nonparametric joint distribution of the study variable y and covariate x.
Abstract: Estimation based on data with nonignorable nonresponse is considered when the joint distribution of the study variable y and covariate x is nonpara- metric and the nonresponse probability conditional on y and x has a parametric form. The likelihood based on observed data may not be identifiable even when the joint distribution of y and x is parametric. We show that this difficulty can be overcome by utilizing a nonresponse instrument, an auxiliary variable related to y but not related to the nonresponse probability conditional on y and x. Under some conditions we can apply the generalized method of moments (GMM) to obtain es- timators of the parameters in the nonresponse probability and the nonparametric joint distribution of y and x. Consistency and asymptotic normality of GMM es- timators are established. Simulation results and an application to a data set from the Korean Labor and Income Panel Survey are also presented.

Proceedings ArticleDOI
04 Jun 2014
TL;DR: A comparative evaluation of parametric and non-parametric approaches for speed prediction during highway driving shows that the relative performance of the different models vary strongly with the prediction horizon, taking into account when selecting a prediction model for a given ITS application.
Abstract: Predicting the future speed of the ego-vehicle is a necessary component of many Intelligent Transportation Systems (ITS) applications, in particular for safety and energy management systems. In the last four decades many parametric speed prediction models have been proposed, the most advanced ones being developed for use in traffic simulators. More recently non-parametric approaches have been applied to closely related problems in robotics. This paper presents a comparative evaluation of parametric and non-parametric approaches for speed prediction during highway driving. Real driving data is used for the evaluation, and both short-term and long-term predictions are tested. The results show that the relative performance of the different models vary strongly with the prediction horizon. This should be taken into account when selecting a prediction model for a given ITS application.

Journal ArticleDOI
TL;DR: In this paper, the impact of external-environmental factors on the performance of economic producers is investigated, which can help to explain the efficiency differentials, as well as improve the managerial policy of the evaluated units.
Abstract: The performance of economic producers is often affected by external or environmental factors that, unlike the inputs and the outputs, are not under the control of the Decision Making Units (DMUs). These factors can be included in the model as exogenous variables and can help to explain the efficiency differentials, as well as improve the managerial policy of the evaluated units. A fully nonparametric methodology, which includes external variables in the frontier model and defines conditional DEA and FDH efficiency scores, is now available for investigating the impact of external-environmental factors on the performance. In this paper, we offer a state-of-the-art review of the literature, which has been proposed to include environmental variables in nonparametric and robust (to outliers) frontier models and to analyse and interpret the conditional efficiency scores, capturing their impact on the attainable set and/or on the distribution of the inefficiency scores. This paper develops and complements the approach of Badin et al. (2012) by suggesting a procedure that allows us to make local inference and provide confidence intervals for the impact of the external factors on the process. We advocate for the nonparametric conditional methodology, which avoids the restrictive “separability” assumption required by the two-stage approaches in order to provide meaningful results. An illustration with real data on mutual funds shows the usefulness of the proposed approach.

Book
06 Dec 2014
TL;DR: In this article, the authors present a test of Hypotheses based on the Normal Distribution of the distribution of the probability distributions of the normal distribution of a set of hypotheses, which is used for statistical analysis of survey data.
Abstract: 1. Introduction, 2. Data and Numbers, 3. Descriptive Tools, 4. Probability and Life Tables, 5. Probability Distributions, 6. Study Designs, 7. Interval Estimation, 8. Test of Hypotheses, 9. Test of Hypotheses Based on the Normal Distribution, 10. Nonparametric Tests, 11. Analysis of Categorical Data, 12. Analysis of Survival Data, 13. Analysis of Variance, 14. Linear Regression, 15. Logistic Regression, 16. Analysis of Survey Data, Appendix A. Statistical Tables, Appendix B. Selected Governmental Biostatistical Data, Appendix C. Solutions to Selected Exercises

Posted Content
01 Jan 2014
TL;DR: This book treats the latest developments in the theory of order-restricted inference, with special attention to nonparametric methods and algorithmic aspects, which are used in computing maximum likelihood estimators and developing distribution theory for inverse problems of this type.
Abstract: This book treats the latest developments in the theory of order-restricted inference, with special attention to nonparametric methods and algorithmic aspects Among the topics treated are current status and interval censoring models, competing risk models, and deconvolution Methods of order restricted inference are used in computing maximum likelihood estimators and developing distribution theory for inverse problems of this type The authors have been active in developing these tools and present the state of the art and the open problems in the field The earlier chapters provide an introduction to the subject, while the later chapters are written with graduate students and researchers in mathematical statistics in mind Each chapter ends with a set of exercises of varying difficulty The theory is illustrated with the analysis of real-life data, which are mostly medical in nature

Journal ArticleDOI
TL;DR: This document provides a brief introduction to the R package gss for nonparametric statistical modeling in a variety of problem settings including regression, density estimation, and hazard estimation.
Abstract: This document provides a brief introduction to the R package gss for nonparametric statistical modeling in a variety of problem settings including regression, density estimation, and hazard estimation. Functional ANOVA (analysis of variance) decompositions are built into models on product domains, and modeling and inferential tools are provided for tasks such as interval estimates, the “testing” of negligible model terms, the handling of correlated data, etc. The methodological background is outlined, and data analysis is illustrated using real-data examples.

Journal ArticleDOI
TL;DR: A new method based on subsampling is proposed to deal with plug-in issues in the case of the Kolmogorov–Smirnov test of uniformity, and some nonparametric estimates satisfying those constraints in the Poisson or in the Hawkes framework are highlighted.
Abstract: When dealing with classical spike train analysis, the practitioner often performs goodness-of-fit tests to test whether the observed process is a Poisson process, for instance, or if it obeys another type of probabilistic model (Yana et al. in Biophys. J. 46(3):323–330, 1984; Brown et al. in Neural Comput. 14(2):325–346, 2002; Pouzat and Chaffiol in Technical report, http://arxiv.org/abs/arXiv:0909.2785 , 2009). In doing so, there is a fundamental plug-in step, where the parameters of the supposed underlying model are estimated. The aim of this article is to show that plug-in has sometimes very undesirable effects. We propose a new method based on subsampling to deal with those plug-in issues in the case of the Kolmogorov–Smirnov test of uniformity. The method relies on the plug-in of good estimates of the underlying model that have to be consistent with a controlled rate of convergence. Some nonparametric estimates satisfying those constraints in the Poisson or in the Hawkes framework are highlighted. Moreover, they share adaptive properties that are useful from a practical point of view. We show the performance of those methods on simulated data. We also provide a complete analysis with these tools on single unit activity recorded on a monkey during a sensory-motor task. Electronic Supplementary Material The online version of this article (doi:10.1186/2190-8567-4-3) contains supplementary material.

Journal ArticleDOI
TL;DR: In this article, the authors proposed a nonparametric maximum likelihood approach to detect multiple change-points in the data sequence, which does not impose any parametric assumption on the underlying distributions.
Abstract: In multiple change-point problems, different data segments often follow different distributions, for which the changes may occur in the mean, scale or the entire distribution from one segment to another. Without the need to know the number of change-points in advance, we propose a nonparametric maximum likelihood approach to detecting multiple change-points. Our method does not impose any parametric assumption on the underlying distributions of the data sequence, which is thus suitable for detection of any changes in the distributions. The number of change-points is determined by the Bayesian information criterion and the locations of the change-points can be estimated via the dynamic programming algorithm and the use of the intrinsic order structure of the likelihood function. Under some mild conditions, we show that the new method provides consistent estimation with an optimal rate. We also suggest a prescreening procedure to exclude most of the irrelevant points prior to the implementation of the nonparametric likelihood method. Simulation studies show that the proposed method has satisfactory performance of identifying multiple change-points in terms of estimation accuracy and computation time.

OtherDOI
29 Sep 2014
TL;DR: In this article, a review of the common nonparametric approaches to incorporate time and other covariate effects for longitudinally observed response data is presented, where the prevailing approaches to model random effects are through functional principal components analysis and B-splines.
Abstract: Nonparametric approaches have recently emerged as a flexible way to model longitudinal data. This entry reviews some of the common nonparametric approaches to incorporate time and other covariate effects for longitudinally observed response data. Smoothing procedures are invoked to estimate the associated nonparametric functions, but the choice of smoothers can vary and is often subjective. Both fixed and random effects may be included for vector or longitudinal covariates. A closely related type of data is functional data, where the prevailing approaches to model random effects are through functional principal components analysis and B-splines. Related semiparametric regression models also play an increasingly important role. Keywords: functional data analysis; scatter-plot smoother; mean curve; fixed effects; random effects; principal components analysis; semiparametric regression

Posted Content
TL;DR: In this paper, the authors present the latest developments in the theory of order-restricted inference, with special attention to nonparametric methods and algorithmic aspects, including current status and interval censoring models, competing risk models, and deconvolution.
Abstract: This book treats the latest developments in the theory of order-restricted inference, with special attention to nonparametric methods and algorithmic aspects. Among the topics treated are current status and interval censoring models, competing risk models, and deconvolution. Methods of order restricted inference are used in computing maximum likelihood estimators and developing distribution theory for inverse problems of this type. The authors have been active in developing these tools and present the state of the art and the open problems in the field. The earlier chapters provide an introduction to the subject, while the later chapters are written with graduate students and researchers in mathematical statistics in mind. Each chapter ends with a set of exercises of varying difficulty. The theory is illustrated with the analysis of real-life data, which are mostly medical in nature.

01 Sep 2014
TL;DR: In this paper, the authors combine ideas from inverse optimization with the theory of variational inequalities to estimate the utility functions of players in a game from their observed actions and estimate the congestion function on a road network from traffic count data.
Abstract: Equilibrium modeling is common in a variety of fields such as game theory and transportation science. The inputs for these models, however, are often difficult to estimate, while their outputs, i.e., the equilibria they are meant to describe, are often directly observable. By combining ideas from inverse optimization with the theory of variational inequalities, we develop an efficient, data-driven technique for estimating the parameters of these models from observed equilibria. We use this technique to estimate the utility functions of players in a game from their observed actions and to estimate the congestion function on a road network from traffic count data. A distinguishing feature of our approach is that it supports both parametric and nonparametric estimation by leveraging ideas from statistical learning (kernel methods and regularization operators). In computational experiments involving Nash and Wardrop equilibria in a nonparametric setting, we find that a) we effectively estimate the unknown demand or congestion function, respectively, and b) our proposed regularization technique substantially improves the out-of-sample performance of our estimators.

Journal ArticleDOI
TL;DR: This technical paper offers a critical re-evaluation of (spectral) Granger causality measures in the analysis of biological timeseries, and demonstrates how both parametric and nonparametric spectral causality Measures can become unreliable in the presence of measurement noise.

Reference EntryDOI
29 Sep 2014
TL;DR: In this article, a Student's t test is used to compare mean differences between treatments when the observations have been obtained in pairs, and the difference between the paired values is assumed to be normally distributed.
Abstract: This is used to compare mean differences between treatments when the observations have been obtained in pairs. The difference between the paired values is assumed to be normally distributed, and the null hypothesis that the expectation is zero is tested by Student's t test. The robustness properties are discussed, as is the asymptotic relative efficiency of nonparametric alternatives. Keywords: student; degrees of freedom; robustness; ARE ; nonparametric; distribution-free; signed rank; normal scores

Journal ArticleDOI
TL;DR: In this paper, a multiscale space on which nonparametric priors and posteriors are naturally defined is introduced, and the authors prove Bernstein-von-Mises theorems for a variety of priors in the setting of Gaussian non-parametric regression and in the i.i.d. sampling model.
Abstract: We continue the investigation of Bernstein–von Mises theorems for nonparametric Bayes procedures from [Ann. Statist. 41 (2013) 1999–2028]. We introduce multiscale spaces on which nonparametric priors and posteriors are naturally defined, and prove Bernstein–von Mises theorems for a variety of priors in the setting of Gaussian nonparametric regression and in the i.i.d. sampling model. From these results we deduce several applications where posterior-based inference coincides with efficient frequentist procedures, including Donsker– and Kolmogorov–Smirnov theorems for the random posterior cumulative distribution functions. We also show that multiscale posterior credible bands for the regression or density function are optimal frequentist confidence bands.

Journal ArticleDOI
TL;DR: In this paper, a nonparametric identification of the SACE is achieved by leveraging post-exposure longitudinal correlates of survival and outcome that may also mediate the exposure effects on survival, and a weighted analysis involving a consistent estimate of the survival process is shown to produce consistent estimates of SACE.
Abstract: In longitudinal studies, outcomes ascertained at follow-up are typically undefined for individuals who die prior to the follow-up visit. In such settings, outcomes are said to be truncated by death and inference about the effects of a point treatment or exposure, restricted to individuals alive at the follow-up visit, could be biased even if as in experimental studies, treatment assignment were randomized. To account for truncation by death, the survivor average causal effect (SACE) defines the effect of treatment on the outcome for the subset of individuals who would have survived regardless of exposure status. In this paper, the author nonparametrically identifies SACE by leveraging post-exposure longitudinal correlates of survival and outcome that may also mediate the exposure effects on survival and outcome. Nonparametric identification is achieved by supposing that the longitudinal data arise from a certain nonparametric structural equations model and by making the monotonicity assumption that the effect of exposure on survival agrees in its direction across individuals. A novel weighted analysis involving a consistent estimate of the survival process is shown to produce consistent estimates of SACE. A data illustration is given, and the methods are extended to the context of time-varying exposures. We discuss a sensitivity analysis framework that relaxes assumptions about independent errors in the nonparametric structural equations model and may be used to assess the extent to which inference may be altered by a violation of key identifying assumptions. © 2014 The Authors. Statistics in Medicine published by John Wiley & Sons, Ltd.

Journal ArticleDOI
01 Jun 2014-Oikos
TL;DR: It is found that better forecasts were correlated with attributes of slow growing species: large maximum age and size for fishes and high trophic level for birds.
Abstract: Short-term forecasts based on time series of counts or survey data are widely used in population biology to provide advice concerning the management, harvest and conservation of natural populations. A common approach to produce these forecasts uses time-series models, of different types, fit to time series of counts. Similar time-series models are used in many other disciplines, however relative to the data available in these other disciplines, population data are often unusually short and noisy and models that perform well for data from other disciplines may not be appropriate for population data. In order to study the performance of time-series forecasting models for natural animal population data, we assembled 2379 time series of vertebrate population indices from actual surveys. Our data were comprised of three vastly different types: highly variable (marine fish productivity), strongly cyclic (adult salmon counts), and small variance but long-memory (bird and mammal counts). We tested the predictive performance of 49 different forecasting models grouped into three broad classes: autoregressive time-series models, non-linear regression-type models and non-parametric time-series models. Low-dimensional parametric autoregressive models gave the most accurate forecasts across a wide range of taxa; the most accurate model was one that simply treated the most recent observation as the forecast. More complex parametric and non-parametric models performed worse, except when applied to highly cyclic species. Across taxa, certain life history characteristics were correlated with lower forecast error; specifically, we found that better forecasts were correlated with attributes of slow growing species: large maximum age and size for fishes and high trophic level for birds. © 2014 Nordic Society Oikos.

Posted Content
TL;DR: In this paper, an automatic statistician, focusing on regression problems, explores an open-ended space of statistical models to discover a good explanation of a data set and then produces a detailed report with figures and natural language text.
Abstract: This paper presents the beginnings of an automatic statistician, focusing on regression problems. Our system explores an open-ended space of statistical models to discover a good explanation of a data set, and then produces a detailed report with figures and natural-language text. Our approach treats unknown regression functions nonparametrically using Gaussian processes, which has two important consequences. First, Gaussian processes can model functions in terms of high-level properties (e.g. smoothness, trends, periodicity, changepoints). Taken together with the compositional structure of our language of models this allows us to automatically describe functions in simple terms. Second, the use of flexible nonparametric models and a rich language for composing them in an open-ended manner also results in state-of-the-art extrapolation performance evaluated over 13 real time series data sets from various domains.