Showing papers in "arXiv: Methodology in 2017"

PDF

Open Access

Posted Content•

A Conceptual Introduction to Hamiltonian Monte Carlo

[...]

10 Jan 2017-arXiv: Methodology

TL;DR: This review provides a comprehensive conceptual account of these theoretical foundations of Hamiltonian Monte Carlo, focusing on developing a principled intuition behind the method and its optimal implementations rather of any exhaustive rigor.

...read moreread less

Abstract: Hamiltonian Monte Carlo has proven a remarkable empirical success, but only recently have we begun to develop a rigorous understanding of why it performs so well on difficult problems and how it is best applied in practice. Unfortunately, that understanding is confined within the mathematics of differential geometry which has limited its dissemination, especially to the applied communities for which it is particularly important. In this review I provide a comprehensive conceptual account of these theoretical foundations, focusing on developing a principled intuition behind the method and its optimal implementations rather of any exhaustive rigor. Whether a practitioner or a statistician, the dedicated reader will acquire a solid grasp of how Hamiltonian Monte Carlo works, when it succeeds, and, perhaps most importantly, when it fails.

...read moreread less

894 citations

Journal Article•DOI•

Visualization in Bayesian workflow

[...]

Jonah Gabry¹, Daniel Simpson², Aki Vehtari³, Michael Betancourt¹, Andrew Gelman¹ - Show less +1 more•Institutions (3)

Columbia University¹, University of Toronto², Aalto University³

05 Sep 2017-arXiv: Methodology

TL;DR: In this paper, the authors present a method for visualizing Bayesian data analysis using trace plots of Markov chains, which is useful for drawing inferences from modern, high-dimensional models.

...read moreread less

Abstract: Bayesian data analysis is about more than just computing a posterior distribution, and Bayesian visualization is about more than trace plots of Markov chains. Practical Bayesian data analysis, like all data analysis, is an iterative process of model building, inference, model checking and evaluation, and model expansion. Visualization is helpful in each of these stages of the Bayesian workflow and it is indispensable when drawing inferences from the types of modern, high-dimensional models that are used by applied researchers.

...read moreread less

390 citations

Book Chapter•DOI•

The Econometrics of Randomized Experiments

[...]

Susan Athey¹, Susan Athey², Guido W. Imbens¹, Guido W. Imbens²•Institutions (2)

Stanford University¹, National Bureau of Economic Research²

01 Jan 2017-arXiv: Methodology

TL;DR: In this paper, the authors present econometric and statistical methods for analyzing randomized experiments, and stress the general efficiency gains from stratification, and contrast intention to treat analyses with instrumental variables analyses allowing for general treatment effect heterogeneity.

...read moreread less

Abstract: In this chapter, we present econometric and statistical methods for analyzing randomized experiments. For basic experiments, we stress randomization-based inference as opposed to sampling-based inference. In randomization-based inference, uncertainty in estimates arises naturally from the random assignment of the treatments, rather than from hypothesized sampling from a large population. We show how this perspective relates to regression analyses for randomized experiments. We discuss the analyses of stratified, paired, and clustered randomized experiments, and we stress the general efficiency gains from stratification. We also discuss complications in randomized experiments such as noncompliance. In the presence of noncompliance, we contrast intention-to-treat analyses with instrumental variables analyses allowing for general treatment effect heterogeneity. We consider, in detail, estimation and inference for heterogenous treatment effects in settings with (possibly many) covariates. These methods allow researchers to explore heterogeneity by identifying subpopulations with different treatment effects while maintaining the ability to construct valid confidence intervals. We also discuss optimal assignment to treatment based on covariates in such settings. Finally, we discuss estimation and inference in experiments in settings with interactions between units, both in general network settings and in settings where the population is partitioned into groups with all interactions contained within these groups.

...read moreread less

268 citations

Journal Article•DOI•

The prior can generally only be understood in the context of the likelihood

[...]

Andrew Gelman, Daniel Simpson, Michael Betancourt

24 Aug 2017-arXiv: Methodology

TL;DR: This paper resolves an apparent paradox in prior modeling: a model encoding true prior information should be chosen without reference to the model of the measurement process, but almost all common prior modeling techniques are implicitly motivated by a reference likelihood.

...read moreread less

Abstract: A key sticking point of Bayesian analysis is the choice of prior distribution, and there is a vast literature on potential defaults including uniform priors, Jeffreys' priors, reference priors, maximum entropy priors, and weakly informative priors. These methods, however, often manifest a key conceptual tension in prior modeling: a model encoding true prior information should be chosen without reference to the model of the measurement process, but almost all common prior modeling techniques are implicitly motivated by a reference likelihood. In this paper we resolve this apparent paradox by placing the choice of prior into the context of the entire Bayesian analysis, from inference to prediction to model evaluation.

...read moreread less

264 citations

Posted Content•

A Case Study Competition Among Methods for Analyzing Large Spatial Data

[...]

Matthew J. Heaton, Abhirup Datta, Andrew O. Finley, Reinhard Furrer, Rajarshi Guhaniyogi, Florian Gerber, Robert B. Gramacy, Dorit Hammerling, Matthias Katzfuss, Finn Lindgren, Douglas Nychka, Furong Sun, Andrew Zammit-Mangion - Show less +9 more

13 Oct 2017-arXiv: Methodology

TL;DR: In this article, the results of a predictive competition among the described methods as implemented by different groups with strong expertise in the methodology have been presented, and each group then wrote their own implementation of their method to produce predictions at the given location and each which was subsequently run on a common computing environment.

...read moreread less

Abstract: The Gaussian process is an indispensable tool for spatial data analysts. The onset of the "big data" era, however, has lead to the traditional Gaussian process being computationally infeasible for modern spatial data. As such, various alternatives to the full Gaussian process that are more amenable to handling big spatial data have been proposed. These modern methods often exploit low rank structures and/or multi-core and multi-threaded computing environments to facilitate computation. This study provides, first, an introductory overview of several methods for analyzing large spatial data. Second, this study describes the results of a predictive competition among the described methods as implemented by different groups with strong expertise in the methodology. Specifically, each research group was provided with two training datasets (one simulated and one observed) along with a set of prediction locations. Each group then wrote their own implementation of their method to produce predictions at the given location and each which was subsequently run on a common computing environment. The methods were then compared in terms of various predictive diagnostics. Supplementary materials regarding implementation details of the methods and code are available for this article online.

...read moreread less

243 citations

Journal Article•DOI•

Abandon Statistical Significance.

[...]

Blakeley B. McShane¹, David Gal², Andrew Gelman³, Christian P. Robert⁴, Jennifer L. Tackett¹ - Show less +1 more•Institutions (4)

Northwestern University¹, University of Illinois at Chicago², Columbia University³, Paris Dauphine University⁴

22 Sep 2017-arXiv: Methodology

TL;DR: The null hypothesis significance testing (NHST) paradigm poses problems for replication and more broadly in the biomedical and social sciences as well as how these problems remain unresolved by proposals involving modified p-value thresholds, confidence intervals, and Bayes factors.

...read moreread less

Abstract: We discuss problems the null hypothesis significance testing (NHST) paradigm poses for replication and more broadly in the biomedical and social sciences as well as how these problems remain unresolved by proposals involving modified p-value thresholds, confidence intervals, and Bayes factors. We then discuss our own proposal, which is to abandon statistical significance. We recommend dropping the NHST paradigm--and the p-value thresholds intrinsic to it--as the default statistical paradigm for research, publication, and discovery in the biomedical and social sciences. Specifically, we propose that the p-value be demoted from its threshold screening role and instead, treated continuously, be considered along with currently subordinate factors (e.g., related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain) as just one among many pieces of evidence. We have no desire to "ban" p-values or other purely statistical measures. Rather, we believe that such measures should not be thresholded and that, thresholded or not, they should not take priority over the currently subordinate factors. We also argue that it seldom makes sense to calibrate evidence as a function of p-values or other purely statistical measures. We offer recommendations for how our proposal can be implemented in the scientific publication process as well as in statistical decision making more broadly.

...read moreread less

232 citations

Journal Article•DOI•

Sparsity information and regularization in the horseshoe and other shrinkage priors

[...]

Juho Piironen, Aki Vehtari¹•Institutions (1)

Helsinki Institute for Information Technology¹

06 Jul 2017-arXiv: Methodology

TL;DR: The regularized horseshoe prior as mentioned in this paper is a generalization of the spike-and-slab prior with a finite slab width, which allows for a minimum level of regularization to the largest values.

...read moreread less

Abstract: The horseshoe prior has proven to be a noteworthy alternative for sparse Bayesian estimation, but has previously suffered from two problems. First, there has been no systematic way of specifying a prior for the global shrinkage hyperparameter based on the prior information about the degree of sparsity in the parameter vector. Second, the horseshoe prior has the undesired property that there is no possibility of specifying separately information about sparsity and the amount of regularization for the largest coefficients, which can be problematic with weakly identified parameters, such as the logistic regression coefficients in the case of data separation. This paper proposes solutions to both of these problems. We introduce a concept of effective number of nonzero parameters, show an intuitive way of formulating the prior for the global hyperparameter based on the sparsity assumptions, and argue that the previous default choices are dubious based on their tendency to favor solutions with more unshrunk parameters than we typically expect a priori. Moreover, we introduce a generalization to the horseshoe prior, called the regularized horseshoe, that allows us to specify a minimum level of regularization to the largest values. We show that the new prior can be considered as the continuous counterpart of the spike-and-slab prior with a finite slab width, whereas the original horseshoe resembles the spike-and-slab with an infinitely wide slab. Numerical experiments on synthetic and real world data illustrate the benefit of both of these theoretical advances.

...read moreread less

227 citations

Posted Content•

A Tutorial on Kernel Density Estimation and Recent Advances

[...]

Yen-Chi Chen¹•Institutions (1)

University of Washington¹

12 Apr 2017-arXiv: Methodology

TL;DR: This tutorial provides a gentle introduction to kernel density estimation (KDE) and recent advances regarding confidence bands and geometric/topological features, and illustrates how one can use KDE to estimate a cumulative distribution function and a receiver operating characteristic curve.

...read moreread less

Abstract: This tutorial provides a gentle introduction to kernel density estimation (KDE) and recent advances regarding confidence bands and geometric/topological features. We begin with a discussion of basic properties of KDE: the convergence rate under various metrics, density derivative estimation, and bandwidth selection. Then, we introduce common approaches to the construction of confidence intervals/bands, and we discuss how to handle bias. Next, we talk about recent advances in the inference of geometric and topological features of a density function using KDE. Finally, we illustrate how one can use KDE to estimate a cumulative distribution function and a receiver operating characteristic curve. We provide R implementations related to this tutorial at the end.

...read moreread less

208 citations

Journal Article•DOI•

Using simulation studies to evaluate statistical methods

[...]

Tim P. Morris, Ian R. White, Michael J. Crowther¹•Institutions (1)

University of Leicester¹

08 Dec 2017-arXiv: Methodology

TL;DR: In this paper, the authors present a structured approach for planning and reporting simulation studies, which involves defining aims, data-generating mechanisms, estimands, methods and performance measures.

...read moreread less

Abstract: Simulation studies are computer experiments that involve creating data by pseudorandom sampling. The key strength of simulation studies is the ability to understand the behaviour of statistical methods because some 'truth' (usually some parameter/s of interest) is known from the process of generating the data. This allows us to consider properties of methods, such as bias. While widely used, simulation studies are often poorly designed, analysed and reported. This tutorial outlines the rationale for using simulation studies and offers guidance for design, execution, analysis, reporting and presentation. In particular, this tutorial provides: a structured approach for planning and reporting simulation studies, which involves defining aims, data-generating mechanisms, estimands, methods and performance measures ('ADEMP'); coherent terminology for simulation studies; guidance on coding simulation studies; a critical discussion of key performance measures and their estimation; guidance on structuring tabular and graphical presentation of results; and new graphical presentations. With a view to describing recent practice, we review 100 articles taken from Volume 34 of Statistics in Medicine that included at least one simulation study and identify areas for improvement.

...read moreread less

192 citations

Posted Content•

Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects

[...]

P. Richard Hahn¹, Jared S. Murray², Carlos M. Carvalho²•Institutions (2)

Arizona State University¹, University of Texas at Austin²

29 Jun 2017-arXiv: Methodology

TL;DR: This article proposed a Bayesian causal forest model for estimating heterogeneous treatment effects from observational data, which is geared specifically towards situations with small effect sizes, heterogeneous effects, and strong confounding.

...read moreread less

Abstract: This paper presents a novel nonlinear regression model for estimating heterogeneous treatment effects from observational data, geared specifically towards situations with small effect sizes, heterogeneous effects, and strong confounding. Standard nonlinear regression models, which may work quite well for prediction, have two notable weaknesses when used to estimate heterogeneous treatment effects. First, they can yield badly biased estimates of treatment effects when fit to data with strong confounding. The Bayesian causal forest model presented in this paper avoids this problem by directly incorporating an estimate of the propensity function in the specification of the response model, implicitly inducing a covariate-dependent prior on the regression function. Second, standard approaches to response surface modeling do not provide adequate control over the strength of regularization over effect heterogeneity. The Bayesian causal forest model permits treatment effect heterogeneity to be regularized separately from the prognostic effect of control variables, making it possible to informatively "shrink to homogeneity". We illustrate these benefits via the reanalysis of an observational study assessing the causal effects of smoking on medical expenditures as well as extensive simulation studies.

...read moreread less

179 citations

Posted Content•

Extended Comparisons of Best Subset Selection, Forward Stepwise Selection, and the Lasso

[...]

Trevor Hastie, Robert Tibshirani, Ryan J. Tibshirani

27 Jul 2017-arXiv: Methodology

TL;DR: An expanded set of simulations showed that the classical best subset selection problem in regression modeling can be formulated as a mixed integer optimization (MIO) problem, and that the relaxed lasso is the overall winner, performing just about as well as the lasso in low SNR scenarios, and as much asbest subset selection in highSNR scenarios.

...read moreread less

Abstract: In exciting new work, Bertsimas et al. (2016) showed that the classical best subset selection problem in regression modeling can be formulated as a mixed integer optimization (MIO) problem. Using recent advances in MIO algorithms, they demonstrated that best subset selection can now be solved at much larger problem sizes that what was thought possible in the statistics community. They presented empirical comparisons of best subset selection with other popular variable selection procedures, in particular, the lasso and forward stepwise selection. Surprisingly (to us), their simulations suggested that best subset selection consistently outperformed both methods in terms of prediction accuracy. Here we present an expanded set of simulations to shed more light on these comparisons. The summary is roughly as follows: (a) neither best subset selection nor the lasso uniformly dominate the other, with best subset selection generally performing better in high signal-to-noise (SNR) ratio regimes, and the lasso better in low SNR regimes; (b) best subset selection and forward stepwise perform quite similarly throughout; (c) the relaxed lasso (actually, a simplified version of the original relaxed estimator defined in Meinshausen, 2007) is the overall winner, performing just about as well as the lasso in low SNR scenarios, and as well as best subset selection in high SNR scenarios.

...read moreread less

Posted Content•

In reply to Faes et al. and Barnett et al. regarding "A study of problems encountered in Granger causality analysis from a neuroscience perspective"

[...]

Patrick A. Stokes, Patrick L. Purdon

29 Sep 2017-arXiv: Methodology

TL;DR: It is found that the notion of causality quantified is incompatible with the objectives of many neuroscience investigations, leading to highly counterintuitive and potentially misleading results.

...read moreread less

Abstract: This reply is in response to commentaries by Barnett, Barrett, and Seth (arXiv:170808001) and Faes, Stramaglia, and Marinazzo (arXiv:170806990) on our paper entitled "A study of problems encountered in Granger causality analysis from a neuroscience perspective" (PNAS 114(34):7063-7072 2017) In our paper, we analyzed several properties of Granger-Geweke causality (GGC) and discussed potential problems in neuroscience applications We demonstrated: (i) that GGC, estimated using separate model fits, is either severely biased, particularly when the true model is known, or a high variance is introduced to overcome the bias; and (ii) that GGC does not reflect some component dynamics of the system The commentaries by both Faes et al and Barnett et al point out that the computational problems of (i) are resolved by using recent computational methods We acknowledge that these problems are indeed resolved by these methods However, the traditional computation using separate model fits continues to be presented and applied More fundamentally, the interpretational problems stemming from (ii) are not in anyway addressed by the improved methods because they are inherent to the definition of GGC These properties are indeed acknowledged by both commentaries We have no misconception of the GGC measure and do not claim that these properties are facially wrong But we do discuss at length how these properties make it inappropriate and misleading for common types of scientific questions, how presentation of GGC results without model estimates are not decipherable, and how the absence of clear statements of questions of interest present further opportunities for misinterpretation

...read moreread less

Journal Article•DOI•

A general framework for Vecchia approximations of Gaussian processes

[...]

Matthias Katzfuss, Joseph Guinness

21 Aug 2017-arXiv: Methodology

TL;DR: It is shown that the general Vecchia approach contains many popular existing GP approximations as special cases, allowing for comparisons among the different methods within a unified framework.

...read moreread less

Abstract: Gaussian processes (GPs) are commonly used as models for functions, time series, and spatial fields, but they are computationally infeasible for large datasets. Focusing on the typical setting of modeling data as a GP plus an additive noise term, we propose a generalization of the Vecchia (1988) approach as a framework for GP approximations. We show that our general Vecchia approach contains many popular existing GP approximations as special cases, allowing for comparisons among the different methods within a unified framework. Representing the models by directed acyclic graphs, we determine the sparsity of the matrices necessary for inference, which leads to new insights regarding the computational properties. Based on these results, we propose a novel sparse general Vecchia approximation, which ensures computational feasibility for large spatial datasets but can lead to considerable improvements in approximation accuracy over Vecchia's original approach. We provide several theoretical results and conduct numerical comparisons. We conclude with guidelines for the use of Vecchia approximations in spatial statistics.

...read moreread less

Posted Content•

Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition

[...]

Vincent Dorie¹, Jennifer Hill, Uri Shalit², Marc Scott, Daniel Cervone - Show less +1 more•Institutions (2)

Columbia University¹, Technion – Israel Institute of Technology²

09 Jul 2017-arXiv: Methodology

TL;DR: The 2016 Atlantic Causal Inference Challenge as discussed by the authors was the first attempt to evaluate the performance of causal inference data analysis, with 30 participants participating in two versions of the challenge, black box algorithms and do-it-yourself analyses.

...read moreread less

Abstract: Statisticians have made great progress in creating methods that reduce our reliance on parametric assumptions. However this explosion in research has resulted in a breadth of inferential strategies that both create opportunities for more reliable inference as well as complicate the choices that an applied researcher has to make and defend. Relatedly, researchers advocating for new methods typically compare their method to at best 2 or 3 other causal inference strategies and test using simulations that may or may not be designed to equally tease out flaws in all the competing methods. The causal inference data analysis challenge, "Is Your SATT Where It's At?", launched as part of the 2016 Atlantic Causal Inference Conference, sought to make progress with respect to both of these issues. The researchers creating the data testing grounds were distinct from the researchers submitting methods whose efficacy would be evaluated. Results from 30 competitors across the two versions of the competition (black box algorithms and do-it-yourself analyses) are presented along with post-hoc analyses that reveal information about the characteristics of causal inference strategies and settings that affect performance. The most consistent conclusion was that methods that flexibly model the response surface perform better overall than methods that fail to do so. Finally new methods are proposed that combine features of several of the top-performing submitted methods.

...read moreread less

Journal Article•DOI•

Detecting causal associations in large nonlinear time series datasets

[...]

Jakob Runge, Dino Sejdinovic, Seth Flaxman

22 Feb 2017-arXiv: Methodology

TL;DR: A novel causal discovery method that flexibly combines linear or nonlinear conditional independence tests with a causal discovery algorithm to estimate causal networks from large-scale time series datasets is introduced.

...read moreread less

Abstract: Identifying causal relationships from observational time series data is a key problem in disciplines such as climate science or neuroscience, where experiments are often not possible. Data-driven causal inference is challenging since datasets are often high-dimensional and nonlinear with limited sample sizes. Here we introduce a novel method that flexibly combines linear or nonlinear conditional independence tests with a causal discovery algorithm that allows to reconstruct causal networks from large-scale time series datasets. We validate the method on a well-established climatic teleconnection connecting the tropical Pacific with extra-tropical temperatures and using large-scale synthetic datasets mimicking the typical properties of real data. The experiments demonstrate that our method outperforms alternative techniques in detection power from small to large-scale datasets and opens up entirely new possibilities to discover causal networks from time series across a range of research fields.

...read moreread less

Journal Article•DOI•

A Review of Self-Exciting Spatio-Temporal Point Processes and Their Applications

[...]

Alex Reinhart

08 Aug 2017-arXiv: Methodology

TL;DR: In this article, the authors describe the basic theory, survey related estimation and inference techniques from each field, highlight several key applications, and suggest directions for future research, as well as suggest future research directions for self-exciting point process models.

...read moreread less

Abstract: Self-exciting spatio-temporal point process models predict the rate of events as a function of space, time, and the previous history of events. These models naturally capture triggering and clustering behavior, and have been widely used in fields where spatio-temporal clustering of events is observed, such as earthquake modeling, infectious disease, and crime. In the past several decades, advances have been made in estimation, inference, simulation, and diagnostic tools for self-exciting point process models. In this review, I describe the basic theory, survey related estimation and inference techniques from each field, highlight several key applications, and suggest directions for future research.

...read moreread less

Journal Article•DOI•

Estimation of causal effects with multiple treatments: a review and new ideas

[...]

Michael J. Lopez, Roee Gutman

18 Jan 2017-arXiv: Methodology

TL;DR: The propensity score is a common tool for estimating the causal effect of a binary treatment in observational data as discussed by the authors, which can reduce the initial covariate bias between the treatment and control groups.

...read moreread less

Abstract: The propensity score is a common tool for estimating the causal effect of a binary treatment in observational data. In this setting, matching, subclassification, imputation, or inverse probability weighting on the propensity score can reduce the initial covariate bias between the treatment and control groups. With more than two treatment options, however, estimation of causal effects requires additional assumptions and techniques, the implementations of which have varied across disciplines. This paper reviews current methods, and it identifies and contrasts the treatment effects that each one estimates. Additionally, we propose possible matching techniques for use with multiple, nominal categorical treatments, and use simulations to show how such algorithms can yield improved covariate similarity between those in the matched sets, relative the pre-matched cohort. To sum, this manuscript provides a synopsis of how to notate and use causal methods for categorical treatments.

...read moreread less

Journal Article•DOI•

Using stacking to average Bayesian predictive distributions

[...]

Yuling Yao, Aki Vehtari, Daniel Simpson, Andrew Gelman

06 Apr 2017-arXiv: Methodology

TL;DR: This work takes the idea of stacking from the point estimation literature and generalizes to the combination of predictive distributions, extending the utility function to any proper scoring rule, using Pareto smoothed importance sampling to efficiently compute the required leave-one-out posterior distributions and regularization to get more stability.

...read moreread less

Abstract: The widely recommended procedure of Bayesian model averaging is flawed in the M-open setting in which the true data-generating process is not one of the candidate models being fit. We take the idea of stacking from the point estimation literature and generalize to the combination of predictive distributions, extending the utility function to any proper scoring rule, using Pareto smoothed importance sampling to efficiently compute the required leave-one-out posterior distributions and regularization to get more stability. We compare stacking of predictive distributions to several alternatives: stacking of means, Bayesian model averaging (BMA), pseudo-BMA using AIC-type weighting, and a variant of pseudo-BMA that is stabilized using the Bayesian bootstrap. Based on simulations and real-data applications, we recommend stacking of predictive distributions, with BB-pseudo-BMA as an approximate alternative when computation cost is an issue.

...read moreread less

Posted Content•

Piecewise-Deterministic Markov Chain Monte Carlo

[...]

Paul Vanetti, Alexandre Bouchard-Côté, George Deligiannidis, Arnaud Doucet

17 Jul 2017-arXiv: Methodology

TL;DR: Novel MCMC methods addressing limitations by bringing together piecewise deterministic Markov processes, Hamiltonian dynamics and slice sampling are introduced and demonstrated on a variety of applications.

...read moreread less

Abstract: A novel class of non-reversible Markov chain Monte Carlo schemes relying on continuous-time piecewise-deterministic Markov Processes has recently emerged. In these algorithms, the state of the Markov process evolves according to a deterministic dynamics which is modified using a Markov transition kernel at random event times. These methods enjoy remarkable features including the ability to update only a subset of the state components while other components implicitly keep evolving and the ability to use an unbiased estimate of the gradient of the log-target while preserving the target as invariant distribution. However, they also suffer from important limitations. The deterministic dynamics used so far do not exploit the structure of the target. Moreover, exact simulation of the event times is feasible for an important yet restricted class of problems and, even when it is, it is application specific. This limits the applicability of these techniques and prevents the development of a generic software implementation of them. We introduce novel MCMC methods addressing these shortcomings. In particular, we introduce novel continuous-time algorithms relying on exact Hamiltonian flows and novel non-reversible discrete-time algorithms which can exploit complex dynamics such as approximate Hamiltonian dynamics arising from symplectic integrators while preserving the attractive features of continuous-time algorithms. We demonstrate the performance of these schemes on a variety of applications.

...read moreread less

Posted Content•

Augmented Minimax Linear Estimation

[...]

David A. Hirshberg, Stefan Wager

30 Nov 2017-arXiv: Methodology

TL;DR: In this article, a simple plug-in estimator based on an estimate of the conditional expectation function is proposed, and then corrected by subtracting a minimax linear estimate of its error.

...read moreread less

Abstract: Many statistical estimands can expressed as continuous linear functionals of a conditional expectation function. This includes the average treatment effect under unconfoundedness and generalizations for continuous-valued and personalized treatments. In this paper, we discuss a general approach to estimating such quantities: we begin with a simple plug-in estimator based on an estimate of the conditional expectation function, and then correct the plug-in estimator by subtracting a minimax linear estimate of its error. We show that our method is semiparametrically efficient under weak conditions and observe promising performance on both real and simulated data.

...read moreread less

Posted Content•

Causal inference for social network data

[...]

Elizabeth L. Ogburn¹, Oleg Sofrygin², Iván Díaz³, Mark J. van der Laan²•Institutions (3)

Johns Hopkins University¹, University of California, Berkeley², Cornell University³

23 May 2017-arXiv: Methodology

TL;DR: Estimation and inference for causal effects that are specifically of interest in social network settings are described and allowed for both dependenceDue to contagion, or transmission of information across network ties, and for dependence due to latent similarities among nodes sharing ties.

...read moreread less

Abstract: We describe semiparametric estimation and inference for causal effects using observational data from a single social network Our asymptotic result is the first to allow for dependence of each observation on a growing number of other units as sample size increases While previous methods have generally implicitly focused on one of two possible sources of dependence among social network observations, we allow for both dependence due to transmission of information across network ties, and for dependence due to latent similarities among nodes sharing ties We describe estimation and inference for new causal effects that are specifically of interest in social network settings, such as interventions on network ties and network structure Using our methods to reanalyze the Framingham Heart Study data used in one of the most influential and controversial causal analyses of social network data, we find that after accounting for network structure there is no evidence for the causal effects claimed in the original paper

...read moreread less

Posted Content•

Bayesian Regression Tree Ensembles that Adapt to Smoothness and Sparsity

[...]

Antonio R. Linero¹, Yun Yang²•Institutions (2)

Florida State University¹, University of Illinois at Urbana–Champaign²

29 Jul 2017-arXiv: Methodology

TL;DR: This work implements sparsity inducing soft decision trees in which the decisions are treated as probabilistic and adapts to the unknown smoothness and sparsity levels, and can be implemented by making minimal modifications to existing Bayesian additive regression tree algorithms.

...read moreread less

Abstract: Ensembles of decision trees are a useful tool for obtaining for obtaining flexible estimates of regression functions. Examples of these methods include gradient boosted decision trees, random forests, and Bayesian CART. Two potential shortcomings of tree ensembles are their lack of smoothness and vulnerability to the curse of dimensionality. We show that these issues can be overcome by instead considering sparsity inducing soft decision trees in which the decisions are treated as probabilistic. We implement this in the context of the Bayesian additive regression trees framework, and illustrate its promising performance through testing on benchmark datasets. We provide strong theoretical support for our methodology by showing that the posterior distribution concentrates at the minimax rate (up-to a logarithmic factor) for sparse functions and functions with additive structures in the high-dimensional regime where the dimensionality of the covariate space is allowed to grow near exponentially in the sample size. Our method also adapts to the unknown smoothness and sparsity levels, and can be implemented by making minimal modifications to existing BART algorithms.

...read moreread less

Posted Content•

Nonparametric causal effects based on incremental propensity score interventions

[...]

Edward H. Kennedy¹•Institutions (1)

Carnegie Mellon University¹

01 Apr 2017-arXiv: Methodology

TL;DR: This work characterizes incremental interventions and gives identifying conditions for corresponding effects, and develops general efficiency theory, proposes efficient nonparametric estimators that can attain fast convergence rates even when incorporating flexible machine learning, and explores finite-sample performance via simulation.

...read moreread less

Abstract: Most work in causal inference considers deterministic interventions that set each unit's treatment to some fixed value. However, under positivity violations these interventions can lead to non-identification, inefficiency, and effects with little practical relevance. Further, corresponding effects in longitudinal studies are highly sensitive to the curse of dimensionality, resulting in widespread use of unrealistic parametric models. We propose a novel solution to these problems: incremental interventions that shift propensity score values rather than set treatments to fixed values. Incremental interventions have several crucial advantages. First, they avoid positivity assumptions entirely. Second, they require no parametric assumptions and yet still admit a simple characterization of longitudinal effects, independent of the number of timepoints. For example, they allow longitudinal effects to be visualized with a single curve instead of lists of coefficients. After characterizing these incremental interventions and giving identifying conditions for corresponding effects, we also develop general efficiency theory, propose efficient nonparametric estimators that can attain fast convergence rates even when incorporating flexible machine learning, and propose a bootstrap-based confidence band and simultaneous test of no treatment effect. Finally we explore finite-sample performance via simulation, and apply the methods to study time-varying sociological effects of incarceration on entry into marriage.

...read moreread less

Journal Article•DOI•

Variable Selection Methods for Model-based Clustering

[...]

Michael Fop, Thomas Brendan Murphy

02 Jul 2017-arXiv: Methodology

TL;DR: This review provides a summary of the methods developed for variable selection in model-based clustering and existing R packages implementing the different methods are indicated and illustrated in application to two data analysis examples.

...read moreread less

Abstract: Model-based clustering is a popular approach for clustering multivariate data which has seen applications in numerous fields. Nowadays, high-dimensional data are more and more common and the model-based clustering approach has adapted to deal with the increasing dimensionality. In particular, the development of variable selection techniques has received a lot of attention and research effort in recent years. Even for small size problems, variable selection has been advocated to facilitate the interpretation of the clustering results. This review provides a summary of the methods developed for variable selection in model-based clustering. Existing R packages implementing the different methods are indicated and illustrated in application to two data analysis examples.

...read moreread less

Posted Content•

Sparse-Input Neural Networks for High-dimensional Nonparametric Regression and Classification

[...]

Jean Feng, Noah Simon

21 Nov 2017-arXiv: Methodology

TL;DR: This manuscript proposes fitting a neural network with a sparse group lasso penalty on the first-layer input weights, which results in a neural net that only uses a small subset of the original features, and characterize the statistical convergence of the penalized empirical risk minimizer to the optimal neural network.

...read moreread less

Abstract: Neural networks are usually not the tool of choice for nonparametric high-dimensional problems where the number of input features is much larger than the number of observations. Though neural networks can approximate complex multivariate functions, they generally require a large number of training observations to obtain reasonable fits, unless one can learn the appropriate network structure. In this manuscript, we show that neural networks can be applied successfully to high-dimensional settings if the true function falls in a low dimensional subspace, and proper regularization is used. We propose fitting a neural network with a sparse group lasso penalty on the first-layer input weights. This results in a neural net that only uses a small subset of the original features. In addition, we characterize the statistical convergence of the penalized empirical risk minimizer to the optimal neural network: we show that the excess risk of this penalized estimator only grows with the logarithm of the number of input features; and we show that the weights of irrelevant features converge to zero. Via simulation studies and data analyses, we show that these sparse-input neural networks outperform existing nonparametric high-dimensional estimation methods when the data has complex higher-order interactions.

...read moreread less

Posted Content•

Unbiased Markov chain Monte Carlo with couplings

[...]

Pierre Jacob¹, John O'Leary¹, Yves F. Atchadé²•Institutions (2)

Harvard University¹, Boston University²

11 Aug 2017-arXiv: Methodology

TL;DR: The theoretical validity of the proposed couplings of Markov chains together with a telescopic sum argument of Glynn and Rhee (2014) is established and their efficiency relative to the underlying MCMC algorithms is studied.

...read moreread less

Abstract: Markov chain Monte Carlo (MCMC) methods provide consistent of integrals as the number of iterations goes to infinity. MCMC estimators are generally biased after any fixed number of iterations. We propose to remove this bias by using couplings of Markov chains together with a telescopic sum argument of Glynn and Rhee (2014). The resulting unbiased estimators can be computed independently in parallel. We discuss practical couplings for popular MCMC algorithms. We establish the theoretical validity of the proposed estimators and study their efficiency relative to the underlying MCMC algorithms. Finally, we illustrate the performance and limitations of the method on toy examples, on an Ising model around its critical temperature, on a high-dimensional variable selection problem, and on an approximation of the cut distribution arising in Bayesian inference for models made of multiple modules.

...read moreread less

Posted Content•

Estimating network memberships by simplex vertex hunting

[...]

Jiashun Jin, Zheng Tracy Ke, Shengming Luo

25 Aug 2017-arXiv: Methodology

TL;DR: A new approach Mixed-SCORE to membership estimation, with an easy-to-use Vertex Hunting step, and derives the convergence rate of Mixed- SCORE using delicate spectral analysis, especially tight row-wise deviation bounds for $\hat{R}$.

...read moreread less

Abstract: Consider an undirected mixed membership network with $n$ nodes and $K$ communities. For each node $i$, $1 \leq i \leq n$, we model the membership by a Probability Mass Function (PMF) $\pi_{i} = (\pi_{i}(1), \pi_{i}(2), \ldots$, $\pi_{i}(K))'$, where $\pi_{i}(k)$ is the probability that node $i$ belongs to community $k$, $1 \leq k \leq K$. We call node $i$ "pure" if $\pi_i$ is degenerate and "mixed" otherwise. The primary interest is to estimate $\pi_i$, $1 \leq i \leq n$. We model the adjacency matrix $A$ with a Degree Corrected Mixed Membership (DCMM) model, and let $\hat{\xi}_1, \hat{\xi}_2, \ldots, \hat{\xi}_K$ be the first $K$ eigenvectors. Define a matrix $\hat{R} \in \mathbb{R}^{n, K-1}$ by $\hat{R}(i,k) = \hat{\xi}_{k+1}(i)/\hat{\xi}_1(i)$, $1 \leq k \leq K-1$, $1 \leq i \leq n$. The oracle counterpart of $\hat{R}$ (denoted by $R$) under the DCMM model contains all information we need for the memberships. In fact, we have an interesting insight: there is a simplex ${\cal S}$ in $\mathbb{R}^{K-1}$ such that row $i$ of $R$ corresponds to a vertex of ${\cal S}$ if node $i$ is pure, and corresponds to an interior point of ${\cal S}$ otherwise. Vertex Hunting (i.e., estimating the vertices of ${\cal S}$) is therefore the key to our problem. We propose a new approach Mixed-SCORE to membership estimation, with an easy-to-use Vertex Hunting step. We derive the convergence rate of Mixed-SCORE using delicate spectral analysis, especially tight row-wise deviation bounds for $\hat{R}$. We have also applied it to $4$ network data sets with encouraging results.

...read moreread less

Posted Content•

Nested Markov Properties for Acyclic Directed Mixed Graphs

[...]

Thomas S. Richardson, Robin J. Evans, James M. Robins, Ilya Shpitser

23 Jan 2017-arXiv: Methodology

TL;DR: In this article, a graphical model associated with acyclic directed mixed graphs (ADMGs) is defined, and the authors show that marginal distributions of DAG models lie in this model, and prove that a characterization of these constraints given in (Tian and Pearl, 2002b) gives an alternative definition of the model.

...read moreread less

Abstract: Directed acyclic graph (DAG) models may be characterized in at least four different ways: via a factorization, the d-separation criterion, the moralization criterion, and the local Markov property. As pointed out by Robins (1986, 1999), Verma and Pearl (1990), and Tian and Pearl (2002b), marginals of DAG models also imply equality constraints that are not conditional independences. The well-known `Verma constraint' is an example. Constraints of this type were used for testing edges (Shpitser et al., 2009), and an efficient marginalization scheme via variable elimination (Shpitser et al., 2011). We show that equality constraints like the `Verma constraint' can be viewed as conditional independences in kernel objects obtained from joint distributions via a fixing operation that generalizes conditioning and marginalization. We use these constraints to define, via Markov properties and a factorization, a graphical model associated with acyclic directed mixed graphs (ADMGs). We show that marginal distributions of DAG models lie in this model, prove that a characterization of these constraints given in (Tian and Pearl, 2002b) gives an alternative definition of the model, and finally show that the fixing operation we used to define the model can be used to give a particularly simple characterization of identifiable causal effects in hidden variable graphical causal models.

...read moreread less

Journal Article•DOI•

Fast calibrated additive quantile regression

[...]

Matteo Fasiolo¹, Simon N. Wood¹, Margaux Zaffran², Raphael Nedellec³, Yannig Goude³ - Show less +1 more•Institutions (3)

University of Bristol¹, Superior National School of Advanced Techniques², Électricité de France³

11 Jul 2017-arXiv: Methodology

TL;DR: A novel framework for fitting additive quantile regression models, which provides well-calibrated inference about the conditional quantiles and fast automatic estimation of the smoothing parameters, for model structures as diverse as those usable with distributional generalized additive models, while maintaining equivalent numerical efficiency and stability is proposed.

...read moreread less

Abstract: We propose a novel framework for fitting additive quantile regression models, which provides well calibrated inference about the conditional quantiles and fast automatic estimation of the smoothing parameters, for model structures as diverse as those usable with distributional GAMs, while maintaining equivalent numerical efficiency and stability. The proposed methods are at once statistically rigorous and computationally efficient, because they are based on the general belief updating framework of Bissiri et al. (2016) to loss based inference, but compute by adapting the stable fitting methods of Wood et al. (2016). We show how the pinball loss is statistically suboptimal relative to a novel smooth generalisation, which also gives access to fast estimation methods. Further, we provide a novel calibration method for efficiently selecting the 'learning rate' balancing the loss with the smoothing priors during inference, thereby obtaining reliable quantile uncertainty estimates. Our work was motivated by a probabilistic electricity load forecasting application, used here to demonstrate the proposed approach. The methods described here are implemented by the qgam R package, available on the Comprehensive R Archive Network (CRAN).

...read moreread less

Posted Content•

Better together? Statistical learning in models made of modules

[...]

Pierre Jacob¹, Lawrence M. Murray², Christopher Holmes³, Christian P. Robert⁴•Institutions (4)

Harvard University¹, Uppsala University², University of Oxford³, Paris Dauphine University⁴

29 Aug 2017-arXiv: Methodology

TL;DR: Why modular approaches might be preferable to the full model in misspecified settings is investigated and a principled criteria to choose between modular and full-model approaches is proposed.

...read moreread less

Abstract: In modern applications, statisticians are faced with integrating heterogeneous data modalities relevant for an inference, prediction, or decision problem. In such circumstances, it is convenient to use a graphical model to represent the statistical dependencies, via a set of connected "modules", each relating to a specific data modality, and drawing on specific domain expertise in their development. In principle, given data, the conventional statistical update then allows for coherent uncertainty quantification and information propagation through and across the modules. However, misspecification of any module can contaminate the estimate and update of others, often in unpredictable ways. In various settings, particularly when certain modules are trusted more than others, practitioners have preferred to avoid learning with the full model in favor of approaches that restrict the information propagation between modules, for example by restricting propagation to only particular directions along the edges of the graph. In this article, we investigate why these modular approaches might be preferable to the full model in misspecified settings. We propose principled criteria to choose between modular and full-model approaches. The question arises in many applied settings, including large stochastic dynamical systems, meta-analysis, epidemiological models, air pollution models, pharmacokinetics-pharmacodynamics, and causal inference with propensity scores.

...read moreread less

Collapse