scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Recursive partitioning for heterogeneous causal effects

05 Jul 2016-Proceedings of the National Academy of Sciences of the United States of America (National Academy of Sciences)-Vol. 113, Iss: 27, pp 7353-7360
TL;DR: This paper provides a data-driven approach to partition the data into subpopulations that differ in the magnitude of their treatment effects, and proposes an “honest” approach to estimation, whereby one sample is used to construct the partition and another to estimate treatment effects for each subpopulation.
Abstract: In this paper we propose methods for estimating heterogeneity in causal effects in experimental and observational studies and for conducting hypothesis tests about the magnitude of differences in treatment effects across subsets of the population. We provide a data-driven approach to partition the data into subpopulations that differ in the magnitude of their treatment effects. The approach enables the construction of valid confidence intervals for treatment effects, even with many covariates relative to the sample size, and without “sparsity” assumptions. We propose an “honest” approach to estimation, whereby one sample is used to construct the partition and another to estimate treatment effects for each subpopulation. Our approach builds on regression tree methods, modified to optimize for goodness of fit in treatment effects and to account for honest estimation. Our model selection criterion anticipates that bias will be eliminated by honest estimation and also accounts for the effect of making additional splits on the variance of treatment effect estimates within each subpopulation. We address the challenge that the “ground truth” for a causal effect is not observed for any individual unit, so that standard approaches to cross-validation must be modified. Through a simulation study, we show that for our preferred method honest estimation results in nominal coverage for 90% confidence intervals, whereas coverage ranges between 74% and 84% for nonhonest approaches. Honest estimation requires estimating the model with a smaller sample size; the cost in terms of mean squared error of treatment effects for our preferred method ranges between 7–22%.
Citations
More filters
Journal ArticleDOI
TL;DR: This paper developed a non-parametric causal forest for estimating heterogeneous treatment effects that extends Breiman's widely used random forest algorithm, and showed that causal forests are pointwise consistent for the true treatment effect and have an asymptotically Gaussian and centered sampling distribution.
Abstract: Many scientific and engineering challenges—ranging from personalized medicine to customized marketing recommendations—require an understanding of treatment effect heterogeneity. In this paper, we develop a non-parametric causal forest for estimating heterogeneous treatment effects that extends Breiman's widely used random forest algorithm. In the potential outcomes framework with unconfoundedness, we show that causal forests are pointwise consistent for the true treatment effect, and have an asymptotically Gaussian and centered sampling distribution. We also discuss a practical method for constructing asymptotic confidence intervals for the true treatment effect that are centered at the causal forest estimates. Our theoretical results rely on a generic Gaussian theory for a large family of random forest algorithms. To our knowledge, this is the first set of results that allows any type of random forest, including classification and regression forests, to be used for provably valid statistical infe...

1,156 citations

Journal ArticleDOI
TL;DR: This work presents a way of thinking about machine learning that gives it its own place in the econometric toolbox, and aims to make them conceptually easier to use by providing a crisper understanding of how these algorithms work, where they excel, and where they can stumble.
Abstract: Machines are increasingly doing “intelligent” things. Face recognition algorithms use a large dataset of photos labeled as having a face or not to estimate a function that predicts the pre...

1,055 citations


Cites background or methods from "Recursive partitioning for heteroge..."

  • ...A carefully constructed heterogeneity tree provides valid estimates of treatment effects in every leaf (Athey and Imbens 2016)....

    [...]

  • ...Athey and Imbens (2016) use sample-splitting to obtain valid (conditional) inference on 10 In particular, we have to avoid “forbidden regressions” (Angrist and Pischke 2008) in which correlation between first-stage residuals and fitted values exists and creates bias in the second stage....

    [...]

Posted Content
TL;DR: This is the first set of results that allows any type of random forest, including classification and regression forests, to be used for provably valid statistical inference and is found to be substantially more powerful than classical methods based on nearest-neighbor matching.
Abstract: Many scientific and engineering challenges -- ranging from personalized medicine to customized marketing recommendations -- require an understanding of treatment effect heterogeneity. In this paper, we develop a non-parametric causal forest for estimating heterogeneous treatment effects that extends Breiman's widely used random forest algorithm. In the potential outcomes framework with unconfoundedness, we show that causal forests are pointwise consistent for the true treatment effect, and have an asymptotically Gaussian and centered sampling distribution. We also discuss a practical method for constructing asymptotic confidence intervals for the true treatment effect that are centered at the causal forest estimates. Our theoretical results rely on a generic Gaussian theory for a large family of random forest algorithms. To our knowledge, this is the first set of results that allows any type of random forest, including classification and regression forests, to be used for provably valid statistical inference. In experiments, we find causal forests to be substantially more powerful than classical methods based on nearest-neighbor matching, especially in the presence of irrelevant covariates.

816 citations


Cites background or methods from "Recursive partitioning for heteroge..."

  • ...In growing trees to build our forest, we follow most closely the approach of Athey and Imbens [2016], who propose honest, causal trees, and obtain valid confidence intervals for average treatment effects for each of the subpopulations (leaves) identified by the algorithm. (Instead of personalizing predictions for each individual, this approach only provides treatment effect estimates for leaf-wise subgroups whose size must grow to infinity.) Other related approaches include those of Su et al. [2009] and Zeileis et al. [2008], which build a tree for treatment effects in subgroups and use statistical tests to determine splits; however, these papers do not analyze bias or consistency properties. Finally, we note a growing literature on estimating heterogeneous treatment effects using different machine learning methods. Imai and Ratkovic [2013], Signorovitch [2007], Tian et al. [2014] and Weisberg and Pontes [2015] develop lasso-like methods for causal inference in a sparse high-dimensional linear setting. Beygelzimer and Langford [2009], Dud́ık et al. [2011], and others discuss procedures for transforming outcomes that enable off-the-shelf loss minimization methods to be used for optimal treatment policy estimation....

    [...]

  • ...Following Athey and Imbens (2016), our proposed forest is composed of causal trees that estimate the effect of the treatment at the leaves of the trees; we thus refer to our algorithm as a causal forest....

    [...]

  • ...In growing trees to build our forest, we follow most closely the approach of Athey and Imbens [2016], who propose honest, causal trees, and obtain valid confidence intervals for average treatment effects for each of the subpopulations (leaves) identified by the algorithm. (Instead of personalizing predictions for each individual, this approach only provides treatment effect estimates for leaf-wise subgroups whose size must grow to infinity.) Other related approaches include those of Su et al. [2009] and Zeileis et al. [2008], which build a tree for treatment effects in subgroups and use statistical tests to determine splits; however, these papers do not analyze bias or consistency properties. Finally, we note a growing literature on estimating heterogeneous treatment effects using different machine learning methods. Imai and Ratkovic [2013], Signorovitch [2007], Tian et al. [2014] and Weisberg and Pontes [2015] develop lasso-like methods for causal inference in a sparse high-dimensional linear setting....

    [...]

  • ...For completeness, we briefly outline the motivation for the splitting rule of Athey and Imbens (2016) we use for our double-sample trees....

    [...]

  • ...We implemented our simulations in R, using the packages causalTree (Athey and Imbens 2016) for building individual trees, randomForestCI (Wager, Hastie, and Efron 2014) for computing V̂IJ , and FNN (Beygelzimer et al. 2013) for k-NN regression....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the authors discuss recent developments in econometrics that they view as important for empirical researchers working on policy evaluation questions, focusing on three main areas, where in each case they highlight recommendations for applied work.
Abstract: In this paper we discuss recent developments in econometrics that we view as important for empirical researchers working on policy evaluation questions. We focus on three main areas, where in each case we highlight recommendations for applied work. First, we discuss new research on identification strategies in program evaluation, with particular focus on synthetic control methods, regression discontinuity, external validity, and the causal interpretation of regression methods. Second, we discuss various forms of supplementary analyses to make the identification strategies more credible. These include placebo analyses as well as sensitivity and robustness analyses. Third, we discuss recent advances in machine learning methods for causal effects. These advances include methods to adjust for differences between treated and control units in high-dimensional settings, and methods for identifying and estimating heterogeneous treatment effects.

664 citations

Journal ArticleDOI
TL;DR: A metalearner, the X-learner, is proposed, which can adapt to structural properties, such as the smoothness and sparsity of the underlying treatment effect, and is shown to be easy to use and to produce results that are interpretable.
Abstract: There is growing interest in estimating and analyzing heterogeneous treatment effects in experimental and observational studies. We describe a number of metaalgorithms that can take advantage of any supervised learning or regression method in machine learning and statistics to estimate the conditional average treatment effect (CATE) function. Metaalgorithms build on base algorithms-such as random forests (RFs), Bayesian additive regression trees (BARTs), or neural networks-to estimate the CATE, a function that the base algorithms are not designed to estimate directly. We introduce a metaalgorithm, the X-learner, that is provably efficient when the number of units in one treatment group is much larger than in the other and can exploit structural properties of the CATE function. For example, if the CATE function is linear and the response functions in treatment and control are Lipschitz-continuous, the X-learner can still achieve the parametric rate under regularity conditions. We then introduce versions of the X-learner that use RF and BART as base learners. In extensive simulation studies, the X-learner performs favorably, although none of the metalearners is uniformly the best. In two persuasion field experiments from political science, we demonstrate how our X-learner can be used to target treatment regimes and to shed light on underlying mechanisms. A software package is provided that implements our methods.

546 citations

References
More filters
Journal ArticleDOI
01 Oct 2001
TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Abstract: Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, aaa, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

79,257 citations


"Recursive partitioning for heteroge..." refers background in this paper

  • ...ethods for a closely related problem, the problem of predicting outcomes as a function of covariates in similar environments. The most popular approaches (e.g. regression trees ([4]), random forests ([3]), LASSO ([24]), support vector machines ([26]), etc.) entail building a model of the relationship between attributes and outcomes, with a penalty parameter that penalizes model complexity. Cross-vali...

    [...]

  • ...geable, and that there is no interference (the stable unit treatment value assumption, or sutva [20]). This assumption may be violated in settings where some units are connected through networks. Let [3] p = pr(Wi = 1) be the marginal treatment probability, and let e(x) = pr(Wi = 1|Xi = x) be the conditional treatment probability (the “propensity score” as defined by [17]). In a randomized experiment ...

    [...]

  • ...ve the same prediction. In this paper, we focus on the analogous goal of deriving a partition of the population according to treatment effect heterogeneity, building on standard regression trees ([4], [3]). Whether the ultimate goal in an application is to derive a partition or fully personalized treatment effect estimates depends on the setting; settings where partitions may be desirable include those...

    [...]

Journal ArticleDOI
TL;DR: A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.
Abstract: SUMMARY We propose a new method for estimation in linear models. The 'lasso' minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree-based models are briefly described.

40,785 citations


"Recursive partitioning for heteroge..." refers background in this paper

  • ...Using Bayesian nonparametric methods, they project estimates of heterogeneous treatment effects onto the feature space using LASSO-type regularization methods to get low-dimensional summaries of heterogeneity....

    [...]

  • ...Beyond those previously discussed, Tian et al. (23) transform the features rather than the outcomes and then apply LASSO to the model with the original outcome and the transformed features....

    [...]

  • ...Imai and Ratkovic (25) use LASSO to estimate the effects of both treatments and attributes, but with different penalty terms for the two types of features to allow for the possibility that the treatment effects are present but the magnitudes of the interactions are small....

    [...]

  • ..., regression trees (5), random forests (6), LASSO (7), support vector machines (8), etc....

    [...]

  • ...The most popular approaches [e.g., regression trees (5), random forests (6), LASSO (7), support vector machines (8), etc.] entail building a model of the relationship between attributes and outcomes, with a penalty parameter that penalizes model complexity....

    [...]

Book
Vladimir Vapnik1
01 Jan 1995
TL;DR: Setting of the learning problem consistency of learning processes bounds on the rate of convergence ofLearning processes controlling the generalization ability of learning process constructing learning algorithms what is important in learning theory?
Abstract: Setting of the learning problem consistency of learning processes bounds on the rate of convergence of learning processes controlling the generalization ability of learning processes constructing learning algorithms what is important in learning theory?.

40,147 citations

01 Jan 1998
TL;DR: Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.
Abstract: A comprehensive look at learning and generalization theory. The statistical theory of learning and generalization concerns the problem of choosing desired functions on the basis of empirical data. Highly applicable to a variety of computer science and robotics fields, this book offers lucid coverage of the theory as a whole. Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.

26,531 citations


"Recursive partitioning for heteroge..." refers methods in this paper

  • ...blem of predicting outcomes as a function of covariates in similar environments. The most popular approaches (e.g. regression trees ([4]), random forests ([3]), LASSO ([24]), support vector machines ([26]), etc.) entail building a model of the relationship between attributes and outcomes, with a penalty parameter that penalizes model complexity. Cross-validation is often used to select the optimal lev...

    [...]

Journal ArticleDOI
TL;DR: The authors discusses the central role of propensity scores and balancing scores in the analysis of observational studies and shows that adjustment for the scalar propensity score is sufficient to remove bias due to all observed covariates.
Abstract: : The results of observational studies are often disputed because of nonrandom treatment assignment. For example, patients at greater risk may be overrepresented in some treatment group. This paper discusses the central role of propensity scores and balancing scores in the analysis of observational studies. The propensity score is the (estimated) conditional probability of assignment to a particular treatment given a vector of observed covariates. Both large and small sample theory show that adjustment for the scalar propensity score is sufficient to remove bias due to all observed covariates. Applications include: matched sampling on the univariate propensity score which is equal percent bias reducing under more general conditions than required for discriminant matching, multivariate adjustment by subclassification on balancing scores where the same subclasses are used to estimate treatment effects for all outcome variables and in all subpopulations, and visual representation of multivariate adjustment by a two-dimensional plot. (Author)

23,744 citations