scispace - formally typeset
Open AccessJournal ArticleDOI

Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study

Jared Lunceford, +1 more
- 15 Oct 2004 - 
- Vol. 23, Iss: 19, pp 2937-2960
TLDR
The propensity score, the probability of treatment exposure conditional on covariates, is the basis for two approaches to adjusting for confounding: methods based on stratification of observations by quantiles of estimated propensity scores and methods based upon weighting observations by the inverse of estimated covariates.
Abstract
Estimation of treatment effects with causal interpretation from observational data is complicated because exposure to treatment may be confounded with subject characteristics. The propensity score, the probability of treatment exposure conditional on covariates, is the basis for two approaches to adjusting for confounding: methods based on stratification of observations by quantiles of estimated propensity scores and methods based on weighting observations by the inverse of estimated propensity scores. We review popular versions of these approaches and related methods offering improved precision, describe theoretical properties and highlight their implications for practice, and present extensive comparisons of performance that provide guidance for practical use.

read more

Content maybe subject to copyright    Report

Stratification and Weighting Via the Propensity Score in Estimation of
Causal Treatment Effects: A Comparative Study
Jared K. Lunceford
1∗†
and Marie Davidian
2
1
Merck Research Laboratories, RY34-A316, P.O. Box 2000, Rahway, NJ 07065-0900, U.S.A.
2
Department of Statistics, North Carolina State University, Box 8203, Raleigh, NC 27695, U.S.A
SUMMARY
Estimation of treatment effects with causal interpretation from observational data is complicated
because exposure to treatment may be confounded with subject characteristics. The propensity
score, the probability of treatment exposure conditional on covariates, is the basis for two ap-
proaches to adjusting for confounding: metho ds based on stratification of observations by quantiles
of estimated propensity scores and methods based on weighting observations by the inverse of
estimated propensity scores. We review popular versions of these approaches and related meth-
ods offering improved precision, describ e theoretical properties and highlight their implications for
practice, and present extensive comparisons of performance that provide guidance for practial use.
KEY WORDS: covariate balance; double robustness; inverse-probability-of-treatment-
weighted-estimator; observational data.
1. INTRODUCTION
Observational data are often the basis for epidemiological and other investigations seeking to
make inference on the effect of treatment exposure on a response. Randomized studies aim to
balance distributions of subject characteristics across groups, so that groups are similar except
for the treatments. However, with observational data, treatment exposure may be associated with
covariates that are also associated with potential response, and groups may be seriously imbalanced
in these factors. Consequently, unbiased treatment comparisons from observational data require
methods that adjust for such confounding of exposure to treatment with subject characteristics,
and inferences with a causal interpretation cannot be made without appropriate adjustment.
Corresp ondence to: Jared K. Lunceford, Merck Research Laboratories, RY34-A316, P.O. Box 2000, Rahway, NJ
07065-0900, U.S.A.
E-mail: jared lunceford@merck.com, phone: 732-594-1725
Contract/grant sponsor: NIH; contract/grant numbers: R01-CA085848 and R37-AI031789
1

For comparing two treatments, “treated” and “control,” say, the propensity score is the proba-
bility of exposure to treatment conditional on observed covariates [1]. Properties of the propensity
score that facilitate causal inferences are given by Rosenbaum and Rubin [1] (see also [2, 3]), and
applications of methods using adjustments based on propensity scores are increasingly widespread,
e.g. [4, 5, 6]. A popular method for estimating the (causal) difference of two treatment means is
that of Rosenbaum and Rubin [7], where individuals are stratified based on estimated propensity
scores and the difference estimated as the average of within-stratum effects. An alternative ap-
proach is to adjust for confounding by using estimated propensity scores to construct weights for
individual observations [8, 9].
In this paper, we review approaches using stratification and weighting based on propensity
scores for making causal inferences from observational data and contrast their performance. A
main objective is to provide a mostly self-contained introduction to these methods and their un-
derpinnings, a description of their properties that highlights insights with implications for practice,
and a demonstration of relative performance that suggests guidelines for application. In Section 2,
we discuss the framework of counterfactuals or potential outcomes [10], which formalizes the notion
of “causal effect,” and assumptions required to justify adjustments for confounding. We describe
popular propensity-score-based approaches and describe some additional methods that may be less
familiar to practitioners that may improve upon these. Section 3 presents theoretical properties of
the estimators, and Section 4 reports on extensive comparative simulations.
2. ESTIMATORS BASED ON THE PROPENSITY SCORE
2.1 Counterfactual Framework
Let Z be an indicator of observed treatment exposure (Z = 1 if treated, Z = 0 if control)
and X be a vector of covariates measured prior to receipt of treatment (baseline) or, if measured
post-treatment, not affected by either treatment. Each individual is assumed to have an associated
random vector (Y
0
,Y
1
), where Y
0
and Y
1
are the values of the response that would be seen if,
possibly contrary to the fact of what actually happened, s/he were to receive control or treatment,
respectively. Consequently, Y
0
and Y
1
are referred to as counterfactuals (or potential outcomes)
and may be viewed as inherent characteristics of the individual. The response Y actually observed
2

is assumed to be that that would be seen under the exposure actually received, formalized as
Y = Y
1
Z +(1 Z)Y
0
. (1)
Thus, (Y,Z, X) are observed on each individual. It is important to distinguish between the observed
response Y and the counterfactual responses Y
0
and Y
1
. The latter are hypothetical and may never
be observed simultaneously; however, they are a convenient construct allowing precise statement
of questions of interest, as we now describe.
The distributions of Y
0
and Y
1
may be thought of as representing the hypothetical distributions
of response for the population of i ndividuals were all individuals to receive control or be treated,
respectively, so the means of these distributions correspond to the mean response if all individuals
were to receive each treatment. Hence, a difference in these means would be attributable to, or
caused by, the treatments. Formally, then,
∆=µ
1
µ
0
= E(Y
1
) E(Y
0
)
is referred to as the average causal effect (of the treated state relative to control). Estimation of
is thus of central interest in comparing treatments.
This framework makes it possible to formalize the difficulty in estimating ∆, and thus making
causal statements, from observational data. The counterfactuals are never both observed for any
subject; thus, whether estimation of is possible relies on whether E(Y
0
)andE(Y
1
)maybe
identified from the observed data (Y,Z,X). The sample average response in the treated group
estimates E(Y |Z = 1), the mean of observed responses among subjects observed to be treated,
which from (1) is equal to E(Y
1
|Z = 1) but is different from E(Y
1
), the mean if the entire population
were treated, and similarly for control. In a randomized trial, as Z is determined for each participant
at random, it is unrelated to how s/he might potential ly respond, and thus (Y
0
,Y
1
) k Z, where
k
denotes statistical independence. Here, using (1), we thus have E(Y |Z =1)=E(Y
1
|Z =
1) = E(Y
1
), and similarly E(Y |Z =0)=E(Y
0
), verifying that the sample average difference is
an unbiased estimator for with a causal interpretation, as widely accepted. However, in an
observational study, because treatment exposure Z is not controlled, Z may not be independent of
(Y
0
,Y
1
); indeed, the same characteristics that lead an individual to be exposed to a treatment may
3

also be associated, or “confounded,” with his/her potential response. In this case, E(Y |Z =1)=
E(Y
1
|Z =1)6= E(Y
1
)andE(Y |Z =0)=E(Y
0
|Z =0)6= E(Y
0
), so that the difference of observed
sample averages is not an unbiased estimator for ∆. It is important to distinguish between the
conditions (Y
0
,Y
1
) k Z and Y k Z. The former involves potential responses, which are indeed
independent of treatment assignment under randomization, while the latter involves the observed
response and is unlikely to be true under any circumstances unless treatment has no effect.
In an observational study, although (Y
0
,Y
1
) k Z is unlikely to hold, it may be possible to
identify subject characteristics related to both potential response and treatment exp osure, referred
to as “confounders.” If we believe that X contains all such confounders, then, for individuals
sharing a particular value of X, there would be no association between the exposure states and
the values of potential responses; i.e. treatment exp osure among individuals with a particular X is
essentially at random. Formally, Y
0
,Y
1
are independent of treatment exposure conditional on X,
written
(Y
0
,Y
1
) k Z | X. (2)
Rosenbaum and Rubin [1] refer to (2) as the assumption of strongly ignorable treatment assignment;
(2) has also been called the assumption of no unmeasured confounders [9]. One must appreciate
that (2) is an assumption; willingness to assume (2) requires the analyst to have confidence that X
contains all characteristics related to both treatment and response and that there are no additional,
unmeasured such confounders.
The benefit of (2) is that E(Y
0
)andE(Y
1
) may b e identified from (Y, Z,X). The regression
relationship E(Y |Z, X) depends only on the observed data, so is identifiable. Then the average for
Z = 1 over all X satisfies E{ E(Y |Z =1, X) } = E{ E(Y
1
|Z =1, X) } = E{ E(Y
1
|X) } = E(Y
1
),
where the first equality is from (1), the second follows from (2), and the outer expectation is
with respect to the distribution of X; similarly, E{ E(Y |Z =0, X) } = E(Y
0
). Thus, it should be
possible to make inferences on if (2) may b e assumed to hold. Methods using the propensity
score are one way to achieve this.
2.2 The Propensity Score
The propensity score e(X)=P (Z =1|X), 0 <e(X) < 1, is the probability of treatment given
the observed covariates. Rosenbaum and Rubin [1] showed that X k
Z | e(X), so individuals from
4

either treatment group with the same prop ensity score are “balanced” in that the distribution of X
is the same regardless of exposure status. Rosenbaum and Rubin show that if (2) holds, in addition
(Y
0
,Y
1
) k Z | e(X), so that treatment exposure is unrelated to the counterfactuals for individuals
sharing the same propensity score. We now review ways these developments may be exploited to
derive estimators for from observed data (Y
i
,Z
i
, X
i
), i =1,...,n, an i.i.d. sample containing
both treated and control subjects.
In practice, the propensity score is unlikely to be known, so it is routine to estimate it from
the observed data (Z
i
, X
i
), i =1,...,n, by assuming that e(X) follows a parametric model, e.g. a
logistic regression model e(X, β)={1 + exp(X
T
β)}
1
, β (p × 1). Interaction and higher-order
terms may also be included. Here, β may be estimated by the maximum likelihood (ML) estimator
b
β solving
n
X
i=1
ψ
β
(Z
i
, X
i
, β)=
n
X
i=1
Z
i
e(X
i
, β)
e(X
i
, β){1 e(X
i
, β)}
∂/β{e(X
i
, β)} = 0. (3)
We assume that the analyst is proficient at modeling e(X, β), so that it is correctly specified, and
write e = e(X, β)ande
β
= ∂/β{e(X, β)}, with subscript i when evaluated at X
i
.
2.3 Estimation of Based on Stratification
The popular approach using stratification on estimated propensity scores to estimate involves
the following steps: (i) Estimate β as in (3) and calculate estimated prop ensity scores be
i
= e(X
i
,
b
β)
for all i; (ii) form K strata according to the sample quantiles of the be
i
, where the jth sample
quantile bq
j
, j =1,...,K, is such that the proportion of be
i
bq
j
is roughly j/K, bq
0
=0,andbq
K
=1;
(iii) within each stratum, calculate the difference of sample means of the Y
i
for each treatment;
and (iv) estimate by a weighted sum of the differences of sample means across strata, where
weighting is by the proportion of observations falling in each stratum. D efining
b
Q
j
=(bq
j1
, bq
j
];
n
j
=
P
n
i=1
I(be
i
b
Q
j
), the number of individuals in stratum j;andn
1j
=
P
n
i=1
Z
i
I(be
i
b
Q
j
) is the
number of these who are treated, the estimator using a weighted sum is
b
S
=
K
X
j=1
³
n
j
n
´
(
n
1
1j
n
X
i=1
Z
i
Y
i
I(be
i
b
Q
j
) (n
j
n
1j
)
1
n
X
i=1
(1 Z
i
)Y
i
I(be
i
b
Q
j
)
)
, (4)
As the weights n
j
/n K
1
, they may be replaced by K
1
to yield an average across strata.
The rationale follows from the property (Y
0
,Y
1
) k Z | e(X) when (2) holds. Because treatment
exposure is essentially at random for individuals with the same propensity value, we expect mean
5

Citations
More filters
Journal ArticleDOI

An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies

TL;DR: The propensity score is a balancing score: conditional on the propensity score, the distribution of observed baseline covariates will be similar between treated and untreated subjects, and different causal average treatment effects and their relationship with propensity score analyses are described.
Journal ArticleDOI

Matching Methods for Causal Inference: A Review and a Look Forward

TL;DR: A structure for thinking about matching methods and guidance on their use is provided, coalescing the existing research (both old and new) and providing a summary of where the literature on matching methods is now and where it should be headed.
Journal ArticleDOI

Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies

TL;DR: A suite of quantitative and qualitative methods are described that allow one to assess whether measured baseline covariates are balanced between treatment groups in the weighted sample to contribute towards an evolving concept of ‘best practice’ when using IPTW to estimate causal treatment effects using observational data.
Journal ArticleDOI

Doubly robust estimation in missing data and causal inference models

TL;DR: The results of simulation studies are presented which demonstrate that the finite sample performance of DR estimators is as impressive as theory would predict and the proposed method is applied to a cardiovascular clinical trial.
Book

Counterfactuals and Causal Inference: Methods and Principles for Social Research

TL;DR: In this article, the authors proposed a method to estimate causal effects by conditioning on observed variables to block backdoor paths in observational social science research, but the method is limited to the case of causal exposure and identification criteria for conditioning estimators.
References
More filters
Journal ArticleDOI

The central role of the propensity score in observational studies for causal effects

Paul R. Rosenbaum, +1 more
- 01 Apr 1983 - 
TL;DR: The authors discusses the central role of propensity scores and balancing scores in the analysis of observational studies and shows that adjustment for the scalar propensity score is sufficient to remove bias due to all observed covariates.
Journal ArticleDOI

Estimating causal effects of treatments in randomized and nonrandomized studies.

TL;DR: A discussion of matching, randomization, random sampling, and other methods of controlling extraneous variation is presented in this paper, where the objective is to specify the benefits of randomization in estimating causal effects of treatments.
Journal ArticleDOI

Propensity score methods for bias reduction in the comparison of a treatment to a non‐randomized control group

TL;DR: The propensity score, defined as the conditional probability of being treated given the covariates, can be used to balance the variance of covariates in the two groups, and therefore reduce bias as mentioned in this paper.
Journal ArticleDOI

Marginal Structural Models and Causal Inference in Epidemiology

TL;DR: In this paper, the authors introduce marginal structural models, a new class of causal models that allow for improved adjustment of confounding in observational studies with exposures or treatments that vary over time, when there exist time-dependent confounders that are also affected by previous treatment.
Journal ArticleDOI

A generalization of sampling without replacement from a finite universe.

TL;DR: In this paper, two sampling schemes are discussed in connection with the problem of determining optimum selection probabilities according to the information available in a supplementary variable, which is a general technique for the treatment of samples drawn without replacement from finite universes when unequal selection probabilities are used.
Related Papers (5)
Frequently Asked Questions (14)
Q1. What have the authors contributed in "Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study" ?

The authors review popular versions of these approaches and related methods offering improved precision, describe theoretical properties and highlight their implications for practice, and present extensive comparisons of performance that provide guidance for practial use. 

An interesting avenue for future research would be to establish guidelines for choosing the number of strata based on theoretical analysis of the rate at which the number of strata should increase with sample size to eliminate bias. 

A popular method for estimating the (causal) difference of two treatment means isthat of Rosenbaum and Rubin [7], where individuals are stratified based on estimated propensityscores and the difference estimated as the average of within-stratum effects. 

Thepractical implication is that, at least in large samples, for these weighted estimators, incorporatingcovariates in the propensity model that are not related to treatment exposure but are associatedwith potential response will always lead to precision for estimating ∆ at least as great as thatattained by disregarding such covariates. 

Toinvestigate relative performance in such a realistic setting, the authors carried out simulations involving anumber of continuous and discrete covariates and a continuous response such that ∆0 > 0, where larger values of the response are preferred, so that treatment is beneficial. 

Low coverages for ∆̂S are due to the residual biases in Table I, as estimated standard errors from (29) performed well, closely tracking the MC standard deviations. 

Because ∆̂DR is the efficient estimator in the class, in large samples, it has smaller variance than ∆̂IPW1 or ∆̂IPW2 , often dramatically so. 

The joint distribution of (X, V ) was specified by taking X3 ∼ Bernoulli(0.2) and then generating V3 as Bernoulli with P (V3 = 1|X3) = 0.75X3 + 0.25(1−X3). 

the scaling has the effectin practice of offering stability in the case where some complete-case probabilities may be small orare highly variable. 

All scenarios are such that values of X associated with lower responses arealso associated with increased propensity for treatment, so that subjects with a covariate profileindicating poor response are those more likely to be treated. 

From (32) and these analogous expressions, the effect of including V in the propensity score model is to reduce the variance relative to that in the case where V is excluded. 

Settings of β and ξ that achieve the features described above were chosen to represent varyingdegrees of association of the corresponding covariate to Z or Y . 

3.2 Stratification EstimatorsHere, the authors present a heuristic account of large-sample results for ∆̂S and ∆̂SR based on representing the stratification and within-stratum estimation schemes for each as solutions to sets ofestimating equations. 

as shown in Section 3.2, for fixed K,∆̂S is not consistent and evidently neither ∆̂S nor ∆̂SR makes use of inverse weighting, so these estimators are not members of this class.