What is the common method for estimating the difference of two treatment means?

A popular method for estimating the (causal) difference of two treatment means isthat of Rosenbaum and Rubin [7], where individuals are stratified based on estimated propensityscores and the difference estimated as the average of within-stratum effects.

What is the implication of incorporating covariates in the propensity model?

Thepractical implication is that, at least in large samples, for these weighted estimators, incorporatingcovariates in the propensity model that are not related to treatment exposure but are associatedwith potential response will always lead to precision for estimating ∆ at least as great as thatattained by disregarding such covariates.

How did the authors determine the relative performance of the subjects in the simulations?

Toinvestigate relative performance in such a realistic setting, the authors carried out simulations involving anumber of continuous and discrete covariates and a continuous response such that ∆0 > 0, where larger values of the response are preferred, so that treatment is beneficial.

Why are low coverages for S due to the residual biases in Table I?

Low coverages for ∆̂S are due to the residual biases in Table I, as estimated standard errors from (29) performed well, closely tracking the MC standard deviations.

Why is DR the efficient estimator in the class?

Because ∆̂DR is the efficient estimator in the class, in large samples, it has smaller variance than ∆̂IPW1 or ∆̂IPW2 , often dramatically so.

How was the joint distribution of (X, V ) specified?

The joint distribution of (X, V ) was specified by taking X3 ∼ Bernoulli(0.2) and then generating V3 as Bernoulli with P (V3 = 1|X3) = 0.75X3 + 0.25(1−X3).

What is the effect of the scaling on the probability of a complete case?

the scaling has the effectin practice of offering stability in the case where some complete-case probabilities may be small orare highly variable.

What is the effect of a covariate profile on the propensity for treatment?

All scenarios are such that values of X associated with lower responses arealso associated with increased propensity for treatment, so that subjects with a covariate profileindicating poor response are those more likely to be treated.

What is the effect of including V in the propensity score model?

From (32) and these analogous expressions, the effect of including V in the propensity score model is to reduce the variance relative to that in the case where V is excluded.

What settings were chosen to represent the degree of association of the corresponding covariate to Z?

Settings of β and ξ that achieve the features described above were chosen to represent varyingdegrees of association of the corresponding covariate to Z or Y .

What is the heuristic account of large-sample results for S?

3.2 Stratification EstimatorsHere, the authors present a heuristic account of large-sample results for ∆̂S and ∆̂SR based on representing the stratification and within-stratum estimation schemes for each as solutions to sets ofestimating equations.

What is the difference between the two classes of estimators?

as shown in Section 3.2, for fixed K,∆̂S is not consistent and evidently neither ∆̂S nor ∆̂SR makes use of inverse weighting, so these estimators are not members of this class.

(Open Access) Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study (2004) | Jared Lunceford

Q: What have the authors contributed in "Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study" ?

The authors review popular versions of these approaches and related methods offering improved precision, describe theoretical properties and highlight their implications for practice, and present extensive comparisons of performance that provide guidance for practial use.

Q: What future works have the authors mentioned in the paper "Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study" ?

An interesting avenue for future research would be to establish guidelines for choosing the number of strata based on theoretical analysis of the rate at which the number of strata should increase with sample size to eliminate bias.

Stratiﬁcation and Weighting Via the Propensity Score in Estimation of

Causal Treatment Eﬀects: A Comparative Study

Jared K. Lunceford

1∗†

and Marie Davidian

Merck Research Laboratories, RY34-A316, P.O. Box 2000, Rahway, NJ 07065-0900, U.S.A.

Department of Statistics, North Carolina State University, Box 8203, Raleigh, NC 27695, U.S.A

SUMMARY

Estimation of treatment eﬀects with causal interpretation from observational data is complicated

because exposure to treatment may be confounded with subject characteristics. The propensity

score, the probability of treatment exposure conditional on covariates, is the basis for two ap-

proaches to adjusting for confounding: metho ds based on stratiﬁcation of observations by quantiles

of estimated propensity scores and methods based on weighting observations by the inverse of

estimated propensity scores. We review popular versions of these approaches and related meth-

ods oﬀering improved precision, describ e theoretical properties and highlight their implications for

practice, and present extensive comparisons of performance that provide guidance for practial use.

KEY WORDS: covariate balance; double robustness; inverse-probability-of-treatment-

weighted-estimator; observational data.

1. INTRODUCTION

Observational data are often the basis for epidemiological and other investigations seeking to

make inference on the eﬀect of treatment exposure on a response. Randomized studies aim to

balance distributions of subject characteristics across groups, so that groups are similar except

for the treatments. However, with observational data, treatment exposure may be associated with

covariates that are also associated with potential response, and groups may be seriously imbalanced

in these factors. Consequently, unbiased treatment comparisons from observational data require

methods that adjust for such confounding of exposure to treatment with subject characteristics,

and inferences with a causal interpretation cannot be made without appropriate adjustment.

∗

Corresp ondence to: Jared K. Lunceford, Merck Research Laboratories, RY34-A316, P.O. Box 2000, Rahway, NJ

07065-0900, U.S.A.

†

E-mail: jared lunceford@merck.com, phone: 732-594-1725

Contract/grant sponsor: NIH; contract/grant numbers: R01-CA085848 and R37-AI031789

For comparing two treatments, “treated” and “control,” say, the propensity score is the proba-

bility of exposure to treatment conditional on observed covariates [1]. Properties of the propensity

score that facilitate causal inferences are given by Rosenbaum and Rubin [1] (see also [2, 3]), and

applications of methods using adjustments based on propensity scores are increasingly widespread,

e.g. [4, 5, 6]. A popular method for estimating the (causal) diﬀerence of two treatment means is

that of Rosenbaum and Rubin [7], where individuals are stratiﬁed based on estimated propensity

scores and the diﬀerence estimated as the average of within-stratum eﬀects. An alternative ap-

proach is to adjust for confounding by using estimated propensity scores to construct weights for

individual observations [8, 9].

In this paper, we review approaches using stratiﬁcation and weighting based on propensity

scores for making causal inferences from observational data and contrast their performance. A

main objective is to provide a mostly self-contained introduction to these methods and their un-

derpinnings, a description of their properties that highlights insights with implications for practice,

and a demonstration of relative performance that suggests guidelines for application. In Section 2,

we discuss the framework of counterfactuals or potential outcomes [10], which formalizes the notion

of “causal eﬀect,” and assumptions required to justify adjustments for confounding. We describe

popular propensity-score-based approaches and describe some additional methods that may be less

familiar to practitioners that may improve upon these. Section 3 presents theoretical properties of

the estimators, and Section 4 reports on extensive comparative simulations.

2. ESTIMATORS BASED ON THE PROPENSITY SCORE

2.1 Counterfactual Framework

Let Z be an indicator of observed treatment exposure (Z = 1 if treated, Z = 0 if control)

and X be a vector of covariates measured prior to receipt of treatment (baseline) or, if measured

post-treatment, not aﬀected by either treatment. Each individual is assumed to have an associated

random vector (Y

), where Y

and Y

are the values of the response that would be seen if,

possibly contrary to the fact of what actually happened, s/he were to receive control or treatment,

respectively. Consequently, Y

and Y

are referred to as counterfactuals (or potential outcomes)

and may be viewed as inherent characteristics of the individual. The response Y actually observed

is assumed to be that that would be seen under the exposure actually received, formalized as

Y = Y

Z +(1− Z)Y

. (1)

Thus, (Y,Z, X) are observed on each individual. It is important to distinguish between the observed

response Y and the counterfactual responses Y

and Y

. The latter are hypothetical and may never

be observed simultaneously; however, they are a convenient construct allowing precise statement

of questions of interest, as we now describe.

The distributions of Y

and Y

may be thought of as representing the hypothetical distributions

of response for the population of i ndividuals were all individuals to receive control or be treated,

respectively, so the means of these distributions correspond to the mean response if all individuals

were to receive each treatment. Hence, a diﬀerence in these means would be attributable to, or

caused by, the treatments. Formally, then,

∆=µ

− µ

= E(Y

) − E(Y

)

is referred to as the average causal eﬀect (of the treated state relative to control). Estimation of ∆

is thus of central interest in comparing treatments.

This framework makes it possible to formalize the diﬃculty in estimating ∆, and thus making

causal statements, from observational data. The counterfactuals are never both observed for any

subject; thus, whether estimation of ∆ is possible relies on whether E(Y

)andE(Y

)maybe

identiﬁed from the observed data (Y,Z,X). The sample average response in the treated group

estimates E(Y |Z = 1), the mean of observed responses among subjects observed to be treated,

which from (1) is equal to E(Y

|Z = 1) but is diﬀerent from E(Y

), the mean if the entire population

were treated, and similarly for control. In a randomized trial, as Z is determined for each participant

at random, it is unrelated to how s/he might potential ly respond, and thus (Y

) k Z, where

denotes statistical independence. Here, using (1), we thus have E(Y |Z =1)=E(Y

|Z =

1) = E(Y

), and similarly E(Y |Z =0)=E(Y

), verifying that the sample average diﬀerence is

an unbiased estimator for ∆ with a causal interpretation, as widely accepted. However, in an

observational study, because treatment exposure Z is not controlled, Z may not be independent of

); indeed, the same characteristics that lead an individual to be exposed to a treatment may

also be associated, or “confounded,” with his/her potential response. In this case, E(Y |Z =1)=

E(Y

|Z =1)6= E(Y

)andE(Y |Z =0)=E(Y

|Z =0)6= E(Y

), so that the diﬀerence of observed

sample averages is not an unbiased estimator for ∆. It is important to distinguish between the

conditions (Y

) k Z and Y k Z. The former involves potential responses, which are indeed

independent of treatment assignment under randomization, while the latter involves the observed

response and is unlikely to be true under any circumstances unless treatment has no eﬀect.

In an observational study, although (Y

) k Z is unlikely to hold, it may be possible to

identify subject characteristics related to both potential response and treatment exp osure, referred

to as “confounders.” If we believe that X contains all such confounders, then, for individuals

sharing a particular value of X, there would be no association between the exposure states and

the values of potential responses; i.e. treatment exp osure among individuals with a particular X is

essentially at random. Formally, Y

are independent of treatment exposure conditional on X,

written

) k Z | X. (2)

Rosenbaum and Rubin [1] refer to (2) as the assumption of strongly ignorable treatment assignment;

(2) has also been called the assumption of no unmeasured confounders [9]. One must appreciate

that (2) is an assumption; willingness to assume (2) requires the analyst to have conﬁdence that X

contains all characteristics related to both treatment and response and that there are no additional,

unmeasured such confounders.

The beneﬁt of (2) is that E(Y

)andE(Y

) may b e identiﬁed from (Y, Z,X). The regression

relationship E(Y |Z, X) depends only on the observed data, so is identiﬁable. Then the average for

Z = 1 over all X satisﬁes E{ E(Y |Z =1, X) } = E{ E(Y

|Z =1, X) } = E{ E(Y

|X) } = E(Y

where the ﬁrst equality is from (1), the second follows from (2), and the outer expectation is

with respect to the distribution of X; similarly, E{ E(Y |Z =0, X) } = E(Y

). Thus, it should be

possible to make inferences on ∆ if (2) may b e assumed to hold. Methods using the propensity

score are one way to achieve this.

2.2 The Propensity Score

The propensity score e(X)=P (Z =1|X), 0 <e(X) < 1, is the probability of treatment given

the observed covariates. Rosenbaum and Rubin [1] showed that X k

Z | e(X), so individuals from

either treatment group with the same prop ensity score are “balanced” in that the distribution of X

is the same regardless of exposure status. Rosenbaum and Rubin show that if (2) holds, in addition

) k Z | e(X), so that treatment exposure is unrelated to the counterfactuals for individuals

sharing the same propensity score. We now review ways these developments may be exploited to

derive estimators for ∆ from observed data (Y

, X

), i =1,...,n, an i.i.d. sample containing

both treated and control subjects.

In practice, the propensity score is unlikely to be known, so it is routine to estimate it from

the observed data (Z

, X

), i =1,...,n, by assuming that e(X) follows a parametric model, e.g. a

logistic regression model e(X, β)={1 + exp(−X

β)}

−1

, β (p × 1). Interaction and higher-order

terms may also be included. Here, β may be estimated by the maximum likelihood (ML) estimator

β solving

i=1

, X

, β)=

i=1

− e(X

, β)

e(X

, β){1 − e(X

, β)}

∂/∂β{e(X

, β)} = 0. (3)

We assume that the analyst is proﬁcient at modeling e(X, β), so that it is correctly speciﬁed, and

write e = e(X, β)ande

= ∂/∂β{e(X, β)}, with subscript i when evaluated at X

2.3 Estimation of ∆ Based on Stratiﬁcation

The popular approach using stratiﬁcation on estimated propensity scores to estimate ∆ involves

the following steps: (i) Estimate β as in (3) and calculate estimated prop ensity scores be

= e(X

β)

for all i; (ii) form K strata according to the sample quantiles of the be

, where the jth sample

quantile bq

, j =1,...,K, is such that the proportion of be

≤ bq

is roughly j/K, bq

=0,andbq

=1;

(iii) within each stratum, calculate the diﬀerence of sample means of the Y

for each treatment;

and (iv) estimate ∆ by a weighted sum of the diﬀerences of sample means across strata, where

weighting is by the proportion of observations falling in each stratum. D eﬁning

=(bq

j−1

, bq

];

i=1

I(be

∈

), the number of individuals in stratum j;andn

i=1

I(be

∈

) is the

number of these who are treated, the estimator using a weighted sum is

∆

j=1

(

−1

i=1

I(be

∈

) − (n

− n

)

−1

i=1

(1 − Z

I(be

∈

)

, (4)

As the weights n

/n ≈ K

−1

, they may be replaced by K

−1

to yield an average across strata.

The rationale follows from the property (Y

) k Z | e(X) when (2) holds. Because treatment

exposure is essentially at random for individuals with the same propensity value, we expect mean

Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study

Figures

Citations

An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies

Matching Methods for Causal Inference: A Review and a Look Forward

Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies

Doubly robust estimation in missing data and causal inference models

Counterfactuals and Causal Inference: Methods and Principles for Social Research

References

The central role of the propensity score in observational studies for causal effects

Estimating causal effects of treatments in randomized and nonrandomized studies.

Propensity score methods for bias reduction in the comparison of a treatment to a non‐randomized control group

Marginal Structural Models and Causal Inference in Epidemiology

A generalization of sampling without replacement from a finite universe.

Related Papers (5)

The central role of the propensity score in observational studies for causal effects

Estimating causal effects of treatments in randomized and nonrandomized studies.

Marginal Structural Models and Causal Inference in Epidemiology

Reducing Bias in Observational Studies Using Subclassification on the Propensity Score

An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies

Frequently Asked Questions (14)

Q1. What have the authors contributed in "Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study" ?

Q2. What future works have the authors mentioned in the paper "Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study" ?

Q3. What is the common method for estimating the difference of two treatment means?

Q4. What is the implication of incorporating covariates in the propensity model?

Q5. How did the authors determine the relative performance of the subjects in the simulations?

Q6. Why are low coverages for S due to the residual biases in Table I?

Q7. Why is DR the efficient estimator in the class?

Q8. How was the joint distribution of (X, V ) specified?

Q9. What is the effect of the scaling on the probability of a complete case?

Q10. What is the effect of a covariate profile on the propensity for treatment?

Q11. What is the effect of including V in the propensity score model?

Q12. What settings were chosen to represent the degree of association of the corresponding covariate to Z?

Q13. What is the heuristic account of large-sample results for S?

Q14. What is the difference between the two classes of estimators?