Some practical guidance for the implementation of propensity score matching
Summary (4 min read)
1 Introduction
- Matching has become a popular approach to estimate causal treatment effects.
- It originated from the statistical literature and shows a close link to the experimental context.
- One possible balancing score is the propensity score, i.e. the probability of participating in a programme given observed characteristics X. Matching procedures based on this balancing score are known as propensity score matching (PSM) and will be the focus of this paper.
- To begin with, a first decision has to be made concerning the estimation of the propensity score (see subsection 3.1).
2 Evaluation Framework and Matching Basics
- Inference about the impact of a treatment on the outcome of an individual involves speculation about how this individual would have performed 2 had he not received the treatment, also known as Roy-Rubin Model.
- (1) The fundamental evaluation problem arises because only one of the potential outcomes is observed for each individual i.
- Hence, estimating the individual treatment effect τi is not possible and one has to concentrate on average treatment effects.
3.1 Estimating the Propensity Score
- When estimating the propensity score, two choices have to be made.
- When leaving the binary treatment case, the choice of the model becomes more important.
- Heckman, LaLonde, and Smith (1999) also point out, that the data for participants and non-participants should stem from the same sources (e.g. the same questionnaire).
- Basically, the points made so far imply that the choice of variables should be based on economic theory and previous empirical findings.
- When using the full specification, bias arises from selecting a wide bandwidth in response to the weakness of the common support.
3.2 Choosing a Matching Algorithm
- The PSM estimator in its general form was stated in equation (9).
- The individual from the comparison group is chosen as a matching partner for a treated individual that is closest in terms of propensity score.
- The matching algorithms discussed so far have in common that only a few observations from the comparison group are used to construct the counterfactual outcome of a treated individual.
- The bandwidth choice is therefore a compromise between a small variance and an unbiased estimate of the true density function.
- If the propensity score is known, the estimator can directly by implemented as the difference between a weighted average of the outcomes for the treated and untreated individuals.
3.3 Overlap and Common Support
- The authors discussion in section 2 has shown that ATT and ATE are only defined in the region of common support.
- Some formal guidelines might help the researcher to determine the region of common support more precisely.
- Implementing the common support condition ensures that any combination of characteristics observed in the treatment group can also be observed among the control group (Bryson, Dorsett, and Purdon, 2002).
- The trimming method on the other hand would explicitly exclude treated observations in that propensity score range and would therefore deliver more reliable results.
- It may be instructive to inspect the characteristics of discarded individuals since those can provide important clues when interpreting the estimated treatment effects.
3.4 Assessing the Matching Quality
- These procedures can also, as already mentioned, help in determining which interactions and higher order terms to include for a given set of covariates X. 11Smith and Todd (2005) note that this theorem holds for any X, including those that do not satisfy the CIA required to justify matching.
- Before matching differences are expected, but after matching the covariates should be balanced in both groups and hence no significant differences should be found.
- The t-test might be preferred if the evaluator is concerned with the statistical significance of the results.
- If the quality indicators are not satisfactory, one reason might be mis-specification of the propensity score model and hence it may be worth to take a step back, include e.g. interaction or higher-order terms in the score estimation and test the quality once again.
3.5 Choice-Based Sampling
- An additional problem arising in evaluation studies is that samples used are often choice-based (Smith and Todd, 2005).
- This is a situation where programme participants are oversampled relative to their frequency in the population of eligible persons.
- Hence, matching can be done on the (mis-weighted) estimate of the odds ratio (or of the log odds ratio).
- Clearly, with single nearest-neighbour matching it does not matter whether matching is performed on the odds ratio or the estimated propensity score (with wrong weights), since ranking of the observations is identical and therefore the same neighbours will be selected.
- For methods that take account of the absolute distance between observations, e.g. kernel matching, it does matter.
3.6 When to Compare and Locking-in Effects
- The major goal is to ensure that participants and non-participants are compared in the same economic environment and the same individual lifecycle position.
- The latter of the two alternatives implies that the outcome of participants who re-enter the labour market in July is compared with matched non-participants in July.
- Since they are involved in the programme, they do not have the same time to search for a new job as non-participants.
- First, the increased employment probability through the programme and second, the reduced search intensity.
- So, if the authors are able to observe the outcome of the individuals for a reasonable time after begin/end of the programme, the occurrence of locking-in effects poses fewer problems but nevertheless has to be taken into account in the interpretation.
3.7 Estimation of Standard Errors
- Testing the statistical significance of treatment effects and computing their standard errors is not a straightforward thing to do.
- This method is a popular way to estimate standard errors in case analytical estimates are biased or unavailable.
- Each bootstrap draw includes the re-estimation of the results, including the first steps of the estimation (propensity score, common support, etc.).
- Furthermore the authors assume homoscedasticity of the variances of the outcome variables within treatment and control group and that the outcome variances do not depend on the estimated propensity score.
- 18 can be justified by results from Lechner (2002) who finds little differences between bootstrapped variances and the variances calculated according to equation (15).
3.8 Available Software to Implement Matching
- The bulk of software tools to implement matching and estimate treatment effects is growing and allows researchers to choose the appropriate tool for their purposes.
- To obtain standard errors the user can choose between bootstrapping and the variance approximation proposed by Lechner (2001).
- Leuven and Sianesi (2003) provide the programme psmatch2 for implementing different kinds of matching estimators including covariate and propensity score matching.
- Standard errors are obtained using bootstrapping methods.
- Finally, Abadie, Drukker, Leber Herr, and Imbens (2004) offer the programme nnmatch for implementing covariate matching, where the user can choose between several different distance metrics.
4.1 Unobserved Heterogeneity - Rosenbaum Bounds
- If there are unobserved variables which affect assignment into treatment and the outcome variable simultaneously, a ‘hidden bias’ might arise.
- But still, both individuals differ in their odds of receiving treatment by a factor that involves the parameter γ and the difference in their unobserved covariates u.
- The authors follow Aakvik (2001) and assume for the sake of simplicity that the unobserved covariate is a dummy variable with ui ∈ {0, 1}.
- Let Q+MH be the test-statistic given that the authors have overestimated the treatment effect and Q−MH the case where they have underestimated the treatment effect.
4.2 Failure of Common Support - Lechner Bounds
- In subsection 3.3 the authors have presented possible approaches to implement the common support restriction.
- Lechner (2000b) describes an approach to check the robustness of estimated treatment effects due to failure of common support.
- To introduce his approach some additional notation is needed.
- Assume that the share of participants within common support relative to the total number of participants as well as ATT for those within the common support, and λ10 are identified.
- Lechner (2000b) states that either ignoring the common support problem or estimating ATT only for the subpopulation within the common support can both be misleading.
5 Conclusion
- The aim of this paper was to give some guidance for the implementation of propensity score matching.
- The first step of implementation is the estimation of the propensity score.
- If it is felt that some variables play a specifically important role in determining participation and outcomes, one can use an ‘overweighting’ strategy, for example by carrying out matching on sub-populations.
- If results among different algorithms differ, further investigations may be needed to reveal the source of disparity.
- Another important decision is when to measure the effects.
Did you find this useful? Give us your feedback
Citations
2,370 citations
991 citations
Cites methods from "Some practical guidance for the imp..."
...Keywords: matching, propensity score matching, coarsened exact matching, Mahalanobis distance matching, model dependence 1 Introduction Matching is an increasingly popular method for preprocessing data to improve causal inferences in observational data (Ho et al. 2007; Morgan andWinship 2014)....
[...]
991 citations
689 citations
648 citations
References
28,298 citations
"Some practical guidance for the imp..." refers background in this paper
...Imbens (2004) or Wooldridge (2004) , section 18:3:2 for a formal description of weighting...
[...]
23,744 citations
21,694 citations
15,499 citations
"Some practical guidance for the imp..." refers background in this paper
...What is seen as more important (see e.g. Silverman, 1986; Pagan and Ullah, 1999) is the choice of the bandwidth parameter with the following Journal of Economic Surveys (2008) Vol. 22, No. 1, pp. 31–72 C© 2008 The Authors....
[...]
...Silverman (1986) or Pagan and Ullah (1999)) is the choice of the bandwidth parameter with the following trade-ofi arising: High bandwidth-values yield a smoother estimated density function, therefore leading to a better flt and a decreasing variance between the estimated and the true underlying density function....
[...]
14,825 citations
Related Papers (5)
Frequently Asked Questions (5)
Q2. What is the reason for including the full set of covariates in small samples?
Including the full set of covariates in small samples might cause problems in terms of higher variance, since either some treated have to be discarded from the analysis or control units have to be used more than once.
Q3. How many subclasses are often enough to remove 95% of the bias associated with one?
Cochrane and Chambers (1965) shows that five subclasses are often enough to remove 95% of the bias associated with one single covariate.
Q4. What is the standardised bias after matching?
(13)The standardised bias after matching is given by:SBafter = 100 · (X1M −X0M)√ 0.5 · (V1M(X) + V0M(X)) , (14)where X1 (V1) is the mean (variance) in the treatment group before matching and X0 (V0) the analogue for the control group.
Q5. What is the problem with the estimation of the net effect?
The problem is that the estimated variance of the treatment effect should also include the variance due to the estimation of the propensity score, the imputation of the common support, and possibly also the order in which treated individuals are matched.