Journal Article•DOI•

Some practical guidance for the implementation of propensity score matching

Marco Caliendo, Sabine Kopeinig¹•Institutions (1)

01 Feb 2008-Journal of Economic Surveys (Blackwell Publishing Ltd)-Vol. 22, Iss: 1, pp 31-72

TL;DR: Propensity score matching (PSM) has become a popular approach to estimate causal treatment effects as discussed by the authors, but empirical examples can be found in very diverse fields of study, and each implementation step involves a lot of decisions and different approaches can be thought of.

read less

Abstract: Propensity score matching (PSM) has become a popular approach to estimate causal treatment effects. It is widely applied when evaluating labour market policies, but empirical examples can be found in very diverse fields of study. Once the researcher has decided to use PSM, he is confronted with a lot of questions regarding its implementation. To begin with, a first decision has to be made concerning the estimation of the propensity score. Following that one has to decide which matching algorithm to choose and determine the region of common support. Subsequently, the matching quality has to be assessed and treatment effects and their standard errors have to be estimated. Furthermore, questions like 'what to do if there is choice-based sampling?' or 'when to measure effects?' can be important in empirical studies. Finally, one might also want to test the sensitivity of estimated treatment effects with respect to unobserved heterogeneity or failure of the common support condition. Each implementation step involves a lot of decisions and different approaches can be thought of. The aim of this paper is to discuss these implementation issues and give some guidance to researchers who want to use PSM for evaluation purposes.

...read moreread less

Summary (4 min read)

Jump to: [1 Introduction] – [2 Evaluation Framework and Matching Basics] – [3.1 Estimating the Propensity Score] – [3.2 Choosing a Matching Algorithm] – [3.3 Overlap and Common Support] – [3.4 Assessing the Matching Quality] – [3.5 Choice-Based Sampling] – [3.6 When to Compare and Locking-in Effects] – [3.7 Estimation of Standard Errors] – [3.8 Available Software to Implement Matching] – [4.1 Unobserved Heterogeneity - Rosenbaum Bounds] – [4.2 Failure of Common Support - Lechner Bounds] and [5 Conclusion]

1 Introduction

Matching has become a popular approach to estimate causal treatment effects.
It originated from the statistical literature and shows a close link to the experimental context.
One possible balancing score is the propensity score, i.e. the probability of participating in a programme given observed characteristics X. Matching procedures based on this balancing score are known as propensity score matching (PSM) and will be the focus of this paper.
To begin with, a first decision has to be made concerning the estimation of the propensity score (see subsection 3.1).

2 Evaluation Framework and Matching Basics

Inference about the impact of a treatment on the outcome of an individual involves speculation about how this individual would have performed 2 had he not received the treatment, also known as Roy-Rubin Model.
(1) The fundamental evaluation problem arises because only one of the potential outcomes is observed for each individual i.
Hence, estimating the individual treatment effect τi is not possible and one has to concentrate on average treatment effects.

3.1 Estimating the Propensity Score

When estimating the propensity score, two choices have to be made.
When leaving the binary treatment case, the choice of the model becomes more important.
Heckman, LaLonde, and Smith (1999) also point out, that the data for participants and non-participants should stem from the same sources (e.g. the same questionnaire).
Basically, the points made so far imply that the choice of variables should be based on economic theory and previous empirical findings.
When using the full specification, bias arises from selecting a wide bandwidth in response to the weakness of the common support.

3.2 Choosing a Matching Algorithm

The PSM estimator in its general form was stated in equation (9).
The individual from the comparison group is chosen as a matching partner for a treated individual that is closest in terms of propensity score.
The matching algorithms discussed so far have in common that only a few observations from the comparison group are used to construct the counterfactual outcome of a treated individual.
The bandwidth choice is therefore a compromise between a small variance and an unbiased estimate of the true density function.
If the propensity score is known, the estimator can directly by implemented as the difference between a weighted average of the outcomes for the treated and untreated individuals.

3.3 Overlap and Common Support

The authors discussion in section 2 has shown that ATT and ATE are only defined in the region of common support.
Some formal guidelines might help the researcher to determine the region of common support more precisely.
Implementing the common support condition ensures that any combination of characteristics observed in the treatment group can also be observed among the control group (Bryson, Dorsett, and Purdon, 2002).
The trimming method on the other hand would explicitly exclude treated observations in that propensity score range and would therefore deliver more reliable results.
It may be instructive to inspect the characteristics of discarded individuals since those can provide important clues when interpreting the estimated treatment effects.

3.4 Assessing the Matching Quality

These procedures can also, as already mentioned, help in determining which interactions and higher order terms to include for a given set of covariates X. 11Smith and Todd (2005) note that this theorem holds for any X, including those that do not satisfy the CIA required to justify matching.
Before matching differences are expected, but after matching the covariates should be balanced in both groups and hence no significant differences should be found.
The t-test might be preferred if the evaluator is concerned with the statistical significance of the results.
If the quality indicators are not satisfactory, one reason might be mis-specification of the propensity score model and hence it may be worth to take a step back, include e.g. interaction or higher-order terms in the score estimation and test the quality once again.

3.5 Choice-Based Sampling

An additional problem arising in evaluation studies is that samples used are often choice-based (Smith and Todd, 2005).
This is a situation where programme participants are oversampled relative to their frequency in the population of eligible persons.
Hence, matching can be done on the (mis-weighted) estimate of the odds ratio (or of the log odds ratio).
Clearly, with single nearest-neighbour matching it does not matter whether matching is performed on the odds ratio or the estimated propensity score (with wrong weights), since ranking of the observations is identical and therefore the same neighbours will be selected.
For methods that take account of the absolute distance between observations, e.g. kernel matching, it does matter.

3.6 When to Compare and Locking-in Effects

The major goal is to ensure that participants and non-participants are compared in the same economic environment and the same individual lifecycle position.
The latter of the two alternatives implies that the outcome of participants who re-enter the labour market in July is compared with matched non-participants in July.
Since they are involved in the programme, they do not have the same time to search for a new job as non-participants.
First, the increased employment probability through the programme and second, the reduced search intensity.
So, if the authors are able to observe the outcome of the individuals for a reasonable time after begin/end of the programme, the occurrence of locking-in effects poses fewer problems but nevertheless has to be taken into account in the interpretation.

3.7 Estimation of Standard Errors

Testing the statistical significance of treatment effects and computing their standard errors is not a straightforward thing to do.
This method is a popular way to estimate standard errors in case analytical estimates are biased or unavailable.
Each bootstrap draw includes the re-estimation of the results, including the first steps of the estimation (propensity score, common support, etc.).
Furthermore the authors assume homoscedasticity of the variances of the outcome variables within treatment and control group and that the outcome variances do not depend on the estimated propensity score.
18 can be justified by results from Lechner (2002) who finds little differences between bootstrapped variances and the variances calculated according to equation (15).

3.8 Available Software to Implement Matching

The bulk of software tools to implement matching and estimate treatment effects is growing and allows researchers to choose the appropriate tool for their purposes.
To obtain standard errors the user can choose between bootstrapping and the variance approximation proposed by Lechner (2001).
Leuven and Sianesi (2003) provide the programme psmatch2 for implementing different kinds of matching estimators including covariate and propensity score matching.
Standard errors are obtained using bootstrapping methods.
Finally, Abadie, Drukker, Leber Herr, and Imbens (2004) offer the programme nnmatch for implementing covariate matching, where the user can choose between several different distance metrics.

4.1 Unobserved Heterogeneity - Rosenbaum Bounds

If there are unobserved variables which affect assignment into treatment and the outcome variable simultaneously, a ‘hidden bias’ might arise.
But still, both individuals differ in their odds of receiving treatment by a factor that involves the parameter γ and the difference in their unobserved covariates u.
The authors follow Aakvik (2001) and assume for the sake of simplicity that the unobserved covariate is a dummy variable with ui ∈ {0, 1}.
Let Q+MH be the test-statistic given that the authors have overestimated the treatment effect and Q−MH the case where they have underestimated the treatment effect.

4.2 Failure of Common Support - Lechner Bounds

In subsection 3.3 the authors have presented possible approaches to implement the common support restriction.
Lechner (2000b) describes an approach to check the robustness of estimated treatment effects due to failure of common support.
To introduce his approach some additional notation is needed.
Assume that the share of participants within common support relative to the total number of participants as well as ATT for those within the common support, and λ10 are identified.
Lechner (2000b) states that either ignoring the common support problem or estimating ATT only for the subpopulation within the common support can both be misleading.

5 Conclusion

The aim of this paper was to give some guidance for the implementation of propensity score matching.
The first step of implementation is the estimation of the propensity score.
If it is felt that some variables play a specifically important role in determining participation and outcomes, one can use an ‘overweighting’ strategy, for example by carrying out matching on sub-populations.
If results among different algorithms differ, further investigations may be needed to reveal the source of disparity.
Another important decision is when to measure the effects.

Did you find this useful? Give us your feedback

Figures (5)

Table 2: Implementation of Propensity Score Matching

Table 1: Trade-Offs in Terms of Bias and Efficiency

Content maybe subject to copyright Report

Marco Caliendo*

Sabine Kopeinig**

Some Practical Guidance for the

Implementation of Propensity Score Matching

Discussion Papers

Berlin, April 2005

* DIW Berlin, IZA, Bonn

** University of Cologne

IMPRESSUM

DIW Berlin

Deutsches Institut für Wirtschaftsforschung

Königin-Luise-Str. 5

14195 Berlin

Tel. +49 (30) 897 89-0

Fax +49 (30) 897 89-200

www.diw.de

ISSN 1433-0210 (Druck) 1619-4535 (elektronisch)

Abdruck oder vergleichbare

Verwendung von Arbeiten

des DIW Berlin ist auch in

Auszügen nur mit vorheriger

schriftlicher Genehmigung

gestattet.

Some Practical Guidance for the Implementation

of Propensity Score Matching

∗

Marco Caliendo

†

DIW, Berlin

IZA, Bonn

Sabine Kopeinig

‡

University

of Cologne

Working Paper

This draft: April 26, 2005

Abstract

Propensity Score Matching (PSM) has become a popular approach to es-

timate causal treatment eﬀects. It is widely applied when evaluating labour

market policies, but empirical examples can be found in very diverse ﬁelds of

study. Once the researcher has decided to use PSM, he is confronted with a

lot of questions regarding its implementation. To begin with, a ﬁrst decision

has to be made concerning the estimation of the propensity score. Following

that one has to decide which matching algorithm to choose and determine

the region of common support. Subsequently, the matching quality has to be

assessed and treatment eﬀects and their standard errors have to be estimated.

Furthermore, questions like ‘what to do if there is choice-based sampling?’ or

‘when to measure eﬀects?’ can be imp ortant in empirical studies. Finally, one

might also want to test the sensitivity of estimated treatment eﬀects with re-

spect to unobserved heterogeneity or failure of the common support condition.

Each implementation step involves a lot of decisions and diﬀerent approaches

can be thought of. The aim of this paper is to discuss these implementa-

tion issues and give some guidance to researchers who want to use PSM for

evaluation purposes.

Keywords: Propensity Score Matching, Implementation, Evaluation, Sensitivity

JEL Classiﬁcation: C40, H43

∗

The authors thank Sascha O. Becker for valuable comments. All remaining errors are our own.

†

Marco Caliendo is Senior Research Associate at the German Institute for Economic Research

(DIW Berlin) and Research Aﬃliate of the IZA, Bonn, e-mail: mcaliendo@diw.de. Corresponding

author: Marco Caliendo, DIW Berlin, Dep. of Public Economics, K¨onigin-Luise-Str. 5, 14195

Berlin, phone: +49-30-89789-154, fax: +49-30-89789-9154.

‡

Sabine Kopeinig is Research Assistant at the Department of Marketing and Market Research,

University of Cologne, e-mail: kopeinig@wiso.uni-koeln.de.

1 Introduction

Matching has b ecome a popular approach to estimate causal treatment eﬀects. It is

widely applied when evaluating labour market policies (see e.g. Dehejia and Wahba

(1999) or Heckman, Ichimura, and Todd (1997)), but empirical examples can be

found in very diverse ﬁelds of study. It applies for all situations where one has

a treatment, a group of treated individuals and a group of untreated individuals.

The nature of treatment may be very diverse. For example, Perkins, Tu, Underhill,

Zhou, and Murray (2000) discuss the usage of matching in pharmacoepidemiologic

research. Hitt and Frei (2002) analyse the eﬀect of online banking on the proﬁtability

of customers. Davies and Kim (2003) compare the eﬀect on the percentage bid-ask

spread of Canadian ﬁrms being interlisted on an US-Exchange, whereas Brand and

Halaby (2003) analyse the eﬀect of elite college attendance on career outcomes.

Ham, Li, and Reagan (2003) study the eﬀect of a migration decision on the wage

growth of young men and Bryson (2002) analyse the eﬀect of union membership on

wages of employees. Every microeconometric evaluation study has to overcome the

fundamental evaluation problem and address the possible occurrence of selection

bias. The ﬁrst problem arises because we would like to know the diﬀerence between

the participants’ outcome with and without treatment. Clearly, we cannot observe

both outcomes for the same individual at the same time. Taking the mean outcome

of non-participants as an approximation is not advisable, since participants and

non-participants usually diﬀer even in the absence of treatment. This problem is

known as selection bias and a good example is the case, where motivated individuals

have a higher probability of entering a training programme and have also a higher

probability of ﬁnding a job. The matching approach is one possible solution to the

selection problem. It originated from the statistical literature and shows a close

link to the experimental context.

Its basic idea is to ﬁnd in a large group of non-

participants those individuals who are similar to the participants in all relevant

pre-treatment characteristics X. That being done, diﬀerences in outcomes of this

well selected and thus adequate control group and of participants can be attributed

to the programme.

Since conditioning on all relevant covariates is limited in case of a high dimen-

sional vector X (‘curse of dimensionality’), Rosenbaum and Rubin (1983) suggest

the use of so-called balancing scores b(X), i.e. functions of the relevant observed co-

variates X such that the conditional distribution of X given b(X) is independent of

assignment into treatment. One possible balancing score is the propensity score, i.e.

the probability of participating in a programme given observed characteristics X.

Matching pro cedures based on this balancing score are known as propensity score

matching (PSM) and will be the focus of this paper. Once the researcher has decided

to use PSM, he is confronted with a lot of questions regarding its implementation.

Figure 1 summarises the necessary steps when implementing PSM.

See e.g. Rubin (1974), Rosenbaum and Rubin (1983, 1985a) or Lechner (1998).

The decision whether to apply PSM or covariate matching (CVM) will not be discussed in this

paper. With CVM distance measures like the Mahalanobis distance are used to calculate similarity

of two individuals in terms of covariate values and the matching is done on these distances. The

interested reader is referred to Imbens (2004) or Abadie and Imbens (2004) who develop covariate

and bias-adjusted matching estimators. Zhao (2004) discusses the basic diﬀerences between PSM

and covariate matching.

Figure 1: PSM - Implementation Steps

Step 0:

Decide

between

PSM and

CVM

Step 1:

Propensity

Score

Estimation

(sec. 3.1)

Step 2:

Choose

Matching

Algorithm

(sec. 3.2)

Step 3:

Check Over-

lap/Common

Support

(sec. 3.3)

Step 5:

Sensitivity

Analysis

(sec. 4)

Step 4:

Matching

Quality/Effect

Estimation

(sec. 3.4-3.7)

CVM: Covariate Matching, PSM: Propensity Score Matching

The aim of this paper is to discuss these issues and give some practical guidance

to researchers who want to use PSM for evaluation purposes. The paper is organised

as follows. In section 2 we will describ e the basic evaluation framework and possible

treatment eﬀects of interest. Furthermore we show how propensity score matching

solves the evaluation problem and highlight the implicit identifying assumptions. In

section 3 we will focus on implementation steps of PSM estimators. To begin with,

a ﬁrst decision has to be made concerning the estimation of the propensity score

(see subsection 3.1). One has not only to decide about the probability model to be

used for estimation, but also about variables which should be included in this model.

In subsection 3.2 we brieﬂy evaluate the (dis-)advantages of diﬀerent matching al-

gorithms. Following that we discuss how to check the overlap between treatment

and comparison group and how to implement the common support requirement in

subsection 3.3. In subsection 3.4 we will show how to assess the matching qual-

ity. Subsequently we present the problem of choice-based sampling and discuss the

question ‘when to measure programme eﬀects?’ in subsections 3.5 and 3.6. Estimat-

ing standard errors for treatment eﬀects will be brieﬂy discussed in subsection 3.7,

before we conclude this section with an overview of available software to estimate

treatment eﬀects (3.8). Section 4 will be concerned with the sensitivity of estimated

treatment eﬀects. In subsection 4.1 we describe an approach (Rosenbaum bounds)

that allows the researcher to determine how strongly an unmeasured variable must

inﬂuence the selection pro cess in order to undermine the implications of PSM. In

subsection 4.2 we describe an approach proposed by Lechner (2000b). He incorpo-

rates information from those individuals who failed the common support restriction,

to calculate bounds of the parameter of interest, if all individuals from the sample at

hand would have been included. Finally, section 5 reviews all steps and concludes.

2 Evaluation Framework and Matching Basics

Roy-Rubin Model: Inference about the impact of a treatment on the outcome of

an individual involves speculation about how this individual would have performed

HTML Viewer

Frequently Asked Questions (5)

Q1. What have the authors contributed in "Some practical guidance for the implementation of propensity score matching∗" ?

The aim of this paper is to discuss these implementation issues and give some guidance to researchers who want to use PSM for evaluation purposes.

Q2. What is the reason for including the full set of covariates in small samples?

Including the full set of covariates in small samples might cause problems in terms of higher variance, since either some treated have to be discarded from the analysis or control units have to be used more than once.

Q3. How many subclasses are often enough to remove 95% of the bias associated with one?

Cochrane and Chambers (1965) shows that five subclasses are often enough to remove 95% of the bias associated with one single covariate.

Q4. What is the standardised bias after matching?

(13)The standardised bias after matching is given by:SBafter = 100 · (X1M −X0M)√ 0.5 · (V1M(X) + V0M(X)) , (14)where X1 (V1) is the mean (variance) in the treatment group before matching and X0 (V0) the analogue for the control group.

Q5. What is the problem with the estimation of the net effect?

The problem is that the estimated variance of the treatment effect should also include the variance due to the estimation of the propensity score, the imputation of the common support, and possibly also the order in which treated individuals are matched.

Some practical guidance for the implementation of propensity score matching

Summary (4 min read)

1 Introduction

2 Evaluation Framework and Matching Basics

3.1 Estimating the Propensity Score

3.2 Choosing a Matching Algorithm

3.3 Overlap and Common Support

3.4 Assessing the Matching Quality

3.5 Choice-Based Sampling

3.6 When to Compare and Locking-in Effects

3.7 Estimation of Standard Errors

3.8 Available Software to Implement Matching

4.1 Unobserved Heterogeneity - Rosenbaum Bounds

4.2 Failure of Common Support - Lechner Bounds

5 Conclusion

Figures (5)

Citations

Cites methods from "Some practical guidance for the imp..."

References

"Some practical guidance for the imp..." refers background in this paper

"Some practical guidance for the imp..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (5)

Q1. What have the authors contributed in "Some practical guidance for the implementation of propensity score matching∗" ?

Q2. What is the reason for including the full set of covariates in small samples?

Q3. How many subclasses are often enough to remove 95% of the bias associated with one?

Q4. What is the standardised bias after matching?

Q5. What is the problem with the estimation of the net effect?