Journal Article•DOI•

A Practitioner’s Guide to Cluster-Robust Inference

31 Mar 2015-Journal of Human Resources (University of Wisconsin Press)-Vol. 50, Iss: 2, pp 317-372

TL;DR: This work considers statistical inference for regression when data are grouped into clusters, with regression model errors independent across clusters but correlated within clusters, when the number of clusters is large and default standard errors can greatly overstate estimator precision.

read less

Abstract: We consider statistical inference for regression when data are grouped into clus- ters, with regression model errors independent across clusters but correlated within clusters. Examples include data on individuals with clustering on village or region or other category such as industry, and state-year dierences-in-dierences studies with clustering on state. In such settings default standard errors can greatly overstate es- timator precision. Instead, if the number of clusters is large, statistical inference after OLS should be based on cluster-robust standard errors. We outline the basic method as well as many complications that can arise in practice. These include cluster-specic �xed eects, few clusters, multi-way clustering, and estimators other than OLS.

...read moreread less

Figures (4)

Table 1 - Cross-section individual level data

Table 2 - Cross-section individual level data Monte Carlo rejection rates of true null hypothesis (slope = 0) with different number of clusters and different rejection methods Nominal 5% rejection rates

Table 4 - State-year panel data with differences-in-differences estimation Monte Carlo rejection rates of true null hypothesis (slope = 0) with different # clusters and different rejection methods Nominal 5% rejection rates

Table 3 - State-year panel data with differences-in-differences estimation Impacts of clustering and estimation choices on estimated coefficients, standard errors, and p-values

Content maybe subject to copyright Report

A Practitioner’s Guide to Cluster-Robust Inference

A. Colin Cameron and Douglas L. Miller

Abstract

We consider statistical inference for regression when data are grouped into clusters, with

regression model errors independent across clusters but correlated within clusters. Examples

include data on individuals with clustering on village or region or other category such as

industry, and state-year differences-in-differences studies with clustering on state. In such

settings default standard errors can greatly overstate estimator precision. Instead, if the number

of clusters is large, statistical inference after OLS should be based on cluster-robust standard

errors. We outline the basic method as well as many complications that can arise in practice.

These include cluster-specific fixed effects, few clusters, multi-way clustering, and estimators

other than OLS.

Colin Cameron is a Professor in the Department of Economics at UC- Davis. Doug Miller is

an Associate Professor in the Department of Economics at UC- Davis. They thank four

referees and the journal editor for very helpful comments and for guidance, participants at the

2013 California Econometrics Conference, a workshop sponsored by the U.K. Programme

Evaluation for Policy Analysis, seminars at University of Southern California and at

University of Uppsala, and the many people who over time have sent them cluster-related

puzzles (the solutions to some of which appear in this paper). Doug Miller acknowledges

financial support from the Center for Health and Wellbeing at the Woodrow Wilson School of

Public Policy at Princeton University.

I. Introduction

In an empiricist’s day-to-day practice, most effort is spent on getting unbiased or

consistent point estimates. That is, a lot of attention focuses on the parameters (

󰆹

). In this

paper we focus on getting accurate statistical inference, a fundamental component of which is

obtaining accurate standard errors (, the estimated standard deviation of 

󰆹

). We begin with

the basic reminder that empirical researchers should also really care about getting this part

right. An asymptotic 95% confidence interval is 

󰆹

± 1.96 × , and hypothesis testing is

typically based on the Wald “t-statistic” = (

󰆹





)/. Both 

󰆹

and  are critical

ingredients for statistical inference, and we should be paying as much attention to getting a

good  as we do to obtain 

󰆹

In this paper, we consider statistical inference in regression models where observations

can be grouped into clusters, with model errors uncorrelated across clusters but correlated

within cluster. One leading example of “clustered errors” is individual-level cross-section data

with clustering on geographical region, such as village or state. Then model errors for

individuals in the same region may be correlated, while model errors for individuals in

different regions are assumed to be uncorrelated. A second leading example is panel data. Then

model errors in different time periods for a given individual (e.g., person or firm or region) may

be correlated, while model errors for different individuals are assumed to be uncorrelated.

Failure to control for within-cluster error correlation can lead to very misleadingly

small standard errors, and consequent misleadingly narrow confidence intervals, large

t-statistics and low p-values. It is not unusual to have applications where standard errors that

control for within-cluster correlation are several times larger than default standard errors that

ignore such correlation. As shown below, the need for such control increases not only with

increase in the size of within-cluster error correlation, but the need also increases with the size

of within-cluster correlation of regressors and with the number of observations within a cluster.

A leading example, highlighted by Moulton (1986, 1990), is when interest lies in measuring the

effect of a policy variable, or other aggregated regressor, that takes the same value for all

observations within a cluster.

One way to control for clustered errors in a linear regression model is to additionally

specify a model for the within-cluster error correlation, consistently estimate the parameters of

this error correlation model, and then estimate the original model by feasible generalized least

squares (FGLS) rather than ordinary least squares (OLS). Examples include random effects

estimators and, more generally, random coefficient and hierarchical models. If all goes well

this provides valid statistical inference, as well as estimates of the parameters of the original

regression model that are more efficient than OLS. However, these desirable properties hold

only under the very strong assumption that the model for within-cluster error correlation is

correctly specified.

A more recent method to control for clustered errors is to estimate the regression model

with limited or no control for within-cluster error correlation, and then post-estimation obtain

“cluster-robust” standard errors proposed by White (1984, p.134-142) for OLS with a

multivariate dependent variable (directly applicable to balanced clusters); by Liang and Zeger

(1986) for linear and nonlinear models; and by Arellano (1987) for the fixed effects estimator

in linear panel models. These cluster-robust standard errors do not require specification of a

model for within-cluster error correlation, but do require the additional assumption that the

number of clusters, rather than just the number of observations, goes to infinity.

Cluster-robust standard errors are now widely used, popularized in part by Rogers

(1993) who incorporated the method in Stata, and by Bertrand, Duflo and Mullainathan (2004)

who pointed out that many differences-in-differences studies failed to control for clustered

errors, and those that did often clustered at the wrong level. Cameron and Miller (2011) and

Wooldridge (2003, 2006) provide surveys, and lengthy expositions are given in Angrist and

Pischke (2009) and Wooldridge (2010).

One goal of this paper is to provide the practitioner with the methods to implement

cluster-robust inference. To this end we include in the paper reference to relevant Stata

commands (for version 13), since Stata is the computer package most often used in applied

microeconometrics research. And we will post on our websites more expansive Stata code and

the datasets used in this paper. A second goal is presenting how to deal with complications such

as determining when there is a need to cluster, incorporating fixed effects, and inference when

there are few clusters. A third goal is to provide an exposition of the underlying econometric

theory as this can aid in understanding complications. In practice the most difficult

complication to deal with can be “few” clusters, see Section VI. There is no clear-cut definition

of “few”; depending on the situation “few” may range from less than 20 to less than 50 clusters

in the balanced case.

We focus on OLS, for simplicity and because this is the most commonly-used

estimation method in practice. Section II presents the basic results for OLS with clustered

errors. In principle, implementation is straightforward as econometrics packages include

cluster-robust as an option for the commonly-used estimators; in Stata it is the

vce(cluster) option. The remainder of the survey concentrates on complications that

often arise in practice. Section III addresses how the addition of fixed effects impacts

cluster-robust inference. Section IV deals with the obvious complication that it is not always

clear what to cluster over. Section V considers clustering when there is more than one way to

do so and these ways are not nested in each other. Section VI considers how to adjust inference

when there are just a few clusters as, without adjustment, test statistics based on the

cluster-robust standard errors over-reject and confidence intervals are too narrow. Section VII

presents extension to the full range of estimators – instrumental variables, nonlinear models

such as logit and probit, and generalized method of moments. Section VIII presents both

empirical examples and real-data based simulations. Concluding thoughts are given in Section

IX.

II. Cluster-Robust Inference

In this section we present the fundamentals of cluster-robust inference. For these basic

results we assume that the model does not include cluster-specific fixed effects, that it is clear

how to form the clusters, and that there are many clusters. We relax these conditions in

subsequent sections.

Clustered errors have two main consequences: they (usually) reduce the precision of 

󰆹

and the standard estimator for the variance of 

󰆹

, V



[

󰆹

], is (usually) biased downward from the

true variance. Computing cluster-robust standard errors is a fix for the latter issue. We illustrate

these issues, initially in the context of a very simple model and then in the following subsection

in a more typical model.

A. A Simple Example

For simplicity, we begin with OLS with a single regressor that is nonstochastic, and

assume no intercept in the model. The results extend to multiple regression with stochastic

regressors.

Let 



= 



+ 



, = 1, . . . , , where 



is nonstochastic and E[



] = 0. The OLS

estimator 

󰆹





















can be re-expressed as 

󰆹

=





















, so in general

V[

󰆹

] = 󰨙E[(

󰆹

)



] = V 







/ 











(1)

If errors are uncorrelated over , then V

[









]



V[







]













V[



]. In the

simplest case of homoskedastic errors, V[



] = 



and (1) simplifies to V[

󰆹

] = 













If instead errors are heteroskedastic, then (1) becomes



[

󰆹

] = 





E[





]



/ 











using V[



] = E[





] since E[



] = 0. Implementation seemingly requires consistent

estimates of each of the  error variances E[





]. In a very influential paper, one that extends

naturally to the clustered setting, White (1980) noted that instead all that is needed is an

estimate of the scalar









E[





]



, and that one can simply use

















, where 



= 





󰆹





is the OLS residual, provided . This leads to estimated variance





[

󰆹

] = 













]/ 











The resulting standard error for 

󰆹

is often called a robust standard error, though a better, more

precise term, is heteroskedastic-robust standard error.

What if errors are correlated over ? In the most general case where all errors are

correlated with each other,

V 







= 



 Cov[







, 







]



= 



 







E[







]





[

󰆹

] = 󰇧



 







E[







]



󰇨/ 











The obvious extension of White (1980) is to use V



[

󰆹

] = 























]/

(









)



, but this

equals zero since











= 0. Instead one needs to first set a large fraction of the error

correlations E[







] to zero. For time series data with errors assumed to be correlated only up

to, say,  periods apart as well as heteroskedastic, White’s result can be extended to yield a

heteroskedastic- and autocorrelation-consistent (HAC) variance estimate; see Newey and West

(1987).

In this paper we consider clustered errors, with E[







] = 0 unless observations  and

 are in the same cluster (such as same region). Then





󰆹

= 󰇧



 







E









[

, 󰨙in󰨙same󰨙cluster

]



󰇨

/ 











(2)

where the indicator function [] equals 1 if event  happens and equals 0 if event  does

not happen. Provided the number of clusters goes to infinity, we can use the variance estimate





[

󰆹

] = 󰇧



 















[, 󰨙in󰨙same󰨙cluster]



󰇨

/ 











(3)

This estimate is called a cluster-robust estimate, though more precisely it is heteroskedastic-

and cluster-robust. This estimate reduces to V





[

󰆹

] in the special case that there is only one

observation in each cluster.

Typically V





[

󰆹

] exceeds V





[

󰆹

] due to the addition of terms when . The

amount of increase is larger (1) the more positively associated are the regressors across

observations in the same cluster (via 







in (3)), (2) the more correlated are the errors (via

E[







] in (2)), and (3) the more observations are in the same cluster (via [,  in same

cluster] in (3)).

There are several take-away messages. First there can be great loss of efficiency in OLS

estimation if errors are correlated within cluster rather than completely uncorrelated.

Intuitively, if errors are positively correlated within cluster then an additional observation in

the cluster no longer provides a completely independent piece of new information. Second,

failure to control for this within-cluster error correlation can lead to using standard errors that

are too small, with consequent overly-narrow confidence intervals, overly-large t-statistics,

and over-rejection of true null hypotheses. Third, it is straightforward to obtain cluster-robust

standard errors, though they do rely on the assumption that the number of clusters goes to

infinity (see Section VI for the few clusters case).

B. Clustered Errors and Two Leading Examples

Let  denote the 



of  individuals in the sample, and  denote the 



of 

clusters. Then for individual  in cluster  the linear model with (one-way) clustering is





= 



󰆒

+ 



(4)

where 



is a × 1 vector. As usual it is assumed that E



|



= 0. The key assumption

is that errors are uncorrelated across clusters, while errors for individuals belonging to the same

cluster may be correlated. Thus

E[







󰆓

|



, 



󰆓

] = 0, 󰨙unless󰨙= 

󰆒

(5)

1. Example 1: Individuals in Cluster

Hersch (1998) uses cross-section individual-level data to estimate the impact of job

injury risk on wages. Since there is no individual-level data on job injury rate, a more

aggregated measure such as job injury risk in the individual’s industry is used as a regressor.

Then for individual  (with = 5960) in industry  (with = 211)





= × 



+ 



󰆒

+ 



The regressor 



is perfectly correlated within industry. The error term will be

positively correlated within industry if the model systematically overpredicts (or

underpredicts) wages in a given industry. In this case default OLS standard errors will be

HTML Viewer

Frequently Asked Questions (12)

Q1. What are the contributions mentioned in the paper "A practitioner’s guide to cluster-robust inference" ?

The authors consider statistical inference for regression when data are grouped into clusters, with regression model errors independent across clusters but correlated within clusters. They thank four referees and the journal editor for very helpful comments and for guidance, participants at the 2013 California Econometrics Conference, a workshop sponsored by the U. K. Programme Evaluation for Policy Analysis, seminars at University of Southern California and at University of Uppsala, and the many people who over time have sent them cluster-related puzzles ( the solutions to some of which appear in this paper ). The authors outline the basic method as well as many complications that can arise in practice.

Q2. What is the main reason that empirical economists use the cluster-specific FE estimator?

(18)The main reason that empirical economists use the cluster-specific FE estimator is that it controls for a limited form of endogeneity of regressors.

Q3. What is the natural approach to introduce cluster-specific effects in a nonlinear model?

The natural approach to introduce cluster-specific effects in a nonlinear model is to include a full set of cluster dummies as additional regressors.

Q4. What is the definition of asymptotic refinement?

Asymptotic refinement can be achieved by bootstrapping a statistic that is asymptotically pivotal, meaning the asymptotic distribution does not depend on any unknown parameters.

Q5. How does Webb propose to reduce the discreteness of p-values with very?

Webb (2013) proposes greatly reducing the discreteness of p-values with very low 𝐺 by instead using a six-point distribution for the weights 𝑑𝑔 in step 1b.

Q6. What is the downside to using cluster-robust standard errors?

Q7. What is the reason why the error uit may be correlated over time?

Then the error 𝑢𝑖𝑡 may be correlated over time (i.e., within-cluster) due to omitted factors that evolve progressively over time.

Q8. What is the difference between cluster-robust and default standard errors?

if clustering has a modest effect, so cluster-robust and default standard errors are similar in expectation, then cluster-robust may be smaller due to noise.

Q9. What is the p-value for a symmetric test based on the original sample?

The p-value for a symmetric test based on the original sample Wald statistic 𝑤 equals the proportion of times that |𝑤| > |𝑤𝑏∗|, 𝑏 = 1, . . . ,𝐵.

Q10. What is the simplest approach to clustering?

The simplest approach is a pooled approach that assumes that clustering does not change the functional form for the conditional probability of a single observation.

Q11. What is the way to calculate a positive semi-definite variance matrix?

Gelbach and Miller (2011) present an eigendecomposition technique used in the time series HAC literature that zeroes out negative eigenvalues in V�2way[𝜷�] to produce a positive semi-definite variance matrix estimate.

Q12. Does the null hypothesis change the rejection rate in this set of simulations?

Comparing row 15 to row 12, imposing the null hypothesis in performing the wild43bootstrap does not change the rejection rate very much in this set of simulations when 𝐺 ≥ 10, although it appears to matter when 𝐺 = 6.

A Practitioner’s Guide to Cluster-Robust Inference

Figures (4)

Citations

Cites background from "A Practitioner’s Guide to Cluster-R..."

Cites methods from "A Practitioner’s Guide to Cluster-R..."

References

Related Papers (5)

Frequently Asked Questions (12)

Q1. What are the contributions mentioned in the paper "A practitioner’s guide to cluster-robust inference" ?

Q2. What is the main reason that empirical economists use the cluster-specific FE estimator?

Q3. What is the natural approach to introduce cluster-specific effects in a nonlinear model?

Q4. What is the definition of asymptotic refinement?

Q5. How does Webb propose to reduce the discreteness of p-values with very?

Q6. What is the downside to using cluster-robust standard errors?

Q7. What is the reason why the error uit may be correlated over time?

Q8. What is the difference between cluster-robust and default standard errors?

Q9. What is the p-value for a symmetric test based on the original sample?

Q10. What is the simplest approach to clustering?

Q11. What is the way to calculate a positive semi-definite variance matrix?

Q12. Does the null hypothesis change the rejection rate in this set of simulations?

Trending Questions (1)