The Meaning of Failed Replications: A Review and Proposal
Summary (3 min read)
Introduction
- Sofern die Verfasser die Dokumente unter Open-Content-Lizenzen (insbesondere CC-Lizenzen) zur Verfügung gestellt haben sollten, gelten abweichend von diesen Nutzungsbedingungen die in der dort genannten Lizenz gewährten Nutzungsrechte.
- The Institute for the Study of Labor (IZA) in Bonn is a local and virtual international research center and a place of communication between science, politics and business.
- A revised version may be available directly from the author.
- The paper contrasts this definition with decades of unsuccessful attempts to standardize terminology, and argues that many prominent results described as replication tests should not be described as such.
1 The problem
- But economics and other social sciences have yet to clearly define what a replication is.
- Thus if a replication test gives discrepant results, under current usage of the term, this could mean a wide spectrum of things—from signaling a legitimate disagreement over the best methods , to signaling incompetence and fraud .
- It shows that usage compatible with the proposed definition is already widespread in the literature, but so is usage that is incompatible.
- It then reviews decades of attempts to resolve this conceptual confusion across the social sciences, and shows how the terminology proposed here addresses the problem.
- The remaining sections argue that the proposed definition creates better incentives for researchers, and applies this definition to classify many recent and prominent critique papers in labor economics, development economics, and other subfields.
2 A proposal to define replication and robustness
- Consider the proposed definitions of replication and robustness tests in Table 1.
- They are distinguished by whether or not the follow-up test should give, in expectation, exactly the same quantitative result.
2.1 What sets a replication apart
- A replication test estimates parameters drawn from the same sampling distribution as those in the original study.
- This form of replication can remedy sampling error or low power, in addition to the errors addressed by a verification test.
- A robustness test estimates parameters drawn from a different sampling distribution from those in the original study.
- This includes dropping influential observations, since a truncated sample cannot represent the same population.
2.2 Examples
- Restricting the term replication to this meaning fits the intuitive meaning that social science borrows from natural science.
- Results discrepant from the original should not be described as a replication issue; there is no reason these tests should yield identical results in expectation.
- This remains true if the new results are “qualitatively” different, such as rejecting the null in city A but failing to reject in city B, or getting a different sign with and without including an interaction term.
- Thus declaring a failure to replicate requires demonstrating that the new estimate should be quantitatively identical to the old result in expectation.
- If confounders can change more quickly, the population in the follow-up is materially different; the sampling distribution for the new estimates is not the same, and the follow-up study is a robustness test (extension to new data).
3.1 Usage compatible with this proposal is widespread
- Many economics journals already endorse the key goal of Table 1: restricting the meaning of the word replication to a sense that does not include robustness tests.
- Authors in the American Economic Review “are expected to send their data, programs, and sufficient details to permit replication.”.
- 1Authors of experimental papers must provide “sufficient explanation to make it possible to use the submitted computer programs to replicate the data analysis”.
- This usage of the term replication is strictly incompatible with a meaning that includes new regression specifications or new data, since the authors of the original paper could not logically be required to “explain” to other authors in what ways they should modify the original work.
- Easterly et al. (2004) revisit the results of Burnside and Dollar (2000) with new data, and describe their inquiry as a “robustness” test, not a replication test.
3.2 Incompatible usage is also widespread
- Many journals and organizations work with a definition that is irreconcilable with Table 1 and the usage in subsection 3.1.
- That is, they define replication so that follow-up studies can fail to replicate an original paper’s findings even when the original study’s code and data are correct and reproduce that study’s results precisely.
- Pesaran’s (2003) editorial policy for the Journal of Applied Econometrics considers “replication” to include testing “if the substantive empirical finding of the paper can be replicated using data from other periods, countries, regions, or other entities as appropriate.”.
- Numerous researchers also work with a different definition than that proposed in Table 1. Johnson et al. (2013) run other studies’ regressions on a new, extended dataset unavailable to the previous authors and describe this inquiry as “replication”.
3.3 Past attempts at a definition have not worked
- The social science literature recognizes this confusion but has not resolved it.
- The literature is chronically afflicted with attempts to define replication.
- There are decades of attempts, across the social sciences, to distinguish two things: studies that 8 revisit an earlier paper by strictly reproducing its findings with the same data and methods it describes, and studies that revisit those findings by changing the data and/or methods.
- Thus if a ‘replication’ study finds a different result, that could mean that the study used identical data and methods or completely different data and methods.
- Third, Table 2 shows not just a range of blurry meanings, but strictly incompatible meanings.
4 Why this unambiguous distinction is needed
- This is because the the two concepts carry sharply different normative messages about the original research, with consequences for the field as a whole.
- At worst, failed replications are linked to “fraud” (Trikalinos et al. 2008) and “doubts about scientific ethics” (Furman et al. 2012).
- Robustness tests often speak of “plausible” alterations to regression specifications, but the original specifications can seem just as “plausible” to another competent researcher.
- I do not criticize these reactions, but consider them inevitable sequelae of the field’s confused terminology, also known as To be clear.
- Second, confusion in the meaning of replication harms research by creating perverse incentives for those conducting replication and robustness checks.
5 Most prominent critiques are not replications
- The definitions proposed in Table 1 would clarify the nature of scientific critiques in the liteature.
- Égert (2013), in contrast, presents a robustness test of the same paper, with reanalysis and extension: he uses alternative estimators and data to challenge choices in Reinhart and Rogoff (2010) that are legitimately disputable.
- The rest, a large majority, are robustness tests.
- And of those, 59% are reanalyses with new methods, 27% are extensions to new data, and 14% carry out both reanalysis and extension.
- Most of these papers do not deserve the vague associations with incompetence or ethics that accompany failed “replications”, but most of them could receive that label in the absence of terminology that clearly distinguishes the two types of critiques.
6 Replication yesterday and tomorrow
- “[T]he replication standard is extremely important to the further development of the discipline,” writes King (1995).
- The meaning of replication needs to be standardized just as the meaning of “statistically significant” once was.
- “Mindlessly taking the exact same data and checking to see if an author ‘made a mistake’ is not a useful activity in the social sciences,” writes (Hamermesh 1997).
- In computational science, “[m]ost of the work in a modern research project is hidden in computational scripts that go to produce the reported results,” writes Donoho (2010).
- Standardization requires leadership—in this case by professional associations and by institutions championing the noble work of replication.
Did you find this useful? Give us your feedback
Citations
392 citations
64 citations
45 citations
34 citations
15 citations
References
432 citations
372 citations
249 citations
244 citations
"The Meaning of Failed Replications:..." refers background in this paper
...Johnson et al. (2013) run other studies’ regressions on a new, extended dataset unavailable to the previous authors and describe this inquiry as “replication”....
[...]
219 citations
"The Meaning of Failed Replications:..." refers background in this paper
...Dority and Fuess (2007) alter both the specifications and dataset of Layard et al. (1994) and describe this exercise as “replication”....
[...]
Related Papers (5)
Frequently Asked Questions (2)
Q2. What are the three types of replication tests?
Replication tests include: fixing coding errors so that the code does exactly what the original paper describes (verification), having the same sample of students take the same exam again to remedy measurement error (verification), and re-sampling the same population to remedy sampling error or low power with otherwise identical methods (reproduction).