scispace - formally typeset
Search or ask a question
Journal ArticleDOI

The Meaning of Failed Replications: A Review and Proposal

TL;DR: The authors proposed an unambiguous definition of replication, and argued that professional associations should formally adopt this definition, thereby improving incentives for researchers to conduct more and better replication tests, and contrasted this definition with decades of unsuccessful attempts to standardize terminology, and argues that many prominent results described as replication tests should not be described as such.
Abstract: Economists are increasingly using publicly shared data and code to check each other’s work, an exercise often called ‘replication’ testing. But this much-needed trend has not been accompanied by a consensus about what ‘replication’ means. If a follow-up study does not ‘replicate’ an original result, according to current usage of the term, this can mean anything from an unremarkable disagreement over methods to scientific incompetence or misconduct.This paper proposes an unambiguous definition of replication. Many social scientists already use the term in the way suggested here, but many more do not. The paper contrasts this definition with decades of unsuccessful attempts to standardize terminology, and argues that many prominent results described as replication tests should not be described as such. It argues that professional associations should formally adopt this definition, thereby improving incentives for researchers to conduct more and better replication tests.

Summary (3 min read)

Introduction

  • Sofern die Verfasser die Dokumente unter Open-Content-Lizenzen (insbesondere CC-Lizenzen) zur Verfügung gestellt haben sollten, gelten abweichend von diesen Nutzungsbedingungen die in der dort genannten Lizenz gewährten Nutzungsrechte.
  • The Institute for the Study of Labor (IZA) in Bonn is a local and virtual international research center and a place of communication between science, politics and business.
  • A revised version may be available directly from the author.
  • The paper contrasts this definition with decades of unsuccessful attempts to standardize terminology, and argues that many prominent results described as replication tests should not be described as such.

1 The problem

  • But economics and other social sciences have yet to clearly define what a replication is.
  • Thus if a replication test gives discrepant results, under current usage of the term, this could mean a wide spectrum of things—from signaling a legitimate disagreement over the best methods , to signaling incompetence and fraud .
  • It shows that usage compatible with the proposed definition is already widespread in the literature, but so is usage that is incompatible.
  • It then reviews decades of attempts to resolve this conceptual confusion across the social sciences, and shows how the terminology proposed here addresses the problem.
  • The remaining sections argue that the proposed definition creates better incentives for researchers, and applies this definition to classify many recent and prominent critique papers in labor economics, development economics, and other subfields.

2 A proposal to define replication and robustness

  • Consider the proposed definitions of replication and robustness tests in Table 1.
  • They are distinguished by whether or not the follow-up test should give, in expectation, exactly the same quantitative result.

2.1 What sets a replication apart

  • A replication test estimates parameters drawn from the same sampling distribution as those in the original study.
  • This form of replication can remedy sampling error or low power, in addition to the errors addressed by a verification test.
  • A robustness test estimates parameters drawn from a different sampling distribution from those in the original study.
  • This includes dropping influential observations, since a truncated sample cannot represent the same population.

2.2 Examples

  • Restricting the term replication to this meaning fits the intuitive meaning that social science borrows from natural science.
  • Results discrepant from the original should not be described as a replication issue; there is no reason these tests should yield identical results in expectation.
  • This remains true if the new results are “qualitatively” different, such as rejecting the null in city A but failing to reject in city B, or getting a different sign with and without including an interaction term.
  • Thus declaring a failure to replicate requires demonstrating that the new estimate should be quantitatively identical to the old result in expectation.
  • If confounders can change more quickly, the population in the follow-up is materially different; the sampling distribution for the new estimates is not the same, and the follow-up study is a robustness test (extension to new data).

3.1 Usage compatible with this proposal is widespread

  • Many economics journals already endorse the key goal of Table 1: restricting the meaning of the word replication to a sense that does not include robustness tests.
  • Authors in the American Economic Review “are expected to send their data, programs, and sufficient details to permit replication.”.
  • 1Authors of experimental papers must provide “sufficient explanation to make it possible to use the submitted computer programs to replicate the data analysis”.
  • This usage of the term replication is strictly incompatible with a meaning that includes new regression specifications or new data, since the authors of the original paper could not logically be required to “explain” to other authors in what ways they should modify the original work.
  • Easterly et al. (2004) revisit the results of Burnside and Dollar (2000) with new data, and describe their inquiry as a “robustness” test, not a replication test.

3.2 Incompatible usage is also widespread

  • Many journals and organizations work with a definition that is irreconcilable with Table 1 and the usage in subsection 3.1.
  • That is, they define replication so that follow-up studies can fail to replicate an original paper’s findings even when the original study’s code and data are correct and reproduce that study’s results precisely.
  • Pesaran’s (2003) editorial policy for the Journal of Applied Econometrics considers “replication” to include testing “if the substantive empirical finding of the paper can be replicated using data from other periods, countries, regions, or other entities as appropriate.”.
  • Numerous researchers also work with a different definition than that proposed in Table 1. Johnson et al. (2013) run other studies’ regressions on a new, extended dataset unavailable to the previous authors and describe this inquiry as “replication”.

3.3 Past attempts at a definition have not worked

  • The social science literature recognizes this confusion but has not resolved it.
  • The literature is chronically afflicted with attempts to define replication.
  • There are decades of attempts, across the social sciences, to distinguish two things: studies that 8 revisit an earlier paper by strictly reproducing its findings with the same data and methods it describes, and studies that revisit those findings by changing the data and/or methods.
  • Thus if a ‘replication’ study finds a different result, that could mean that the study used identical data and methods or completely different data and methods.
  • Third, Table 2 shows not just a range of blurry meanings, but strictly incompatible meanings.

4 Why this unambiguous distinction is needed

  • This is because the the two concepts carry sharply different normative messages about the original research, with consequences for the field as a whole.
  • At worst, failed replications are linked to “fraud” (Trikalinos et al. 2008) and “doubts about scientific ethics” (Furman et al. 2012).
  • Robustness tests often speak of “plausible” alterations to regression specifications, but the original specifications can seem just as “plausible” to another competent researcher.
  • I do not criticize these reactions, but consider them inevitable sequelae of the field’s confused terminology, also known as To be clear.
  • Second, confusion in the meaning of replication harms research by creating perverse incentives for those conducting replication and robustness checks.

5 Most prominent critiques are not replications

  • The definitions proposed in Table 1 would clarify the nature of scientific critiques in the liteature.
  • Égert (2013), in contrast, presents a robustness test of the same paper, with reanalysis and extension: he uses alternative estimators and data to challenge choices in Reinhart and Rogoff (2010) that are legitimately disputable.
  • The rest, a large majority, are robustness tests.
  • And of those, 59% are reanalyses with new methods, 27% are extensions to new data, and 14% carry out both reanalysis and extension.
  • Most of these papers do not deserve the vague associations with incompetence or ethics that accompany failed “replications”, but most of them could receive that label in the absence of terminology that clearly distinguishes the two types of critiques.

6 Replication yesterday and tomorrow

  • “[T]he replication standard is extremely important to the further development of the discipline,” writes King (1995).
  • The meaning of replication needs to be standardized just as the meaning of “statistically significant” once was.
  • “Mindlessly taking the exact same data and checking to see if an author ‘made a mistake’ is not a useful activity in the social sciences,” writes (Hamermesh 1997).
  • In computational science, “[m]ost of the work in a modern research project is hidden in computational scripts that go to produce the reported results,” writes Donoho (2010).
  • Standardization requires leadership—in this case by professional associations and by institutions championing the noble work of replication.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Clemens, Michael A.
Working Paper
The Meaning of Failed Replications: A Review and
Proposal
IZA Discussion Papers, No. 9000
Provided in Cooperation with:
IZA – Institute of Labor Economics
Suggested Citation: Clemens, Michael A. (2015) : The Meaning of Failed Replications: A
Review and Proposal, IZA Discussion Papers, No. 9000, Institute for the Study of Labor (IZA),
Bonn
This Version is available at:
http://hdl.handle.net/10419/110735
Standard-Nutzungsbedingungen:
Die Dokumente auf EconStor dürfen zu eigenen wissenschaftlichen
Zwecken und zum Privatgebrauch gespeichert und kopiert werden.
Sie dürfen die Dokumente nicht für öffentliche oder kommerzielle
Zwecke vervielfältigen, öffentlich ausstellen, öffentlich zugänglich
machen, vertreiben oder anderweitig nutzen.
Sofern die Verfasser die Dokumente unter Open-Content-Lizenzen
(insbesondere CC-Lizenzen) zur Verfügung gestellt haben sollten,
gelten abweichend von diesen Nutzungsbedingungen die in der dort
genannten Lizenz gewährten Nutzungsrechte.
Terms of use:
Documents in EconStor may be saved and copied for your
personal and scholarly purposes.
You are not to copy documents for public or commercial
purposes, to exhibit the documents publicly, to make them
publicly available on the internet, or to distribute or otherwise
use the documents in public.
If the documents have been made available under an Open
Content Licence (especially Creative Commons Licences), you
may exercise further usage rights as specified in the indicated
licence.

DISCUSSION PAPER SERIES
Forschungsinstitut
zur Zukunft der Arbeit
Institute for the Study
of Labor
The Meaning of Failed Replications:
A Review and Proposal
IZA DP No. 9000
April 2015
Michael A. Clemens

The Meaning of Failed Replications:
A Review and Proposal
Michael A. Clemens
Center for Global Development
and IZA
Discussion Paper No. 9000
April 2015
IZA
P.O. Box 7240
53072 Bonn
Germany
Phone: +49-228-3894-0
Fax: +49-228-3894-180
E-mail: iza@iza.org
Any opinions expressed here are those of the author(s) and not those of IZA. Research published in
this series may include views on policy, but the institute itself takes no institutional policy positions.
The IZA research network is committed to the IZA Guiding Principles of Research Integrity.
The Institute for the Study of Labor (IZA) in Bonn is a local and virtual international research center
and a place of communication between science, politics and business. IZA is an independent nonprofit
organization supported by Deutsche Post Foundation. The center is associated with the University of
Bonn and offers a stimulating research environment through its international network, workshops and
conferences, data service, project support, research visits and doctoral program. IZA engages in (i)
original and internationally competitive research in all fields of labor economics, (ii) development of
policy concepts, and (iii) dissemination of research results and concepts to the interested public.
IZA Discussion Papers often represent preliminary work and are circulated to encourage discussion.
Citation of such a paper should account for its provisional character. A revised version may be
available directly from the author.

IZA Discussion Paper No. 9000
April 2015
ABSTRACT
The Meaning of Failed Replications: A Review and Proposal
*
The welcome rise of replication tests in economics has not been accompanied by a single,
clear definition of replication. A discrepant replication, in current usage of the term, can signal
anything from an unremarkable disagreement over methods to scientific incompetence or
misconduct. This paper proposes an unambiguous definition of replication, one that reflects
currently common but unstandardized use. It contrasts this definition with decades of
unsuccessful attempts to standardize terminology, and argues that many prominent results
described as replication testsin labor, development, and other fields of economics should
not be described as such. Adopting this definition can improve incentives for researchers,
encouraging more and better replication tests.
NON-TECHNICAL SUMMARY
Economists are increasingly using publicly shared data and code to check each others work,
an exercise often called replication testing. But this much-needed trend has not been
accompanied by a consensus about what replication means. Many follow-up studies on
influential papers in labor economics and other fields have been unable to replicate the
original result. But according to current usage of the term, this can mean anything from an
unremarkable disagreement over methods to scientific incompetence or misconduct.
This paper proposes an unambiguous definition of replication. Many social scientists already
use the term in the way suggested here, but many more do not. The paper contrasts this
definition with decades of unsuccessful attempts to standardize terminology, and argues that
many prominent results described as replication tests should not be described as such. It
argues that professional associations should formally adopt this definition, thereby improving
incentives for researchers to conduct more and better replication tests.
JEL Classification: B40, C18, C80
Keywords: replication, robustness, transparency, open data, ethics, reproducible,
replicate, misconduct, fraud, error, code, registry
Corresponding author:
Michael A. Clemens
Center for Global Development
2055 L Street NW, 5th floor
Washington, DC 20036
USA
E-mail: mclemens@cgdev.org
*
This research was generously supported by Good Ventures. Chris Blattman, Annette Brown, Angus
Deaton, Gabriel Demombynes, Stefan Dercon, John Hoddinott, Macartan Humphreys, Stephan
Klasen, Ted Miguel, Emily Oster, Justin Sandefur, Bill Savedoff, and Ben Wood provided helpful
comments. All viewpoints and any errors are the sole responsibility of the author and do not represent
CGD, its Board of Directors, or its funders.

1 The problem
Social science is benefiting from a surge of interest in subjecting published research to repli-
cation tests. But economics and other social sciences have yet to clearly define what a repli-
cation is. Thus if a replication test gives discrepant results, under current usage of the term,
this could mean a wide spectrum of things—from signaling a legitimate disagreement over
the best methods (science), to signaling incompetence and fraud (pseudoscience). Terminol-
ogy that lumps together fundamentally different things impedes scientific progress, hobbling
researchers with fruitless debates and poor incentives.
This paper argues that the movement for replication in social science will become stronger
with clear terminology. It begins by proposing an unambiguous definition of replication. It
shows that usage compatible with the proposed definition is already widespread in the lit-
erature, but so is usage that is incompatible. It then reviews decades of attempts to resolve
this conceptual confusion across the social sciences, and shows how the terminology proposed
here addresses the problem. The remaining sections argue that the proposed definition cre-
ates better incentives for researchers, and applies this definition to classify many recent and
prominent critique papers in labor economics, development economics, and other subfields.
It concludes by arguing that the need for this terminology arises from a generational shift in
how empirical social science is conducted.
2 A proposal to define replication and robustness
Consider the proposed definitions of replication and robustness tests in Table 1. They are
distinguished by whether or not the follow-up test should give, in expectation, exactly the
same quantitative result.
1

Citations
More filters
Journal Article
TL;DR: In this paper, the authors exploit a new multi-country historical dataset on public (government) debt to search for a systemic relationship between high public debt level growth and inflation, and their main result is that whereas the link between growth and debt seems relatively weak at “normal” debt levels, median growth rates for countries with over roughly ninety percent of gdp are about one percent lower than otherwise; average growth rates are several percent lower.
Abstract: In this paper, we exploit a new multi-country historical dataset on public (government) debt to search for a systemic relationship between high public debt level growth and inflation our main result is that whereas the link between growth and debt seems relatively weak at “normal” debt levels, median growth rates for countries with over roughly ninety percent of gdp are about one percent lower than otherwise; average (mean) growth rates are several percent lower . surprisingly,

392 citations

Posted Content
TL;DR: Two approaches for identifying the conditional probability of publication as a function of a study’s results are proposed, the first based on systematic replication studies and the second based on meta-studies.
Abstract: Some empirical results are more likely to be published than others. Such selective publication leads to biased estimates and distorted inference. This paper proposes two approaches for identifying the conditional probability of publication as a function of a study’s results, the first based on systematic replication studies and the second based on meta-studies. For known conditional publication probabilities, we propose median-unbiased estimators and associated confidence sets that correct for selective publication. We apply our methods to recent large-scale replication studies in experimental economics and psychology, and to meta-studies of the effects of minimum wages and de-worming programs.

64 citations

Posted Content
Abstract: We report on various aspects of replication research in economics. Our report includes (i) a brief history of data sharing and replication; (ii) the results of a survey administered to the editors of all 333 economics journals listed in Web of Science; (iii) an analysis of 162 replication studies that have been published in peer-reviewed economics journals from 1977–2014; (iv) a discussion of the future of replication research in economics; and (v) observations on how replications can be better integrated into research efforts to address problems associated with publication bias and other Type I error phenomena. This paper is part of an ongoing project which includes the website replicationnetwork.com, which provides additional, regularly updated information on replications in economics.

45 citations

Book ChapterDOI
01 Jan 2017
TL;DR: A number of critical innovations spurred the rapid expansion in the use of field experiments by academics as mentioned in this paper. But as researchers got more involved in the design and implementation of the interventions they tested, new ethical issues arose.
Abstract: A number of critical innovations spurred the rapid expansion in the use of field experiments by academics. Some of these were econometric but many were intensely practical. Researchers learned how to work with a wide range of implementing organizations from small, local nongovernmental organizations to large government bureaucracies. They improved data collection techniques and switched to digital data collection. As researchers got more involved in the design and implementation of the interventions they tested, new ethical issues arose. Finally, the dramatic rise in the use of experiments increased the benefits associated with research transparency. This chapter records some of these practical innovations. It focuses on how to select and effectively work with the organization running an intervention which is being evaluated; ways to minimize attrition, monitor enumerators, and ensure data are collected consistently in treatment and comparison areas; practical ethical issues such as when to start the ethics approval process; and research transparency, including how to prevent publication bias and data mining and the role of experimental registries, preanalysis plans, data publication reanalysis, and replication efforts.

34 citations

Posted Content
TL;DR: The authors reviewed the research on the impacts of incarceration on crime and concluded that, at typical policy margins in the United States today, decarceration has zero net impact on crime outside of prison, while imprisoning people temporarily stops them from committing crime outside prison walls, it also tends to increase their criminality after release.
Abstract: This paper reviews the research on the impacts of incarceration on crime. Where data availability permits, reviewed studies are replicated and reanalyzed. Among three dozen studies I reviewed, I obtained or reconstructed the data and code for eight. Replication and reanalysis revealed significant methodological concerns in seven and led to major reinterpretations of four. I estimate that, at typical policy margins in the United States today, decarceration has zero net impact on crime outside of prison. That estimate is uncertain, but at least as much evidence suggests that decarceration reduces crime as increases it. The crux of the matter is that tougher sentences hardly deter crime, and that while imprisoning people temporarily stops them from committing crime outside prison walls, it also tends to increase their criminality after release. As a result, "tough-on-crime" initiatives can reduce crime in the short run but cause offsetting harm in the long run. A cost-benefit analysis finds that even under a devil's advocate reading of this evidence, in which incarceration does reduce crime in U.S., it is unlikely to increase aggregate welfare.

15 citations

References
More filters
Posted Content
TL;DR: This article examined whether the Solow growth model is consistent with the international variation in the standard of living and showed that an augmented Solow model that includes accumulation of human as well as physical capital provides an excellent description of the cross-country data.
Abstract: This paper examines whether the Solow growth model is consistent with the international variation in the standard of living. It shows that an augmented Solow model that includes accumulation of human as well as physical capital provides an excellent description of the cross-country data. The model explains about 80 percent of the international variation in income per capita, and the estimated influences of physical-capital accumulation, human-capital accumulation, and population growth confirm the model's predictions. The paper also examines the implications of the Solow model for convergence in standards of living -- that is, for whether poor countries tend to grow faster than rich countries. The evidence indicates that, holding population growth and capital accumulation constant, countries converge at about the rate the augmented Solow model predicts.

432 citations

Posted Content
TL;DR: The authors found that published extensions typically produced results that conflicted with the original studies; of the 20 extensions published, 12 conflicted with earlier results, and only 3 provided full confirmation, and they consumed 1.1% of the journal space.
Abstract: Replication is rare in marketing. Of 1,120 papers sampled from three major marketing journals, none were replications. Only 1.8% of the papers were extensions, and they consumed 1.1% of the journal space. On average, these extensions appeared seven years after the original study. The publication rate for such works has been decreasing since the 1970s. Published extensions typically produced results that conflicted with the original studies; of the 20 extensions published, 12 conflicted with the earlier results, and only 3 provided full confirmation. Published replications do not attract as many citations after publication as do the original studies, even when the results fail to support the original studies.

372 citations

Posted Content
TL;DR: The authors analyzed the effectiveness of foreign aid programs to gain insights into political regimes in aid recipient countries and found that the impact of aid does not vary according to whether recipient governments are liberal democratic or highly repressive.
Abstract: Critics of foreign aid programs have long argued that poverty reflects government failure. In this paper I analyze the effectiveness of foreign aid programs to gain insights into political regimes in aid recipient countries. My analytical framework shows how three stylized political/economic regimes labeled egalitarian, elitist and laissez-faire would use foreign aid. I then test reduced form equations using data on nonmilitary aid flows to 96 countries. I find that models of elitist political regimes best predict the impact of foreign aid. Aid does not significantly increase investment and growth, nor benefit the poor as measured by improvements in human development indicators, but it does increase the size of government. I also find that the impact of aid does not vary according to whether recipient governments are liberal democratic or highly repressive. But liberal political regimes and democracies, ceteris paribus, have on average 30% lower infant mortality than the least free regimes. This may be due to greater empowerment of the poor under liberal regimes even though the political elite continues to receive the benefits of aid programs. An implication is that short term aid targeted to support new liberal regimes may be a more successful means of reducing poverty than current programs.

249 citations

Journal ArticleDOI
TL;DR: The authors showed that the Penn World Table (PWT) GDP estimates vary substantially across different versions of the PWT despite being derived from very similar underlying data and using almost identical methodologies.
Abstract: This paper sheds light on two problems in the Penn World Table (PWT) GDP estimates. First, we show that these estimates vary substantially across different versions of the PWT despite being derived from very similar underlying data and using almost identical methodologies; that this variability is systematic; and that it is intrinsic to the methodology deployed by the PWT to estimate growth rates. Moreover, this variability matters for the cross-country growth literature. While growth studies that use low-frequency data remain robust to data revisions, studies that use annual data are less robust. Second, the PWT methodology leads to GDP estimates that are not valued at purchasing power parity (PPP) prices. This is surprising because the raison d’etre of the PWT is to adjust national estimates of GDP by valuing output at common international (PPP) prices so that the resulting PPP-adjusted estimates of GDP are comparable across countries. We propose an approach to address these two problems of variability and valuation.

244 citations


"The Meaning of Failed Replications:..." refers background in this paper

  • ...Johnson et al. (2013) run other studies’ regressions on a new, extended dataset unavailable to the previous authors and describe this inquiry as “replication”....

    [...]

Book
21 Jul 1994
TL;DR: In this paper, the authors have created a general framework of analysis which fully integrates macroeconomic theory with a detailed look at the microeconomic workings of the labour market, and illuminated by up-to-the-minute empirical evidence relating to all OECD countries.
Abstract: The authors have fully revised and updated part of their 1991 book - "Unemployment: Macroeconomic Performance and the Labour Market" - to create a shorter, accessible undergraduate textbook on unemployment. The authors question the inevitability of present levels of unemployment in the Western world, and the view that a trade-off exists between price stability and unemployment levels. Students are presented with an explanation of the reasons for unemployment, the existence of an average level, and the reasons that unemployment levels often fluctuate. The authors have created a general framework of analysis which fully integrates macroeconomic theory with a detailed look at the microeconomic workings of the labour market. This is illuminated by up-to-the-minute empirical evidence relating to all OECD countries. This book also incorporates the latest theoretical thinking on topics such as insider-outsider theories, and hysteresis in labour markets, as well as revealing the role of factors such as union bargaining, efficiency wages and labour mobility. The final section weighs up various governmental practices to combat unemployment, and reveals the different institutions and recent experiences of OECD countries.

219 citations


"The Meaning of Failed Replications:..." refers background in this paper

  • ...Dority and Fuess (2007) alter both the specifications and dataset of Layard et al. (1994) and describe this exercise as “replication”....

    [...]

Frequently Asked Questions (2)
Q1. What are the contributions in "The meaning of failed replications: a review and proposal" ?

This paper proposes an unambiguous definition of replication, one that reflects currently common but unstandardized use. Adopting this definition can improve incentives for researchers, encouraging more and better replication tests. This paper proposes an unambiguous definition of replication. The paper contrasts this definition with decades of unsuccessful attempts to standardize terminology, and argues that many prominent results described as replication tests should not be described as such. It argues that professional associations should formally adopt this definition, thereby improving incentives for researchers to conduct more and better replication tests. Many social scientists already use the term in the way suggested here, but many more do not. 

Replication tests include: fixing coding errors so that the code does exactly what the original paper describes (verification), having the same sample of students take the same exam again to remedy measurement error (verification), and re-sampling the same population to remedy sampling error or low power with otherwise identical methods (reproduction).