scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Estimating the reproducibility of psychological science

28 Aug 2015-Science (American Association for the Advancement of Science)-Vol. 349, Iss: 6251, pp 943-943
TL;DR: A large-scale assessment suggests that experimental reproducibility in psychology leaves a lot to be desired, and correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.
Abstract: Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
26 May 2016-Nature

2,609 citations

Journal ArticleDOI
Daniel J. Benjamin1, James O. Berger2, Magnus Johannesson1, Magnus Johannesson3, Brian A. Nosek4, Brian A. Nosek5, Eric-Jan Wagenmakers6, Richard A. Berk7, Kenneth A. Bollen8, Björn Brembs9, Lawrence D. Brown7, Colin F. Camerer10, David Cesarini11, David Cesarini12, Christopher D. Chambers13, Merlise A. Clyde2, Thomas D. Cook14, Thomas D. Cook15, Paul De Boeck16, Zoltan Dienes17, Anna Dreber3, Kenny Easwaran18, Charles Efferson19, Ernst Fehr20, Fiona Fidler21, Andy P. Field17, Malcolm R. Forster22, Edward I. George7, Richard Gonzalez23, Steven N. Goodman24, Edwin J. Green25, Donald P. Green26, Anthony G. Greenwald27, Jarrod D. Hadfield28, Larry V. Hedges15, Leonhard Held20, Teck-Hua Ho29, Herbert Hoijtink30, Daniel J. Hruschka31, Kosuke Imai32, Guido W. Imbens24, John P. A. Ioannidis24, Minjeong Jeon33, James Holland Jones34, Michael Kirchler35, David Laibson36, John A. List37, Roderick J. A. Little23, Arthur Lupia23, Edouard Machery38, Scott E. Maxwell39, Michael A. McCarthy21, Don A. Moore40, Stephen L. Morgan41, Marcus R. Munafò42, Shinichi Nakagawa43, Brendan Nyhan44, Timothy H. Parker45, Luis R. Pericchi46, Marco Perugini47, Jeffrey N. Rouder48, Judith Rousseau49, Victoria Savalei50, Felix D. Schönbrodt51, Thomas Sellke52, Betsy Sinclair53, Dustin Tingley36, Trisha Van Zandt16, Simine Vazire54, Duncan J. Watts55, Christopher Winship36, Robert L. Wolpert2, Yu Xie32, Cristobal Young24, Jonathan Zinman44, Valen E. Johnson18, Valen E. Johnson1 
University of Southern California1, Duke University2, Stockholm School of Economics3, University of Virginia4, Center for Open Science5, University of Amsterdam6, University of Pennsylvania7, University of North Carolina at Chapel Hill8, University of Regensburg9, California Institute of Technology10, New York University11, Research Institute of Industrial Economics12, Cardiff University13, Mathematica Policy Research14, Northwestern University15, Ohio State University16, University of Sussex17, Texas A&M University18, Royal Holloway, University of London19, University of Zurich20, University of Melbourne21, University of Wisconsin-Madison22, University of Michigan23, Stanford University24, Rutgers University25, Columbia University26, University of Washington27, University of Edinburgh28, National University of Singapore29, Utrecht University30, Arizona State University31, Princeton University32, University of California, Los Angeles33, Imperial College London34, University of Innsbruck35, Harvard University36, University of Chicago37, University of Pittsburgh38, University of Notre Dame39, University of California, Berkeley40, Johns Hopkins University41, University of Bristol42, University of New South Wales43, Dartmouth College44, Whitman College45, University of Puerto Rico46, University of Milan47, University of California, Irvine48, Paris Dauphine University49, University of British Columbia50, Ludwig Maximilian University of Munich51, Purdue University52, Washington University in St. Louis53, University of California, Davis54, Microsoft55
TL;DR: The default P-value threshold for statistical significance is proposed to be changed from 0.05 to 0.005 for claims of new discoveries in order to reduce uncertainty in the number of discoveries.
Abstract: We propose to change the default P-value threshold for statistical significance from 0.05 to 0.005 for claims of new discoveries.

1,586 citations

Journal ArticleDOI
TL;DR: In this article, the authors introduce the current state-of-the-art of network estimation and propose two novel statistical methods: the correlation stability coefficient and the bootstrapped difference test for edge-weights and centrality indices.
Abstract: The usage of psychological networks that conceptualize behavior as a complex interplay of psychological and other components has gained increasing popularity in various research fields. While prior publications have tackled the topics of estimating and interpreting such networks, little work has been conducted to check how accurate (i.e., prone to sampling variation) networks are estimated, and how stable (i.e., interpretation remains similar with less observations) inferences from the network structure (such as centrality indices) are. In this tutorial paper, we aim to introduce the reader to this field and tackle the problem of accuracy under sampling variation. We first introduce the current state-of-the-art of network estimation. Second, we provide a rationale why researchers should investigate the accuracy of psychological networks. Third, we describe how bootstrap routines can be used to (A) assess the accuracy of estimated network connections, (B) investigate the stability of centrality indices, and (C) test whether network connections and centrality estimates for different variables differ from each other. We introduce two novel statistical methods: for (B) the correlation stability coefficient, and for (C) the bootstrapped difference test for edge-weights and centrality indices. We conducted and present simulation studies to assess the performance of both methods. Finally, we developed the free R-package bootnet that allows for estimating psychological networks in a generalized framework in addition to the proposed bootstrap methods. We showcase bootnet in a tutorial, accompanied by R syntax, in which we analyze a dataset of 359 women with posttraumatic stress disorder available online.

1,584 citations

Posted Content
TL;DR: This article proposed to change the default P-value threshold for statistical significance for claims of new discoveries from 0.05 to 0.005, which is the threshold used in this paper.
Abstract: We propose to change the default P-value threshold for statistical significance for claims of new discoveries from 0.05 to 0.005.

1,415 citations

References
More filters
Journal ArticleDOI
TL;DR: The metafor package provides functions for conducting meta-analyses in R and includes functions for fitting the meta-analytic fixed- and random-effects models and allows for the inclusion of moderators variables (study-level covariates) in these models.
Abstract: The metafor package provides functions for conducting meta-analyses in R. The package includes functions for fitting the meta-analytic fixed- and random-effects models and allows for the inclusion of moderators variables (study-level covariates) in these models. Meta-regression analyses with continuous and categorical moderators can be conducted in this way. Functions for the Mantel-Haenszel and Peto's one-step method for meta-analyses of 2 x 2 table data are also available. Finally, the package provides various plot functions (for example, for forest, funnel, and radial plots) and functions for assessing the model fit, for obtaining case diagnostics, and for tests of publication bias.

11,237 citations


"Estimating the reproducibility of p..." refers methods in this paper

  • ...We conducted fixed-effect meta-analyses using the R packagemetafor (27) on Fisher-transformed correlations for all study-pairs in subset MA and on study-pairs with the odds ratio as the dependent variable....

    [...]

Journal ArticleDOI
TL;DR: Quantitative procedures for computing the tolerance for filed and future null results are reported and illustrated, and the implications are discussed.
Abstract: For any given research area, one cannot tell how many studies have been conducted but never reported. The extreme view of the "file drawer problem" is that journals are filled with the 5% of the studies that show Type I errors, while the file drawers are filled with the 95% of the studies that show nonsignificant results. Quantitative procedures for computing the tolerance for filed and future null results are reported and illustrated, and the implications are discussed. (15 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)

7,159 citations


"Estimating the reproducibility of p..." refers background in this paper

  • ...Reproducibility Project 13 inflated effect sizes due to publication, selection, reporting, or other biases (9, 12-23)....

    [...]

  • ...problematic practices include selective reporting, selective analysis, and insufficient specification of the conditions necessary or sufficient to obtain the results (12-23)....

    [...]

Journal ArticleDOI
TL;DR: It is shown that the average statistical power of studies in the neurosciences is very low, and the consequences include overestimates of effect size and low reproducibility of results.
Abstract: A study with low statistical power has a reduced chance of detecting a true effect, but it is less well appreciated that low power also reduces the likelihood that a statistically significant result reflects a true effect. Here, we show that the average statistical power of studies in the neurosciences is very low. The consequences of this include overestimates of effect size and low reproducibility of results. There are also ethical dimensions to this problem, as unreliable research is inefficient and wasteful. Improving reproducibility in neuroscience is a key priority and requires attention to well-established but often ignored methodological principles.

5,683 citations


"Estimating the reproducibility of p..." refers background in this paper

  • ...Reproducibility Project 13 inflated effect sizes due to publication, selection, reporting, or other biases (9, 12-23)....

    [...]

  • ...problematic practices include selective reporting, selective analysis, and insufficient specification of the conditions necessary or sufficient to obtain the results (12-23)....

    [...]

  • ...publication bias favoring positive results together produce a literature with upwardly biased effect sizes (14, 16, 32, 33)....

    [...]

Book
01 Jun 2015
TL;DR: A practical primer on how to calculate and report effect sizes for t-tests and ANOVA's such that effect sizes can be used in a-priori power analyses and meta-analyses and a detailed overview of the similarities and differences between within- and between-subjects designs is provided.
Abstract: Effect sizes are the most important outcome of empirical studies. Most articles on effect sizes highlight their importance to communicate the practical significance of results. For scientists themselves, effect sizes are most useful because they facilitate cumulative science. Effect sizes can be used to determine the sample size for follow-up studies, or examining effects across studies. This article aims to provide a practical primer on how to calculate and report effect sizes for t-tests and ANOVA’s such that effect sizes can be used in a-priori power analyses and meta-analyses. Whereas many articles about effect sizes focus on between-subjects designs and address within-subjects designs only briefly, I provide a detailed overview of the similarities and differences between within- and between-subjects designs. I suggest that some research questions in experimental psychology examine inherently intra-individual effects, which makes effect sizes that incorporate the correlation between measures the best summary of the results. Finally, a supplementary spreadsheet is provided to make it as easy as possible for researchers to incorporate effect size calculations into their workflow.

5,374 citations

15 Aug 2006
TL;DR: In this paper, the authors discuss the implications of these problems for the conduct and interpretation of research and suggest that claimed research findings may often be simply accurate measures of the prevailing bias.
Abstract: There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser pre-selection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.

5,003 citations

Trending Questions (1)
Do any factors affect the reproducibility of psychological studies?

The paper mentions that the strength of the original evidence is a better predictor of replication success than characteristics of the original and replication teams.