Author
Štěpán Bahník
Other affiliations: Prague College, Academy of Sciences of the Czech Republic, University of Würzburg
Bio: Štěpán Bahník is an academic researcher from University of Economics, Prague. The author has contributed to research in topics: Replication (statistics) & Anchoring. The author has an hindex of 13, co-authored 42 publications receiving 6511 citations. Previous affiliations of Štěpán Bahník include Prague College & Academy of Sciences of the Czech Republic.
Papers
More filters
••
Alexander A. Aarts, Joanna E. Anderson1, Christopher J. Anderson2, Peter Raymond Attridge3 +287 more•Institutions (116)
TL;DR: A large-scale assessment suggests that experimental reproducibility in psychology leaves a lot to be desired, and correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.
Abstract: Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.
5,532 citations
••
University of Florida1, University of Padua2, University of Würzburg3, Pennsylvania State University4, University of Social Sciences and Humanities5, Tilburg University6, City University of New York7, Koç University8, University of Michigan9, University of Kuala Lumpur10, Texas A&M University11, San Diego State University12, Mount Saint Vincent University13, Radboud University Nijmegen14, Virginia Commonwealth University15, Texas A&M University–Commerce16, Loyola University Chicago17, Worcester Polytechnic Institute18, London School of Economics and Political Science19, James Madison University20, Occidental College21, McDaniel College22, Connecticut College23, Wilfrid Laurier University24, University of Brasília25, California State University, Northridge26, University of Virginia27, Ohio State University28, University of Wisconsin-Madison29, Ithaca College30, Charles University in Prague31, Western Kentucky University32, Washington and Lee University33
TL;DR: The authors compared variation in the replicability of 13 classic and contemporary effects across 36 independent samples totaling 6,344 participants and found that the results of these experiments are more dependent on the effect itself than on the sample and setting used to investigate the effect.
Abstract: Although replication is a central tenet of science, direct replications are rare in psychology. This research tested variation in the replicability of 13 classic and contemporary effects across 36 independent samples totaling 6,344 participants. In the aggregate, 10 effects replicated consistently. One effect – imagined contact reducing prejudice – showed weak support for replicability. And two effects – flag priming influencing conservatism and currency priming influencing system justification – did not replicate. We compared whether the conditions such as lab versus online or US versus international sample predicted effect magnitudes. By and large they did not. The results of this small sample of effects suggest that replicability is more dependent on the effect itself than on the sample and setting used to investigate the effect.
767 citations
••
Richard A. Klein1, Michelangelo Vianello2, Fred Hasselman3, Byron G. Adams4 +187 more•Institutions (118)
TL;DR: This paper conducted preregistered replications of 28 classic and contemporary published findings, with protocols that were peer reviewed in advance, to examine variation in effect magnitudes across samples and settings, and found that very little heterogeneity was attributable to the order in which the tasks were performed or whether the task were administered in lab versus online.
Abstract: We conducted preregistered replications of 28 classic and contemporary published findings, with protocols that were peer reviewed in advance, to examine variation in effect magnitudes across samples and settings. Each protocol was administered to approximately half of 125 samples that comprised 15,305 participants from 36 countries and territories. Using the conventional criterion of statistical significance (p < .05), we found that 15 (54%) of the replications provided evidence of a statistically significant effect in the same direction as the original finding. With a strict significance criterion (p < .0001), 14 (50%) of the replications still provided such evidence, a reflection of the extremely high-powered design. Seven (25%) of the replications yielded effect sizes larger than the original ones, and 21 (75%) yielded effect sizes smaller than the original ones. The median comparable Cohen’s ds were 0.60 for the original findings and 0.15 for the replications. The effect sizes were small (< 0.20) in 16 of the replications (57%), and 9 effects (32%) were in the direction opposite the direction of the original effect. Across settings, the Q statistic indicated significant heterogeneity in 11 (39%) of the replication effects, and most of those were among the findings with the largest overall effect sizes; only 1 effect that was near zero in the aggregate showed significant heterogeneity according to this measure. Only 1 effect had a tau value greater than .20, an indication of moderate heterogeneity. Eight others had tau values near or slightly above .10, an indication of slight heterogeneity. Moderation tests indicated that very little heterogeneity was attributable to the order in which the tasks were performed or whether the tasks were administered in lab versus online. Exploratory comparisons revealed little heterogeneity between Western, educated, industrialized, rich, and democratic (WEIRD) cultures and less WEIRD cultures (i.e., cultures with relatively high and low WEIRDness scores, respectively). Cumulatively, variability in the observed effect sizes was attributable more to the effect being studied than to the sample or setting in which it was studied.
495 citations
••
University of Sussex1, INSEAD2, University of Virginia3, University of Padua4, University of Cologne5, University of Cincinnati6, University of Economics, Prague7, Hong Kong Polytechnic University8, University of Liverpool9, Stockholm School of Economics10, Linnaeus University11, University of Hong Kong12, University of California, Berkeley13, City University of New York14, New York University15, University of Manchester16, Westat17, Temple University18, Northwestern University19, University of Zurich20, University of Sheffield21, Stockholm University22, Ludwig Maximilian University of Munich23, University of Minnesota24, Xiamen University25, Oregon State University26, Universidade Federal de Santa Catarina27, University of Washington28, Queen Mary University of London29, University of Nottingham30, Cardiff University31, University of Maryland, College Park32, Brigham Young University33, Loyola University Maryland34, University of Toronto35, University of Giessen36, United States Military Academy37, State University of New York at Oswego38, Concordia University39, University of Bamberg40, University of Amsterdam41, Center for Open Science42
TL;DR: In this paper, 29 teams involving 61 analysts used the same data set to address the same research question: whether soccer referees are more likely to give red cards to dark-skin-toned players than to light-skinned-players.
Abstract: Twenty-nine teams involving 61 analysts used the same data set to address the same research question: whether soccer referees are more likely to give red cards to dark-skin-toned players than to light-skin-toned players. Analytic approaches varied widely across the teams, and the estimated effect sizes ranged from 0.89 to 2.93 (Mdn = 1.31) in odds-ratio units. Twenty teams (69%) found a statistically significant positive effect, and 9 teams (31%) did not observe a significant relationship. Overall, the 29 different analyses used 21 unique combinations of covariates. Neither analysts’ prior beliefs about the effect of interest nor their level of expertise readily explained the variation in the outcomes of the analyses. Peer ratings of the quality of the analyses also did not account for the variability. These findings suggest that significant variation in the results of analyses of complex data may be difficult to avoid, even by experts with honest intentions. Crowdsourcing data analysis, a strategy in which numerous research teams are recruited to simultaneously investigate the same research question, makes transparent how defensible, yet subjective, analytic choices influence research results.
396 citations
••
University of Otago1, Stonehill College2, Mount Saint Vincent University3, University of Würzburg4, State University of New York at Brockport5, University of Nebraska–Lincoln6, Erasmus University Rotterdam7, Università degli Studi Suor Orsola Benincasa8, University of Leeds9, George Fox University10, Texas A&M University–Commerce11, University of Social Sciences and Humanities12, Lehigh Carbon Community College13, University of Warwick14, Flinders University15, University of Münster16, Rochester Institute of Technology17, University of Virginia18, Victoria University of Wellington19, Goldsmiths, University of London20, University of North Dakota21, College of Charleston22, University of Stirling23, Kent State University24, University of Tasmania25, University of Oxford26, University of Düsseldorf27, Ohio State University28, University of Central Lancashire29, University of Maine30, Iowa State University31, Nebraska Wesleyan University32, University of Navarra33, University of Wyoming34, Masaryk University35, University of Portsmouth36, University of Texas at El Paso37, Niagara University38, Charles University in Prague39, Arkansas State University40
TL;DR: This article found that participants who described the robber were 25% worse at identifying the robber in a lineup than were participants who instead listed U.S. states and capitals, which has been termed the verbal overshadowing effect.
Abstract: Trying to remember something now typically improves your ability to remember it later. However, after watching a video of a simulated bank robbery, participants who verbally described the robber were 25% worse at identifying the robber in a lineup than were participants who instead listed U.S. states and capitals—this has been termed the “verbal overshadowing” effect (Schooler & Engstler-Schooler, 1990). More recent studies suggested that this effect might be substantially smaller than first reported. Given uncertainty about the effect size, the influence of this finding in the memory literature, and its practical importance for police procedures, we conducted two collections of preregistered direct replications (RRR1 and RRR2) that differed only in the order of the description task and a filler task. In RRR1, when the description task immediately followed the robbery, participants who provided a description were 4% less likely to select the robber than were those in the control condition. In RRR2, when the description was delayed by 20 min, they were 16% less likely to select the robber. These findings reveal a robust verbal overshadowing effect that is strongly influenced by the relative timing of the tasks. The discussion considers further implications of these replications for our understanding of verbal overshadowing.
180 citations
Cited by
More filters
••
Alexander A. Aarts, Joanna E. Anderson1, Christopher J. Anderson2, Peter Raymond Attridge3 +287 more•Institutions (116)
TL;DR: A large-scale assessment suggests that experimental reproducibility in psychology leaves a lot to be desired, and correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.
Abstract: Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.
5,532 citations
••
TL;DR: This work argues for the adoption of measures to optimize key elements of the scientific process: methods, reporting and dissemination, reproducibility, evaluation and incentives, in the hope that this will facilitate action toward improving the transparency, reproducible and efficiency of scientific research.
Abstract: Improving the reliability and efficiency of scientific research will increase the credibility of the published scientific literature and accelerate discovery. Here we argue for the adoption of measures to optimize key elements of the scientific process: methods, reporting and dissemination, reproducibility, evaluation and incentives. There is some evidence from both simulations and empirical studies supporting the likely effectiveness of these measures, but their broad adoption by researchers, institutions, funders and journals will require iterative evaluation and improvement. We discuss the goals of these measures, and how they can be implemented, in the hope that this will facilitate action toward improving the transparency, reproducibility and efficiency of scientific research.
1,951 citations
••
TL;DR: The characteristics of Mechanical Turk as a participant pool for psychology and other social sciences, highlighting the traits of the MTurk samples, why people become Mechanical Turk workers and research participants, and how data quality on Mechanical Turk compares to that from other pools and depends on controllable and uncontrollable factors as mentioned in this paper.
Abstract: Mechanical Turk (MTurk), an online labor market created by Amazon, has recently become popular among social scientists as a source of survey and experimental data. The workers who populate this market have been assessed on dimensions that are universally relevant to understanding whether, why, and when they should be recruited as research participants. We discuss the characteristics of MTurk as a participant pool for psychology and other social sciences, highlighting the traits of the MTurk samples, why people become MTurk workers and research participants, and how data quality on MTurk compares to that from other pools and depends on controllable and uncontrollable factors.
1,926 citations
••
University of Southern California1, Duke University2, Stockholm School of Economics3, University of Virginia4, Center for Open Science5, University of Amsterdam6, University of Pennsylvania7, University of North Carolina at Chapel Hill8, University of Regensburg9, California Institute of Technology10, New York University11, Research Institute of Industrial Economics12, Cardiff University13, Mathematica Policy Research14, Northwestern University15, Ohio State University16, University of Sussex17, Texas A&M University18, Royal Holloway, University of London19, University of Zurich20, University of Melbourne21, University of Wisconsin-Madison22, University of Michigan23, Stanford University24, Rutgers University25, Columbia University26, University of Washington27, University of Edinburgh28, National University of Singapore29, Utrecht University30, Arizona State University31, Princeton University32, University of California, Los Angeles33, Imperial College London34, University of Innsbruck35, Harvard University36, University of Chicago37, University of Pittsburgh38, University of Notre Dame39, University of California, Berkeley40, Johns Hopkins University41, University of Bristol42, University of New South Wales43, Dartmouth College44, Whitman College45, University of Puerto Rico46, University of Milan47, University of California, Irvine48, Paris Dauphine University49, University of British Columbia50, Ludwig Maximilian University of Munich51, Purdue University52, Washington University in St. Louis53, University of California, Davis54, Microsoft55
TL;DR: The default P-value threshold for statistical significance is proposed to be changed from 0.05 to 0.005 for claims of new discoveries in order to reduce uncertainty in the number of discoveries.
Abstract: We propose to change the default P-value threshold for statistical significance from 0.05 to 0.005 for claims of new discoveries.
1,586 citations