scispace - formally typeset
Search or ask a question
Author

Štěpán Bahník

Bio: Štěpán Bahník is an academic researcher from University of Economics, Prague. The author has contributed to research in topics: Replication (statistics) & Anchoring. The author has an hindex of 13, co-authored 42 publications receiving 6511 citations. Previous affiliations of Štěpán Bahník include Prague College & Academy of Sciences of the Czech Republic.

Papers
More filters
Journal ArticleDOI
28 Aug 2015-Science
TL;DR: A large-scale assessment suggests that experimental reproducibility in psychology leaves a lot to be desired, and correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.
Abstract: Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.

5,532 citations

Journal ArticleDOI
TL;DR: The authors compared variation in the replicability of 13 classic and contemporary effects across 36 independent samples totaling 6,344 participants and found that the results of these experiments are more dependent on the effect itself than on the sample and setting used to investigate the effect.
Abstract: Although replication is a central tenet of science, direct replications are rare in psychology. This research tested variation in the replicability of 13 classic and contemporary effects across 36 independent samples totaling 6,344 participants. In the aggregate, 10 effects replicated consistently. One effect – imagined contact reducing prejudice – showed weak support for replicability. And two effects – flag priming influencing conservatism and currency priming influencing system justification – did not replicate. We compared whether the conditions such as lab versus online or US versus international sample predicted effect magnitudes. By and large they did not. The results of this small sample of effects suggest that replicability is more dependent on the effect itself than on the sample and setting used to investigate the effect.

767 citations

Journal ArticleDOI
24 Dec 2018
TL;DR: This paper conducted preregistered replications of 28 classic and contemporary published findings, with protocols that were peer reviewed in advance, to examine variation in effect magnitudes across samples and settings, and found that very little heterogeneity was attributable to the order in which the tasks were performed or whether the task were administered in lab versus online.
Abstract: We conducted preregistered replications of 28 classic and contemporary published findings, with protocols that were peer reviewed in advance, to examine variation in effect magnitudes across samples and settings. Each protocol was administered to approximately half of 125 samples that comprised 15,305 participants from 36 countries and territories. Using the conventional criterion of statistical significance (p < .05), we found that 15 (54%) of the replications provided evidence of a statistically significant effect in the same direction as the original finding. With a strict significance criterion (p < .0001), 14 (50%) of the replications still provided such evidence, a reflection of the extremely high-powered design. Seven (25%) of the replications yielded effect sizes larger than the original ones, and 21 (75%) yielded effect sizes smaller than the original ones. The median comparable Cohen’s ds were 0.60 for the original findings and 0.15 for the replications. The effect sizes were small (< 0.20) in 16 of the replications (57%), and 9 effects (32%) were in the direction opposite the direction of the original effect. Across settings, the Q statistic indicated significant heterogeneity in 11 (39%) of the replication effects, and most of those were among the findings with the largest overall effect sizes; only 1 effect that was near zero in the aggregate showed significant heterogeneity according to this measure. Only 1 effect had a tau value greater than .20, an indication of moderate heterogeneity. Eight others had tau values near or slightly above .10, an indication of slight heterogeneity. Moderation tests indicated that very little heterogeneity was attributable to the order in which the tasks were performed or whether the tasks were administered in lab versus online. Exploratory comparisons revealed little heterogeneity between Western, educated, industrialized, rich, and democratic (WEIRD) cultures and less WEIRD cultures (i.e., cultures with relatively high and low WEIRDness scores, respectively). Cumulatively, variability in the observed effect sizes was attributable more to the effect being studied than to the sample or setting in which it was studied.

495 citations

Journal ArticleDOI
23 Aug 2018
TL;DR: In this paper, 29 teams involving 61 analysts used the same data set to address the same research question: whether soccer referees are more likely to give red cards to dark-skin-toned players than to light-skinned-players.
Abstract: Twenty-nine teams involving 61 analysts used the same data set to address the same research question: whether soccer referees are more likely to give red cards to dark-skin-toned players than to light-skin-toned players. Analytic approaches varied widely across the teams, and the estimated effect sizes ranged from 0.89 to 2.93 (Mdn = 1.31) in odds-ratio units. Twenty teams (69%) found a statistically significant positive effect, and 9 teams (31%) did not observe a significant relationship. Overall, the 29 different analyses used 21 unique combinations of covariates. Neither analysts’ prior beliefs about the effect of interest nor their level of expertise readily explained the variation in the outcomes of the analyses. Peer ratings of the quality of the analyses also did not account for the variability. These findings suggest that significant variation in the results of analyses of complex data may be difficult to avoid, even by experts with honest intentions. Crowdsourcing data analysis, a strategy in which numerous research teams are recruited to simultaneously investigate the same research question, makes transparent how defensible, yet subjective, analytic choices influence research results.

396 citations

Journal ArticleDOI
Victoria K. Alogna1, M. K. Attaya2, P. Aucoin3, Štěpán Bahník4, S. Birch5, Angie R. Birt3, Brian H. Bornstein6, Samantha Bouwmeester7, Maria A. Brandimonte8, Charity Brown9, K. Buswell10, Curt A. Carlson11, Maria A. Carlson11, Simon Chu, Aleksandra Cislak12, M. Colarusso13, Melissa F. Colloff14, Kimberly S. Dellapaolera6, Jean-Francois Delvenne9, A. Di Domenico, Aaron Drummond15, Gerald Echterhoff16, John E. Edlund17, Casey Eggleston18, Beth Fairfield, Gregory Franco19, Fiona Gabbert20, Bradlee W. Gamblin21, Maryanne Garry19, R. Gentry10, Elizabeth Gilbert18, D. L. Greenberg22, Jamin Halberstadt1, Lauren C. Hall15, Peter J. B. Hancock23, D. Hirsch24, Glenys A. Holt25, Joshua Conrad Jackson1, Jonathan Jong26, Andre Kehn21, C. Koch10, René Kopietz16, U. Körner27, Melina A. Kunar14, Calvin K. Lai18, Stephen R. H. Langton23, Fábio Pitombo Leite28, Nicola Mammarella, John E. Marsh29, K. A. McConnaughy2, S. McCoy30, Alex H. McIntyre23, Christian A. Meissner31, Robert B. Michael19, A. A. Mitchell32, M. Mugayar-Baldocchi22, R. Musselman13, C. Ng1, Austin Lee Nichols33, Narina Nunez34, Matthew A. Palmer25, J. E. Pappagianopoulos2, Marilyn S. Petro32, Christopher R. Poirier2, Emma Portch9, M. Rainsford25, A. Rancourt30, C. Romig24, Eva Rubínová35, Mevagh Sanson19, Liam Satchell36, James D. Sauer36, Kimberly Schweitzer34, J. Shaheed10, Faye Collette Skelton29, G. A. Sullivan2, Kyle J. Susa37, Jessica K. Swanner31, W. B. Thompson38, R. Todaro24, Joanna Ulatowska, Tim Valentine20, Peter P. J. L. Verkoeijen7, Marek A. Vranka39, Kimberley A. Wade14, Christopher A. Was24, Dawn R. Weatherford40, K. Wiseman34, Tara Zaksaite9, Daniel V. Zuj25, Rolf A. Zwaan7 
TL;DR: This article found that participants who described the robber were 25% worse at identifying the robber in a lineup than were participants who instead listed U.S. states and capitals, which has been termed the verbal overshadowing effect.
Abstract: Trying to remember something now typically improves your ability to remember it later. However, after watching a video of a simulated bank robbery, participants who verbally described the robber were 25% worse at identifying the robber in a lineup than were participants who instead listed U.S. states and capitals—this has been termed the “verbal overshadowing” effect (Schooler & Engstler-Schooler, 1990). More recent studies suggested that this effect might be substantially smaller than first reported. Given uncertainty about the effect size, the influence of this finding in the memory literature, and its practical importance for police procedures, we conducted two collections of preregistered direct replications (RRR1 and RRR2) that differed only in the order of the description task and a filler task. In RRR1, when the description task immediately followed the robbery, participants who provided a description were 4% less likely to select the robber than were those in the control condition. In RRR2, when the description was delayed by 20 min, they were 16% less likely to select the robber. These findings reveal a robust verbal overshadowing effect that is strongly influenced by the relative timing of the tasks. The discussion considers further implications of these replications for our understanding of verbal overshadowing.

180 citations


Cited by
More filters
Journal ArticleDOI
28 Aug 2015-Science
TL;DR: A large-scale assessment suggests that experimental reproducibility in psychology leaves a lot to be desired, and correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.
Abstract: Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.

5,532 citations

Journal ArticleDOI
26 May 2016-Nature

2,609 citations

Journal ArticleDOI
TL;DR: This work argues for the adoption of measures to optimize key elements of the scientific process: methods, reporting and dissemination, reproducibility, evaluation and incentives, in the hope that this will facilitate action toward improving the transparency, reproducible and efficiency of scientific research.
Abstract: Improving the reliability and efficiency of scientific research will increase the credibility of the published scientific literature and accelerate discovery. Here we argue for the adoption of measures to optimize key elements of the scientific process: methods, reporting and dissemination, reproducibility, evaluation and incentives. There is some evidence from both simulations and empirical studies supporting the likely effectiveness of these measures, but their broad adoption by researchers, institutions, funders and journals will require iterative evaluation and improvement. We discuss the goals of these measures, and how they can be implemented, in the hope that this will facilitate action toward improving the transparency, reproducibility and efficiency of scientific research.

1,951 citations

Journal ArticleDOI
TL;DR: The characteristics of Mechanical Turk as a participant pool for psychology and other social sciences, highlighting the traits of the MTurk samples, why people become Mechanical Turk workers and research participants, and how data quality on Mechanical Turk compares to that from other pools and depends on controllable and uncontrollable factors as mentioned in this paper.
Abstract: Mechanical Turk (MTurk), an online labor market created by Amazon, has recently become popular among social scientists as a source of survey and experimental data. The workers who populate this market have been assessed on dimensions that are universally relevant to understanding whether, why, and when they should be recruited as research participants. We discuss the characteristics of MTurk as a participant pool for psychology and other social sciences, highlighting the traits of the MTurk samples, why people become MTurk workers and research participants, and how data quality on MTurk compares to that from other pools and depends on controllable and uncontrollable factors.

1,926 citations

Journal ArticleDOI
Daniel J. Benjamin1, James O. Berger2, Magnus Johannesson3, Magnus Johannesson1, Brian A. Nosek4, Brian A. Nosek5, Eric-Jan Wagenmakers6, Richard A. Berk7, Kenneth A. Bollen8, Björn Brembs9, Lawrence D. Brown7, Colin F. Camerer10, David Cesarini11, David Cesarini12, Christopher D. Chambers13, Merlise A. Clyde2, Thomas D. Cook14, Thomas D. Cook15, Paul De Boeck16, Zoltan Dienes17, Anna Dreber3, Kenny Easwaran18, Charles Efferson19, Ernst Fehr20, Fiona Fidler21, Andy P. Field17, Malcolm R. Forster22, Edward I. George7, Richard Gonzalez23, Steven N. Goodman24, Edwin J. Green25, Donald P. Green26, Anthony G. Greenwald27, Jarrod D. Hadfield28, Larry V. Hedges15, Leonhard Held20, Teck-Hua Ho29, Herbert Hoijtink30, Daniel J. Hruschka31, Kosuke Imai32, Guido W. Imbens24, John P. A. Ioannidis24, Minjeong Jeon33, James Holland Jones34, Michael Kirchler35, David Laibson36, John A. List37, Roderick J. A. Little23, Arthur Lupia23, Edouard Machery38, Scott E. Maxwell39, Michael A. McCarthy21, Don A. Moore40, Stephen L. Morgan41, Marcus R. Munafò42, Shinichi Nakagawa43, Brendan Nyhan44, Timothy H. Parker45, Luis R. Pericchi46, Marco Perugini47, Jeffrey N. Rouder48, Judith Rousseau49, Victoria Savalei50, Felix D. Schönbrodt51, Thomas Sellke52, Betsy Sinclair53, Dustin Tingley36, Trisha Van Zandt16, Simine Vazire54, Duncan J. Watts55, Christopher Winship36, Robert L. Wolpert2, Yu Xie32, Cristobal Young24, Jonathan Zinman44, Valen E. Johnson18, Valen E. Johnson1 
University of Southern California1, Duke University2, Stockholm School of Economics3, University of Virginia4, Center for Open Science5, University of Amsterdam6, University of Pennsylvania7, University of North Carolina at Chapel Hill8, University of Regensburg9, California Institute of Technology10, New York University11, Research Institute of Industrial Economics12, Cardiff University13, Mathematica Policy Research14, Northwestern University15, Ohio State University16, University of Sussex17, Texas A&M University18, Royal Holloway, University of London19, University of Zurich20, University of Melbourne21, University of Wisconsin-Madison22, University of Michigan23, Stanford University24, Rutgers University25, Columbia University26, University of Washington27, University of Edinburgh28, National University of Singapore29, Utrecht University30, Arizona State University31, Princeton University32, University of California, Los Angeles33, Imperial College London34, University of Innsbruck35, Harvard University36, University of Chicago37, University of Pittsburgh38, University of Notre Dame39, University of California, Berkeley40, Johns Hopkins University41, University of Bristol42, University of New South Wales43, Dartmouth College44, Whitman College45, University of Puerto Rico46, University of Milan47, University of California, Irvine48, Paris Dauphine University49, University of British Columbia50, Ludwig Maximilian University of Munich51, Purdue University52, Washington University in St. Louis53, University of California, Davis54, Microsoft55
TL;DR: The default P-value threshold for statistical significance is proposed to be changed from 0.05 to 0.005 for claims of new discoveries in order to reduce uncertainty in the number of discoveries.
Abstract: We propose to change the default P-value threshold for statistical significance from 0.05 to 0.005 for claims of new discoveries.

1,586 citations