scispace - formally typeset
Search or ask a question
Journal ArticleDOI

False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant

17 Oct 2011-Psychological Science (SAGE Publications)-Vol. 22, Iss: 11, pp 1359-1366
TL;DR: It is shown that despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings, flexibility in data collection, analysis, and reporting dramatically increases actual false- positive rates, and a simple, low-cost, and straightforwardly effective disclosure-based solution is suggested.
Abstract: In this article, we accomplish two things. First, we show that despite empirical psychologists' nominal endorsement of a low rate of false-positive findings (≤ .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis. Second, we suggest a simple, low-cost, and straightforwardly effective disclosure-based solution to this problem. The solution involves six concrete requirements for authors and four guidelines for reviewers, all of which impose a minimal burden on the publication process.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: It is shown that the average statistical power of studies in the neurosciences is very low, and the consequences include overestimates of effect size and low reproducibility of results.
Abstract: A study with low statistical power has a reduced chance of detecting a true effect, but it is less well appreciated that low power also reduces the likelihood that a statistically significant result reflects a true effect. Here, we show that the average statistical power of studies in the neurosciences is very low. The consequences of this include overestimates of effect size and low reproducibility of results. There are also ethical dimensions to this problem, as unreliable research is inefficient and wasteful. Improving reproducibility in neuroscience is a key priority and requires attention to well-established but often ignored methodological principles.

5,683 citations

Journal ArticleDOI
28 Aug 2015-Science
TL;DR: A large-scale assessment suggests that experimental reproducibility in psychology leaves a lot to be desired, and correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.
Abstract: Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.

5,532 citations

Journal ArticleDOI
TL;DR: The American Statistical Association (ASA) released a policy statement on p-values and statistical significance in 2015 as discussed by the authors, which was based on a discussion with the ASA Board of Trustees and concerned with reproducibility and replicability of scientific conclusions.
Abstract: Cobb’s concern was a long-worrisome circularity in the sociology of science based on the use of bright lines such as p< 0.05: “We teach it because it’s what we do; we do it because it’s what we teach.” This concern was brought to the attention of the ASA Board. The ASA Board was also stimulated by highly visible discussions over the last few years. For example, ScienceNews (Siegfried 2010) wrote: “It’s science’s dirtiest secret: The ‘scientific method’ of testing hypotheses by statistical analysis stands on a flimsy foundation.” A November 2013, article in Phys.org Science News Wire (2013) cited “numerous deep flaws” in null hypothesis significance testing. A ScienceNews article (Siegfried 2014) on February 7, 2014, said “statistical techniques for testing hypotheses...havemore flaws than Facebook’s privacy policies.” Aweek later, statistician and “Simply Statistics” blogger Jeff Leek responded. “The problem is not that people use P-values poorly,” Leek wrote, “it is that the vast majority of data analysis is not performed by people properly trained to perform data analysis” (Leek 2014). That same week, statistician and science writer Regina Nuzzo published an article in Nature entitled “Scientific Method: Statistical Errors” (Nuzzo 2014). That article is nowone of the most highly viewedNature articles, as reported by altmetric.com (http://www.altmetric.com/details/2115792#score). Of course, it was not simply a matter of responding to some articles in print. The statistical community has been deeply concerned about issues of reproducibility and replicability of scientific conclusions. Without getting into definitions and distinctions of these terms, we observe that much confusion and even doubt about the validity of science is arising. Such doubt can lead to radical choices, such as the one taken by the editors of Basic andApplied Social Psychology, who decided to ban p-values (null hypothesis significance testing) (Trafimow and Marks 2015). Misunderstanding or misuse of statistical inference is only one cause of the “reproducibility crisis” (Peng 2015), but to our community, it is an important one. When the ASA Board decided to take up the challenge of developing a policy statement on p-values and statistical significance, it did so recognizing this was not a lightly taken step. The ASA has not previously taken positions on specific matters of statistical practice. The closest the association has come to this is a statement on the use of value-added models (VAM) for educational assessment (Morganstein and Wasserstein 2014) and a statement on risk-limiting post-election audits (American Statistical Association 2010). However, these were truly policy-related statements. The VAM statement addressed a key educational policy issue, acknowledging the complexity of the issues involved, citing limitations of VAMs as effective performance models, and urging that they be developed and interpreted with the involvement of statisticians. The statement on election auditing was also in response to a major but specific policy issue (close elections in 2008), and said that statistically based election audits should become a routine part of election processes. By contrast, the Board envisioned that the ASA statement on p-values and statistical significance would shed light on an aspect of our field that is too often misunderstood and misused in the broader research community, and, in the process, provides the community a service. The intended audience would be researchers, practitioners, and science writers who are not primarily statisticians. Thus, this statementwould be quite different from anything previously attempted. The Board tasked Wasserstein with assembling a group of experts representing a wide variety of points of view. On behalf of the Board, he reached out to more than two dozen such people, all of whom said theywould be happy to be involved. Several expressed doubt about whether agreement could be reached, but those who did said, in effect, that if there was going to be a discussion, they wanted to be involved. Over the course of many months, group members discussed what format the statement should take, tried to more concretely visualize the audience for the statement, and began to find points of agreement. That turned out to be relatively easy to do, but it was just as easy to find points of intense disagreement. The time came for the group to sit down together to hash out these points, and so in October 2015, 20 members of the group met at the ASA Office in Alexandria, Virginia. The 2-day meeting was facilitated by Regina Nuzzo, and by the end of the meeting, a good set of points around which the statement could be built was developed. The next 3 months saw multiple drafts of the statement, reviewed by group members, by Board members (in a lengthy discussion at the November 2015 ASA Board meeting), and by members of the target audience. Finally, on January 29, 2016, the Executive Committee of the ASA approved the statement. The statement development process was lengthier and more controversial than anticipated. For example, there was considerable discussion about how best to address the issue of multiple potential comparisons (Gelman and Loken 2014). We debated at some length the issues behind the words “a p-value near 0.05 taken by itself offers only weak evidence against the null

4,361 citations

Journal ArticleDOI
TL;DR: In this article, the authors highlight the disadvantages of this method and present the median absolute deviation, an alternative and more robust measure of dispersion that is easy to implement, and explain the procedures for calculating this indicator in SPSS and R software.

2,647 citations


Cites background from "False-Positive Psychology: Undisclo..."

  • ...In a recent article, Simmons, Nelson, and Simonsohn (2011) showed how, due to themisuse of statistical tools, significant results could easily turn out to be false positives (i.e., effects considered significantwhereas the null hypothesis is actually true)....

    [...]

Journal ArticleDOI
TL;DR: In a randomized double-blind study, science faculty from research-intensive universities rated the application materials of a student as significantly more competent and hireable than the (identical) female applicant, and preexisting subtle bias against women played a moderating role.
Abstract: Despite efforts to recruit and retain more women, a stark gender disparity persists within academic science. Abundant research has demonstrated gender bias in many demographic groups, but has yet to experimentally investigate whether science faculty exhibit a bias against female students that could contribute to the gender disparity in academic science. In a randomized double-blind study (n = 127), science faculty from research-intensive universities rated the application materials of a student—who was randomly assigned either a male or female name—for a laboratory manager position. Faculty participants rated the male applicant as significantly more competent and hireable than the (identical) female applicant. These participants also selected a higher starting salary and offered more career mentoring to the male applicant. The gender of the faculty participants did not affect responses, such that female and male faculty were equally likely to exhibit bias against the female student. Mediation analyses indicated that the female student was less likely to be hired because she was viewed as less competent. We also assessed faculty participants’ preexisting subtle bias against women using a standard instrument and found that preexisting subtle bias against women played a moderating role, such that subtle bias against women was associated with less support for the female student, but was unrelated to reactions to the male student. These results suggest that interventions addressing faculty gender bias might advance the goal of increasing the participation of women in science.

2,362 citations

References
More filters
Journal ArticleDOI
Ziva Kunda1
TL;DR: It is proposed that motivation may affect reasoning through reliance on a biased set of cognitive processes--that is, strategies for accessing, constructing, and evaluating beliefs--that are considered most likely to yield the desired conclusion.
Abstract: It is proposed that motivation may affect reasoning through reliance on a biased set of cognitive processes—that is, strategies for accessing, constructing, and evaluating beliefs. The motivation to be accurate enhances use of those beliefs and strategies that are considered most appropriate, whereas the motivation to arrive at particular conclusions enhances use of those that are considered most likely to yield the desired conclusion. There is considerable evidence that people are more likely to arrive at conclusions that they want to arrive at, but their ability to do so is constrained by their ability to construct seemingly reasonable justifications for these conclusions. These ideas can account for a wide variety of research concerned with motivated reasoning. The notion that goals or motives affect reasoning has a long and controversial history in social psychology. The propositions that motives may affect perceptions (Erdelyi, 1974), attitudes (Festinger, 1957), and attributions (Heider, 1958) have been put forth by some psychologists and challenged by others. Although early researchers and theorists took it for granted that motivation may cause people to make self-serving attributions and permit them to believe what they want to believe because they want to believe it, this view, and the research used to uphold it, came under concentrated criticism in the 1970s. The major and most damaging criticism of the motivational view was that all research purported to demonstrate motivated reasoning could be reinterpreted in entirely cognitive, nonmotivational terms (Miller & Ross, 1975; Nisbett & Ross, 1980). Thus people could draw self-serving conclusions not because they wanted to but because these conclusions seemed more plausible, given their prior beliefs and expectancies. Because both cognitive and motivational accounts could be generated for any empirical study, some theorists argued that the hot versus cold cognition controversy could not be solved, at least in the attribution paradigm (Ross & Fletcher, 1985; Tetlock & Levi, 1982). One reason for the persistence of this controversy lies in the failure of researchers to explore the mechanisms underlying motivated reasoning. Recently, several authors have attempted to rectify this neglect (Kruglanski & Freund, 1983; Kunda, 1987; Pyszczynski & Greenberg, 1987; Sorrentino & Higgins, 1986). All these authors share a view of motivation as having its effects through cognitive processes: People rely on cognitive processes and representations to arrive at their desired conclusions, but motivation plays a role in determining which of these will be used on a given occasion.

6,643 citations


"False-Positive Psychology: Undisclo..." refers background in this paper

  • ...of ambiguous information and remarkably adept at reaching justifiable conclusions that mesh with their desires (Babcock & Loewenstein, 1997; Dawson, Gilovich, & Regan, 2002; Gilovich, 1983; Hastorf & Cantril, 1954; Kunda, 1990; Zuckerman, 1979)....

    [...]

  • ...…3/17/11; Revision accepted 5/23/11 of ambiguous information and remarkably adept at reaching justifiable conclusions that mesh with their desires (Babcock & Loewenstein, 1997; Dawson, Gilovich, & Regan, 2002; Gilovich, 1983; Hastorf & Cantril, 1954; Kunda, 1990; Zuckerman, 1979)....

    [...]

15 Aug 2006
TL;DR: In this paper, the authors discuss the implications of these problems for the conduct and interpretation of research and suggest that claimed research findings may often be simply accurate measures of the prevailing bias.
Abstract: There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser pre-selection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.

5,003 citations

Journal ArticleDOI
TL;DR: In this article, a group sequential design is proposed to divide patient entry into a number of equal-sized groups so that the decision to stop the trial or continue is based on repeated significance tests of the accumulated data after each group is evaluated.
Abstract: SUMMARY In clinical trials with sequential patient entry, fixed sample size designs are unjustified on ethical grounds and sequential designs are often impracticable. One solution is a group sequential design dividing patient entry into a number of equal-sized groups so that the decision to stop the trial or continue is based on repeated significance tests of the accumulated data after each group is evaluated. Exact results are obtained for a trial with two treatments and a normal response with known variance. The design problem of determining the required size and number of groups is also considered. Simulation shows that these normal results may be adapted to other types of response data. An example shows that group sequential designs can sometimes be statistically superior to standard sequential designs.

1,573 citations


"False-Positive Psychology: Undisclo..." refers background in this paper

  • ...Something like this has been proposed for medical trials that monitor outcomes as the study progresses (see, e.g., Pocock, 1977)....

    [...]

Journal ArticleDOI
TL;DR: It is found that the percentage of respondents who have engaged in questionable practices was surprisingly high, which suggests that some questionable practices may constitute the prevailing research norm.
Abstract: Cases of clear scientific misconduct have received significant media attention recently, but less flagrantly questionable research practices may be more prevalent and, ultimately, more damaging to the academic enterprise. Using an anonymous elicitation format supplemented by incentives for honest reporting, we surveyed over 2,000 psychologists about their involvement in questionable research practices. The impact of truth-telling incentives on self-admissions of questionable research practices was positive, and this impact was greater for practices that respondents judged to be less defensible. Combining three different estimation methods, we found that the percentage of respondents who have engaged in questionable practices was surprisingly high. This finding suggests that some questionable practices may constitute the prevailing research norm.

1,504 citations

Journal ArticleDOI
TL;DR: The authors found that self-serving effects for both success and failure are obtained in most but not all experimental paradigms, and that these attributions are better understood in motivational than in information-processing terms.
Abstract: Do causal attributions serve the need to protect and / or enhance self-esteem? In a recent review, Miller and Ross (1975) proposed that there is evidence for self-serving effect in the attribution of success but not in the attribution of failure; and that this effect reflects biases in information-processing rather than self-esteem maintenance. The present review indicated that self-serving effects for both success and failure are obtained in most but not all experimental paradigms. Processes which may suppress or even reverse the self-serving effect were discussed. Most important, the examination of research in which self-serving effects are obtained suggested that these attributions are better understood in motivational than in information-processing terms.

1,144 citations


"False-Positive Psychology: Undisclo..." refers background in this paper

  • ...of ambiguous information and remarkably adept at reaching justifiable conclusions that mesh with their desires (Babcock & Loewenstein, 1997; Dawson, Gilovich, & Regan, 2002; Gilovich, 1983; Hastorf & Cantril, 1954; Kunda, 1990; Zuckerman, 1979)....

    [...]

  • ...…3/17/11; Revision accepted 5/23/11 of ambiguous information and remarkably adept at reaching justifiable conclusions that mesh with their desires (Babcock & Loewenstein, 1997; Dawson, Gilovich, & Regan, 2002; Gilovich, 1983; Hastorf & Cantril, 1954; Kunda, 1990; Zuckerman, 1979)....

    [...]