scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Cryptic multiple hypotheses testing in linear models: overestimated effect sizes and the winner's curse.

01 Jan 2011-Behavioral Ecology and Sociobiology (Springer-Verlag)-Vol. 65, Iss: 1, pp 47-55
TL;DR: Full model tests and P value adjustments can be used as a guide to how frequently type I errors arise by sampling variation alone, and favour the presentation of full models, since they best reflect the range of predictors investigated and ensure a balanced representation also of non-significant results.
Abstract: Fitting generalised linear models (GLMs) with more than one predictor has become the standard method of analysis in evolutionary and behavioural research. Often, GLMs are used for exploratory data analysis, where one starts with a complex full model including interaction terms and then simplifies by removing non-significant terms. While this approach can be useful, it is problematic if significant effects are interpreted as if they arose from a single a priori hypothesis test. This is because model selection involves cryptic multiple hypothesis testing, a fact that has only rarely been acknowledged or quantified. We show that the probability of finding at least one ‘significant’ effect is high, even if all null hypotheses are true (e.g. 40% when starting with four predictors and their two-way interactions). This probability is close to theoretical expectations when the sample size (N) is large relative to the number of predictors including interactions (k). In contrast, type I error rates strongly exceed even those expectations when model simplification is applied to models that are over-fitted before simplification (low N/k ratio). The increase in false-positive results arises primarily from an overestimation of effect sizes among significant predictors, leading to upward-biased effect sizes that often cannot be reproduced in follow-up studies (‘the winner's curse’). Despite having their own problems, full model tests and P value adjustments can be used as a guide to how frequently type I errors arise by sampling variation alone. We favour the presentation of full models, since they best reflect the range of predictors investigated and ensure a balanced representation also of non-significant results.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
23 May 2018-PeerJ
TL;DR: This overview should serve as a widely accessible code of best practice for applying LMMs to complex biological problems and model structures, and in doing so improve the robustness of conclusions drawn from studies investigating ecological and evolutionary questions.
Abstract: The use of linear mixed effects models (LMMs) is increasingly common in the analysis of biological data. Whilst LMMs offer a flexible approach to modelling a broad range of data types, ecological data are often complex and require complex model structures, and the fitting and interpretation of such models is not always straightforward. The ability to achieve robust biological inference requires that practitioners know how and when to apply these tools. Here, we provide a general overview of current methods for the application of LMMs to biological data, and highlight the typical pitfalls that can be encountered in the statistical modelling process. We tackle several issues regarding methods of model selection, with particular reference to the use of information theory and multi-model inference in ecology. We offer practical solutions and direct the reader to key references that provide further technical detail for those seeking a deeper understanding. This overview should serve as a widely accessible code of best practice for applying LMMs to complex biological problems and model structures, and in doing so improve the robustness of conclusions drawn from studies investigating ecological and evolutionary questions.

1,210 citations


Cites background or methods from "Cryptic multiple hypotheses testing..."

  • ...This cryptic multiple testing can lead to hugely inflated Type I errors (Forstmeier & Schielzeth, 2011)....

    [...]

  • ...Performing ‘full model tests’ (comparing the global model to an intercept only model) before investigating single-predictor effects controls the Type I error rate (Forstmeier & Schielzeth, 2011)....

    [...]

  • ...…deletion procedures have come under heavy criticism; they can overestimate the effect size of significant predictors (Whittingham et al., 2006; Forstmeier & Schielzeth, 2011; Burnham, Anderson & Huyvaert, 2011) and force the researcher to focus on a single best model as if it were the only…...

    [...]

  • ...Because stepwise deletion can cause biased effect sizes, presenting means and SEs of parameters from the global model should be more robust, especially when the n/k ratio is low (Forstmeier & Schielzeth, 2011)....

    [...]

  • ...Guidelines for the ideal ratio of data points (n) to estimated parameters (k) vary widely (see Forstmeier & Schielzeth, 2011)....

    [...]

Journal ArticleDOI
TL;DR: An overview of how mixed-effect models can be used to partition variation in, and correlations among, phenotypic attributes into between- and within-individual variance components is provided.
Abstract: Growing interest in proximate and ultimate causes and consequences of between- and within-individual variation in labile components of the phenotype - such as behaviour or physiology - characterizes current research in evolutionary ecology. The study of individual variation requires tools for quantification and decomposition of phenotypic variation into between- and within-individual components. This is essential as variance components differ in their ecological and evolutionary implications. We provide an overview of how mixed-effect models can be used to partition variation in, and correlations among, phenotypic attributes into between- and within-individual variance components. Optimal sampling schemes to accurately estimate (with sufficient power) a wide range of repeatabilities and key (co)variance components, such as between- and within-individual correlations, are detailed. Mixed-effect models enable the usage of unambiguous terminology for patterns of biological variation that currently lack a formal statistical definition (e.g. 'animal personality' or 'behavioural syndromes'), and facilitate cross-fertilisation between disciplines such as behavioural ecology, ecological physiology and quantitative genetics.

854 citations


Cites background from "Cryptic multiple hypotheses testing..."

  • ...…to encourage working towards the most significant description of the data set, because step-wise approaches are statistically problematic (e.g. Whittingham et al. 2006; Forstmeier & Schielzeth 2011; Simmons, Nelson & Simonsohn 2011), and inhibit general inferences (Dochtermann & Jenkins 2011)....

    [...]

Journal ArticleDOI
TL;DR: It is shown that automated model selection techniques should not be relied on in the analysis of complex multivariable datasets, as this can lead to extreme biases when predictors are collinear, have strong effects but differ in their degree of measurement error.
Abstract: There has been a great deal of recent discussion of the practice of regression analysis (or more generally, linear modelling) in behaviour and ecology. In this paper, I wish to highlight two factors that have been under-considered, collinearity and measurement error in predictors, as well as to consider what happens when both exist at the same time. I examine what the consequences are for conventional regression analysis (ordinary least squares, OLS) as well as model averaging methods, typified by information theoretic approaches based around Akaike’s information criterion. Collinearity causes variance inflation of estimated slopes in OLS analysis, as is well known. In the presence of collinearity, model averaging reduces this variance for predictors with weak effects, but also can lead to parameter bias. When collinearity is strong or when all predictors have strong effects, model averaging relies heavily on the full model including all predictors and hence the results from this and OLS are essentially the same. I highlight that it is not safe to simply eliminate collinear variables without due consideration of their likely independent effects as this can lead to biases. Measurement error is also considered and I show that when collinearity exists, this can lead to extreme biases when predictors are collinear, have strong effects but differ in their degree of measurement error. I highlight techniques for dealing with and diagnosing these problems. These results reinforce that automated model selection techniques should not be relied on in the analysis of complex multivariable datasets.

309 citations

Journal ArticleDOI
TL;DR: It is argued that a culture of ‘you can publish if your study is rigorous’ creates a systematic bias against the null hypothesis which renders meta‐analyses questionable and may even lead to a situation where hypotheses become difficult to falsify.
Abstract: Recently there has been a growing concern that many published research findings do not hold up in attempts to replicate them. We argue that this problem may originate from a culture of 'you can publish if you found a significant effect'. This culture creates a systematic bias against the null hypothesis which renders meta-analyses questionable and may even lead to a situation where hypotheses become difficult to falsify. In order to pinpoint the sources of error and possible solutions, we review current scientific practices with regard to their effect on the probability of drawing a false-positive conclusion. We explain why the proportion of published false-positive findings is expected to increase with (i) decreasing sample size, (ii) increasing pursuit of novelty, (iii) various forms of multiple testing and researcher flexibility, and (iv) incorrect P-values, especially due to unaccounted pseudoreplication, i.e. the non-independence of data points (clustered data). We provide examples showing how statistical pitfalls and psychological traps lead to conclusions that are biased and unreliable, and we show how these mistakes can be avoided. Ultimately, we hope to contribute to a culture of 'you can publish if your study is rigorous'. To this end, we highlight promising strategies towards making science more objective. Specifically, we enthusiastically encourage scientists to preregister their studies (including a priori hypotheses and complete analysis plans), to blind observers to treatment groups during data collection and analysis, and unconditionally to report all results. Also, we advocate reallocating some efforts away from seeking novelty and discovery and towards replicating important research findings of one's own and of others for the benefit of the scientific community as a whole. We believe these efforts will be aided by a shift in evaluation criteria away from the current system which values metrics of 'impact' almost exclusively and towards a system which explicitly values indices of scientific rigour.

291 citations


Cites background or methods from "Cryptic multiple hypotheses testing..."

  • ...Simulations (Forstmeier & Schielzeth, 2011) revealed that P-values begin to become excessively small once there are fewer than three data points per predictor (N 3k with k being the number of parameters to be estimated)....

    [...]

  • ...In a simulation study it was shown (Forstmeier & Schielzeth, 2011), that when all null hypotheses are true (using randomly generated data), the chance of finding at least one significant effect lies close to 70%)....

    [...]

  • ...However, when screening the literature in the field of ecology and evolution, Forstmeier & Schielzeth (2011) found that authors rarely described the initial full model that they had fitted....

    [...]

  • ...This has been termed ‘cryptic multiple hypotheses testing’ (Forstmeier & Schielzeth, 2011)....

    [...]

Journal ArticleDOI
07 Jul 2017-PeerJ
TL;DR: The widespread use of ‘statistical significance’ as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process, and potential arguments against removing significance thresholds are discussed.
Abstract: The widespread use of ‘statistical significance’ as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p -values into ‘significant’ and ‘nonsignificant’ contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p -values at face value, but mistrust results with larger p -values. In either case, p -values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance ( p ≤ 0.05) is hardly replicable: at a good statistical power of 80%, two studies will be ‘conflicting’, meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p -hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p -values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p -values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that ‘there is no effect’. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p -values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p -values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.

240 citations

References
More filters
Journal ArticleDOI
TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.
Abstract: SUMMARY The common approach to the multiplicity problem calls for controlling the familywise error rate (FWER). This approach, though, has faults, and we point out a few. A different approach to problems of multiple significance testing is presented. It calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate. This error rate is equivalent to the FWER when all hypotheses are true but is smaller otherwise. Therefore, in problems where the control of the false discovery rate rather than that of the FWER is desired, there is potential for a gain in power. A simple sequential Bonferronitype procedure is proved to control the false discovery rate for independent test statistics, and a simulation study shows that the gain in power is substantial. The use of the new procedure and the appropriateness of the criterion are illustrated with examples.

83,420 citations


"Cryptic multiple hypotheses testing..." refers methods in this paper

  • ...The second approach is to control table-wide type I error rates by using sequential Bonferroni correction (Holm 1979; Hochberg 1988; Rice 1989; Wright 1992) or falsediscovery rate (FDR) control (Benjamini and Hochberg 1995; Storey and Tibshirani 2003)....

    [...]

Book
19 Jun 2013
TL;DR: The second edition of this book is unique in that it focuses on methods for making formal statistical inference from all the models in an a priori set (Multi-Model Inference).
Abstract: Introduction * Information and Likelihood Theory: A Basis for Model Selection and Inference * Basic Use of the Information-Theoretic Approach * Formal Inference From More Than One Model: Multi-Model Inference (MMI) * Monte Carlo Insights and Extended Examples * Statistical Theory and Numerical Results * Summary

36,993 citations


"Cryptic multiple hypotheses testing..." refers background or methods in this paper

  • ...However, one should keep in mind that the standard errors (as well as point estimates) are conditional on the model structure ( Burnham and Anderson 2002 )....

    [...]

  • ... Burnham and Anderson (2002) also emphasise that standard errors in linear models are conditional on the model structure and criticise stepwise selection procedures for their failure to incorporate model structure uncertainty into estimates of precision, i.e....

    [...]

  • ...Since parameter estimates are conditional on the model ( Burnham and Anderson 2002; Lukacs et al. 2010), estimates for a particular predictor might change signs depending on whether or not a correlated predictor is included....

    [...]

  • ...…however, often make us forget that this constitutes a case of multiple hypotheses testing that will lead to high rates of type I errors (Zhang 1992; Whittingham et al. 2006; Mundry and Nunn 2009) as well as biased effect size estimates (Burnham and Anderson 2002; Lukacs et al. 2010)....

    [...]

  • ...Automated procedures of model simplification, however, often make us forget that this constitutes a case of multiple hypotheses testing that will lead to high rates of type I errors (Zhang 1992; Whittingham et al. 2006; Mundry and Nunn 2009 )a s well as biased effect size estimates ( Burnham and Anderson 2002 ; L ukacs et al.2010)....

    [...]

Book
01 Jan 1991
TL;DR: In this article, the effects of predictor scaling on the coefficients of regression equations are investigated. But, they focus mainly on the effect of predictors scaling on coefficients of regressions.
Abstract: Introduction Interactions between Continuous Predictors in Multiple Regression The Effects of Predictor Scaling on Coefficients of Regression Equations Testing and Probing Three-Way Interactions Structuring Regression Equations to Reflect Higher Order Relationships Model and Effect Testing with Higher Order Terms Interactions between Categorical and Continuous Variables Reliability and Statistical Power Conclusion Some Contrasts Between ANOVA and MR in Practice

27,897 citations

Book
01 Jan 2000
TL;DR: Suitable for those new to statistics as well as students on intermediate and more advanced courses, the book walks students through from basic to advanced level concepts, all the while reinforcing knowledge through the use of SAS(R).
Abstract: Hot on the heels of the 3rd edition of Andy Field's award-winning Discovering Statistics Using SPSS comes this brand new version for students using SAS(R). Andy has teamed up with a co-author, Jeremy Miles, to adapt the book with all the most up-to-date commands and programming language from SAS(R) 9.2. If you're using SAS(R), this is the only book on statistics that you will need! The book provides a comprehensive collection of statistical methods, tests and procedures, covering everything you're likely to need to know for your course, all presented in Andy's accessible and humourous writing style. Suitable for those new to statistics as well as students on intermediate and more advanced courses, the book walks students through from basic to advanced level concepts, all the while reinforcing knowledge through the use of SAS(R). A 'cast of characters' supports the learning process throughout the book, from providing tips on how to enter data in SAS(R) properly to testing knowledge covered in chapters interactively, and 'real world' and invented examples illustrate the concepts and make the techniques come alive. The book's companion website (see link above) provides students with a wide range of invented and real published research datasets. Lecturers can find multiple choice questions and PowerPoint slides for each chapter to support their teaching.

25,020 citations

Journal ArticleDOI
TL;DR: In this paper, a simple and widely accepted multiple test procedure of the sequentially rejective type is presented, i.e. hypotheses are rejected one at a time until no further rejections can be done.
Abstract: This paper presents a simple and widely ap- plicable multiple test procedure of the sequentially rejective type, i.e. hypotheses are rejected one at a tine until no further rejections can be done. It is shown that the test has a prescribed level of significance protection against error of the first kind for any combination of true hypotheses. The power properties of the test and a number of possible applications are also discussed.

20,459 citations


"Cryptic multiple hypotheses testing..." refers methods in this paper

  • ...The second approach is to control table-wide type I error rates by using sequential Bonferroni correction (Holm 1979; Hochberg 1988; Rice 1989; Wright 1992) or falsediscovery rate (FDR) control (Benjamini and Hochberg 1995; Storey and Tibshirani 2003)....

    [...]