scispace - formally typeset
Search or ask a question

Showing papers on "Statistical hypothesis testing published in 2020"


Book
23 Jul 2020
TL;DR: The idea of a randomization test has been explored in the context of data analysis for a long time as mentioned in this paper, and it has been applied in a variety of applications in biology, such as single species ecology and community ecology.
Abstract: Preface to the Second Edition Preface to the First Edition Randomization The Idea of a Randomization Test Examples of Randomization Tests Aspects of Randomization Testing Raised by the Examples Sampling the Randomization Distribution or Systematic Enumeration Equivalent Test Statistics Significance Levels for Classical and Randomization Tests Limitations of Randomization Tests Confidence Limits by Randomization Applications of Randomization in Biology Single Species Ecology Genetics, Evolution and Natural Selection Community Ecology Randomization and Observational Studies Chapter Summary The Jackknife The Jackknife Estimator Applications of Jackknifing in Biology Single Species Analyses Genetics, Evolution and Natural Selection Community Ecology Chapter Summary The Bootstrap Resampling with Replacement Standard Bootstrap Confidence Limits Simple Percentile Confidence Limits Bias Corrected Percentile Confidence Limits Accelerated Bias Corrected Percentile Limits Other Methods for Constructing Confidence Intervals Transformations to Improve Bootstrap Intervals Parametric Confidence Intervals A Better Estimate of Bias Bootstrap Tests of Significance Balanced Bootstrap Sampling Applications of Bootstrapping in Biology Single Species Ecology Genetics, Evolution and Natural Selection Community Ecology Further Reading Chapter Summary Monte Carlo Methods Monte Carlo Tests Generalized Monte Carlo Tests Implicit Statistical Models Applications of Monte Carlo Methods in Biology Single Species Ecology Chapter Summary Some General Considerations Questions about Computer-Intensive Methods Power Number of Random Sets of Data Needed for a Test Determining a Randomization Distribution Exactly The number of replications for confidence intervals More Efficient Bootstrap Sampling Methods The Generation of Pseudo-Random Numbers The Generation of Random Permutations Chapter Summary One and Two Sample Tests The Paired Comparisons Design The One Sample Randomization Test The Two Sample Randomization Test Bootstrap Tests Randomizing Residuals Comparing the Variation in Two Samples A Simulation Study The Comparison of Two Samples on Multiple Measurements Further Reading Chapter Summary Exercises Analysis of Variance One Factor Analysis of Variance Tests for Constant Variance Testing for Mean Differences Using Residuals Examples of More Complicated Types of Analysis of Variance Procedures for Handling Unequal Group Variances Other Aspects of Analysis of Variance Further Reading Chapter Summary Exercises Regression Analysis Simple Linear Regression Randomizing Residuals Testing for a Non-Zero B Value Confidence Limits for B Multiple Linear Regression Alternative Randomization Methods with Multiple Regression Bootstrapping and Jackknifing with Regression Further Reading Chapter Summary Exercises Distance Matrices and Spatial Data Testing for Association between Distance Matrices The Mantel Test Sampling the Randomization Distribution Confidence Limits for Regression Coefficients The Multiple Mantel Test Other Approaches with More than Two Matrices Further Reading Chapter Summary Exercises Other Analyses on Spatial Data Spatial Data Analysis The Study of Spatial Point Patterns Mead's Randomization Test Tests for Randomness Based on Distances Testing for an Association between Two Point Patterns The Besag-Diggle Test Tests Using Distances between Points Testing for Random Marking Further Reading Chapter Summary Exercises Time Series Randomization and Time Series Randomization Tests for Serial Correlation Randomization T ests for Trend Randomization Tests for Periodicity Irregularly Spaced Series Tests on Times of Occurrence Discussion on Procedures for Irregular Series Bootstrap and Monte Carlo Tests Further Reading Chapter Summary Exercises Multivariate Data Univariate and Multivariate Tests Sample Means and Covariance Matrices Comparison of Sample Mean Vectors Chi-Squared Analyses for Count Data Principle Component Analysis and Other One Sample Methods Discriminant Function Analysis Further Reading Chapter Summary Exercises Survival and Growth Data Bootstrapping Survival Data Bootstrapping for Variable Selection Bootstrapping for Model Selection Group Comparisons Growth Data Further Reading Chapter Summary Exercises Non-Standard Situations The Construction of Tests in Non-Standard Situations Species Co-Occurrences on Islands An Alternative Generalized Monte Carlo Test Examining Time Changes in Niche Overlap Probing Multivariate Data with Random Skewers Ant Species Sizes in Europe Chapter Summary Bayesian Methods The Bayesian Approach to Data Analysis The Gibbs Sampler and Related Methods Biological Applications Further Reading Chapter Summary Exercises Conclusion and Final Comments Randomization Bootstrapping Monte Carlo Methods in General Classical versus Bayesian Inference Appendix Software for Computer Intensive Statistics References Index

4,706 citations


Journal ArticleDOI
TL;DR: Why P values do not differentiate inconclusive null findings from those that provide important evidence for the absence of an effect is shown and a tutorial on how to use Bayesian hypothesis testing to overcome this issue is provided.
Abstract: Most neuroscientists would agree that for brain research to progress, we have to know which experimental manipulations have no effect as much as we must identify those that do have an effect The dominant statistical approaches used in neuroscience rely on P values and can establish the latter but not the former This makes non-significant findings difficult to interpret: do they support the null hypothesis or are they simply not informative? Here we show how Bayesian hypothesis testing can be used in neuroscience studies to establish both whether there is evidence of absence and whether there is absence of evidence Through simple tutorial-style examples of Bayesian t-tests and ANOVA using the open-source project JASP, this article aims to empower neuroscientists to use this approach to provide compelling and rigorous evidence for the absence of an effect

303 citations


Journal ArticleDOI
TL;DR: A survey on the current trends of the proposals of statistical analyses for the comparison of algorithms of computational intelligence can be found in this paper, along with a description of the statistical background of these tests.
Abstract: A key aspect of the design of evolutionary and swarm intelligence algorithms is studying their performance. Statistical comparisons are also a crucial part which allows for reliable conclusions to be drawn. In the present paper we gather and examine the approaches taken from different perspectives to summarise the assumptions made by these statistical tests, the conclusions reached and the steps followed to perform them correctly. In this paper, we conduct a survey on the current trends of the proposals of statistical analyses for the comparison of algorithms of computational intelligence and include a description of the statistical background of these tests. We illustrate the use of the most common tests in the context of the Competition on single-objective real parameter optimisation of the IEEE Congress on Evolutionary Computation (CEC) 2017 and describe the main advantages and drawbacks of the use of each kind of test and put forward some recommendations concerning their use.

255 citations


Journal ArticleDOI
TL;DR: The 2.5 release of Hyphy includes a completely re-engineered computational core and analysis library that introduces new classes of evolutionary models and statistical tests, delivers substantial performance and stability enhancements, improves usability, streamlines end-to-end analysis workflows, makes it easier to develop custom analyses, and is mostly backwards compatible with previous HyPhy releases.
Abstract: HYpothesis testing using PHYlogenies (HyPhy) is a scriptable, open-source package for fitting a broad range of evolutionary models to multiple sequence alignments, and for conducting subsequent parameter estimation and hypothesis testing, primarily in the maximum likelihood statistical framework. It has become a popular choice for characterizing various aspects of the evolutionary process: natural selection, evolutionary rates, recombination, and coevolution. The 2.5 release (available from www.hyphy.org) includes a completely re-engineered computational core and analysis library that introduces new classes of evolutionary models and statistical tests, delivers substantial performance and stability enhancements, improves usability, streamlines end-to-end analysis workflows, makes it easier to develop custom analyses, and is mostly backward compatible with previous HyPhy releases.

252 citations


Journal ArticleDOI
TL;DR: A survey on the current trends of the proposals of statistical analyses for the comparison of algorithms of computational intelligence is conducted and a description of the statistical background of these tests is included.
Abstract: A key aspect of the design of evolutionary and swarm intelligence algorithms is studying their performance. Statistical comparisons are also a crucial part which allows for reliable conclusions to be drawn. In the present paper we gather and examine the approaches taken from different perspectives to summarise the assumptions made by these statistical tests, the conclusions reached and the steps followed to perform them correctly. In this paper, we conduct a survey on the current trends of the proposals of statistical analyses for the comparison of algorithms of computational intelligence and include a description of the statistical background of these tests. We illustrate the use of the most common tests in the context of the Competition on single-objective real parameter optimisation of the IEEE Congress on Evolutionary Computation (CEC) 2017 and describe the main advantages and drawbacks of the use of each kind of test and put forward some recommendations concerning their use.

196 citations


Journal ArticleDOI
TL;DR: In this paper, the authors propose a test statistic based on the sample covariance between the residuals, which they call the generalised covariance measure (GCM) and prove that the validity of this form of test relies almost entirely on the weak requirement that the regression procedures are able to estimate the conditional means $X$ given $Z$, and $Y$ given £Z$ at a slow rate.
Abstract: It is a common saying that testing for conditional independence, that is, testing whether whether two random vectors $X$ and $Y$ are independent, given $Z$, is a hard statistical problem if $Z$ is a continuous random variable (or vector). In this paper, we prove that conditional independence is indeed a particularly difficult hypothesis to test for. Valid statistical tests are required to have a size that is smaller than a pre-defined significance level, and different tests usually have power against a different class of alternatives. We prove that a valid test for conditional independence does not have power against any alternative. Given the nonexistence of a uniformly valid conditional independence test, we argue that tests must be designed so their suitability for a particular problem may be judged easily. To address this need, we propose in the case where $X$ and $Y$ are univariate to nonlinearly regress $X$ on $Z$, and $Y$ on $Z$ and then compute a test statistic based on the sample covariance between the residuals, which we call the generalised covariance measure (GCM). We prove that validity of this form of test relies almost entirely on the weak requirement that the regression procedures are able to estimate the conditional means $X$ given $Z$, and $Y$ given $Z$, at a slow rate. We extend the methodology to handle settings where $X$ and $Y$ may be multivariate or even high dimensional. While our general procedure can be tailored to the setting at hand by combining it with any regression technique, we develop the theoretical guarantees for kernel ridge regression. A simulation study shows that the test based on GCM is competitive with state of the art conditional independence tests. Code is available as the R package $\mathtt{GeneralisedCovarianceMeasure}$ on CRAN.

195 citations


Journal ArticleDOI
TL;DR: A generative null model is presented, provided as an open-access software platform, that generates surrogate maps with spatial autocorrelation matched to SA of a target brain map that can simulate surrogate brain maps that preserve the SA of cortical, subcortical, parcellated, and dense brain maps.

187 citations


Journal ArticleDOI
TL;DR: In four examples from the gerontology literature, different ways to specify alternative models that can be used to reject the presence of a meaningful or predicted effect in hypothesis tests are illustrated.
Abstract: Researchers often conclude an effect is absent when a null-hypothesis significance test yields a nonsignificant p value. However, it is neither logically nor statistically correct to conclude an effect is absent when a hypothesis test is not significant. We present two methods to evaluate the presence or absence of effects: Equivalence testing (based on frequentist statistics) and Bayes factors (based on Bayesian statistics). In four examples from the gerontology literature, we illustrate different ways to specify alternative models that can be used to reject the presence of a meaningful or predicted effect in hypothesis tests. We provide detailed explanations of how to calculate, report, and interpret Bayes factors and equivalence tests. We also discuss how to design informative studies that can provide support for a null model or for the absence of a meaningful effect. The conceptual differences between Bayes factors and equivalence tests are discussed, and we also note when and why they might lead to similar or different inferences in practice. It is important that researchers are able to falsify predictions or can quantify the support for predicted null effects. Bayes factors and equivalence tests provide useful statistical tools to improve inferences about null effects.

175 citations


Journal ArticleDOI
TL;DR: The great novelty of the proposed AMSD-kNN method is to create a novel unsupervised learning strategy for SHM by a new multivariate distance measure and one-class kNN rule by finding sufficient nearest neighbors that guarantee the estimate of well-conditioned local covariance matrices.

129 citations


Journal ArticleDOI
TL;DR: In this paper, the power of the Mann-Kendall (MK) statistical test has been widely applied in the trend detection of the hydrometeorological time series, and the results indicate that in addition to the significance level and the sample length, the MK test power has a close relationship with the sample variance and the magnitude of the trend.
Abstract: The Mann-Kendall (MK) statistical test has been widely applied in the trend detection of the hydrometeorological time series. Previous studies have mainly focused on the null hypothesis of “no trend” or the “Type I Error”. However, few studies address the capability of the MK test to successfully recognize the trends. In some cases, especially when the trend test is jointly applied with hydropower station design, flood risk assessment, and water quality evaluation, the “Type II error” is equally important and should not be neglected. To cope with this problem, we carry out Monte Carlo simulations and the results indicate that in addition to the significance level and the sample length, the MK test power has a close relationship with the sample variance and the magnitude of the trend. For a given time series with fixed length, the power of the MK test increases as the slope increases and declines with increasing sample variance. A deterministic relationship between the slope and the standard deviation of the white noise that can be used for evaluating the power of the MK test has also been detected. Furthermore, we find that a positive autocorrelation contained in the time series will increase both the Type I and the Type II errors due to the enlargement of the variance in the MK statistics. Finally, we recommend that researchers slightly increase the significance level and lengthen the time series sample to improve the power of the MK test in future studies.

125 citations


Journal ArticleDOI
TL;DR: In this paper, the authors provide a guide for executing and interpreting a Bayesian ANOVA with JASP, an open-source statistical software program with a graphical user interface, using two empirical examples.
Abstract: Analysis of variance (ANOVA) is the standard procedure for statistical inference in factorial designs. Typically, ANOVAs are executed using frequentist statistics, where p-values determine statistical significance in an all-or-none fashion. In recent years, the Bayesian approach to statistics is increasingly viewed as a legitimate alternative to the p-value. However, the broad adoption of Bayesian statistics-and Bayesian ANOVA in particular-is frustrated by the fact that Bayesian concepts are rarely taught in applied statistics courses. Consequently, practitioners may be unsure how to conduct a Bayesian ANOVA and interpret the results. Here we provide a guide for executing and interpreting a Bayesian ANOVA with JASP, an open-source statistical software program with a graphical user interface. We explain the key concepts of the Bayesian ANOVA using two empirical examples.

Journal ArticleDOI
TL;DR: In this article, a statistical significance test for necessary conditions is proposed, which evaluates the evidence against the null hypothesis of an effect being due to chance, and is based on the approximate permutation test.
Abstract: In this article, we present a statistical significance test for necessary conditions. This is an elaboration of necessary condition analysis (NCA), which is a data analysis approach that estimates the necessity effect size of a condition X for an outcome Y. NCA puts a ceiling on the data, representing the level of X that is necessary (but not sufficient) for a given level of Y. The empty space above the ceiling relative to the total empirical space characterizes the necessity effect size. We propose a statistical significance test that evaluates the evidence against the null hypothesis of an effect being due to chance. Such a randomness test helps protect researchers from making Type 1 errors and drawing false positive conclusions. The test is an “approximate permutation test.” The test is available in NCA software for R. We provide suggestions for further statistical development of NCA.

Journal ArticleDOI
TL;DR: A surprisingly simple method for producing statistical significance statements without any regularity conditions and it is shown that in settings when computing the MLE is hard, for the purpose of constructing valid tests and intervals, it is sufficient to upper bound the maximum likelihood.
Abstract: We propose a general method for constructing confidence sets and hypothesis tests that have finite-sample guarantees without regularity conditions We refer to such procedures as “universal” The method is very simple and is based on a modified version of the usual likelihood-ratio statistic that we call “the split likelihood-ratio test” (split LRT) statistic The (limiting) null distribution of the classical likelihood-ratio statistic is often intractable when used to test composite null hypotheses in irregular statistical models Our method is especially appealing for statistical inference in these complex setups The method we suggest works for any parametric model and also for some nonparametric models, as long as computing a maximum-likelihood estimator (MLE) is feasible under the null Canonical examples arise in mixture modeling and shape-constrained inference, for which constructing tests and confidence sets has been notoriously difficult We also develop various extensions of our basic methods We show that in settings when computing the MLE is hard, for the purpose of constructing valid tests and intervals, it is sufficient to upper bound the maximum likelihood We investigate some conditions under which our methods yield valid inferences under model misspecification Further, the split LRT can be used with profile likelihoods to deal with nuisance parameters, and it can also be run sequentially to yield anytime-valid P values and confidence sequences Finally, when combined with the method of sieves, it can be used to perform model selection with nested model classes

Journal ArticleDOI
TL;DR: Crowdsourced testing of research hypotheses helps reveal the true consistency of empirical support for a scientific claim.
Abstract: To what extent are research results influenced by subjective decisions that scientists make as they design studies? Fifteen research teams independently designed studies to answer five original research questions related to moral judgments, negotiations, and implicit cognition. Participants from two separate large samples (total N > 15,000) were then randomly assigned to complete one version of each study. Effect sizes varied dramatically across different sets of materials designed to test the same hypothesis: materials from different teams rendered statistically significant effects in opposite directions for four out of five hypotheses, with the narrowest range in estimates being d = -0.37 to +0.26. Meta-analysis and a Bayesian perspective on the results revealed overall support for two hypotheses, and a lack of support for three hypotheses. Overall, practically none of the variability in effect sizes was attributable to the skill of the research team in designing materials, while considerable variability was attributable to the hypothesis being tested. In a forecasting survey, predictions of other scientists were significantly correlated with study results, both across and within hypotheses. Crowdsourced testing of research hypotheses helps reveal the true consistency of empirical support for a scientific claim.

Journal ArticleDOI
M. Baak1, R. Koopman1, H. Snoek2, Sander Klous2, Sander Klous1 
TL;DR: In this paper, a new and practical correlation coefficient, ϕ K, based on several refinements to Pearson's hypothesis test of independence of two variables, is proposed, which works consistently between categorical, ordinal and interval variables, and can be used to calculate correlations between variables of mixed type.

Journal ArticleDOI
Riko Kelter1
TL;DR: A non-technical introduction to Bayesian hypothesis testing in JASP is provided by comparing traditional tests and statistical methods with their Bayesian counterparts by showing the strengths and limitations of JASp for frequentist NHST and Bayesian inference.
Abstract: Although null hypothesis significance testing (NHST) is the agreed gold standard in medical decision making and the most widespread inferential framework used in medical research, it has several drawbacks. Bayesian methods can complement or even replace frequentist NHST, but these methods have been underutilised mainly due to a lack of easy-to-use software. JASP is an open-source software for common operating systems, which has recently been developed to make Bayesian inference more accessible to researchers, including the most common tests, an intuitive graphical user interface and publication-ready output plots. This article provides a non-technical introduction to Bayesian hypothesis testing in JASP by comparing traditional tests and statistical methods with their Bayesian counterparts. The comparison shows the strengths and limitations of JASP for frequentist NHST and Bayesian inference. Specifically, Bayesian hypothesis testing via Bayes factors can complement and even replace NHST in most situations in JASP. While p-values can only reject the null hypothesis, the Bayes factor can state evidence for both the null and the alternative hypothesis, making confirmation of hypotheses possible. Also, effect sizes can be precisely estimated in the Bayesian paradigm via JASP. Bayesian inference has not been widely used by now due to the dearth of accessible software. Medical decision making can be complemented by Bayesian hypothesis testing in JASP, providing richer information than single p-values and thus strengthening the credibility of an analysis. Through an easy point-and-click interface researchers used to other graphical statistical packages like SPSS can seemlessly transition to JASP and benefit from the listed advantages with only few limitations.

Proceedings Article
12 Jul 2020
TL;DR: It is argued that it is problematic to measure accuracy with respect to data that reflects bias, and instead, it should be considering accuracy withrespect to ideal, unbiased data.
Abstract: A trade-off between accuracy and fairness is almost taken as a given in the existing literature on fairness in machine learning. Yet, it is not preordained that accuracy should decrease with increased fairness. Novel to this work, we examine fair classification through the lens of mismatched hypothesis testing: trying to find a classifier that distinguishes between two ideal distributions when given two mismatched distributions that are biased. Using Chernoff information, a tool in information theory, we theoretically demonstrate that, contrary to popular belief, there always exist ideal distributions such that optimal fairness and accuracy (with respect to the ideal distributions) are achieved simultaneously: there is no trade-off. Moreover, the same classifier yields the lack of a trade-off with respect to ideal distributions while yielding a trade-off when accuracy is measured with respect to the given (possibly biased) dataset. To complement our main result, we formulate an optimization to find ideal distributions and derive fundamental limits to explain why a trade-off exists on the given biased dataset. We also derive conditions under which active data collection can alleviate the fairness-accuracy trade-off in the real world. Our results lead us to contend that it is problematic to measure accuracy with respect to data that reflects bias, and instead, we should be considering accuracy with respect to ideal, unbiased data.

Journal ArticleDOI
TL;DR: It is demonstrated and illustrated that the Monte Carlo technique leads to overly precise conclusions on the values of estimated parameters, and to incorrect hypothesis tests, thus pointing out a fundamental flaw.
Abstract: The Monte Carlo technique is widely used and recommended for including uncertainties LCA. Typically, 1000 or 10,000 runs are done, but a clear argument for that number is not available, and with the growing size of LCA databases, an excessively high number of runs may be a time-consuming thing. We therefore investigate if a large number of runs are useful, or if it might be unnecessary or even harmful. We review the standard theory or probability distributions for describing stochastic variables, including the combination of different stochastic variables into a calculation. We also review the standard theory of inferential statistics for estimating a probability distribution, given a sample of values. For estimating the distribution of a function of probability distributions, two major techniques are available, analytical, applying probability theory and numerical, using Monte Carlo simulation. Because the analytical technique is often unavailable, the obvious way-out is Monte Carlo. However, we demonstrate and illustrate that it leads to overly precise conclusions on the values of estimated parameters, and to incorrect hypothesis tests. We demonstrate the effect for two simple cases: one system in a stand-alone analysis and a comparative analysis of two alternative systems. Both cases illustrate that statistical hypotheses that should not be rejected in fact are rejected in a highly convincing way, thus pointing out a fundamental flaw. Apart form the obvious recommendation to use larger samples for estimating input distributions, we suggest to restrict the number of Monte Carlo runs to a number not greater than the sample sizes used for the input parameters. As a final note, when the input parameters are not estimated using samples, but through a procedure, such as the popular pedigree approach, the Monte Carlo approach should not be used at all.

Posted Content
TL;DR: A class of kernel-based two-sample tests, which aim to determine whether two sets of samples are drawn from the same distribution, which applies both to kernels on deep features and to simpler radial basis kernels or multiple kernel learning.
Abstract: We propose a class of kernel-based two-sample tests, which aim to determine whether two sets of samples are drawn from the same distribution. Our tests are constructed from kernels parameterized by deep neural nets, trained to maximize test power. These tests adapt to variations in distribution smoothness and shape over space, and are especially suited to high dimensions and complex data. By contrast, the simpler kernels used in prior kernel testing work are spatially homogeneous, and adaptive only in lengthscale. We explain how this scheme includes popular classifier-based two-sample tests as a special case, but improves on them in general. We provide the first proof of consistency for the proposed adaptation method, which applies both to kernels on deep features and to simpler radial basis kernels or multiple kernel learning. In experiments, we establish the superior performance of our deep kernels in hypothesis testing on benchmark and real-world data. The code of our deep-kernel-based two sample tests is available at this https URL.

Journal ArticleDOI
TL;DR: In this article, the authors investigate the problem of statistical inference of true model parameters based on SGD when the population loss function is strongly convex and satisfies certain smoothness conditions, and propose two consistent estimators of the asymptotic covariance of the average iterate from SGD: (1) a plug-in estimator, and (2) a batch-means estimator.
Abstract: The stochastic gradient descent (SGD) algorithm has been widely used in statistical estimation for large-scale data due to its computational and memory efficiency. While most existing works focus on the convergence of the objective function or the error of the obtained solution, we investigate the problem of statistical inference of true model parameters based on SGD when the population loss function is strongly convex and satisfies certain smoothness conditions. Our main contributions are twofold. First, in the fixed dimension setup, we propose two consistent estimators of the asymptotic covariance of the average iterate from SGD: (1) a plug-in estimator, and (2) a batch-means estimator, which is computationally more efficient and only uses the iterates from SGD. Both proposed estimators allow us to construct asymptotically exact confidence intervals and hypothesis tests. Second, for high-dimensional linear regression, using a variant of the SGD algorithm, we construct a debiased estimator of each regression coefficient that is asymptotically normal. This gives a one-pass algorithm for computing both the sparse regression coefficients and confidence intervals, which is computationally attractive and applicable to online data.

Journal ArticleDOI
TL;DR: A parameter is introduced that measures the goodness of fit of a model but does not depend on the sample size, which is a step-by-step illustration of the proposed method using a model for post-neonatal mortality developed in a large cohort of more than 300,000 observations.
Abstract: Evaluating the goodness of fit of logistic regression models is crucial to ensure the accuracy of the estimated probabilities. Unfortunately, such evaluation is problematic in large samples. Because the power of traditional goodness of fit tests increases with the sample size, practically irrelevant discrepancies between estimated and true probabilities are increasingly likely to cause the rejection of the hypothesis of perfect fit in larger and larger samples. This phenomenon has been widely documented for popular goodness of fit tests, such as the Hosmer-Lemeshow test. To address this limitation, we propose a modification of the Hosmer-Lemeshow approach. By standardizing the noncentrality parameter that characterizes the alternative distribution of the Hosmer-Lemeshow statistic, we introduce a parameter that measures the goodness of fit of a model but does not depend on the sample size. We provide the methodology to estimate this parameter and construct confidence intervals for it. Finally, we propose a formal statistical test to rigorously assess whether the fit of a model, albeit not perfect, is acceptable for practical purposes. The proposed method is compared in a simulation study with a competing modification of the Hosmer-Lemeshow test, based on repeated subsampling. We provide a step-by-step illustration of our method using a model for postneonatal mortality developed in a large cohort of more than 300 000 observations.

Proceedings Article
18 Jun 2020
TL;DR: An improved method for symbolic regression that seeks to fit data to formulas that are Pareto-optimal, in the sense of having the best accuracy for a given complexity is presented.
Abstract: We present an improved method for symbolic regression that seeks to fit data to formulas that are Pareto-optimal, in the sense of having the best accuracy for a given complexity. It improves on the previous state-of-the-art by typically being orders of magnitude more robust toward noise and bad data, and also by discovering many formulas that stumped previous methods. We develop a method for discovering generalized symmetries (arbitrary modularity in the computational graph of a formula) from gradient properties of a neural network fit. We use normalizing flows to generalize our symbolic regression method to probability distributions from which we only have samples, and employ statistical hypothesis testing to accelerate robust brute-force search.

Proceedings ArticleDOI
02 Feb 2020
TL;DR: In this article, the authors present a new theory of hypothesis testing called s-value, a notion of evidence which, unlike p-values, allows for effortlessly combining evidence from several tests, even in the common scenario where the decision to perform a new test depends on the previous test outcome.
Abstract: We present a new theory of hypothesis testing. The main concept is the s-value, a notion of evidence which, unlike p-values, allows for effortlessly combining evidence from several tests, even in the common scenario where the decision to perform a new test depends on the previous test outcome: safe tests based on s-values generally preserve Type-I error guarantees under such ‘optional continuation’. S-values exist for completely general testing problems with composite null and alternatives. Their prime interpretation is in terms of gambling or investing, each S-value corresponding to a particular investment. Surprisingly, optimal "GROW" S-values, which lead to fastest capital growth, are fully characterized by the joint information projection (JIPr) between the set of all Bayes marginal distributions on ${\mathcal{H}_0}$ and ${\mathcal{H}_1}$. Thus, optimal s-values also have an interpretation as Bayes factors, with priors given by the JIPr. We illustrate the theory using two classical testing scenarios: the one-sample t-test and the 2 × 2-contingency table. In the t-test setting, GROW S-values correspond to adopting the right Haar prior on the variance, like in Jeffreys’ Bayesian t-test. However, unlike Jeffreys’, the default safe t-test puts a discrete 2-point prior on the effect size, leading to better behaviour in terms of statistical power. Sharing Fisherian, Neymanian and Jeffreys-Bayesian interpretations, S-values and safe tests may provide a methodology acceptable to adherents of all three schools.

Journal ArticleDOI
TL;DR: A range of circumstances within biological research where the McNemar test can be effectively applied is described, the different variants of the test that exist are described, how these variants can be accessed in R is explained and guidance on which of these variants to adopt is offered.
Abstract: It is not uncommon for researchers to want to interrogate paired binomial data. For example, researchers may want to compare an organism’s response (positive or negative) to two different stimuli. If they apply both stimuli to a sample of individuals, it would be natural to present the data in a 2 × 2 table. There would be two cells with concordant results (the frequency of individuals which responded positively or negatively to both stimuli) and two cells with discordant results (the frequency of individuals who responded positively to one stimulus, but negatively to the other). The key issue is whether the totals in the two discordant cells are sufficiently different to suggest that the stimuli trigger different reactions. In terms of the null hypothesis testing paradigm, this would translate as a P value which is the probability of seeing the observed difference in these two values or a more extreme difference if the two stimuli produced an identical reaction. The statistical test designed to provide this P value is the McNemar test. Here, we seek to promote greater and better use of the McNemar test. To achieve this, we fully describe a range of circumstances within biological research where it can be effectively applied, describe the different variants of the test that exist, explain how these variants can be accessed in R, and offer guidance on which of these variants to adopt. To support our arguments, we highlight key recent methodological advances and compare these with a novel survey of current usage of the test. When analysing paired binomial data, researchers appear to reflexively apply a chi-squared test, with the McNemar test being largely overlooked, despite it often being more appropriate. As these tests evaluate a different null hypothesis, selecting the appropriate test is essential for effective analysis. When using the McNemar test, there are four methods that can be applied. Recent advice has outlined clear guidelines on which method should be used. By conducting a survey, we provide support for these guidelines, but identify that the method chosen in publications is rarely specified or the most appropriate. Our study provides clear guidance on which method researchers should select and highlights examples of when this test should be used and how it can be implemented easily to improve future research.

Journal ArticleDOI
TL;DR: The problems associated with post-hoc power, particularly the fact that the resulting calculated power is a monotone function of the p value and therefore contains no additional helpful information are reviewed.
Abstract: Post-hoc power estimates (power calculated for hypothesis tests after performing them) are sometimes requested by reviewers in an attempt to promote more rigorous designs. However, they should never be requested or reported because they have been shown to be logically invalid and practically misleading. We review the problems associated with post-hoc power, particularly the fact that the resulting calculated power is a monotone function of the p-value and therefore contains no additional helpful information. We then discuss some situations that seem at first to call for post-hoc power analysis, such as attempts to decide on the practical implications of a null finding, or attempts to determine whether the sample size of a secondary data analysis is adequate for a proposed analysis, and consider possible approaches to achieving these goals. We make recommendations for practice in situations in which clear recommendations can be made, and point out other situations where further methodological research and discussion are required.

Posted ContentDOI
21 Jul 2020
TL;DR: The hierarchical bootstrap has been shown to be an effective tool to accurately analyze such data, and while it has been used extensively in the statistical literature, its use is not widespread in neuroscience - despite the ubiquity of hierarchical datasets as mentioned in this paper.
Abstract: A common feature in many neuroscience datasets is the presence of hierarchical data structures, most commonly recording the activity of multiple neurons in multiple animals across multiple trials. Accordingly, the measurements constituting the dataset are not independent, even though the traditional statistical analyses often applied in such cases (e.g., Student’s t-test) treat them as such. The hierarchical bootstrap has been shown to be an effective tool to accurately analyze such data, and while it has been used extensively in the statistical literature, its use is not widespread in neuroscience - despite the ubiquity of hierarchical datasets. In this paper, we illustrate the intuitiveness and utility of this approach to analyze hierarchically nested datasets. We use simulated neural data to show that traditional statistical tests can result in a false positive rate of over 45%, even if the Type-I error rate is set at 5%. While summarizing data across non-independent points (or lower levels) can potentially fix this problem, this approach greatly reduces the statistical power of the analysis. The hierarchical bootstrap, when applied sequentially over the levels of the hierarchical structure, keeps the Type-I error rate within the intended bound and retains more statistical power than summarizing methods. We conclude by demonstrating the effectiveness of the method in two real-world examples, first analyzing singing data in Bengalese finches (Lonchura striata var. domestica) and second quantifying changes in behavior under optogenetic control in flies (Drosophila melanogaster).

Journal ArticleDOI
TL;DR: In this article, the observed ranks are conceptualized as an impoverished reflection of an underlying continuous scale, and inference concerns the parameters that govern the latent representation, which can be used to obtain Bayes factors for rank-order problems.
Abstract: Bayesian inference for rank-order problems is frustrated by the absence of an explicit likelihood function. This hurdle can be overcome by assuming a latent normal representation that is consistent with the ordinal information in the data: the observed ranks are conceptualized as an impoverished reflection of an underlying continuous scale, and inference concerns the parameters that govern the latent representation. We apply this generic data-augmentation method to obtain Bayes factors for three popular rank-based tests: the rank sum test, the signed rank test, and Spearman's ρs.

Journal ArticleDOI
TL;DR: A feature selection through ensemble classifiers helps to select important variables and thus is applicable for different sample distributions and demonstrates the effectiveness of ECFS-DEA for differential expression analysis on expression profiles.
Abstract: Various methods for differential expression analysis have been widely used to identify features which best distinguish between different categories of samples. Multiple hypothesis testing may leave out explanatory features, each of which may be composed of individually insignificant variables. Multivariate hypothesis testing holds a non-mainstream position, considering the large computation overhead of large-scale matrix operation. Random forest provides a classification strategy for calculation of variable importance. However, it may be unsuitable for different distributions of samples. Based on the thought of using an ensemble classifier, we develop a feature selection tool for differential expression analysis on expression profiles (i.e., ECFS-DEA for short). Considering the differences in sample distribution, a graphical user interface is designed to allow the selection of different base classifiers. Inspired by random forest, a common measure which is applicable to any base classifier is proposed for calculation of variable importance. After an interactive selection of a feature on sorted individual variables, a projection heatmap is presented using k-means clustering. ROC curve is also provided, both of which can intuitively demonstrate the effectiveness of the selected feature. Feature selection through ensemble classifiers helps to select important variables and thus is applicable for different sample distributions. Experiments on simulation and realistic data demonstrate the effectiveness of ECFS-DEA for differential expression analysis on expression profiles. The software is available at http://bio-nefu.com/resource/ecfs-dea.

Journal ArticleDOI
Riko Kelter1
TL;DR: An extensive simulation study is conducted to compare common Bayesian significance and effect measures which can be obtained from a posterior distribution for one of the most important statistical procedures in medical research and in particular clinical trials, the two-sample Student's (and Welch’s) t-test.
Abstract: The replication crisis hit the medical sciences about a decade ago, but today still most of the flaws inherent in null hypothesis significance testing (NHST) have not been solved. While the drawbacks of p-values have been detailed in endless venues, for clinical research, only a few attractive alternatives have been proposed to replace p-values and NHST. Bayesian methods are one of them, and they are gaining increasing attention in medical research, as some of their advantages include the description of model parameters in terms of probability, as well as the incorporation of prior information in contrast to the frequentist framework. While Bayesian methods are not the only remedy to the situation, there is an increasing agreement that they are an essential way to avoid common misconceptions and false interpretation of study results. The requirements necessary for applying Bayesian statistics have transitioned from detailed programming knowledge into simple point-and-click programs like JASP. Still, the multitude of Bayesian significance and effect measures which contrast the gold standard of significance in medical research, the p-value, causes a lack of agreement on which measure to report. Therefore, in this paper, we conduct an extensive simulation study to compare common Bayesian significance and effect measures which can be obtained from a posterior distribution. In it, we analyse the behaviour of these measures for one of the most important statistical procedures in medical research and in particular clinical trials, the two-sample Student’s (and Welch’s) t-test. The results show that some measures cannot state evidence for both the null and the alternative. While the different indices behave similarly regarding increasing sample size and noise, the prior modelling influences the obtained results and extreme priors allow for cherry-picking similar to p-hacking in the frequentist paradigm. The indices behave quite differently regarding their ability to control the type I error rates and regarding their ability to detect an existing effect. Based on the results, two of the commonly used indices can be recommended for more widespread use in clinical and biomedical research, as they improve the type I error control compared to the classic two-sample t-test and enjoy multiple other desirable properties.

Journal ArticleDOI
TL;DR: Accuracy, false detection and computational time provide a comprehensive assessment of each feature selection method and shed light on alternatives to the Lasso-regularization which are not as popular in practice yet.
Abstract: In this paper, we review state-of-the-art methods for feature selection in statistics with an application-oriented eye. Indeed, sparsity is a valuable property and the profusion of research on the topic might have provided little guidance to practitioners. We demonstrate empirically how noise and correlation impact both the accuracy—the number of correct features selected—and the false detection—the number of incorrect features selected—for five methods: the cardinality-constrained formulation, its Boolean relaxation, l1 regularization and two methods with non-convex penalties. A cogent feature selection method is expected to exhibit a two-fold convergence, namely the accuracy and false detection rate should converge to 1 and 0 respectively, as the sample size increases. As a result, proper method should recover all and nothing but true features. Empirically, the integer optimization formulation and its Boolean relaxation are the closest to exhibit this two properties consistently in various regimes of noise and correlation. In addition, apart from the discrete optimization approach which requires a substantial, yet often affordable, computational time, all methods terminate in times comparable with the glmnet package for Lasso. We released code for methods that were not publicly implemented. Jointly considered, accuracy, false detection and computational time provide a comprehensive assessment of each feature selection method and shed light on alternatives to the Lasso-regularization which are not as popular in practice yet.