scispace - formally typeset
Search or ask a question

Showing papers on "Statistical hypothesis testing published in 2022"


Journal ArticleDOI
01 Mar 2022
TL;DR: The authors suggest language of evidence that allows for a more nuanced approach to communicate scientific findings as a simple and intuitive alternative to statistical significance testing, and provide examples for rewriting results sections in research papers accordingly.
Abstract: Despite much criticism, black-or-white null-hypothesis significance testing with an arbitrary P-value cutoff still is the standard way to report scientific findings. One obstacle to progress is likely a lack of knowledge about suitable alternatives. Here, we suggest language of evidence that allows for a more nuanced approach to communicate scientific findings as a simple and intuitive alternative to statistical significance testing. We provide examples for rewriting results sections in research papers accordingly. Language of evidence has previously been suggested in medical statistics, and it is consistent with reporting approaches of international research networks, like the Intergovernmental Panel on Climate Change, for example. Instead of re-inventing the wheel, ecology and evolution might benefit from adopting some of the 'good practices' that exist in other fields.

149 citations


Journal ArticleDOI
01 Jan 2022-Neuron
TL;DR: The authors introduce linear and generalized mixed-effects models that consider data dependence and provide clear instruction on how to recognize when they are needed and how to apply them. But the most widely used methods such as t test and ANOVA do not take data dependence into account and thus are often misused.

104 citations


Journal ArticleDOI
24 Jan 2022
TL;DR: The most commonly used tests include t-test, analysis of variance (ANOVA), non-parametric tests, chi-square, and post-hoc analyses as mentioned in this paper .
Abstract: In medical research, when independent variables are categorical (i.e., dividing groups), statistical analysis is often required. This situation mostly occurs on randomized controlled trials and observational studies that have multiple patient groups. Also, when analyzing continuous independent variables in a single patient group, breakpoints can be set to categorize them into several groups. To test statistical differences between groups, a proper statistical method should be selected, mainly based on the type of dependent variable (i.e., result) and context. The most commonly used tests include t-test, analysis of variance (ANOVA), non-parametric tests, chi-square, and post-hoc analyses. In this article, the author explains statistical methods and which methods should be selected. Through this paper, researchers will be able to understand statistical methods and receive help when choosing and performing statistical analysis. The article can also be used as a reference when researchers justify their statistical approaches when publishing research results.

82 citations


Journal ArticleDOI
TL;DR: In this paper , the authors used permutation analysis to identify differentially expressed genes between two conditions using human population RNA-seq samples, and they found that two popular bioinformatics methods, DESeq2 and edgeR, have unexpectedly high false discovery rates.
Abstract: When identifying differentially expressed genes between two conditions using human population RNA-seq samples, we found a phenomenon by permutation analysis: two popular bioinformatics methods, DESeq2 and edgeR, have unexpectedly high false discovery rates. Expanding the analysis to limma-voom, NOISeq, dearseq, and Wilcoxon rank-sum test, we found that FDR control is often failed except for the Wilcoxon rank-sum test. Particularly, the actual FDRs of DESeq2 and edgeR sometimes exceed 20% when the target FDR is 5%. Based on these results, for population-level RNA-seq studies with large sample sizes, we recommend the Wilcoxon rank-sum test.

50 citations


Journal ArticleDOI
TL;DR: In this article , the authors used permutation analysis to identify differentially expressed genes between two conditions using human population RNA-seq samples, and they found that two popular bioinformatics methods, DESeq2 and edgeR, have unexpectedly high false discovery rates.
Abstract: When identifying differentially expressed genes between two conditions using human population RNA-seq samples, we found a phenomenon by permutation analysis: two popular bioinformatics methods, DESeq2 and edgeR, have unexpectedly high false discovery rates. Expanding the analysis to limma-voom, NOISeq, dearseq, and Wilcoxon rank-sum test, we found that FDR control is often failed except for the Wilcoxon rank-sum test. Particularly, the actual FDRs of DESeq2 and edgeR sometimes exceed 20% when the target FDR is 5%. Based on these results, for population-level RNA-seq studies with large sample sizes, we recommend the Wilcoxon rank-sum test.

41 citations


Journal ArticleDOI
TL;DR: This paper derives upper bounds for the $L^2$ minimax risk in nonparametric estimation and derives asymptotic distributions for the constructed network and a relating hypothesis testing procedure that is proven as minimax optimal under suitable network architectures.

19 citations


Journal ArticleDOI
Long Cheng1, Mingkun Xue1, Yan Wang1, Yong Wang1, Yangyang Bi 
TL;DR: A modified generalized probability data association algorithm based on arrival of time that can mitigate the influence of NLOS errors and achieve higher localization accuracy when compared with the existing methods is proposed.
Abstract: Wireless sensor network (WSN) is composed of many micro sensor nodes, and the localization technology is one of the most important applications of the WSN technology. At present, many positioning algorithms have high positioning accuracy in line-of-sight environment, but poor positioning accuracy in non-line-of-sight (NLOS) environment. In this article, we propose a modified generalized probability data association algorithm based on arrival of time. We divided the range measurements into N different groups, and each group obtained the corresponding position estimation, model probabilities, and covariance matrix of the mobile node through IMM-EKF. We used model probability and hypothesis test to perform NLOS identification for N groups, in which the model probability provided by each group was used for the first NLOS identification, and the innovation and innovation covariance matrix were used for the second NLOS identification in the hypothesis test. Position estimation contaminated by NLOS error is discarded. The correct position estimation is weighted with the corresponding association probability. The simulation and experimental results show that the proposed algorithm can mitigate the influence of NLOS errors and achieve higher localization accuracy when compared with the existing methods.

19 citations


Journal ArticleDOI
17 Feb 2022-PeerJ
TL;DR: PCAtest is an R package that implements permutation-based statistical tests to evaluate the overall significance of a PCA, the significance of each PC axis, and of contributions of each observed variable to the significant axes.
Abstract: Principal Component Analysis (PCA) is one of the most broadly used statistical methods for the ordination and dimensionality-reduction of multivariate datasets across many scientific disciplines. Trivial PCs can be estimated from data sets without any correlational structure among the original variables, and traditional criteria for selecting non-trivial PC axes are difficult to implement, partially subjective or based on ad hoc thresholds. PCAtest is an R package that implements permutation-based statistical tests to evaluate the overall significance of a PCA, the significance of each PC axis, and of contributions of each observed variable to the significant axes. Based on simulation and empirical results, I encourage R users to routinely apply PCAtest to test the significance of their PCA before proceeding with the direct interpretation of PC axes and/or the utilization of PC scores in subsequent evolutionary and ecological analyses.

18 citations


Journal ArticleDOI
TL;DR: In this paper , the Wilcoxon rank sum test (WRST) has been used for operation state monitoring and automated fault detection of wind turbines, which can monitor one or several key process parameters of the wind turbine simultaneously.

17 citations


Journal ArticleDOI
TL;DR: In this article , the authors propose a new SAT sampler named , and present a new statistical test to verify sampler uniformity, and report the evaluation of and five other state-of-the-art SAT samplers.
Abstract: Abstract Many analyses on configurable software systems are intractable when confronted with colossal and highly-constrained configuration spaces. These analyses could instead use statistical inference, where a tractable sample accurately predicts results for the entire space. To do so, the laws of statistical inference requires each member of the population to be equally likely to be included in the sample, i.e., the sampling process needs to be “uniform”. SAT-samplers have been developed to generate uniform random samples at a reasonable computational cost. However, there is a lack of experimental validation over colossal spaces to show whether the samplers indeed produce uniform samples or not. This paper (i) proposes a new sampler named , (ii) presents a new statistical test to verify sampler uniformity, and (iii) reports the evaluation of and five other state-of-the-art samplers: , , , , and . Our experimental results show only satisfies both scalability and uniformity.

14 citations


Journal ArticleDOI
TL;DR: It is concluded that studies in ecology and evolutionary biology are mostly exploratory and descriptive, and should shift from claiming to ‘test’ specific hypotheses statistically to describing and discussing many hypotheses (possible true effect sizes) that are most compatible with the authors' data, given their statistical model.
Abstract: A paradigm shift away from null hypothesis significance testing seems in progress. Based on simulations, we illustrate some of the underlying motivations. First, p‐values vary strongly from study to study, hence dichotomous inference using significance thresholds is usually unjustified. Second, ‘statistically significant’ results have overestimated effect sizes, a bias declining with increasing statistical power. Third, ‘statistically non‐significant’ results have underestimated effect sizes, and this bias gets stronger with higher statistical power. Fourth, the tested statistical hypotheses usually lack biological justification and are often uninformative. Despite these problems, a screen of 48 papers from the 2020 volume of the Journal of Evolutionary Biology exemplifies that significance testing is still used almost universally in evolutionary biology. All screened studies tested default null hypotheses of zero effect with the default significance threshold of p = 0.05, none presented a pre‐specified alternative hypothesis, pre‐study power calculation and the probability of ‘false negatives’ (beta error rate). The results sections of the papers presented 49 significance tests on average (median 23, range 0–390). Of 41 studies that contained verbal descriptions of a ‘statistically non‐significant’ result, 26 (63%) falsely claimed the absence of an effect. We conclude that studies in ecology and evolutionary biology are mostly exploratory and descriptive. We should thus shift from claiming to ‘test’ specific hypotheses statistically to describing and discussing many hypotheses (possible true effect sizes) that are most compatible with our data, given our statistical model. We already have the means for doing so, because we routinely present compatibility (‘confidence’) intervals covering these hypotheses.

Book ChapterDOI
01 Jan 2022
TL;DR: Low-degree likelihood ratio as mentioned in this paper is a generalization of the sum-of-squares (SoS) hierarchy of convex programs, which can be used to predict the computational hardness of a variety of statistical inference tasks.
Abstract: These notes survey and explore an emerging method, which we call the low-degree method, for understanding statistical-versus-computational tradeoffs in high-dimensional inference problems. In short, the method posits that a certain quantity—the second moment of the low-degree likelihood ratio—gives insight into how much computational time is required to solve a given hypothesis testing problem, which can in turn be used to predict the computational hardness of a variety of statistical inference tasks. While this method originated in the study of the sum-of-squares (SoS) hierarchy of convex programs, we present a self-contained introduction that does not require knowledge of SoS. In addition to showing how to carry out predictions using the method, we include a discussion investigating both rigorous and conjectural consequences of these predictions. These notes include some new results, simplified proofs, and refined conjectures. For instance, we point out a formal connection between spectral methods and the low-degree likelihood ratio, and we give a sharp low-degree lower bound against subexponential-time algorithms for tensor PCA.

Journal ArticleDOI
TL;DR: Testing Statistical Hypotheses, 4th Edition, the authors covers finite-sample theory and large-sample theories across two volumes, covering finite and large sample theory across two books.
Abstract: Testing Statistical Hypotheses, 4th Edition, covers finite-sample theory and large-sample theory across two volumes.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed Active Drift Detection based on Meta learning (Meta-ADD), a novel framework that learns to classify concept drift by offline pre-training a model on data stream with known drifts, then online fine-tuning model to improve detection accuracy.

Journal ArticleDOI
TL;DR: This commentary presents the history behind NHST along with the limitations that modern‐day NHST presents, and suggests that a statistics reform regarding NHST be considered.
Abstract: Traditional null hypothesis significance testing (NHST) incorporating the critical level of significance of 0.05 has become the cornerstone of decision‐making in health care, and nowhere less so than in obstetric and gynecological research. However, such practice is controversial. In particular, it was never intended for clinical significance to be inferred from statistical significance. The inference of clinical importance based on statistical significance (p < 0.05), and lack of clinical significance otherwise (p ≥ 0.05) represents misunderstanding of the original purpose of NHST. Furthermore, the limitations of NHST—sensitivity to sample size, plus type I and II errors—are frequently ignored. Therefore, decision‐making based on NHST has the potential for recurrent false claims about the effectiveness of interventions or importance of exposure to risk factors, or dismissal of important ones. This commentary presents the history behind NHST along with the limitations that modern‐day NHST presents, and suggests that a statistics reform regarding NHST be considered.

Journal ArticleDOI
12 May 2022-Synthese
TL;DR: The authors argue that the main criticisms of statistical significance testing arise from misunderstanding and misusing the statistical tools, and that in fact the purported remedies themselves risk damaging science, and argue that banning the use of p-value thresholds in interpreting data does not diminish but rather exacerbates data-dredging and biasing selection effects.
Abstract: Abstract While the common procedure of statistical significance testing and its accompanying concept of p-values have long been surrounded by controversy, renewed concern has been triggered by the replication crisis in science. Many blame statistical significance tests themselves, and some regard them as sufficiently damaging to scientific practice as to warrant being abandoned. We take a contrary position, arguing that the central criticisms arise from misunderstanding and misusing the statistical tools, and that in fact the purported remedies themselves risk damaging science. We argue that banning the use of p-value thresholds in interpreting data does not diminish but rather exacerbates data-dredging and biasing selection effects. If an account cannot specify outcomes that will not be allowed to count as evidence for a claim—if all thresholds are abandoned—then there is no test of that claim. The contributions of this paper are: To explain the rival statistical philosophies underlying the ongoing controversy; To elucidate and reinterpret statistical significance tests, and explain how this reinterpretation ameliorates common misuses and misinterpretations; To argue why recent recommendations to replace, abandon, or retire statistical significance undermine a central function of statistics in science: to test whether observed patterns in the data are genuine or due to background variability.

Journal ArticleDOI
R.U.Memetov1
TL;DR: The authors showed that random slopes can lead to a substantial increase in false-positive conclusions in null-hypothesis tests and showed that the same is true for Bayesian hypothesis testing with mixed models, which often yield Bayes factors reflecting very strong evidence for a mean effect on the population level.
Abstract: Mixed models are gaining popularity in psychology. For frequentist mixed models, previous research showed that excluding random slopes-differences between individuals in the direction and size of an effect-from a model when they are in the data can lead to a substantial increase in false-positive conclusions in null-hypothesis tests. Here, I demonstrated through five simulations that the same is true for Bayesian hypothesis testing with mixed models, which often yield Bayes factors reflecting very strong evidence for a mean effect on the population level even if there was no such effect. Including random slopes in the model largely eliminates the risk of strong false positives but reduces the chance of obtaining strong evidence for true effects. I recommend starting analysis by testing the support for random slopes in the data and removing them from the models only if there is clear evidence against them.

Journal ArticleDOI
TL;DR: In this paper, the authors investigated how the statistical distribution of the residual is affected when the reference null space is estimated, where its refined covariance term considers also the uncertainty related to the estimate.

Journal ArticleDOI
TL;DR: For example, this paper argued that exploratory hypothesis tests avoid researcher commitment and researcher prophecy biases, reduce the probability of data fraud, facilitate inference to the best explanation, and allow peer reviewers to make additional contributions at the data analysis stage.
Abstract: Preregistration has been proposed as a useful method for making a publicly verifiable distinction between confirma- tory hypothesis tests, which involve planned tests of ante hoc hypotheses, and exploratory hypothesis tests, which involve unplanned tests of post hoc hypotheses. This distinction is thought to be important because it has been proposed that confirmatory hypothesis tests provide more compelling results (less uncertain, less tentative, less open to bias) than exploratory hypothesis tests. In this article, we challenge this proposition and argue that there are several advantages of exploratory hypothesis tests that can make their results more compelling than those of confirmatory hypothesis tests. We also consider some potential disadvantages of exploratory hypothesis tests and conclude that their advantages can outweigh the disadvantages. We conclude that exploratory hypothesis tests avoid researcher commitment and researcher prophecy biases, reduce the probability of data fraud, are more appropriate in the context of unplanned deviations, facilitate inference to the best explanation, and allow peer reviewers to make additional contributions at the data analysis stage. In contrast, confirmatory hypothesis tests may lead to an inappropriate level of confidence in research conclusions, less appropriate analyses in the context of unplanned deviations, and greater bias and errors in theoretical inferences.

Journal ArticleDOI
TL;DR: In this article , the authors address the issue of sampling uncertainty when researchers make a claim about effect magnitude: informal assessment of the range of magnitudes represented by the confidence interval; testing of hypotheses of substantial (meaningful) and non-substantial magnitudes; assessment of probabilities of substantial and trivial (inconsequential) magnitudes with Bayesian methods based on non-informative or informative priors; and testing of the nil or zero hypothesis.
Abstract: A sample provides only an approximate estimate of the magnitude of an effect, owing to sampling uncertainty. The following methods address the issue of sampling uncertainty when researchers make a claim about effect magnitude: informal assessment of the range of magnitudes represented by the confidence interval; testing of hypotheses of substantial (meaningful) and non-substantial magnitudes; assessment of the probabilities of substantial and trivial (inconsequential) magnitudes with Bayesian methods based on non-informative or informative priors; and testing of the nil or zero hypothesis. Assessment of the confidence interval, testing of substantial and non-substantial hypotheses, and assessment of Bayesian probabilities with a non-informative prior are subject to differing interpretations but are all effectively equivalent and can reasonably define and provide necessary and sufficient evidence for substantial and trivial effects. Informative priors in Bayesian assessments are problematic, because they are hard to quantify and can bias the outcome. Rejection of the nil hypothesis (presented as statistical significance), and failure to reject the nil hypothesis (presented as statistical non-significance), provide neither necessary nor sufficient evidence for substantial and trivial effects. To properly account for sampling uncertainty in effect magnitudes, researchers should therefore replace rather than supplement the nil-hypothesis test with one or more of the other three equivalent methods. Surprisal values, second-generation p values, and the hypothesis comparisons of evidential statistics are three other recent approaches to sampling uncertainty that are not recommended. Important issues beyond sampling uncertainty include representativeness of sampling, accuracy of the statistical model, individual differences, individual responses, and rewards of benefit and costs of harm of clinically or practically important interventions and side effects.

Journal ArticleDOI
TL;DR: This perspective discusses the applicability and utility of integrating equivalence testing when conducting sex comparisons in cardiovascular research, and recommends that cardiovascular researchers consider implementing this statistical tool to better understanding of similar and different cardiovascular processes between sexes.
Abstract: The number of research studies investigating whether similar or different cardiovascular responses or adaptations exist between males and females are increasing. Traditionally, difference-based statistical methods (e.g., t-test, ANOVA, etc.) have been implemented to compare cardiovascular function between males and females, with a P-value >0.05 used to denote similarity between sexes. However, an absence of evidence (i.e., large P-value) is not evidence of absence (i.e., no sex differences). Equivalence testing determines whether two measures or groups provide statistically equivalent outcomes, in that they differ by less than an 'ideally prespecified' smallest effect size of interest. Our perspective discusses the applicability and utility of integrating equivalence testing when conducting sex comparisons in cardiovascular research. An emphasis is placed on how cardiovascular researchers may conduct equivalence testing across multiple study designs (e.g., cross-sectional comparisons, repeated measures intervention, etc.). The strengths and weaknesses of this statistical tool are discussed. Equivalence analyses are relatively simple to conduct, may be used in conjunction with traditional hypothesis testing to interpret findings, and permits the determination of statistically equivalent responses between sexes. We recommend that cardiovascular researchers consider implementing equivalence testing to better our understanding of similar and different cardiovascular processes between sexes.

Journal ArticleDOI
TL;DR: In this article , the authors provided a brief overview on geometric and statistical aspects of angles in multidimensional spaces, showing that the angle between two independent vectors is concentrated around the right angle, with a more pronounced peak in a higher-dimensional space.
Abstract: Parallelism between evolutionary trajectories in a trait space is often seen as evidence for repeatability of phenotypic evolution, and angles between trajectories play a pivotal role in the analysis of parallelism. However, properties of angles in multidimensional spaces have not been widely appreciated by biologists. To remedy this situation, this study provides a brief overview on geometric and statistical aspects of angles in multidimensional spaces. Under the null hypothesis that trajectory vectors have no preferred directions (i.e. uniform distribution on hypersphere), the angle between two independent vectors is concentrated around the right angle, with a more pronounced peak in a higher-dimensional space. This probability distribution is closely related to t- and beta distributions, which can be used for testing the null hypothesis concerning a pair of trajectories. A recently proposed method with eigenanalysis of a vector correlation matrix can be connected to the test of no correlation or concentration of multiple vectors, for which simple test procedures are available in the statistical literature. Concentration of vectors can also be examined by tools of directional statistics such as the Rayleigh test. These frameworks provide biologists with baselines to make statistically justified inferences for (non)parallel evolution.

Journal ArticleDOI
TL;DR: The authors argue that Tendeiro and Kiers are overly pessimistic, and that several of their "issues" with NHBT may in fact be conceived as pronounced advantages, and illustrate their arguments with simple concrete examples.
Abstract: Tendeiro and Kiers (2019) provide a detailed and scholarly critique of Null Hypothesis Bayesian Testing (NHBT) and its central component-the Bayes factor-that allows researchers to update knowledge and quantify statistical evidence. Tendeiro and Kiers conclude that NHBT constitutes an improvement over frequentist p-values, but primarily elaborate on a list of 11 "issues" of NHBT. We believe that several issues identified by Tendeiro and Kiers are of central importance for elucidating the complementary roles of hypothesis testing versus parameter estimation and for appreciating the virtue of statistical thinking over conducting statistical rituals. But although we agree with many of their thoughtful recommendations, we believe that Tendeiro and Kiers are overly pessimistic, and that several of their "issues" with NHBT may in fact be conceived as pronounced advantages. We illustrate our arguments with simple, concrete examples and end with a critical discussion of one of the recommendations by Tendeiro and Kiers, which is that "estimation of the full posterior distribution offers a more complete picture" than a Bayes factor hypothesis test. (PsycInfo Database Record (c) 2022 APA, all rights reserved).

Journal ArticleDOI
TL;DR: In this paper , a joint test of two-sample mean vectors and covariance matrices for high-dimensional data is proposed, where the high-power regions of highdimensional mean tests or covariance tests are expanded to a wider alternative space and then combined together in the simultaneous test.
Abstract: Power-enhanced tests with high-dimensional data have received growing attention in theoretical and applied statistics in recent years. Existing tests possess their respective high-power regions, and we may lack prior knowledge about the alternatives when testing for a problem of interest in practice. There is a critical need of developing powerful testing procedures against more general alternatives. This article studies the joint test of two-sample mean vectors and covariance matrices for high-dimensional data. We first expand the high-power regions of high-dimensional mean tests or covariance tests to a wider alternative space and then combine their strengths together in the simultaneous test. We develop a new power-enhanced simultaneous test that is powerful to detect differences in either mean vectors or covariance matrices under either sparse or dense alternatives. We prove that the proposed testing procedures align with the power enhancement principles introduced by Fan, Liao, and Yao and achieve the accurate asymptotic size and consistent asymptotic power. We demonstrate the finite-sample performance using simulation studies and a real application to find differentially expressed gene-sets in cancer studies. Supplementary materials for this article are available online.

Journal ArticleDOI
TL;DR: In this paper , a 95% confidence interval is defined as the subset of possible effect sizes that have p-values larger than 0.05 as calculated from the same data and the same background statistical assumptions.
Abstract: It has long been argued that we need to consider much more than an observed point estimate and a p-value to understand statistical results. One of the most persistent misconceptions about p-values is that they are necessarily calculated assuming a null hypothesis of no effect is true. Instead, p-values can and should be calculated for multiple hypothesized values for the effect size. For example, a p-value function allows us to visualize results continuously by examining how the p-value varies as we move across possible effect sizes. For more focused discussions, a 95% confidence interval shows the subset of possible effect sizes that have p-values larger than 0.05 as calculated from the same data and the same background statistical assumptions. In this sense a confidence interval can be taken as showing the effect sizes that are most compatible with the data, given the assumptions, and thus may be better termed a compatibility interval. The question that should then be asked is whether any or all of the effect sizes within the interval are substantial enough to be of practical importance.

Journal ArticleDOI
TL;DR: In this article , the authors define always valid p-values and confidence intervals that let users try to take advantage of data as fast as it becomes available, providing valid statistical inference whenever they make their decision.
Abstract: A/B tests are typically analyzed via frequentist p-values and confidence intervals, but these inferences are wholly unreliable if users endogenously choose samples sizes by continuously monitoring their tests. We define always valid p-values and confidence intervals that let users try to take advantage of data as fast as it becomes available, providing valid statistical inference whenever they make their decision. Always valid inference can be interpreted as a natural interface for a sequential hypothesis test, which empowers users to implement a modified test tailored to them. In particular, we show in an appropriate sense that the measures we develop trade off sample size and power efficiently, despite a lack of prior knowledge of the user’s relative preference between these two goals. We also use always valid p-values to obtain multiple hypothesis testing control in the sequential context. Our methodology has been implemented in a large-scale commercial A/B testing platform to analyze hundreds of thousands of experiments to date.

Journal ArticleDOI
TL;DR: In this article , the authors investigated how the statistical distribution of the residual is affected when the reference null space is estimated, where its refined covariance term considers also the uncertainty related to the estimate.

Journal ArticleDOI
TL;DR: A sequential, anytime-valid method to test the conditional independence of a response Y and a predictor X given a random vector Z, based on e -statistics and test martingales, which generalize likelihood ratios and allow valid inference at arbitrary stopping times is proposed.
Abstract: We propose a sequential, anytime-valid method to test the conditional independence of a response $Y$ and a predictor $X$ given a random vector $Z$. The proposed test is based on e-statistics and test martingales, which generalize likelihood ratios and allow valid inference at arbitrary stopping times. In accordance with the recently introduced model-X setting, our test depends on the availability of the conditional distribution of $X$ given $Z$, or at least a sufficiently sharp approximation thereof. Within this setting, we derive a general method for constructing e-statistics for testing conditional independence, show that it leads to growth-rate optimal e-statistics for simple alternatives, and prove that our method yields tests with asymptotic power one in the special case of a logistic regression model. A simulation study is done to demonstrate that the approach is competitive in terms of power when compared to established sequential and nonsequential testing methods, and robust with respect to violations of the model-X assumption.

Journal ArticleDOI
TL;DR: In this paper , the authors argue that Bayesian data analysis provides suitable tools to assess practical significance rigorously and demonstrate their claims in a case study comparing different test techniques, and apply cumulative prospect theory on top of the statistical model to quantitatively connect their statistical analysis output to a practically meaningful context.
Abstract: A key goal of empirical research in software engineering is to assess practical significance, which answers the question whether the observed effects of some compared treatments show a relevant difference in practice in realistic scenarios. Even though plenty of standard techniques exist to assess statistical significance, connecting it to practical significance is not straightforward or routinely done; indeed, only a few empirical studies in software engineering assess practical significance in a principled and systematic way. In this paper, we argue that Bayesian data analysis provides suitable tools to assess practical significance rigorously. We demonstrate our claims in a case study comparing different test techniques. The case study's data was previously analyzed (Afzal et al. , 2015) using standard techniques focusing on statistical significance. Here, we build a multilevel model of the same data, which we fit and validate using Bayesian techniques. Our method is to apply cumulative prospect theory on top of the statistical model to quantitatively connect our statistical analysis output to a practically meaningful context. This is then the basis both for assessing and arguing for practical significance. Our study demonstrates that Bayesian analysis provides a technically rigorous yet practical framework for empirical software engineering. A substantial side effect is that any uncertainty in the underlying data will be propagated through the statistical model, and its effects on practical significance are made clear. Thus, in combination with cumulative prospect theory, Bayesian analysis supports seamlessly assessing practical significance in an empirical software engineering context, thus potentially clarifying and extending the relevance of research for practitioners.

Journal ArticleDOI
TL;DR: The use of more than one animal is widely believed to allow an inference on the population as discussed by the authors , which is not justifiable for many questions, but also ethically and/or economically not feasible.
Abstract: Abstract The field of in vivo neurophysiology currently uses statistical standards that are based on tradition rather than formal analysis. Typically, data from two (or few) animals are pooled for one statistical test, or a significant test in a first animal is replicated in one (or few) further animals. The use of more than one animal is widely believed to allow an inference on the population. Here, we explain that a useful inference on the population would require larger numbers and a different statistical approach. The field should consider to perform studies at that standard, potentially through coordinated multicenter efforts, for selected questions of exceptional importance. Yet, for many questions, this is ethically and/or economically not justifiable. We explain why in those studies with two (or few) animals, any useful inference is limited to the sample of investigated animals, irrespective of whether it is based on few animals, two animals, or a single animal.