Showing papers on "Statistical hypothesis testing published in 2019"

PDF

Open Access

Journal Article•DOI•

Raincloud plots: a multi-platform tool for robust data visualization

[...]

Micah Allen¹, Micah Allen², Micah Allen³, Davide Poggiali⁴, Kirstie Whitaker¹, Tom R. Marshall⁵, Rogier A. Kievit⁶ - Show less +3 more•Institutions (6)

University of Cambridge¹, Aarhus University², Aarhus University Hospital³, University of Padua⁴, University of Oxford⁵, Cognition and Brain Sciences Unit⁶

01 Apr 2019

TL;DR: This tutorial paper provides basic demonstrations of the strength of raincloud plots and similar approaches, outline potential modifications for their optimal use, and provides open-source code for their streamlined implementation in R, Python and Matlab.

...read moreread less

Abstract: Across scientific disciplines, there is a rapidly growing recognition of the need for more statistically robust, transparent approaches to data visualization. Complementary to this, many scientists have called for plotting tools that accurately and transparently convey key aspects of statistical effects and raw data with minimal distortion. Previously common approaches, such as plotting conditional mean or median barplots together with error-bars have been criticized for distorting effect size, hiding underlying patterns in the raw data, and obscuring the assumptions upon which the most commonly used statistical tests are based. Here we describe a data visualization approach which overcomes these issues, providing maximal statistical information while preserving the desired 'inference at a glance' nature of barplots and other similar visualization devices. These "raincloud plots" can visualize raw data, probability density, and key summary statistics such as median, mean, and relevant confidence intervals in an appealing and flexible format with minimal redundancy. In this tutorial paper, we provide basic demonstrations of the strength of raincloud plots and similar approaches, outline potential modifications for their optimal use, and provide open-source code for their streamlined implementation in R, Python and Matlab ( https://github.com/RainCloudPlots/RainCloudPlots). Readers can investigate the R and Python tutorials interactively in the browser using Binder by Project Jupyter.

...read moreread less

796 citations

Journal Article•DOI•

Abandon Statistical Significance

[...]

Blakeley B. McShane¹, David Gal², Andrew Gelman³, Christian P. Robert⁴, Jennifer L. Tackett¹ - Show less +1 more•Institutions (4)

Northwestern University¹, University of Illinois at Chicago², Columbia University³, Paris Dauphine University⁴

20 Mar 2019-The American Statistician

TL;DR: This work recommends dropping the NHST paradigm—and the p-value thresholds intrinsic to it—as the default statistical paradigm for research, publication, and discovery in the biomedical and social sciences and argues that it seldom makes sense to calibrate evidence as a function of p-values or other purely statistical measures.

...read moreread less

Abstract: We discuss problems the null hypothesis significance testing (NHST) paradigm poses for replication and more broadly in the biomedical and social sciences as well as how these problems remain unreso

...read moreread less

565 citations

Journal Article•DOI•

Why You Should Always Include a Random Slope for the Lower-Level Variable Involved in a Cross-Level Interaction

[...]

Jan Paul Heisig, Merlin Schaeffer¹•Institutions (1)

University of Copenhagen¹

01 Apr 2019-European Sociological Review

TL;DR: In this paper, the authors argue that multilevel models involving cross-level interactions should always include random slopes on the lower-level components of those interactions Failure to do so will usually result in severely anti-conservative statistical inference.

...read moreread less

Abstract: Mixed-effects multilevel models are often used to investigate cross-level interactions, a specific type of context effect that may be understood as an upper-level variable moderating the association between a lower-level predictor and the outcome We argue that multilevel models involving cross-level interactions should always include random slopes on the lower-level components of those interactions Failure to do so will usually result in severely anti-conservative statistical inference We illustrate the problem with extensive Monte Carlo simulations and examine its practical relevance by studying 30 prototypical cross-level interactions with European Social Survey data for 28 countries In these empirical applications, introducing a random slope term reduces the absolute t-ratio of the cross-level interaction term by 31 per cent or more in three quarters of cases, with an average reduction of 42 per cent Many practitioners seem to be unaware of these issues Roughly half of the cross-level interaction estimates published in the European Sociological Review between 2011 and 2016 are based on models that omit the crucial random slope term Detailed analysis of the associated test statistics suggests that many of the estimates would not reach conventional thresholds for statistical significance in correctly specified models that include the random slope This raises the question how much robust evidence of cross-level interactions sociology has actually produced over the past decades

...read moreread less

247 citations

Journal Article•DOI•

Inferential statistics as descriptive statistics: there is no replication crisis if we don't expect replication

[...]

Valentin Amrhein¹, David Trafimow², Sander Greenland³•Institutions (3)

University of Basel¹, New Mexico State University², University of California, Los Angeles³

20 Mar 2019-The American Statistician

TL;DR: This paper found that many results may be selected for drawing inference because some threshold of a statistic like the P-value was crossed, leading to a failure to replicate, and that statistical inference often fails to replicate.

...read moreread less

Abstract: Statistical inference often fails to replicate. One reason is that many results may be selected for drawing inference because some threshold of a statistic like the P-value was crossed, leading to ...

...read moreread less

232 citations

Journal Article•DOI•

Learning New Physics from a Machine

[...]

R. T. D’Agnolo¹, A. Wulzer², A. Wulzer³, A. Wulzer⁴•Institutions (4)

SLAC National Accelerator Laboratory¹, École Polytechnique Fédérale de Lausanne², CERN³, University of Padua⁴

08 Jan 2019-Physical Review D

TL;DR: In this paper, the authors use neural networks to detect data departures from a given reference model, with no prior bias on the nature of the new physics responsible for the discrepancy, using a likelihood-ratio hypothesis test.

...read moreread less

Abstract: We propose using neural networks to detect data departures from a given reference model, with no prior bias on the nature of the new physics responsible for the discrepancy. The virtues of neural networks as unbiased function approximants make them particularly suited for this task. An algorithm that implements this idea is constructed, as a straightforward application of the likelihood-ratio hypothesis test. The algorithm compares observations with an auxiliary set of reference-distributed events, possibly obtained with a Monte Carlo event generator. It returns a $p$ value, which measures the compatibility of the reference model with the data. It also identifies the most discrepant phase-space region of the data set, to be selected for further investigation. The most interesting potential applications are model-independent new physics searches, although our approach could also be used to compare the theoretical predictions of different Monte Carlo event generators, or for data validation algorithms. In this work we study the performance of our algorithm on a few simple examples. The results confirm the model independence of the approach, namely that it displays good sensitivity to a variety of putative signals. Furthermore, we show that the reach does not depend much on whether a favorable signal region is selected based on prior expectations. We identify directions for improvement towards applications to real experimental data sets.

...read moreread less

139 citations

Journal Article•DOI•

Automatic Pavement Crack Detection by Multi-Scale Image Fusion

[...]

Haifeng Li¹, Dezhen Song², Yu Liu¹, Binbin Li²•Institutions (2)

Civil Aviation University of China¹, Texas A&M University²

01 Jun 2019-IEEE Transactions on Intelligent Transportation Systems

TL;DR: This work develops a windowed minimal intensity path-based method to extract the candidate cracks in the image at each scale, and develops a crack evaluation model based on a multivariate statistical hypothesis test that outperforms all counterparts.

...read moreread less

Abstract: Pavement crack detection from images is a challenging problem due to intensity inhomogeneity, topology complexity, low contrast, and noisy texture background. Traditional learning-based approaches have difficulties in obtaining representative training samples. We propose a new unsupervised multi-scale fusion crack detection (MFCD) algorithm that does not require training data. First, we develop a windowed minimal intensity path-based method to extract the candidate cracks in the image at each scale. Second, we find the crack correspondences across different scales. Finally, we develop a crack evaluation model based on a multivariate statistical hypothesis test. Our approach successfully combines strengths from both the large-scale detection (robust but poor in localization) and the small-scale detection (detail-preserving but sensitive to clutter). We analyze and experimentally test the computational complexity of our MFCD algorithm. We have implemented the algorithm and have it extensively tested on three public data sets, including two public pavement data sets and an airport runway data set. Compared with six existing methods, experimental results show that our method outperforms all counterparts. Specifically, it increases the precision, recall, and F1-measure over the state-of-the-art by 22%, 12%, and 19%, respectively, on one public data set.

...read moreread less

134 citations

Journal Article•DOI•

Exploration, Inference, and Prediction in Neuroscience and Biomedicine.

[...]

Danilo Bzdok¹, Danilo Bzdok², John P. A. Ioannidis³•Institutions (3)

French Institute for Research in Computer Science and Automation¹, RWTH Aachen University², Stanford University³

01 Apr 2019-Trends in Neurosciences

TL;DR: In this article, the antagonistic philosophies behind two quantitative approaches: certifying robust effects in understandable variables, and evaluating how accurately a built model can forecast future outcomes, are discussed.

...read moreread less

123 citations

Journal Article•DOI•

Unsupervised Detection of Anomalous Sound Based on Deep Learning and the Neyman–Pearson Lemma

[...]

Yuma Koizumi¹, Shoichiro Saito¹, Hisashi Uematsu¹, Kawachi Yuta¹, Noboru Harada¹ - Show less +1 more•Institutions (1)

Nippon Telegraph and Telephone¹

01 Jan 2019-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: In this article, an objective function based on the Neyman-Pearson lemma was proposed for unsupervised anomaly detection in sound ADS using an autoencoder AE.

...read moreread less

Abstract: This paper proposes a novel optimization principle and its implementation for unsupervised anomaly detection in sound ADS using an autoencoder AE. The goal of the unsupervised-ADS is to detect unknown anomalous sounds without training data of anomalous sounds. The use of an AE as a normal model is a state-of-the-art technique for the unsupervised-ADS. To decrease the false positive rate FPR, the AE is trained to minimize the reconstruction error of normal sounds, and the anomaly score is calculated as the reconstruction error of the observed sound. Unfortunately, since this training procedure does not take into account the anomaly score for anomalous sounds, the true positive rate TPR does not necessarily increase. In this study, we define an objective function based on the Neyman–Pearson lemma by considering the ADS as a statistical hypothesis test. The proposed objective function trains the AE to maximize the TPR under an arbitrary low FPR condition. To calculate the TPR in the objective function, we consider that the set of anomalous sounds is the complementary set of normal sounds and simulate anomalous sounds by using a rejection sampling algorithm. Through experiments using synthetic data, we found that the proposed method improved the performance measures of the ADS under low FPR conditions. In addition, we confirmed that the proposed method could detect anomalous sounds in real environments.

...read moreread less

122 citations

Journal Article•DOI•

A tutorial on testing hypotheses using the Bayes factor.

[...]

Herbert Hoijtink¹, Joris Mulder, Caspar J. van Lissa¹, Xin Gu²•Institutions (2)

Utrecht University¹, East China Normal University²

11 Feb 2019-Psychological Methods

TL;DR: After reading this tutorial and executing the associated code, researchers will be able to use their own data for the evaluation of hypotheses by means of the Bayes factor, not only in thecontext of ANOVA models, but also in the context of other statistical models.

...read moreread less

Abstract: Learning about hypothesis evaluation using the Bayes factor could enhance psychological research. In contrast to null-hypothesis significance testing it renders the evidence in favor of each of the hypotheses under consideration (it can be used to quantify support for the null-hypothesis) instead of a dichotomous reject/do-not-reject decision; it can straightforwardly be used for the evaluation of multiple hypotheses without having to bother about the proper manner to account for multiple testing; and it allows continuous reevaluation of hypotheses after additional data have been collected (Bayesian updating). This tutorial addresses researchers considering to evaluate their hypotheses by means of the Bayes factor. The focus is completely applied and each topic discussed is illustrated using Bayes factors for the evaluation of hypotheses in the context of an ANOVA model, obtained using the R package bain. Readers can execute all the analyses presented while reading this tutorial if they download bain and the R-codes used. It will be elaborated in a completely nontechnical manner: what the Bayes factor is, how it can be obtained, how Bayes factors should be interpreted, and what can be done with Bayes factors. After reading this tutorial and executing the associated code, researchers will be able to use their own data for the evaluation of hypotheses by means of the Bayes factor, not only in the context of ANOVA models, but also in the context of other statistical models. (PsycINFO Database Record (c) 2019 APA, all rights reserved).

...read moreread less

98 citations

Journal Article•DOI•

A methodology for energy multivariate time series forecasting in smart buildings based on feature selection

[...]

Aurora González-Vidal¹, Fernando Jiménez¹, Antonio F. Gómez-Skarmeta¹•Institutions (1)

University of Murcia¹

01 Aug 2019-Energy and Buildings

TL;DR: A methodology to transform the time-dependent database into a structure that standard machine learning algorithms can process, and then, apply different types of feature selection methods for regression tasks is proposed.

...read moreread less

97 citations

Journal Article•DOI•

Guiding New Physics Searches with Unsupervised Learning

[...]

Andrea De Simone¹, Thomas Jacques¹•Institutions (1)

International School for Advanced Studies¹

29 Mar 2019-European Physical Journal C

TL;DR: A statistical test upon a test statistic which measures deviations between two samples, using a Nearest Neighbors approach to estimate the local ratio of the density of points, which is model-independent and non-parametric.

...read moreread less

Abstract: We propose a new scientific application of unsupervised learning techniques to boost our ability to search for new phenomena in data, by detecting discrepancies between two datasets. These could be, for example, a simulated standard-model background, and an observed dataset containing a potential hidden signal of New Physics. We build a statistical test upon a test statistic which measures deviations between two samples, using a Nearest Neighbors approach to estimate the local ratio of the density of points. The test is model-independent and non-parametric, requiring no knowledge of the shape of the underlying distributions, and it does not bin the data, thus retaining full information from the multidimensional feature space. As a proof-of-concept, we apply our method to synthetic Gaussian data, and to a simulated dark matter signal at the Large Hadron Collider. Even in the case where the background can not be simulated accurately enough to claim discovery, the technique is a powerful tool to identify regions of interest for further study.

...read moreread less

Posted Content•

The Odds are Odd: A Statistical Test for Detecting Adversarial Examples

[...]

Kevin Roth¹, Yannic Kilcher¹, Thomas Hofmann¹•Institutions (1)

ETH Zurich¹

13 Feb 2019-arXiv: Learning

TL;DR: In this paper, the authors investigate conditions under which test statistics exist that can reliably detect examples, which have been adversarially manipulated in a white-box attack, and provide conditions for detectability via the suggested test statistics.

...read moreread less

Abstract: We investigate conditions under which test statistics exist that can reliably detect examples, which have been adversarially manipulated in a white-box attack. These statistics can be easily computed and calibrated by randomly corrupting inputs. They exploit certain anomalies that adversarial attacks introduce, in particular if they follow the paradigm of choosing perturbations optimally under p-norm constraints. Access to the log-odds is the only requirement to defend models. We justify our approach empirically, but also provide conditions under which detectability via the suggested test statistics is guaranteed to be effective. In our experiments, we show that it is even possible to correct test time predictions for adversarial attacks with high accuracy.

...read moreread less

Journal Article•DOI•

What Have We (Not) Learnt from Millions of Scientific Papers with P Values

[...]

John P. A. Ioannidis¹•Institutions (1)

Stanford University¹

20 Mar 2019-The American Statistician

TL;DR: While much of the unreliability of past and present research is driven by small, underpowered studies, NHST with P values may be also particularly problematic in the era of overpowered big data.

...read moreread less

Abstract: P values linked to null hypothesis significance testing (NHST) is the most widely (mis)used method of statistical inference. Empirical data suggest that across the biomedical literature (1990–2015)...

...read moreread less

Journal Article•DOI•

Methods to adjust for multiple comparisons in the analysis and sample size calculation of randomised controlled trials with multiple primary outcomes

[...]

Victoria Vickerstaff¹, Rumana Z Omar¹, Gareth Ambler¹•Institutions (1)

University College London¹

21 Jun 2019-BMC Medical Research Methodology

TL;DR: Simulation studies are used to investigate the disjunctive power, marginal power and FWER obtained after applying Bonferroni, Holm, Hochberg, Dubey/Armitage-Parmar and Stepdown-minP adjustment methods.

...read moreread less

Abstract: Multiple primary outcomes may be specified in randomised controlled trials (RCTs). When analysing multiple outcomes it’s important to control the family wise error rate (FWER). A popular approach to do this is to adjust the p-values corresponding to each statistical test used to investigate the intervention effects by using the Bonferroni correction. It’s also important to consider the power of the trial to detect true intervention effects. In the context of multiple outcomes, depending on the clinical objective, the power can be defined as: ‘disjunctive power’, the probability of detecting at least one true intervention effect across all the outcomes or ‘marginal power’ the probability of finding a true intervention effect on a nominated outcome. We provide practical recommendations on which method may be used to adjust for multiple comparisons in the sample size calculation and the analysis of RCTs with multiple primary outcomes. We also discuss the implications on the sample size for obtaining 90% disjunctive power and 90% marginal power. We use simulation studies to investigate the disjunctive power, marginal power and FWER obtained after applying Bonferroni, Holm, Hochberg, Dubey/Armitage-Parmar and Stepdown-minP adjustment methods. Different simulation scenarios were constructed by varying the number of outcomes, degree of correlation between the outcomes, intervention effect sizes and proportion of missing data. The Bonferroni and Holm methods provide the same disjunctive power. The Hochberg and Hommel methods provide power gains for the analysis, albeit small, in comparison to the Bonferroni method. The Stepdown-minP procedure performs well for complete data. However, it removes participants with missing values prior to the analysis resulting in a loss of power when there are missing data. The sample size requirement to achieve the desired disjunctive power may be smaller than that required to achieve the desired marginal power. The choice between whether to specify a disjunctive or marginal power should depend on the clincial objective.

...read moreread less

Posted Content•

Notes on Computational Hardness of Hypothesis Testing: Predictions using the Low-Degree Likelihood Ratio.

[...]

Dmitriy Kunisky¹, Alexander S. Wein¹, Afonso S. Bandeira¹•Institutions (1)

New York University¹

26 Jul 2019-arXiv: Statistics Theory

TL;DR: These notes survey and explore an emerging method, which is called the low-degree method, for predicting and understanding statistical-versus-computational tradeoffs in high-dimensional inference problems, which posits that a certain quantity gives insight into how much computational time is required to solve a given hypothesis testing problem.

...read moreread less

Abstract: These notes survey and explore an emerging method, which we call the low-degree method, for predicting and understanding statistical-versus-computational tradeoffs in high-dimensional inference problems. In short, the method posits that a certain quantity -- the second moment of the low-degree likelihood ratio -- gives insight into how much computational time is required to solve a given hypothesis testing problem, which can in turn be used to predict the computational hardness of a variety of statistical inference tasks. While this method originated in the study of the sum-of-squares (SoS) hierarchy of convex programs, we present a self-contained introduction that does not require knowledge of SoS. In addition to showing how to carry out predictions using the method, we include a discussion investigating both rigorous and conjectural consequences of these predictions. These notes include some new results, simplified proofs, and refined conjectures. For instance, we point out a formal connection between spectral methods and the low-degree likelihood ratio, and we give a sharp low-degree lower bound against subexponential-time algorithms for tensor PCA.

...read moreread less

Journal Article•DOI•

Power Analysis, Sample Size, and Assessment of Statistical Assumptions—Improving the Evidential Value of Lighting Research

[...]

J Uttley¹•Institutions (1)

University of Sheffield¹

25 Jan 2019-Leukos

TL;DR: Addressing issues raised in this article related to sample sizes, statistical test assumptions, and reporting of effect sizes can improve the evidential value of lighting research.

...read moreread less

Abstract: The reporting of accurate and appropriate conclusions is an essential aspect of scientific research, and failure in this endeavor can threaten the progress of cumulative knowledge. This is highligh...

...read moreread less

Journal Article•DOI•

Large-scale directed network inference with multivariate transfer entropy and hierarchical statistical testing

[...]

Leonardo Novelli¹, Patricia Wollstadt², Pedro A. M. Mediano³, Michael Wibral⁴, Joseph T. Lizier¹ - Show less +1 more•Institutions (4)

University of Sydney¹, Honda², Imperial College London³, University of Göttingen⁴

15 Jul 2019

TL;DR: The algorithm presented—as implemented in the IDTxl open-source software—addresses challenges by employing hierarchical statistical tests to control the family-wise error rate and to allow for efficient parallelization, and was validated on synthetic datasets involving random networks of increasing size.

...read moreread less

Abstract: Network inference algorithms are valuable tools for the study of large-scale neuroimaging datasets. Multivariate transfer entropy is well suited for this task, being a model-free measure that captures nonlinear and lagged dependencies between time series to infer a minimal directed network model. Greedy algorithms have been proposed to efficiently deal with high-dimensional datasets while avoiding redundant inferences and capturing synergistic effects. However, multiple statistical comparisons may inflate the false positive rate and are computationally demanding, which limited the size of previous validation studies. The algorithm we present-as implemented in the IDTxl open-source software-addresses these challenges by employing hierarchical statistical tests to control the family-wise error rate and to allow for efficient parallelization. The method was validated on synthetic datasets involving random networks of increasing size (up to 100 nodes), for both linear and nonlinear dynamics. The performance increased with the length of the time series, reaching consistently high precision, recall, and specificity (>98% on average) for 10,000 time samples. Varying the statistical significance threshold showed a more favorable precision-recall trade-off for longer time series. Both the network size and the sample size are one order of magnitude larger than previously demonstrated, showing feasibility for typical EEG and magnetoencephalography experiments.

...read moreread less

Journal Article•DOI•

An Empirical Comparison of Machine-Learning Methods on Bank Client Credit Assessments

[...]

Lkhagvadorj Munkhdalai, Tsendsuren Munkhdalai, Oyun-Erdene Namsrai, Jong Yun Lee, Keun Ho Ryu - Show less +1 more

29 Jan 2019-Sustainability

TL;DR: If lending institutions in the 2001s had used their own credit scoring model constructed by machine-learning methods explored in this study, their expected credit losses would have been lower, and they would be more sustainable.

...read moreread less

Abstract: Machine learning and artificial intelligence have achieved a human-level performance in many application domains, including image classification, speech recognition and machine translation. However, in the financial domain expert-based credit risk models have still been dominating. Establishing meaningful benchmark and comparisons on machine-learning approaches and human expert-based models is a prerequisite in further introducing novel methods. Therefore, our main goal in this study is to establish a new benchmark using real consumer data and to provide machine-learning approaches that can serve as a baseline on this benchmark. We performed an extensive comparison between the machine-learning approaches and a human expert-based model—FICO credit scoring system—by using a Survey of Consumer Finances (SCF) data. As the SCF data is non-synthetic and consists of a large number of real variables, we applied two variable-selection methods: the first method used hypothesis tests, correlation and random forest-based feature importance measures and the second method was only a random forest-based new approach (NAP), to select the best representative features for effective modelling and to compare them. We then built regression models based on various machine-learning algorithms ranging from logistic regression and support vector machines to an ensemble of gradient boosted trees and deep neural networks. Our results demonstrated that if lending institutions in the 2001s had used their own credit scoring model constructed by machine-learning methods explored in this study, their expected credit losses would have been lower, and they would be more sustainable. In addition, the deep neural networks and XGBoost algorithms trained on the subset selected by NAP achieve the highest area under the curve (AUC) and accuracy, respectively.

...read moreread less

Journal Article•DOI•

Replication Bayes factors from evidence updating

[...]

Alexander Ly¹, Alexander Etz², Maarten Marsman¹, Eric-Jan Wagenmakers¹•Institutions (2)

University of Amsterdam¹, University of California, Irvine²

01 Jan 2019-Behavior Research Methods

TL;DR: A general method that allows experimenters to quantify the evidence from the data of a direct replication attempt given data already acquired from an original study is described.

...read moreread less

Abstract: We describe a general method that allows experimenters to quantify the evidence from the data of a direct replication attempt given data already acquired from an original study. These so-called replication Bayes factors are a reconceptualization of the ones introduced by Verhagen and Wagenmakers (Journal of Experimental Psychology: General, 143(4), 1457–1475 2014) for the common t test. This reconceptualization is computationally simpler and generalizes easily to most common experimental designs for which Bayes factors are available.

...read moreread less

Journal Article•DOI•

Machine learning-based statistical testing hypothesis for fault detection in photovoltaic systems

[...]

Radhia Fazai¹, Kamaleldin Abodayeh², Majdi Mansouri¹, Mohamed Trabelsi, Hazem Nounou¹, Mohamed Nounou¹, George E. Georghiou³ - Show less +3 more•Institutions (3)

Texas A&M University at Qatar¹, Prince Sultan University², University of Cyprus³

15 Sep 2019-Solar Energy

TL;DR: The developed GPR-based GLRT approach is assessed using simulated and real PV data through monitoring the key PV system variables and the computation time, missed detection rate, and false alarm rate are computed to evaluate the fault detection performance of the proposed approach.

...read moreread less

Journal Article•DOI•

Semantic and Cognitive Tools to Aid Statistical Science: Replace Confidence and Significance by Compatibility and Surprise

[...]

Zad Rafi¹, Sander Greenland²•Institutions (2)

New York University¹, University of California, Los Angeles²

18 Sep 2019-arXiv: Methodology

TL;DR: In this paper, the Shannon transform of the $P$-value $p, also known as the binary surprisal, is used to measure the information supplied by the testing procedure, and to help calibrate intuitions against simple physical experiments like coin tossing.

...read moreread less

Abstract: Researchers often misinterpret and misrepresent statistical outputs. This abuse has led to a large literature on modification or replacement of testing thresholds and $P$-values with confidence intervals, Bayes factors, and other devices. Because the core problems appear cognitive rather than statistical, we review simple aids to statistical interpretations. These aids emphasize logical and information concepts over probability, and thus may be more robust to common misinterpretations than are traditional descriptions. We use the Shannon transform of the $P$-value $p$, also known as the binary surprisal or $S$-value $s=-\log_{2}(p)$, to measure the information supplied by the testing procedure, and to help calibrate intuitions against simple physical experiments like coin tossing. We also use tables or graphs of test statistics for alternative hypotheses, and interval estimates for different percentile levels, to thwart fallacies arising from arbitrary dichotomies. Finally, we reinterpret $P$-values and interval estimates in unconditional terms, which describe compatibility of data with the entire set of analysis assumptions. We illustrate these methods with a reanalysis of data from an existing record-based cohort study. In line with other recent recommendations, we advise that teaching materials and research reports discuss $P$-values as measures of compatibility rather than significance, compute $P$-values for alternative hypotheses whenever they are computed for null hypotheses, and interpret interval estimates as showing values of high compatibility with data, rather than regions of confidence. Our recommendations emphasize cognitive devices for displaying the compatibility of the observed data with various hypotheses of interest, rather than focusing on single hypothesis tests or interval estimates. We believe these simple reforms are well worth the minor effort they require.

...read moreread less

Posted Content•

Veridical Data Science

[...]

Bin Yu, Karl Kumbier

23 Jan 2019-arXiv: Machine Learning

TL;DR: The PCS framework will be illustrated through the development of the DeepTune framework for characterizing V4 neurons, and the PDR desiderata for interpretable machine learning as part of veridical data science will be proposed.

...read moreread less

Abstract: Building and expanding on principles of statistics, machine learning, and scientific inquiry, we propose the predictability, computability, and stability (PCS) framework for veridical data science. Our framework, comprised of both a workflow and documentation, aims to provide responsible, reliable, reproducible, and transparent results across the entire data science life cycle. The PCS workflow uses predictability as a reality check and considers the importance of computation in data collection/storage and algorithm design. It augments predictability and computability with an overarching stability principle for the data science life cycle. Stability expands on statistical uncertainty considerations to assess how human judgment calls impact data results through data and model/algorithm perturbations. Moreover, we develop inference procedures that build on PCS, namely PCS perturbation intervals and PCS hypothesis testing, to investigate the stability of data results relative to problem formulation, data cleaning, modeling decisions, and interpretations. We illustrate PCS inference through neuroscience and genomics projects of our own and others and compare it to existing methods in high dimensional, sparse linear model simulations. Over a wide range of misspecified simulation models, PCS inference demonstrates favorable performance in terms of ROC curves. Finally, we propose PCS documentation based on R Markdown or Jupyter Notebook, with publicly available, reproducible codes and narratives to back up human choices made throughout an analysis. The PCS workflow and documentation are demonstrated in a genomics case study available on Zenodo.

...read moreread less

Posted Content•

Multivariate Rank-based Distribution-free Nonparametric Testing using Measure Transportation

[...]

Nabarun Deb¹, Bodhisattva Sen¹•Institutions (1)

Columbia University¹

18 Sep 2019-arXiv: Statistics Theory

TL;DR: This paper proposes a general framework for distribution-free nonparametric testing in multi-dimensions, based on a notion of multivariate ranks defined using the theory of measure transportation, and proposes (multivariate) rank versions of distance covariance and energy statistic for testing scenarios (i) and (ii) respectively.

...read moreread less

Abstract: In this paper, we propose a general framework for distribution-free nonparametric testing in multi-dimensions, based on a notion of multivariate ranks defined using the theory of measure transportation. Unlike other existing proposals in the literature, these multivariate ranks share a number of useful properties with the usual one-dimensional ranks; most importantly, these ranks are distribution-free. This crucial observation allows us to design nonparametric tests that are exactly distribution-free under the null hypothesis. We demonstrate the applicability of this approach by constructing exact distribution-free tests for two classical nonparametric problems: (i) testing for mutual independence between random vectors, and (ii) testing for the equality of multivariate distributions. In particular, we propose (multivariate) rank versions of distance covariance (Szekely et al., 2007) and energy statistic (Szekely and Rizzo, 2013) for testing scenarios (i) and (ii) respectively. In both these problems, we derive the asymptotic null distribution of the proposed test statistics. We further show that our tests are consistent against all fixed alternatives. Moreover, the proposed tests are tuning-free, computationally feasible and are well-defined under minimal assumptions on the underlying distributions (e.g., they do not need any moment assumptions). We also demonstrate the efficacy of these procedures via extensive simulations. In the process of analyzing the theoretical properties of our procedures, we end up proving some new results in the theory of measure transportation and in the limit theory of permutation statistics using Stein's method for exchangeable pairs, which may be of independent interest.

...read moreread less

Journal Article•DOI•

Selection of appropriate statistical methods for data analysis

[...]

Prabhaker Mishra¹, Chandra Mani Pandey¹, Uttam Singh¹, Amit Keshri¹, Mayilvaganan Sabaretnam¹ - Show less +1 more•Institutions (1)

Sanjay Gandhi Post Graduate Institute of Medical Sciences¹

01 Jul 2019-Annals of Cardiac Anaesthesia

TL;DR: The present article has discussed the parametric and non-parametric methods, their assumptions, and how to select appropriate statistical methods for analysis and interpretation of the biomedical data.

...read moreread less

Abstract: In biostatistics, for each of the specific situation, statistical methods are available for analysis and interpretation of the data. To select the appropriate statistical method, one need to know the assumption and conditions of the statistical methods, so that proper statistical method can be selected for data analysis. Two main statistical methods are used in data analysis: descriptive statistics, which summarizes data using indexes such as mean and median and another is inferential statistics, which draw conclusions from data using statistical tests such as student's t-test. Selection of appropriate statistical method depends on the following three things: Aim and objective of the study, Type and distribution of the data used, and Nature of the observations (paired/unpaired). All type of statistical methods that are used to compare the means are called parametric while statistical methods used to compare other than means (ex-median/mean ranks/proportions) are called nonparametric methods. In the present article, we have discussed the parametric and non-parametric methods, their assumptions, and how to select appropriate statistical methods for analysis and interpretation of the biomedical data.

...read moreread less

Proceedings Article•

Calibration tests in multi-class classification: A unifying framework

[...]

David Widmann¹, Fredrik Lindsten², Dave Zachariah¹•Institutions (2)

Uppsala University¹, Linköping University²

01 Oct 2019

TL;DR: These estimators can be interpreted as test statistics associated with well-defined bounds and approximations of the p-value under the null hypothesis that the model is calibrated, significantly improving the interpretability of calibration measures, which otherwise lack any meaningful unit or scale.

...read moreread less

Abstract: In safety-critical applications a probabilistic model is usually required to be calibrated, i.e., to capture the uncertainty of its predictions accurately. In multi-class classification, calibration of the most confident predictions only is often not sufficient. We propose and study calibration measures for multi-class classification that generalize existing measures such as the expected calibration error, the maximum calibration error, and the maximum mean calibration error. We propose and evaluate empirically different consistent and unbiased estimators for a specific class of measures based on matrix-valued kernels. Importantly, these estimators can be interpreted as test statistics associated with well-defined bounds and approximations of the p-value under the null hypothesis that the model is calibrated, significantly improving the interpretability of calibration measures, which otherwise lack any meaningful unit or scale.

...read moreread less

Posted Content•

Inference for multiple heterogeneous networks with a common invariant subspace

[...]

Jesús Arroyo¹, Avanti Athreya², Joshua Cape³, Guodong Chen², Carey E. Priebe², Joshua T. Vogelstein² - Show less +2 more•Institutions (3)

Texas A&M University¹, Johns Hopkins University², University of Pittsburgh³

24 Jun 2019-arXiv: Methodology

TL;DR: A new model is introduced, the common subspace independent-edge multiple random graph model, which describes a heterogeneous collection of networks with a shared latent structure on the vertices but potentially different connectivity patterns for each graph, and is both flexible enough to meaningfully account for important graph differences and tractable enough to allow for accurate inference in multiple networks.

...read moreread less

Abstract: The development of models for multiple heterogeneous network data is of critical importance both in statistical network theory and across multiple application domains. Although single-graph inference is well-studied, multiple graph inference is largely unexplored, in part because of the challenges inherent in appropriately modeling graph differences and yet retaining sufficient model simplicity to render estimation feasible. This paper addresses exactly this gap, by introducing a new model, the common subspace independent-edge (COSIE) multiple random graph model, which describes a heterogeneous collection of networks with a shared latent structure on the vertices but potentially different connectivity patterns for each graph. The COSIE model encompasses many popular network representations, including the stochastic blockmodel. The model is both flexible enough to meaningfully account for important graph differences and tractable enough to allow for accurate inference in multiple networks. In particular, a joint spectral embedding of adjacency matrices - the multiple adjacency spectral embedding (MASE) - leads, in a COSIE model, to simultaneous consistent estimation of underlying parameters for each graph. Under mild additional assumptions, MASE estimates satisfy asymptotic normality and yield improvements for graph eigenvalue estimation and hypothesis testing. In both simulated and real data, the COSIE model and the MASE embedding can be deployed for a number of subsequent network inference tasks, including dimensionality reduction, classification, hypothesis testing and community detection. Specifically, when MASE is applied to a dataset of connectomes constructed through diffusion magnetic resonance imaging, the result is an accurate classification of brain scans by patient and a meaningful determination of heterogeneity across scans of different subjects.

...read moreread less

Journal Article•DOI•

How Do I Know What My Theory Predicts

[...]

Zoltan Dienes¹•Institutions (1)

University of Sussex¹

14 Nov 2019

TL;DR: In this paper, a Bayes factor is used to quantify the amount of evidence for or against a theory relative to the null hypothesis, which can then be quantified by a Bayesian approach.

...read moreread less

Abstract: To get evidence for or against a theory relative to the null hypothesis, one needs to know what the theory predicts. The amount of evidence can then be quantified by a Bayes factor. Specifying the ...

...read moreread less

Journal Article•DOI•

Bayesian clinical trial design using historical data that inform the treatment effect

[...]

Matthew A. Psioda¹, Joseph G. Ibrahim¹•Institutions (1)

University of North Carolina at Chapel Hill¹

01 Jul 2019-Biostatistics

TL;DR: The broadly applicable, simulation-based methodology provides a framework for calibrating the informativeness of a prior while simultaneously identifying the minimum sample size required for a new trial such that the overall design has appropriate power to detect a non-null treatment effect and reasonable type I error control.

...read moreread less

Abstract: We consider the problem of Bayesian sample size determination for a clinical trial in the presence of historical data that inform the treatment effect. Our broadly applicable, simulation-based methodology provides a framework for calibrating the informativeness of a prior while simultaneously identifying the minimum sample size required for a new trial such that the overall design has appropriate power to detect a non-null treatment effect and reasonable type I error control. We develop a comprehensive strategy for eliciting null and alternative sampling prior distributions which are used to define Bayesian generalizations of the traditional notions of type I error control and power. Bayesian type I error control requires that a weighted-average type I error rate not exceed a prespecified threshold. We develop a procedure for generating an appropriately sized Bayesian hypothesis test using a simple partial-borrowing power prior which summarizes the fraction of information borrowed from the historical trial. We present results from simulation studies that demonstrate that a hypothesis test procedure based on this simple power prior is as efficient as those based on more complicated meta-analytic priors, such as normalized power priors or robust mixture priors, when all are held to precise type I error control requirements. We demonstrate our methodology using a real data set to design a follow-up clinical trial with time-to-event endpoint for an investigational treatment in high-risk melanoma.

...read moreread less

Posted Content•

On Binscatter

[...]

Matias D. Cattaneo, Richard K. Crump, Max H. Farrell, Yingjie Feng

25 Feb 2019-arXiv: Econometrics

TL;DR: A foundational, thorough analysis of binscatter is presented: an array of theoretical and practical results that aid both in understanding current practices and in offering theory-based guidance for future applications are given.

...read moreread less

Abstract: \textit{Binscatter}, or a binned scatter plot, is a very popular tool in applied microeconomics. It provides a flexible, yet parsimonious way of visualizing and summarizing mean, quantile, and other nonparametric regression functions in large data sets. It is also often used for informal evaluation of substantive hypotheses such as linearity or monotonicity of the unknown function. This paper presents a foundational econometric analysis of binscatter, offering an array of theoretical and practical results that aid both understanding current practices (i.e., their validity or lack thereof) as well as guiding future applications. In particular, we highlight important methodological problems related to covariate adjustment methods used in current practice, and provide a simple, valid approach. Our results include a principled choice for the number of bins, confidence intervals and bands, hypothesis tests for parametric and shape restrictions for mean, quantile, and other functions of interest, among other new methods, all applicable to canonical binscatter as well as to nonlinear, higher-order polynomial, smoothness-restricted and covariate-adjusted extensions thereof. Companion general-purpose software packages for \texttt{Python}, \texttt{R}, and \texttt{Stata} are provided. From a technical perspective, we present novel theoretical results for possibly nonlinear semi-parametric partitioning-based series estimation with random partitions that are of independent interest.

...read moreread less

Journal Article•DOI•

More about the basic assumptions of t-test: normality and sample size

[...]

Tae Kyun Kim¹, Jae Hong Park²•Institutions (2)

Pusan National University¹, Inje University²

01 Apr 2019-Korean Journal of Anesthesiology

TL;DR: The relationships between normality, power, and sample size were discussed and it was found that as the sample size decreased in the normality test, sufficient power was not guaranteed even with the same significance level.

...read moreread less

Abstract: Most parametric tests start with the basic assumption on the distribution of populations. The conditions required to conduct the t-test include the measured values in ratio scale or interval scale, simple random extraction, normal distribution of data, appropriate sample size, and homogeneity of variance. The normality test is a kind of hypothesis test which has Type I and II errors, similar to the other hypothesis tests. It means that the sample size must influence the power of the normality test and its reliability. It is hard to find an established sample size for satisfying the power of the normality test. In the current article, the relationships between normality, power, and sample size were discussed. As the sample size decreased in the normality test, sufficient power was not guaranteed even with the same significance level. In the independent t-test, the change in power according to sample size and sample size ratio between groups was observed. When the sample size of one group was fixed and that of another group increased, power increased to some extent. However, it was not more efficient than increasing the sample sizes of both groups equally. To ensure the power in the normality test, sufficient sample size is required. The power is maximized when the sample size ratio between two groups is 1 : 1.

...read moreread less

Collapse