scispace - formally typeset
Search or ask a question

Showing papers on "Statistical hypothesis testing published in 2016"


Journal ArticleDOI
TL;DR: Misinterpretation and abuse of statistical tests, confidence intervals, and statistical power have been decried for decades, yet remain rampant as discussed by the authors, and there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof Instead, correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists.
Abstract: Misinterpretation and abuse of statistical tests, confidence intervals, and statistical power have been decried for decades, yet remain rampant A key problem is that there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof Instead, correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists This high cognitive demand has led to an epidemic of shortcut definitions and interpretations that are simply wrong, sometimes disastrously so-and yet these misinterpretations dominate much of the scientific literature In light of this problem, we provide definitions and a discussion of basic statistics that are more general and critical than typically found in traditional introductory expositions Our goal is to provide a resource for instructors, researchers, and consumers of statistics whose knowledge of statistical theory and technique may be limited but who wish to avoid and spot misinterpretations We emphasize how violation of often unstated analysis protocols (such as selecting analyses for presentation based on the P values they produce) can lead to small P values even if the declared test hypothesis is correct, and can lead to large P values even if that hypothesis is incorrect We then provide an explanatory list of 25 misinterpretations of P values, confidence intervals, and power We conclude with guidelines for improving statistical interpretation and reporting

1,584 citations


Journal Article
TL;DR: This paper provided definitions and a discussion of basic statistics that are more general and critical than typically found in traditional introductory expositions, and provided an explanatory list of 25 misinterpretations of P values, confidence intervals, and power.
Abstract: Misinterpretation and abuse of statistical tests, confidence intervals, and statistical power have been decried for decades, yet remain rampant. A key problem is that there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof. Instead, correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists. This high cognitive demand has led to an epidemic of shortcut definitions and interpretations that are simply wrong, sometimes disastrously so-and yet these misinterpretations dominate much of the scientific literature. In light of this problem, we provide definitions and a discussion of basic statistics that are more general and critical than typically found in traditional introductory expositions. Our goal is to provide a resource for instructors, researchers, and consumers of statistics whose knowledge of statistical theory and technique may be limited but who wish to avoid and spot misinterpretations. We emphasize how violation of often unstated analysis protocols (such as selecting analyses for presentation based on the P values they produce) can lead to small P values even if the declared test hypothesis is correct, and can lead to large P values even if that hypothesis is incorrect. We then provide an explanatory list of 25 misinterpretations of P values, confidence intervals, and power. We conclude with guidelines for improving statistical interpretation and reporting.

1,354 citations


01 Jan 2016
TL;DR: This introduction to robust estimation and hypothesis testing helps people to enjoy a good book with a cup of coffee in the afternoon, instead they cope with some harmful bugs inside their laptop.
Abstract: Thank you very much for downloading introduction to robust estimation and hypothesis testing. As you may know, people have search numerous times for their favorite books like this introduction to robust estimation and hypothesis testing, but end up in harmful downloads. Rather than enjoying a good book with a cup of coffee in the afternoon, instead they cope with some harmful bugs inside their laptop.

968 citations


Journal ArticleDOI
TL;DR: This paper provides a data-driven approach to partition the data into subpopulations that differ in the magnitude of their treatment effects, and proposes an “honest” approach to estimation, whereby one sample is used to construct the partition and another to estimate treatment effects for each subpopulation.
Abstract: In this paper we propose methods for estimating heterogeneity in causal effects in experimental and observational studies and for conducting hypothesis tests about the magnitude of differences in treatment effects across subsets of the population. We provide a data-driven approach to partition the data into subpopulations that differ in the magnitude of their treatment effects. The approach enables the construction of valid confidence intervals for treatment effects, even with many covariates relative to the sample size, and without “sparsity” assumptions. We propose an “honest” approach to estimation, whereby one sample is used to construct the partition and another to estimate treatment effects for each subpopulation. Our approach builds on regression tree methods, modified to optimize for goodness of fit in treatment effects and to account for honest estimation. Our model selection criterion anticipates that bias will be eliminated by honest estimation and also accounts for the effect of making additional splits on the variance of treatment effect estimates within each subpopulation. We address the challenge that the “ground truth” for a causal effect is not observed for any individual unit, so that standard approaches to cross-validation must be modified. Through a simulation study, we show that for our preferred method honest estimation results in nominal coverage for 90% confidence intervals, whereas coverage ranges between 74% and 84% for nonhonest approaches. Honest estimation requires estimating the model with a smaller sample size; the cost in terms of mean squared error of treatment effects for our preferred method ranges between 7–22%.

913 citations


Journal ArticleDOI
TL;DR: A general approach to valid inference after model selection by the lasso is developed to form valid confidence intervals for the selected coefficients and test whether all relevant variables have been included in the model.
Abstract: We develop a general approach to valid inference after model selection. At the core of our framework is a result that characterizes the distribution of a post-selection estimator conditioned on the selection event. We specialize the approach to model selection by the lasso to form valid confidence intervals for the selected coefficients and test whether all relevant variables have been included in the model.

616 citations


Journal ArticleDOI
TL;DR: A new Empirical Bayes approach for large‐scale hypothesis testing, including estimating false discovery rates (FDRs), and effect sizes, and it is argued that the local false sign rate is a superior measure of significance than the local FDR because it is both more generally applicable and can be more robustly estimated.
Abstract: We introduce a new Empirical Bayes approach for large-scale hypothesis testing, including estimating false discovery rates (FDRs), and effect sizes. This approach has two key differences from existing approaches to FDR analysis. First, it assumes that the distribution of the actual (unobserved) effects is unimodal, with a mode at 0. This "unimodal assumption" (UA), although natural in many contexts, is not usually incorporated into standard FDR analysis, and we demonstrate how incorporating it brings many benefits. Specifically, the UA facilitates efficient and robust computation-estimating the unimodal distribution involves solving a simple convex optimization problem-and enables more accurate inferences provided that it holds. Second, the method takes as its input two numbers for each test (an effect size estimate and corresponding standard error), rather than the one number usually used ($p$ value or $z$ score). When available, using two numbers instead of one helps account for variation in measurement precision across tests. It also facilitates estimation of effects, and unlike standard FDR methods, our approach provides interval estimates (credible regions) for each effect in addition to measures of significance. To provide a bridge between interval estimates and significance measures, we introduce the term "local false sign rate" to refer to the probability of getting the sign of an effect wrong and argue that it is a superior measure of significance than the local FDR because it is both more generally applicable and can be more robustly estimated. Our methods are implemented in an R package ashr available from http://github.com/stephens999/ashr.

615 citations


Journal ArticleDOI
TL;DR: Independent hypothesis weighting (IHW) is described, a method that assigns weights using covariates independent of the P-values under the null hypothesis but informative of each test's power or prior probability of thenull hypothesis.
Abstract: For multiple hypothesis testing in genomics and other large-scale data analyses, the independent hypothesis weighting (IHW) approach uses data-driven P-value weight assignment to improve power while controlling the false discovery rate. Hypothesis weighting improves the power of large-scale multiple testing. We describe independent hypothesis weighting (IHW), a method that assigns weights using covariates independent of the P-values under the null hypothesis but informative of each test's power or prior probability of the null hypothesis ( http://www.bioconductor.org/packages/IHW ). IHW increases power while controlling the false discovery rate and is a practical approach to discovering associations in genomics, high-throughput biology and other large data sets.

480 citations


Journal ArticleDOI
TL;DR: This paper reviews the history of the multiple-testing issue within the atmospheric sciences literature and illustrates a statistically principled and computationally easy approach to dealing with it—namely, control of the false discovery rate.
Abstract: Special care must be exercised in the interpretation of multiple statistical hypothesis tests—for example, when each of many tests corresponds to a different location. Correctly interpreting results of multiple simultaneous tests requires a higher standard of evidence than is the case when evaluating results of a single test, and this has been known in the atmospheric sciences literature for more than a century. Even so, the issue continues to be widely ignored, leading routinely to overstatement and overinterpretation of scientific results, to the detriment of the discipline. This paper reviews the history of the multiple-testing issue within the atmospheric sciences literature and illustrates a statistically principled and computationally easy approach to dealing with it—namely, control of the false discovery rate.

467 citations


Posted Content
TL;DR: This work develops likelihood-free inference methods and highlight hypothesis testing as a principle for learning in implicit generative models, using which it is able to derive the objective function used by GANs, and many other related objectives.
Abstract: Generative adversarial networks (GANs) provide an algorithmic framework for constructing generative models with several appealing properties: they do not require a likelihood function to be specified, only a generating procedure; they provide samples that are sharp and compelling; and they allow us to harness our knowledge of building highly accurate neural network classifiers. Here, we develop our understanding of GANs with the aim of forming a rich view of this growing area of machine learning---to build connections to the diverse set of statistical thinking on this topic, of which much can be gained by a mutual exchange of ideas. We frame GANs within the wider landscape of algorithms for learning in implicit generative models--models that only specify a stochastic procedure with which to generate data--and relate these ideas to modelling problems in related fields, such as econometrics and approximate Bayesian computation. We develop likelihood-free inference methods and highlight hypothesis testing as a principle for learning in implicit generative models, using which we are able to derive the objective function used by GANs, and many other related objectives. The testing viewpoint directs our focus to the general problem of density ratio estimation. There are four approaches for density ratio estimation, one of which is a solution using classifiers to distinguish real from generated data. Other approaches such as divergence minimisation and moment matching have also been explored in the GAN literature, and we synthesise these views to form an understanding in terms of the relationships between them and the wider literature, highlighting avenues for future exploration and cross-pollination.

343 citations


Posted Content
TL;DR: By incorporating information about dependence ignored in classical multiple testing procedures, such as the Bonferroni and Holm corrections, the bootstrap-based procedure has much greater ability to detect truly false null hypotheses.
Abstract: Empiricism in the sciences allows us to test theories, formulate optimal policies, and learn how the world works. In this manner, it is critical that our empirical work provides accurate conclusions about underlying data patterns. False positives represent an especially important problem, as vast public and private resources can be misguided if we base decisions on false discovery. This study explores one especially pernicious influence on false positives—multiple hypothesis testing (MHT). While MHT potentially affects all types of empirical work, we consider three common scenarios where MHT influences inference within experimental economics: jointly identifying treatment effects for a set of outcomes, estimating heterogeneous treatment effects through subgroup analysis, and conducting hypothesis testing for multiple treatment conditions. Building upon the work of Romano and Wolf (2010), we present a correction procedure that incorporates the three scenarios, and illustrate the improvement in power by comparing our results with those obtained by the classic studies due to Bonferroni (1935) and Holm (1979). Importantly, under weak assumptions, our testing procedure asymptotically controls the familywise error rate – the probability of one false rejection – and is asymptotically balanced. We showcase our approach by revisiting the data reported in Karlan and List (2007), to deepen our understanding of why people give to charitable causes.

292 citations


Journal ArticleDOI
TL;DR: In this article, the authors propose new inference tools for forward stepwise regression, least angle regression, and the lasso, which can be expressed as polyhedral constraints on the observation vector y.
Abstract: We propose new inference tools for forward stepwise regression, least angle regression, and the lasso. Assuming a Gaussian model for the observation vector y, we first describe a general scheme to perform valid inference after any selection event that can be characterized as y falling into a polyhedral set. This framework allows us to derive conditional (post-selection) hypothesis tests at any step of forward stepwise or least angle regression, or any step along the lasso regularization path, because, as it turns out, selection events for these procedures can be expressed as polyhedral constraints on y. The p-values associated with these tests are exactly uniform under the null distribution, in finite samples, yielding exact Type I error control. The tests can also be inverted to produce confidence intervals for appropriate underlying regression parameters. The R package selectiveInference, freely available on the CRAN repository, implements the new inference tools described in this article. Suppl...

Journal ArticleDOI
TL;DR: In this paper, the authors focus on two common inferential scenarios: testing the nullity of a normal mean (i.e., the Bayesian equivalent of the t -test) and testing the correlation.

Journal ArticleDOI
TL;DR: A novel ensemble model for bankruptcy prediction that utilizes Extreme Gradient Boosting for learning an ensemble of decision trees is proposed and a new approach for generating synthetic features to improve prediction is proposed.
Abstract: We propose a novel ensemble model for bankruptcy prediction.We use Extreme Gradient Boosting as an ensemble of decision trees.We propose a new approach for generating synthetic features to improve prediction.The presented method is evaluated on real-life data of Polish companies. Bankruptcy prediction has been a subject of interests for almost a century and it still ranks high among hottest topics in economics. The aim of predicting financial distress is to develop a predictive model that combines various econometric measures and allows to foresee a financial condition of a firm. In this domain various methods were proposed that were based on statistical hypothesis testing, statistical modeling (e.g., generalized linear models), and recently artificial intelligence (e.g., neural networks, Support Vector Machines, decision tress). In this paper, we propose a novel approach for bankruptcy prediction that utilizes Extreme Gradient Boosting for learning an ensemble of decision trees. Additionally, in order to reflect higher-order statistics in data and impose a prior knowledge about data representation, we introduce a new concept that we refer as to synthetic features. A synthetic feature is a combination of the econometric measures using arithmetic operations (addition, subtraction, multiplication, division). Each synthetic feature can be seen as a single regression model that is developed in an evolutionary manner. We evaluate our solution using the collected data about Polish companies in five tasks corresponding to the bankruptcy prediction in the 1st, 2nd, 3rd, 4th, and 5th year. We compare our approach with the reference methods.

Journal ArticleDOI
TL;DR: In this article, the authors explore the concept of statistical evidence and how it can be quantified using the Bayes factor, and discuss the philosophical issues inherent in the use of the BFA.

Journal ArticleDOI
TL;DR: The practical advantages of Bayesian inference are demonstrated here through two concrete examples as mentioned in this paper, which demonstrate how Bayesian analyses can be more informative, more elegant, and more flexible than the orthodox methodology that remains dominant within the field of psychology.
Abstract: The practical advantages of Bayesian inference are demonstrated here through two concrete examples. In the first example, we wish to learn about a criminal’s IQ: a problem of parameter estimation. In the second example, we wish to quantify and track support in favor of the null hypothesis that Adam Sandler movies are profitable regardless of their quality: a problem of hypothesis testing. The Bayesian approach unifies both problems within a coherent predictive framework, in which parameters and models that predict the data successfully receive a boost in plausibility, whereas parameters and models that predict poorly suffer a decline. Our examples demonstrate how Bayesian analyses can be more informative, more elegant, and more flexible than the orthodox methodology that remains dominant within the field of psychology.

Journal ArticleDOI
TL;DR: In this article, Bayesian approaches to hypothesis testing and estimation with confidence or credible intervals have been discussed, as well as Bayesian methods to meta-analysis, random control trials, and planning.
Abstract: In the practice of data analysis, there is a conceptual distinction between hypothesis testing, on the one hand, and estimation with quantified uncertainty, on the other hand. Among frequentists in psychology a shift of emphasis from hypothesis testing to estimation has been dubbed “the New Statistics” (Cumming, 2014). A second conceptual distinction is between frequentist methods and Bayesian methods. Our main goal in this article is to explain how Bayesian methods achieve the goals of the New Statistics better than frequentist methods. The article reviews frequentist and Bayesian approaches to hypothesis testing and to estimation with confidence or credible intervals. The article also describes Bayesian approaches to meta-analysis, random control trials, and planning (e.g., power analysis).

Journal ArticleDOI
TL;DR: The reader will be acquainted with the basic research tools that are utilised while conducting various studies and an understanding of quantitative and qualitative variables and the measures of central tendency are covered.
Abstract: Statistical methods involved in carrying out a study include planning, designing, collecting data, analysing, drawing meaningful interpretation and reporting of the research findings. The statistical analysis gives meaning to the meaningless numbers, thereby breathing life into a lifeless data. The results and inferences are precise only if proper statistical tests are used. This article will try to acquaint the reader with the basic research tools that are utilised while conducting various studies. The article covers a brief outline of the variables, an understanding of quantitative and qualitative variables and the measures of central tendency. An idea of the sample size estimation, power analysis and the statistical errors is given. Finally, there is a summary of parametric and non-parametric tests used for data analysis.

Journal ArticleDOI
TL;DR: Five metrics commonly used as quantitative descriptors of sample similarity in detrital geochronology, including the Kolmogorov-Smirnov and Kuiper tests are tested, as well as Cross-correlation, Likeness, and Similarity coefficients of probability density plots, and locally adaptive, variable-bandwidth KDEs.
Abstract: The increase in detrital geochronological data presents challenges to existing approaches to data visualization and comparison, and highlights the need for quantitative techniques able to evaluate and compare multiple large data sets. We test five metrics commonly used as quantitative descriptors of sample similarity in detrital geochronology: the Kolmogorov-Smirnov (K-S) and Kuiper tests, as well as Cross-correlation, Likeness, and Similarity coefficients of probability density plots (PDPs), kernel density estimates (KDEs), and locally adaptive, variable-bandwidth KDEs (LA-KDEs). We assess these metrics by applying them to 20 large synthetic data sets and one large empirical data set, and evaluate their utility in terms of sample similarity based on the following three criteria. (1) Similarity of samples from the same population should systematically increase with increasing sample size. (2) Metrics should maximize sensitivity by using the full range of possible coefficients. (3) Metrics should minimize artifacts resulting from sample-specific complexity. K-S and Kuiper test p-values passed only one criterion, indicating that they are poorly suited as quantitative descriptors of sample similarity. Likeness and Similarity coefficients of PDPs, as well as K-S and Kuiper test D and V values, performed better by passing two of the criteria. Cross-correlation of PDPs passed all three criteria. All coefficients calculated from KDEs and LA-KDEs failed at least two of the criteria. As hypothesis tests of derivation from a common source, individual K-S and Kuiper p-values too frequently reject the null hypothesis that samples come from a common source when they are identical. However, mean p-values calculated by repeated subsampling and comparison (minimum of 4 trials) consistently yield a binary discrimination of identical versus different source populations. Cross-correlation and Likeness of PDPs and Cross-correlation of KDEs yield the widest divergence in coefficients and thus a consistent discrimination between identical and different source populations, with Cross-correlation of PDPs requiring the smallest sample size. In light of this, we recommend acquisition of large detrital geochronology data sets for quantitative comparison. We also recommend repeated subsampling of detrital geochronology data sets and calculation of the mean and standard deviation of the comparison metric in order to capture the variability inherent in sampling a multimodal population. These statistical tools are implemented using DZstats, a MATLAB-based code that can be accessed via an executable file graphical user interface. It implements all of the statistical tests discussed in this paper, and exports the results both as spreadsheets and as graphic files.

Journal ArticleDOI
TL;DR: This tutorial study clarifies the relationships between traditional methods based on allele frequency differentiation and EA methods and provides a unified framework for their underlying statistical tests, and demonstrates how techniques developed in the area of genomewide association studies, such as inflation factors and linear mixed models, benefit genome scan methods.
Abstract: Population differentiation (PD) and ecological association (EA) tests have recently emerged as prominent statistical methods to investigate signatures of local adaptation using population genomic data. Based on statistical models, these genomewide testing procedures have attracted considerable attention as tools to identify loci potentially targeted by natural selection. An important issue with PD and EA tests is that incorrect model specification can generate large numbers of false-positive associations. Spurious association may indeed arise when shared demographic history, patterns of isolation by distance, cryptic relatedness or genetic background are ignored. Recent works on PD and EA tests have widely focused on improvements of test corrections for those confounding effects. Despite significant algorithmic improvements, there is still a number of open questions on how to check that false discoveries are under control and implement test corrections, or how to combine statistical tests from multiple genome scan methods. This tutorial study provides a detailed answer to these questions. It clarifies the relationships between traditional methods based on allele frequency differentiation and EA methods and provides a unified framework for their underlying statistical tests. We demonstrate how techniques developed in the area of genomewide association studies, such as inflation factors and linear mixed models, benefit genome scan methods and provide guidelines for good practice while conducting statistical tests in landscape and population genomic applications. Finally, we highlight how the combination of several well-calibrated statistical tests can increase the power to reject neutrality, improving our ability to infer patterns of local adaptation in large population genomic data sets.

Journal ArticleDOI
TL;DR: In this paper, the authors derived a closed-form statistic for transient detection, flux measurement, and any image-difference hypothesis testing, which is mathematically proven to be the optimal transient detection statistic in the limit of background-dominated noise.
Abstract: Transient detection and flux measurement via image subtraction stand at the base of time domain astronomy. Due to the varying seeing conditions, the image subtraction process is non-trivial, and existing solutions suffer from a variety of problems. Starting from basic statistical principles, we develop the optimal statistic for transient detection, flux measurement, and any image-difference hypothesis testing. We derive a closed-form statistic that: (1) is mathematically proven to be the optimal transient detection statistic in the limit of background-dominated noise, (2) is numerically stable, (3) for accurately registered, adequately sampled images, does not leave subtraction or deconvolution artifacts, (4) allows automatic transient detection to the theoretical sensitivity limit by providing credible detection significance, (5) has uncorrelated white noise, (6) is a sufficient statistic for any further statistical test on the difference image, and, in particular, allows us to distinguish particle hits and other image artifacts from real transients, (7) is symmetric to the exchange of the new and reference images, (8) is at least an order of magnitude faster to compute than some popular methods, and (9) is straightforward to implement. Furthermore, we present extensions of this method that make it resilient to registration errors, color-refraction errors, and any noise source that can be modeled. In addition, we show that the optimal way to prepare a reference image is the proper image coaddition presented in Zackay & Ofek. We demonstrate this method on simulated data and real observations from the PTF data release 2. We provide an implementation of this algorithm in MATLAB and Python.

Journal ArticleDOI
TL;DR: In this article, the authors explore the properties of the RV coefficient using simulated data sets and show that it is adversely affected by attributes of the data (sample size and number of variables) that do not characterize the covariance structure between sets of variables.
Abstract: Summary Modularity describes the case where patterns of trait covariation are unevenly dispersed across traits. Specifically, trait correlations are high and concentrated within subsets of variables (modules), but the correlations between traits across modules are relatively weaker. For morphometric data sets, hypotheses of modularity are commonly evaluated using the RV coefficient, an association statistic used in a wide variety of fields. In this article, I explore the properties of the RV coefficient using simulated data sets. Using data drawn from a normal distribution where the data were neither modular nor integrated in structure, I show that the RV coefficient is adversely affected by attributes of the data (sample size and the number of variables) that do not characterize the covariance structure between sets of variables. Thus, with the RV coefficient, patterns of modularity or integration in data are confounded with trends generated by sample size and the number of variables, which limits biological interpretations and renders comparisons of RV coefficients across data sets uninformative. As an alternative, I propose the covariance ratio (CR) for quantifying modular structure and show that it is unaffected by sample size or the number of variables. Further, statistical tests based on the CR exhibit appropriate type I error rates and display higher statistical power relative to the RV coefficient when evaluating modular data. Overall, these findings demonstrate that the RV coefficient does not display statistical characteristics suitable for reliable assessment of hypotheses of modular or integrated structure and therefore should not be used to evaluate these patterns in morphological data sets. By contrast, the covariance ratio meets these criteria and provides a useful alternative method for assessing the degree of modular structure in morphological data.

Proceedings Article
04 Nov 2016
TL;DR: The properties, performance, and uses of C2ST are established and their main theoretical properties are analyzed, and their use to evaluate the sample quality of generative models with intractable likelihoods, such as Generative Adversarial Networks, are proposed.
Abstract: The goal of two-sample tests is to assess whether two samples, $S P ∼ P n$ and $S Q ∼ Q m$ , are drawn from the same distribution. Perhaps intriguingly, one relatively unexplored method to build two-sample tests is the use of binary classifiers. In particular, construct a dataset by pairing the n examples in S P with a positive label, and by pairing the m examples in $S Q$ with a negative label. If the null hypothesis " $P = Q$ " is true, then the classification accuracy of a binary classifier on a held-out subset of this dataset should remain near chance-level. As we will show, such Classifier Two-Sample Tests (C2ST) learn a suitable representation of the data on the fly, return test statistics in interpretable units, have a simple null distribution, and their predictive uncertainty allow to interpret where P and Q differ. The goal of this paper is to establish the properties, performance, and uses of C2ST. First, we analyze their main theoretical properties. Second, we compare their performance against a variety of state-of-the-art alternatives. Third, we propose their use to evaluate the sample quality of generative models with intractable likelihoods, such as Generative Adversarial Networks (GANs). Fourth, we showcase the novel application of GANs together with C2ST for causal discovery.

Journal ArticleDOI
TL;DR: This work theoretically establishes the limiting distribution of the principal eigenvalue of the suitably centred and scaled adjacency matrix and uses that distribution for the test of the hypothesis that a random graph is of Erdős–Rényi (noise) type, and designs a recursive bipartitioning algorithm, which naturally uncovers nested community structure.
Abstract: Summary Community detection in networks is a key exploratory tool with applications in a diverse set of areas, ranging from finding communities in social and biological networks to identifying link farms in the World Wide Web. The problem of finding communities or clusters in a network has received much attention from statistics, physics and computer science. However, most clustering algorithms assume knowledge of the number of clusters k. We propose to determine k automatically in a graph generated from a stochastic block model by using a hypothesis test of independent interest. Our main contribution is twofold; first, we theoretically establish the limiting distribution of the principal eigenvalue of the suitably centred and scaled adjacency matrix and use that distribution for our test of the hypothesis that a random graph is of Erdős–Renyi (noise) type. Secondly, we use this test to design a recursive bipartitioning algorithm, which naturally uncovers nested community structure. Using simulations and quantifiable classification tasks on real world networks with ground truth, we show that our algorithm outperforms state of the art methods.

Journal ArticleDOI
TL;DR: The basic concepts and practical use of nonparametric tests are discussed for the guide to the proper use.
Abstract: Conventional statistical tests are usually called parametric tests. Parametric tests are used more frequently than nonparametric tests in many medical articles, because most of the medical researchers are familiar with and the statistical software packages strongly support parametric tests. Parametric tests require important assumption; assumption of normality which means that distribution of sample means is normally distributed. However, parametric test can be misleading when this assumption is not satisfied. In this circumstance, nonparametric tests are the alternative methods available, because they do not required the normality assumption. Nonparametric tests are the statistical methods based on signs and ranks. In this article, we will discuss about the basic concepts and practical use of nonparametric tests for the guide to the proper use.

Journal ArticleDOI
TL;DR: This work proposes to retain the likelihood ratio test in combination with decision criteria that increase with sample size, and addresses the concern that structural equation models cannot necessarily be expected to provide an exact description of real-world phenomena.
Abstract: One of the most important issues in structural equation modeling concerns testing model fit. We propose to retain the likelihood ratio test in combination with decision criteria that increase with sample size. Specifically, rooted in Neyman–Pearson hypothesis testing, we advocate balancing α- and β-error risks. This strategy has a number of desirable consequences and addresses several objections that have been raised against the likelihood ratio test in model evaluation. First, balancing error risks avoids logical problems with Fisher-type hypotheses tests when predicting the null hypothesis (i.e., model fit). Second, both types of statistical decision errors are controlled. Third, larger samples are encouraged (rather than penalized) because both error risks diminish as the sample size increases. Finally, the strategy addresses the concern that structural equation models cannot necessarily be expected to provide an exact description of real-world phenomena.

Journal ArticleDOI
TL;DR: This paper presents and discusses the main procedures to estimate the size of an effect with respect to the specific statistical test used for hypothesis testing and can be seen as an introduction and a guide for the reader interested in the use of effect size estimation.
Abstract: The evidence based medicine paradigm demands scientific reliability, but modern research seems to overlook it sometimes. The power analysis represents a way to show the meaningfulness of findings, regardless to the emphasized aspect of statistical significance. Within this statistical framework, the estimation of the effect size represents a means to show the relevance of the evidences produced through research. In this regard, this paper presents and discusses the main procedures to estimate the size of an effect with respect to the specific statistical test used for hypothesis testing. Thus, this work can be seen as an introduction and a guide for the reader interested in the use of effect size estimation for its scientific endeavour.

Journal ArticleDOI
14 Sep 2016
TL;DR: In this article, a review highlights the important strategic role that stone artifact replication experiments must continue to play in further developing a scientific approach to archaeology, and highlights the importance of using information from empirically documented situations to generate predictions.
Abstract: For many years, intuition and common sense often guided the transference of patterning ostensibly evident in experimental flintknapping results to interpretations of the archaeological record, with little emphasis placed on hypothesis testing, experimental variables, experimental design, or statistical analysis of data. Today, archaeologists routinely take steps to address these issues. We build on these modern efforts by reviewing several important uses of replication experiments: (1) as a means of testing a question, hypothesis, or assumption about certain parameters of stone-tool technology; (2) as a model, in which information from empirically documented situations is used to generate predictions; and (3) as a means of validating analytical methods. This review highlights the important strategic role that stone artifact replication experiments must continue to play in further developing a scientific approach to archaeology.

Journal ArticleDOI
TL;DR: An informal introduction to the foundational ideas behind Bayesian data analysis, using a linear mixed models analysis of data from a typical psycholinguistics experiment, and some examples illustrating the flexibility of model specification in the Bayesian framework.
Abstract: We present the fundamental ideas underlying statistical hypothesis testing using the frequentist framework. We start with a simple example that builds up the one-sample t-test from the beginning, explaining important concepts such as the sampling distribution of the sample mean, and the iid assumption. Then, we examine the meaning of the p-value in detail and discuss several important misconceptions about what a p-value does and does not tell us. This leads to a discussion of Type I, II error and power, and Type S and M error. An important conclusion from this discussion is that one should aim to carry out appropriately powered studies. Next, we discuss two common issues that we have encountered in psycholinguistics and linguistics: running experiments until significance is reached and the ‘garden-of-forking-paths’ problem discussed by Gelman and others. The best way to use frequentist methods is to run appropriately powered studies, check model assumptions, clearly separate exploratory data analysis from planned comparisons decided upon before the study was run, and always attempt to replicate results.

Posted Content
TL;DR: In this article, the authors proposed a classifier two-sample test (C2ST) to evaluate the sample quality of generative models with intractable likelihoods, such as Generative Adversarial Networks (GANs).
Abstract: The goal of two-sample tests is to assess whether two samples, $S_P \sim P^n$ and $S_Q \sim Q^m$, are drawn from the same distribution. Perhaps intriguingly, one relatively unexplored method to build two-sample tests is the use of binary classifiers. In particular, construct a dataset by pairing the $n$ examples in $S_P$ with a positive label, and by pairing the $m$ examples in $S_Q$ with a negative label. If the null hypothesis "$P = Q$" is true, then the classification accuracy of a binary classifier on a held-out subset of this dataset should remain near chance-level. As we will show, such Classifier Two-Sample Tests (C2ST) learn a suitable representation of the data on the fly, return test statistics in interpretable units, have a simple null distribution, and their predictive uncertainty allow to interpret where $P$ and $Q$ differ. The goal of this paper is to establish the properties, performance, and uses of C2ST. First, we analyze their main theoretical properties. Second, we compare their performance against a variety of state-of-the-art alternatives. Third, we propose their use to evaluate the sample quality of generative models with intractable likelihoods, such as Generative Adversarial Networks (GANs). Fourth, we showcase the novel application of GANs together with C2ST for causal discovery.

Proceedings ArticleDOI
13 Aug 2016
TL;DR: The use of the Incremental Kolmogorov-Smirnov test to detect concept drifts without true labels, which is a significant speed-up compared to the O(N log N) cost of the non-incremental implementation.
Abstract: Data stream research has grown rapidly over the last decade. Two major features distinguish data stream from batch learning: stream data are generated on the fly, possibly in a fast and variable rate; and the underlying data distribution can be non-stationary, leading to a phenomenon known as concept drift. Therefore, most of the research on data stream classification focuses on proposing efficient models that can adapt to concept drifts and maintain a stable performance over time. However, specifically for the classification task, the majority of such methods rely on the instantaneous availability of true labels for all already classified instances. This is a strong assumption that is rarely fulfilled in practical applications. Hence there is a clear need for efficient methods that can detect concept drifts in an unsupervised way. One possibility is the well-known Kolmogorov-Smirnov test, a statistical hypothesis test that checks whether two samples differ. This work has two main contributions. The first one is the Incremental Kolmogorov-Smirnov algorithm that allows performing the Kolmogorov-Smirnov hypothesis test instantly using two samples that change over time, where the change is an insertion and/or removal of an observation. Our algorithm employs a randomized tree and is able to perform the insertion and removal operations in O(log N) with high probability and calculate the Kolmogorov-Smirnov test in O(1), where N is the number of sample observations. This is a significant speed-up compared to the O(N log N) cost of the non-incremental implementation. The second contribution is the use of the Incremental Kolmogorov-Smirnov test to detect concept drifts without true labels. Classification algorithms adapted to use the test rely on a limited portion of those labels just to update the classification model after a concept drift is detected.