scispace - formally typeset
Search or ask a question
Journal Article•DOI•

Permutation p-values should never be zero: calculating exact p-values when permutations are randomly drawn

TL;DR: In this article, the authors argue that permutation should be viewed as generating an exact discrete null distribution, which can be used to estimate the tail probability of the test statistic in a multiple testing context.
Abstract: Permutation tests are amongst the most commonly used statistical tools in modern genomic research, a process by which p-values are attached to a test statistic by randomly permuting the sample or gene labels. Yet permutation p-values published in the genomic literature are often computed incorrectly, understated by about 1/m, where m is the number of permutations. The same is often true in the more general situation when Monte Carlo simulation is used to assign p-values. Although the p-value understatement is usually small in absolute terms, the implications can be serious in a multiple testing context. The understatement arises from the intuitive but mistaken idea of using permutation to estimate the tail probability of the test statistic. We argue instead that permutation should be viewed as generating an exact discrete null distribution. The relevant literature, some of which is likely to have been relatively inaccessible to the genomic community, is reviewed and summarized. A computation strategy is developed for exact p-values when permutations are randomly drawn. The strategy is valid for any number of permutations and samples. Some simple recommendations are made for the implementation of permutation tests in practice.

Content maybe subject to copyright    Report

Citations
More filters
Journal Article•DOI•
TL;DR: This paper presents a generic framework for permutation inference for complex general linear models (glms) when the errors are exchangeable and/or have a symmetric distribution, and shows that, even in the presence of nuisance effects, these permutation inferences are powerful while providing excellent control of false positives in a wide range of common and relevant imaging research scenarios.

2,756 citations

Posted Content•DOI•
20 Jun 2016-bioRxiv
TL;DR: FGSEA method is presented, able to estimate arbitrarily low GSEA P-values with a higher accuracy and much faster compared to other implementations, and a polynomial algorithm is presented to calculate GSEAP-values exactly.
Abstract: Preranked gene set enrichment analysis (GSEA) is a widely used method for interpretation of gene expression data in terms of biological processes. Here we present FGSEA method that is able to estimate arbitrarily low GSEA P-values with a higher accuracy and much faster compared to other implementations. We also present a polynomial algorithm to calculate GSEA P-values exactly, which we use to practically confirm the accuracy of the method.

1,433 citations

Posted Content•DOI•
20 Jun 2016-bioRxiv
TL;DR: It is shown that it is possible to make hundreds of thousands permutations in a few minutes, which leads to very accurate p-values, which allows applying standard FDR correction procedures, which are more accurate than the ones currently used.
Abstract: Gene set enrichment analysis is a widely used tool for analyzing gene expression data. However, current implementations are slow due to a large number of required samples for the analysis to have a good statistical power. In this paper we present a novel algorithm, that efficiently reuses one sample multiple times and thus speeds up the analysis. We show that it is possible to make hundreds of thousands permutations in a few minutes, which leads to very accurate p-values. This, in turn, allows applying standard FDR correction procedures, which are more accurate than the ones currently used. The method is implemented in a form of an R package and is freely available at \url{https://github.com/ctlab/fgsea}.

1,221 citations

Journal Article•DOI•
TL;DR: This article describes a computational workflow for low-level analyses of scRNA-seq data, based primarily on software packages from the open-source Bioconductor project, which covers basic steps including quality control, data exploration and normalization, as well as more complex procedures such as cell cycle phase assignment.
Abstract: Single-cell RNA sequencing (scRNA-seq) is widely used to profile the transcriptome of individual cells This provides biological resolution that cannot be matched by bulk RNA sequencing, at the cost of increased technical noise and data complexity The differences between scRNA-seq and bulk RNA-seq data mean that the analysis of the former cannot be performed by recycling bioinformatics pipelines for the latter Rather, dedicated single-cell methods are required at various steps to exploit the cellular resolution while accounting for technical noise This article describes a computational workflow for low-level analyses of scRNA-seq data, based primarily on software packages from the open-source Bioconductor project It covers basic steps including quality control, data exploration and normalization, as well as more complex procedures such as cell cycle phase assignment, identification of highly variable and correlated genes, clustering into subpopulations and marker gene detection Analyses were demonstrated on gene-level count data from several publicly available datasets involving haematopoietic stem cells, brain-derived cells, T-helper cells and mouse embryonic stem cells This will provide a range of usage scenarios from which readers can construct their own analysis pipelines

1,128 citations

Journal Article•DOI•
TL;DR: Analysis of breast cancer data shows that CAMERA recovers known relationships between tumor subtypes in very convincing terms and is shown to control the type I error rate correctly regardless of inter-gene correlations, yet retains excellent power for detecting genuine differential expression.
Abstract: Competitive gene set tests are commonly used in molecular pathway analysis to test for enrichment of a particular gene annotation category amongst the differential expression results from a microarray experiment. Existing gene set tests that rely on gene permutation are shown here to be extremely sensitive to inter-gene correlation. Several data sets are analyzed to show that inter-gene correlation is non-ignorable even for experiments on homogeneous cell populations using genetically identical model organisms. A new gene set test procedure (CAMERA) is proposed based on the idea of estimating the inter-gene correlation from the data, and using it to adjust the gene set test statistic. An efficient procedure is developed for estimating the inter-gene correlation and characterizing its precision. CAMERA is shown to control the type I error rate correctly regardless of inter-gene correlations, yet retains excellent power for detecting genuine differential expression. Analysis of breast cancer data shows that CAMERA recovers known relationships between tumor subtypes in very convincing terms. CAMERA can be used to analyze specified sets or as a pathway analysis tool using a database of molecular signatures.

651 citations

References
More filters
Book•
01 Jan 1993
TL;DR: This article presents bootstrap methods for estimation, using simple arguments, with Minitab macros for implementing these methods, as well as some examples of how these methods could be used for estimation purposes.
Abstract: This article presents bootstrap methods for estimation, using simple arguments. Minitab macros for implementing these methods are given.

37,183 citations

Journal Article•DOI•
TL;DR: In this paper, a simple and widely accepted multiple test procedure of the sequentially rejective type is presented, i.e. hypotheses are rejected one at a time until no further rejections can be done.
Abstract: This paper presents a simple and widely ap- plicable multiple test procedure of the sequentially rejective type, i.e. hypotheses are rejected one at a tine until no further rejections can be done. It is shown that the test has a prescribed level of significance protection against error of the first kind for any combination of true hypotheses. The power properties of the test and a number of possible applications are also discussed.

20,459 citations

Book•
23 Jul 2020
TL;DR: The idea of a randomization test has been explored in the context of data analysis for a long time as mentioned in this paper, and it has been applied in a variety of applications in biology, such as single species ecology and community ecology.
Abstract: Preface to the Second Edition Preface to the First Edition Randomization The Idea of a Randomization Test Examples of Randomization Tests Aspects of Randomization Testing Raised by the Examples Sampling the Randomization Distribution or Systematic Enumeration Equivalent Test Statistics Significance Levels for Classical and Randomization Tests Limitations of Randomization Tests Confidence Limits by Randomization Applications of Randomization in Biology Single Species Ecology Genetics, Evolution and Natural Selection Community Ecology Randomization and Observational Studies Chapter Summary The Jackknife The Jackknife Estimator Applications of Jackknifing in Biology Single Species Analyses Genetics, Evolution and Natural Selection Community Ecology Chapter Summary The Bootstrap Resampling with Replacement Standard Bootstrap Confidence Limits Simple Percentile Confidence Limits Bias Corrected Percentile Confidence Limits Accelerated Bias Corrected Percentile Limits Other Methods for Constructing Confidence Intervals Transformations to Improve Bootstrap Intervals Parametric Confidence Intervals A Better Estimate of Bias Bootstrap Tests of Significance Balanced Bootstrap Sampling Applications of Bootstrapping in Biology Single Species Ecology Genetics, Evolution and Natural Selection Community Ecology Further Reading Chapter Summary Monte Carlo Methods Monte Carlo Tests Generalized Monte Carlo Tests Implicit Statistical Models Applications of Monte Carlo Methods in Biology Single Species Ecology Chapter Summary Some General Considerations Questions about Computer-Intensive Methods Power Number of Random Sets of Data Needed for a Test Determining a Randomization Distribution Exactly The number of replications for confidence intervals More Efficient Bootstrap Sampling Methods The Generation of Pseudo-Random Numbers The Generation of Random Permutations Chapter Summary One and Two Sample Tests The Paired Comparisons Design The One Sample Randomization Test The Two Sample Randomization Test Bootstrap Tests Randomizing Residuals Comparing the Variation in Two Samples A Simulation Study The Comparison of Two Samples on Multiple Measurements Further Reading Chapter Summary Exercises Analysis of Variance One Factor Analysis of Variance Tests for Constant Variance Testing for Mean Differences Using Residuals Examples of More Complicated Types of Analysis of Variance Procedures for Handling Unequal Group Variances Other Aspects of Analysis of Variance Further Reading Chapter Summary Exercises Regression Analysis Simple Linear Regression Randomizing Residuals Testing for a Non-Zero B Value Confidence Limits for B Multiple Linear Regression Alternative Randomization Methods with Multiple Regression Bootstrapping and Jackknifing with Regression Further Reading Chapter Summary Exercises Distance Matrices and Spatial Data Testing for Association between Distance Matrices The Mantel Test Sampling the Randomization Distribution Confidence Limits for Regression Coefficients The Multiple Mantel Test Other Approaches with More than Two Matrices Further Reading Chapter Summary Exercises Other Analyses on Spatial Data Spatial Data Analysis The Study of Spatial Point Patterns Mead's Randomization Test Tests for Randomness Based on Distances Testing for an Association between Two Point Patterns The Besag-Diggle Test Tests Using Distances between Points Testing for Random Marking Further Reading Chapter Summary Exercises Time Series Randomization and Time Series Randomization Tests for Serial Correlation Randomization T ests for Trend Randomization Tests for Periodicity Irregularly Spaced Series Tests on Times of Occurrence Discussion on Procedures for Irregular Series Bootstrap and Monte Carlo Tests Further Reading Chapter Summary Exercises Multivariate Data Univariate and Multivariate Tests Sample Means and Covariance Matrices Comparison of Sample Mean Vectors Chi-Squared Analyses for Count Data Principle Component Analysis and Other One Sample Methods Discriminant Function Analysis Further Reading Chapter Summary Exercises Survival and Growth Data Bootstrapping Survival Data Bootstrapping for Variable Selection Bootstrapping for Model Selection Group Comparisons Growth Data Further Reading Chapter Summary Exercises Non-Standard Situations The Construction of Tests in Non-Standard Situations Species Co-Occurrences on Islands An Alternative Generalized Monte Carlo Test Examining Time Changes in Niche Overlap Probing Multivariate Data with Random Skewers Ant Species Sizes in Europe Chapter Summary Bayesian Methods The Bayesian Approach to Data Analysis The Gibbs Sampler and Related Methods Biological Applications Further Reading Chapter Summary Exercises Conclusion and Final Comments Randomization Bootstrapping Monte Carlo Methods in General Classical versus Bayesian Inference Appendix Software for Computer Intensive Statistics References Index

4,706 citations

Book•
01 Jan 1935

4,510 citations