scispace - formally typeset
Search or ask a question

Showing papers on "False discovery rate published in 2006"


Journal ArticleDOI
TL;DR: In this article, a two-stage adaptive procedure is proposed to control the false discovery rate at the desired level q. This framework enables us to study analytically the properties of other procedures that exist in the literature.
Abstract: We provide a new two-stage procedure in which the linear step-up procedure is used in stage one to estimate mo, providing a new level q' which is used in the linear step-up procedure in the second stage. We prove that a general form of the two-stage procedure controls the false discovery rate at the desired level q. This framework enables us to study analytically the properties of other procedures that exist in the literature. A simulation study is presented that shows that two-stage adaptive procedures improve in power over the original procedure, mainly because they provide tighter control of the false discovery rate. We further study the performance of the current suggestions, some variations of the procedures, and previous suggestions, in the case where the test statistics are positively dependent, a case for which the original procedure controls the false discovery rate. In the setting studied here the newly proposed two-stage procedure is the only one that controls the false discovery rate. The procedures are illustrated with two examples of biological importance.

2,319 citations


Journal ArticleDOI
TL;DR: I compared pairwise significance from three published studies using three critical values corresponding to Bonferroni, FDR, and modified FDR methods and suggest that the modified FDR method may provide the most biologically important critical value for evaluating significance of population differentiation in conservation genetics.
Abstract: Studies in conservation genetics often attempt to determine genetic differentiation between two or more temporally or geographically distinct sample collections. Pairwise p-values from Fisher’s exact tests or contingency Chi-square tests are commonly reported with a Bonferroni correction for multiple tests. While the Bonferroni correction controls the experiment-wise α, this correction is very conservative and results in greatly diminished power to detect differentiation among pairs of sample collections. An alternative is to control the false discovery rate (FDR) that provides increased power, but this method only maintains experiment-wise α when none of the pairwise comparisons are significant. Recent modifications to the FDR method provide a moderate approach to determining significance level. Simulations reveal that critical values of multiple comparison tests with both the Bonferroni method and a modified FDR method approach a minimum asymptote very near zero as the number of tests gets large, but the Bonferroni method approaches zero much more rapidly than the modified FDR method. I compared pairwise significance from three published studies using three critical values corresponding to Bonferroni, FDR, and modified FDR methods. Results suggest that the modified FDR method may provide the most biologically important critical value for evaluating significance of population differentiation in conservation genetics.␣Ultimately, more thorough reporting of statistical significance is needed to allow interpretation of biological significance of genetic differentiation among populations.

792 citations


Journal ArticleDOI
TL;DR: This work provides a new perspective on a class of model selection rules which has been introduced recently by several authors, and exhibits a close connection with FDR-controlling procedures under stringent control of the false discovery rate.
Abstract: We attempt to recover an n-dimensional vector observed in white noise, where n is large and the vector is known to be sparse, but the degree of sparsity is unknown. We consider three different ways of defining sparsity of a vector: using the fraction of nonzero terms; imposing power-law decay bounds on the ordered entries; and controlling the lp norm for p small. We obtain a procedure which is asymptotically minimax for l r loss, simultaneously throughout a range of such sparsity classes. The optimal procedure is a data-adaptive thresholding scheme, driven by control of the False Discovery Rate (FDR). FDR control is a relatively recent innovation in simultaneous testing, ensuring that at most a certain fraction of the rejected null hypotheses will correspond to false rejections. In our treatment, the FDR control parameter qn also plays a determining role in asymptotic minimaxity. If q = lim qn ∈ [0,1/2] and also qn > γ/log(n) we get sharp asymptotic minimaxity, simultaneously, over a wide range of sparse parameter spaces and loss functions. On the other hand, q = lim qn ∈ (1/2,1], forces the risk to exceed the minimax risk by a factor growing with q. To our knowledge, this relation between ideas in simultaneous inference and asymptotic decision theory is new. Our work provides a new perspective on a class of model selection rules which has been introduced recently by several authors. These new rules impose complexity penalization of the form 2 � log( potential model size / actual model size ). We exhibit a close connection with FDR-controlling procedures under stringent control of the false discovery rate.

456 citations


Journal Article
TL;DR: This manuscript presents a comprehensive set of tools for the computation of 3D structural statistical shape analysis, which has been applied in several studies on brain morphometry, but can potentially be employed in other 3D shape problems.
Abstract: Shape analysis has become of increasing interest to the neuroimaging community due to its potential to precisely locate morphological changes between healthy and pathological structures This manuscript presents a comprehensive set of tools for the computation of 3D structural statistical shape analysis It has been applied in several studies on brain morphometry, but can potentially be employed in other 3D shape problems Its main limitations is the necessity of spherical topologyThe input of the proposed shape analysis is a set of binary segmentation of a single brain structure, such as the hippocampus or caudate These segmentations are converted into a corresponding spherical harmonic description (SPHARM), which is then sampled into a triangulated surfaces (SPHARM-PDM) After alignment, differences between groups of surfaces are computed using the Hotelling T(2) two sample metric Statistical p-values, both raw and corrected for multiple comparisons, result in significance maps Additional visualization of the group tests are provided via mean difference magnitude and vector maps, as well as maps of the group covariance informationThe correction for multiple comparisons is performed via two separate methods that each have a distinct view of the problem The first one aims to control the family-wise error rate (FWER) or false-positives via the extrema histogram of non-parametric permutations The second method controls the false discovery rate and results in a less conservative estimate of the false-negatives

445 citations


Journal ArticleDOI
TL;DR: In this article, a new criterion, the false discovery rate (FDR), was proposed to control the proportion of false declarations of significance among those individual deviations from null hypotheses considered to be significant.
Abstract: Assessing the significance of multiple and dependent comparisons is an important, and often ignored, issue that becomes more critical as the size of data sets increases. If not accounted for, false-positive differences are very likely to be identified. The need to address this issue has led to the development of a myriad of procedures to account for multiple testing. The simplest and most widely used technique is the Bonferroni method, which controls the probability that a true null hypothesis is incorrectly rejected. However, it is a very conservative procedure. As a result, the larger the data set the greater the chances that truly significant differences will be missed. In 1995, a new criterion, the false discovery rate (FDR), was proposed to control the proportion of false declarations of significance among those individual deviations from null hypotheses considered to be significant. It is more powerful than all previously proposed methods. Multiple and dependent comparisons are also fundamental in spatial analysis. As the number of locations increases, assessing the significance of local statistics of spatial association becomes a complex matter. In this article we use empirical and simulated data to evaluate the use of the FDR approach in appraising the occurrence of clusters detected by local indicators of spatial association. Results show a significant gain in identification of meaningful clusters when controlling the FDR, in comparison to more conservative approaches. When no control is adopted, false clusters are likely to be identified. If a conservative approach is used, clusters are only partially identified and true clusters are largely missed. In contrast, when the FDR approach is adopted, clusters are fully identified. Incorporating a correction for spatial dependence to conservative methods improves the results, but not enough to match those obtained by the FDR approach.

378 citations


Journal ArticleDOI
TL;DR: In this article, the authors present a method for multiple hypothesis testing that maintains control of the false discovery rate while incorporating prior information about the hypotheses, which takes the form of p-value weights.
Abstract: We present a method for multiple hypothesis testing that maintains control of the False Discovery Rate while incorporating prior information about the hypotheses. The prior information takes the form of p-value weights. If the assignment of weights is positively associated with the null hypotheses being false, the procedure improves power, except in cases where power is already near one. Even if the assignment of weights is poor, power is only reduced slightly, as long as the weights are not too large. We also provide a similar method to control False Discovery Exceedance.

343 citations


Journal ArticleDOI
TL;DR: FDR-based procedure controls the expected proportion of erroneously rejected hypotheses among the rejected hypotheses, which offers a more objective, powerful, and consistent measure of Type I error than Bonferroni correction and maintains a better balance between power and specificity.

315 citations


Journal ArticleDOI
TL;DR: In this paper, a simple method that uses the detection (Present/Absent) call generated by the Affymetrix microarray suite version 5 software (MAS5) to remove data that is not reliably detected before further analysis, and compare this with filtering by expression level is presented.
Abstract: Affymetrix GeneChips® are widely used for expression profiling of tens of thousands of genes The large number of comparisons can lead to false positives Various methods have been used to reduce false positives, but they have rarely been compared or quantitatively evaluated Here we describe and evaluate a simple method that uses the detection (Present/Absent) call generated by the Affymetrix microarray suite version 5 software (MAS5) to remove data that is not reliably detected before further analysis, and compare this with filtering by expression level We explore the effects of various thresholds for removing data in experiments of different size (from 3 to 10 arrays per treatment), as well as their relative power to detect significant differences in expression Our approach sets a threshold for the fraction of arrays called Present in at least one treatment group This method removes a large percentage of probe sets called Absent before carrying out the comparisons, while retaining most of the probe sets called Present It preferentially retains the more significant probe sets (p ≤ 0001) and those probe sets that are turned on or off, and improves the false discovery rate Permutations to estimate false positives indicate that probe sets removed by the filter contribute a disproportionate number of false positives Filtering by fraction Present is effective when applied to data generated either by the MAS5 algorithm or by other probe-level algorithms, for example RMA (robust multichip average) Experiment size greatly affects the ability to reproducibly detect significant differences, and also impacts the effect of filtering; smaller experiments (3–5 samples per treatment group) benefit from more restrictive filtering (≥50% Present) Use of a threshold fraction of Present detection calls (derived by MAS5) provided a simple method that effectively eliminated from analysis probe sets that are unlikely to be reliable while preserving the most significant probe sets and those turned on or off; it thereby increased the ratio of true positives to false positives

267 citations


Journal ArticleDOI
TL;DR: The calculated effect sizes may be further used in simple analyses that can help to estimate the true effect of a predictor variable and thus make general conclusions, and the omission of nonsignificant results from publications is undesirable.
Abstract: Studies in behavioral ecology often investigate several traits and then apply multiple statistical tests to discover their pairwise associations. Traditionally, such approaches require the adjustment of individual significance levels because as more statistical tests are performed the greater the likelihood that Type I errors are committed (i.e., rejecting H0 when it is true) (Rice 1989). Bonferroni correction that lowers the critical P values for each particular test based on the number of tests to be performed is frequently used to reduce problems associated with multiple comparisons (Cabin and Mitchell 2000). However, this procedure dramatically increases the risk of committing Type II errors as it results in a high risk of not rejecting a H0 when it is false. To reach 80% statistical power, it is necessary to have huge sample sizes to detect medium (r1⁄4 0.3 or d 1⁄4 0.5; sensu Cohen 1988) or small (r 1⁄4 0.1 or d 1⁄4 0.2; sensu Cohen 1988) strength effects (e.g., say N 1⁄4 128 or N 1⁄4 788, respectively, for a 2-sample t-test), but sample size is often limited when studying behavior. The strict application of Bonferroni correction in the field of ecology and behavioral ecology has therefore been criticized for mathematical and logical reasons (Wright 1992; Benjamini and Hochberg 1995; Perneger 1998; Moran 2003; Nakagawa 2004). As a potential solution, Wright (1992) and Chandler (1995) advocated that the sacrificial loss of power can be avoided by choosing an experimentwise error rate higher than the usually accepted 5%, which results in a balance between different types of errors. As another alternative, the researcher might be more interested in controlling the proportion of erroneously rejected null hypotheses, the socalled false discovery rate, than in controlling for familywise error rate (Benjamini and Hochberg, 1995). Although this approach allows for increased power in large series of repeated tests, it is rarely applied in ecological studies (Garcia 2003, 2004). Recently, Nakagawa (2004) suggested reporting effect sizes together with confidence intervals (CIs) for all potential relationships to allow the readers to judge the biological importance of the results and to reduce publication bias. Due to the low power of the tests, the majority of investigated relationships are expected to be nonsignificant, which is thought to make publication difficult. Such difficulty is generally assumed to cause behavioral ecologists to selectively report data (Moran 2003; Nakagawa 2004). The omission of nonsignificant results from publications is undesirable for both scientific and ethical reasons, which makes Bonferroni adjustment problematic. It is noteworthy that direct tests comparing effect sizes of representative samples of published and unpublished studies showed no evidence of publication bias in the biological literature (Koricheva 2003; Moller et al. 2005). However, independent of publication bias, conclusions drawn from effect sizes and the associated CIs should be encouraged. Such an approach considers the magnitude of an effect on a continuous scale, whereas conventional hypothesis testing based on significance levels tends to treat biological questions as allor-nothing effects depending on whether P values exceed the critical limit or not (Chow 1988; Wilkinson and Task Force Stat Inference 1999; Thompson 2002). Hence, using the same data, the former approach may reveal that a particular effect is small, but still biologically important, whereas, the later approach may lead the investigator to conclude that the hypothesized phenomenon does not exist in nature. Although such philosophical differences may dramatically influence our knowledge, presenting standardized effect sizes is still uncommon in ecology and evolution (Nakagawa 2004). Here, I suggest that, in addition to their presentation, the calculated effect sizes may be further used in simple analyses that can help to estimate the true effect of a predictor variable and thus make general conclusions. These analytical tools rely on the fact that the strength and direction of relationships, as reflected by standardized measures of effect sizes (Pearson’s r, Cohen’s d, or Hedges’ g), are comparable and independent of the scale on which the variables were measured (e.g., Hedges and Olkin 1985; Cohen 1988; Rosenthal 1991). Thus, if multiple traits are measured and multiple correlations are calculated, the corresponding effect sizes tabulated among the variables measured will have a certain statistical distribution with measurable attributes. Below, I present 4 simple analyses to demonstrate how such statistical attributes can be used to make general interpretations. I will confine myself to a typical sampling design from behavioral ecology in which the experimenter is interested in explaining variation in certain traits (response variables) in the light of other (predictor) variables. Specific sampling designs can be tailored according to the biological question at hand that will be illustrated by using real data on the collared flycatcher, Ficedula albicollis from Garamszegi et al. (2004). I will also discuss the confounding effect of colinearity between variables that may violate the assumption of statistical independence and the potentially low power of the suggested tests.

220 citations


Journal ArticleDOI
TL;DR: The multiplicity problem is reviewed, the advantage of the FDR approach is illustrated, and the Benjamini–Hochberg methods for controlling the false discovery rate are promoted for widespread adoption in ecology.
Abstract: Ecologists routinely use Bonferroni-based methods to control the alpha inflation associated with multiple hypothesis testing, despite the aggravating loss of power incurred. Some critics call for abandonment of this approach of controlling the familywise error rate (FWER), contending that too many unwary researchers have adopted it in the name of scientific rigour even though it often does more harm than good. We do not recommend rejecting multiplicity correction altogether. Instead, we recommend using an alternative approach. In particular, we advocate the Benjamini–Hochberg and related methods for controlling the false discovery rate (FDR). Unlike the FWER approach, which safeguards against falsely rejecting even a single null hypothesis, the FDR approach controls the rate at which null hypotheses are falsely rejected (i.e., false discoveries are made). The FDR approach represents a compromise between outright refusal to control for multiplicity, which maximizes alpha inflation, and strict adhe...

156 citations


Journal ArticleDOI
TL;DR: All group testing methods identified pathways that had already been described to be involved in the pathogenesis of prostate cancer as well as pathways recurrently identified in these analyses are more likely to be reliable than those from a single analysis on a single dataset.
Abstract: Motivation: The wide use of DNA microarrays for the investigation of the cell transcriptome triggered the invention of numerous methods for the processing of microarray data and lead to a growing number of microarray studies that examine the same biological conditions. However, comparisons made on the level of gene lists obtained by different statistical methods or from different datasets hardly converge. We aimed at examining such discrepancies on the level of apparently affected biologically related groups of genes, e.g. metabolic or signalling pathways. This can be achieved by group testing procedures, e.g. over-representation analysis, functional class scoring (FCS), or global tests. Results: Three public prostate cancer datasets obtained with the same microarray platform (HGU95A/HGU95Av2) were analyzed. Each dataset was subjected to normalization by either variance stabilizing normalization (vsn) or mixed model normalization (MMN). Then, statistical analysis of microarrays was applied to the vsn-normalized data and mixed model analysis to the data normalized by MMN. For multiple testing adjustment the false discovery rate was calculated and the threshold was set to 0.05. Gene lists from the same method applied to different datasets showed overlaps between 42 and 52%, while lists from different methods applied to the same dataset had between 63 and 85% of genes in common. A number of six gene lists obtained by the two statistical methods applied to the three datasets was then subjected to group testing by Fisher's exact test. Group testing by GSEA and global test was applied to the three datasets, as well. Fisher's exact test followed by global test showed more consistent results with respect to the concordance between analyses on gene lists obtained by different methods and different datasets than the GSEA. However, all group testing methods identified pathways that had already been described to be involved in the pathogenesis of prostate cancer. Moreover, pathways recurrently identified in these analyses are more likely to be reliable than those from a single analysis on a single dataset. Contact: b.brors@dkfz.de Supplementary Information: Supplementary Figure 1 and Supplementary Tables 1--4 are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: This work investigates the performance of a stratified false discovery control approach and shows that controlling FDR at a low rate, e.g. 5% or 10%, may not be feasible for some GWA studies.
Abstract: The multiplicity problem has become increasingly important in genetic studies as the capacity for high-throughput genotyping has increased. The control of False Discovery Rate (FDR) (Benjamini and Hochberg. [1995] J. R. Stat. Soc. Ser. B 57:289-300) has been adopted to address the problems of false positive control and low power inherent in high-volume genome-wide linkage and association studies. In many genetic studies, there is often a natural stratification of the m hypotheses to be tested. Given the FDR framework and the presence of such stratification, we investigate the performance of a stratified false discovery control approach (i.e. control or estimate FDR separately for each stratum) and compare it to the aggregated method (i.e. consider all hypotheses in a single stratum). Under the fixed rejection region framework (i.e. reject all hypotheses with unadjusted p-values less than a pre-specified level and then estimate FDR), we demonstrate that the aggregated FDR is a weighted average of the stratum-specific FDRs. Under the fixed FDR framework (i.e. reject as many hypotheses as possible and meanwhile control FDR at a pre-specified level), we specify a condition necessary for the expected total number of true positives under the stratified FDR method to be equal to or greater than that obtained from the aggregated FDR method. Application to a recent Genome-Wide Association (GWA) study by Maraganore et al. ([2005] Am. J. Hum. Genet. 77:685-693) illustrates the potential advantages of control or estimation of FDR by stratum. Our analyses also show that controlling FDR at a low rate, e.g. 5% or 10%, may not be feasible for some GWA studies.

Journal ArticleDOI
TL;DR: In this paper, a general class of methods for exceedance control of FDP based on inverting tests of uniformity is presented, which produces a confidence envelope for the FDP as a function of rejection threshold.
Abstract: Multiple testing methods to control the false discovery rate, the expected proportion of falsely rejected null hypotheses among all rejections, have received much attention. It can be valuable to control not the mean of this false discovery proportion (FDP), but rather the probability that the FDP exceeds a specified bound. In this article we construct a general class of methods for exceedance control of FDP based on inverting tests of uniformity. The method also produces a confidence envelope for the FDP as a function of rejection threshold. We discuss how to select a procedure with good power.

Journal ArticleDOI
TL;DR: This short article discusses a simple method for assessing sample size requirements in microarray experiments by estimating the false discovery rate and false negative rate of a list of genes for a given hypothesized mean difference and various samples sizes.
Abstract: In this short article, we discuss a simple method for assessing sample size requirements in microarray experiments. Our method starts with the output from a permutation-based analysis for a set of pilot data, e.g. from the SAM package. Then for a given hypothesized mean difference and various samples sizes, we estimate the false discovery rate and false negative rate of a list of genes; these are also interpretable as per gene power and type I error. We also discuss application of our method to other kinds of response variables, for example survival outcomes. Our method seems to be useful for sample size assessment in microarray experiments.

Journal ArticleDOI
TL;DR: In this article, the false discovery rate (FDR) and the false non-discrepancy ratio (FNR) were derived for single-step multiple testing procedures and the results extended previously known results, providing further insights into the notions of FDR and FNR and related measures.
Abstract: Results on the false discovery rate (FDR) and the false nondiscovery rate (FNR) are developed for single-step multiple testing procedures. In addition to verifying desirable properties of FDR and FNR as measures of error rates, these results extend previously known results, providing further insights, particularly under dependence, into the notions of FDR and FNR and related measures. First, considering fixed configurations of true and false null hypotheses, inequalities are obtained to explain how an FDR- or FNR-controlling single-step procedure, such as a Bonferroni or Sidak procedure, can potentially be improved. Two families of procedures are then constructed, one that modifies the FDR-controlling and the other that modifies the FNR-controlling Sidak procedure. These are proved to control FDR or FNR under independence less conservatively than the corresponding families that modify the FDR- or FNR-controlling Bonferroni procedure. Results of numerical investigations of the performance of the modified Sidak FDR procedure over its competitors are presented. Second, considering a mixture model where different configurations of true and false null hypotheses are assumed to have certain probabilities, results are also derived that extend some of Storey’s work to the dependence case.

Journal ArticleDOI
TL;DR: A family of methods that use a set of P-values to estimate or control the false discovery rate and similar error rates for microarray studies are described.
Abstract: The analysis of microarray data often involves performing a large number of statistical tests, usually at least one test per queried gene. Each test has a certain probability of reaching an incorrect inference; therefore, it is crucial to estimate or control error rates that measure the occurrence of erroneous conclusions in reporting and interpreting the results of a microarray study. In recent years, many innovative statistical methods have been developed to estimate or control various error rates for microarray studies. Researchers need guidance choosing the appropriate statistical methods for analysing these types of data sets. This review describes a family of methods that use a set of P-values to estimate or control the false discovery rate and similar error rates. Finally, these methods are classified in a manner that suggests the appropriate method for specific applications and diagnostic procedures that can identify problems in the analysis are described.

Journal ArticleDOI
01 Aug 2006-Genetics
TL;DR: It is shown that the general applicability of FDR for declaring significant linkages in the analysis of a single trait is dubious, and a generalized version of the GWER is proposed, called GWERk, that allows one to provide a more liberal balance between true positives and false positives at no additional cost in computation or assumptions.
Abstract: Linkage analysis involves performing significance tests at many loci located throughout the genome. Traditional criteria for declaring a linkage statistically significant have been formulated with the goal of controlling the rate at which any single false positive occurs, called the genomewise error rate (GWER). As complex traits have become the focus of linkage analysis, it is increasingly common to expect that a number of loci are truly linked to the trait. This is especially true in mapping quantitative trait loci (QTL), where sometimes dozens of QTL may exist. Therefore, alternatives to the strict goal of preventing any single false positive have recently been explored, such as the false discovery rate (FDR) criterion. Here, we characterize some of the challenges that arise when defining relaxed significance criteria that allow for at least one false positive linkage to occur. In particular, we show that the FDR suffers from several problems when applied to linkage analysis of a single trait. We therefore conclude that the general applicability of FDR for declaring significant linkages in the analysis of a single trait is dubious. Instead, we propose a significance criterion that is more relaxed than the traditional GWER, but does not appear to suffer from the problems of the FDR. A generalized version of the GWER is proposed, called GWERk, that allows one to provide a more liberal balance between true positives and false positives at no additional cost in computation or assumptions.

Journal ArticleDOI
TL;DR: The generalize the local fdr as a function of multiple statistics, combining a common test statistic for assessing DE with its standard error information and shows that the fdr2d performs better than commonly used modified test statistics.
Abstract: Motivation: The false discovery rate (fdr) is a key tool for statistical assessment of differential expression (DE) in microarray studies Overall control of the fdr alone, however, is not sufficient to address the problem of genes with small variance, which generally suffer from a disproportionally high rate of false positives It is desirable to have an fdr-controlling procedure that automatically accounts for gene variability Methods: We generalize the local fdr as a function of multiple statistics, combining a common test statistic for assessing DE with its standard error information We use a non-parametric mixture model for DE and non-DE genes to describe the observed multi-dimensional statistics, and estimate the distribution for non-DE genes via the permutation method We demonstrate this fdr2d approach for simulated and real microarray data Results: The fdr2d allows objective assessment of DE as a function of gene variability We also show that the fdr2d performs better than commonly used modified test statistics Availability: An R-package OCplus containing functions for computing fdr2d() and other operating characteristics of microarray data is available at http://wwwmebkise/~yudpaw Contact: alexanderploner@mebkise

Journal ArticleDOI
TL;DR: A novel feature selection procedure based on a mixture model and a non-gaussianity measure of a gene's expression profile is presented, which can be used to find genes that define either small outlier subgroups or major subdivisions, depending on the sign of kurtosis.
Abstract: Motivation: Elucidating the molecular taxonomy of cancers and finding biological and clinical markers from microarray experiments is problematic due to the large number of variables being measured. Feature selection methods that can identify relevant classifiers or that can remove likely false positives prior to supervised analysis are therefore desirable. Results: We present a novel feature selection procedure based on a mixture model and a non-gaussianity measure of a gene's expression profile. The method can be used to find genes that define either small outlier subgroups or major subdivisions, depending on the sign of kurtosis. The method can also be used as a filtering step, prior to supervised analysis, in order to reduce the false discovery rate. We validate our methodology using six independent datasets by rediscovering major classifiers in ER negative and ER positive breast cancer and in prostate cancer. Furthermore, our method finds two novel subtypes within the basal subgroup of ER negative breast tumours, associated with apoptotic and immune response functions respectively, and with statistically different clinical outcome. Availability: An R-function pack that implements the methods used here has been added to vabayelMix, available from (www.cran.r-project.org). Contact: aet21@cam.ac.uk Supplementary information: Supplementary information is available at Bioinformatics online.

Journal ArticleDOI
Nicolai Meinshausen1
TL;DR: In this paper, a confidence envelope for false discovery control when testing multiple hypotheses of association simultaneously is proposed, which allows for an exploratory approach when choosing suitable rejection regions while still retaining strong control over the proportion of false discoveries.
Abstract: . We propose a confidence envelope for false discovery control when testing multiple hypotheses of association simultaneously. The method is valid under arbitrary and unknown dependence between the test statistics and allows for an exploratory approach when choosing suitable rejection regions while still retaining strong control over the proportion of false discoveries.

Journal ArticleDOI
TL;DR: It is shown that the set I ^ ⊆ { 1, …, p } consisting of the indices of rejected hypotheses β i = 0 is a consistent estimator of I 0, under appropriate conditions on the design matrix X and the control values used in either procedure.

Journal ArticleDOI
TL;DR: The ChIP-Chip data structure is investigated and methods for inferring the location of transcription factor binding sites from these data, which involve testing for each probe whether it is part of a bound sequence using a scan statistic that takes into account the spatial structure of the data, are proposed.
Abstract: Cawley et al. (2004) have recently mapped the locations of binding sites for three transcription factors along human chromosomes 21 and 22 using ChIP–Chip experiments. ChIP–Chip experiments are a new approach to the genomewide identification of transcription factor binding sites and consist of chromatin (Ch) immunoprecipitation (IP) of transcription factorbound genomic DNA followed by high density oligonucleotide hybridization (Chip) of the IP-enriched DNA. We investigate the ChIP–Chip data structure and propose methods for inferring the location of transcription factor binding sites from these data. The proposed methods involve testing for each probe whether it is part of a bound sequence using a scan statistic that takes into account the spatial structure of the data. Different multiple testing procedures are considered for controlling the familywise error rate and false discovery rate. A nested-Bonferroni adjustment, which is more powerful than the traditional Bonferroni adjustment when the test statis...

Book ChapterDOI
TL;DR: In this article, a stepdown procedure for controlling the false discovery proportion (FDP) is proposed, defined as the ratio of the number of false positives to the total number of rejections.
Abstract: Consider the problem of testing multiple null hypotheses. A clas- sical approach to dealing with the multiplicity problem is to restrict attention to procedures that control the familywise error rate (FWER), the probabil- ity of even one false rejection. However, if s is large, control of the FWER is so stringent that the ability of a procedure which controls the FWER to detect false null hypotheses is limited. Consequently, it is desirable to consider other measures of error control. We will consider methods based on control of the false discovery proportion (FDP) defined by the number of false rejec- tions divided by the total number of rejections (defined to be 0 if there are no rejections). The false discovery rate proposed by Benjamini and Hochberg (1995) controls E(FDP). Here, we construct methods such that, for any and �, P {FDP > } ≤ �. Based on p-values of individual tests, we consider stepdown procedures that control the FDP, without imposing dependence as- sumptions on the joint distribution of the p-values. A greatly improved version of a method given in Lehmann and Romano (10) is derived and generalized to provide a means by which any sequence of nondecreasing constants can be rescaled to ensure control of the FDP. We also provide a stepdown procedure that controls the FDR under a dependence assumption.

Journal ArticleDOI
TL;DR: In this paper, the Neyman-Pearson test is used to compare an observed value of a statistic with a specified region of the statistic's range; if the value falls in the region, the data are considered not likely to have been generated given the hypothesis is true, and the hypothesis are rejected.
Abstract: Recent issues of Health Services Research (HSR), the Journal of Health Economics, and Medical Care each contain articles that lack attention to the requirements of multiple hypotheses. The problems with multiple hypotheses are well known and often addressed in textbooks on research methods under the topics of joint tests (e.g., Greene 2003; Kennedy 2003) and significance level adjustment (e.g., Kleinbaum et al. 1998; Rothman and Greenland 1998; Portney and Watkins 2000; Myers and Well 2003; Stock and Watson 2003); yet, a look at applied journals in health services research quickly reveals that attention to the issue is not universal. This paper has two goals: to remind researchers of issues regarding multiple hypotheses and to provide a few helpful guidelines. I first discuss when to combine hypotheses into a composite for a joint test; I then discuss the adjustment of test criterion for sets of hypotheses. Although often treated in statistics as two solutions to the same problem (Johnson and Wichern 1992), here I treat them as separate tasks with distinct motivations. In this paper I focus on Neyman–Pearson testing using Fisher's p-value as the interpretational quantity. Classically, a test compares an observed value of a statistic with a specified region of the statistic's range; if the value falls in the region, the data are considered not likely to have been generated given the hypothesis is true, and the hypothesis is rejected. However, it is common practice to instead compare a p-value to a significance level, rejecting the hypothesis if the p-value is smaller than the significance level. Because most tests are based on tail areas of distributions, this is a distinction without a difference for the purpose of this paper, and so I will use the p-value and significance-level terms in this discussion. Of greater import is the requirement that hypotheses are stated a priori. A test is based on the prior assertion that if a given hypothesis is true, the data generating process will produce a value of the selected statistic that falls into the rejection region with probability equal to the corresponding significance level, which typically corresponds to a p-value smaller than the significance level. Setting hypotheses a priori is important in order to avoid a combinatorial explosion of error. For example, in a multiple regression model the a posteriori interpretation of regression coefficients in the absence of prior hypotheses does not account for the fact that the pattern of coefficients may be generated by chance. The important distinction is between the a priori hypothesis “the coefficient estimates for these particular variables in the data will be significant” and the a posteriori observation that “the coefficient estimates for these particular variables are significant.” In the first case, even if variables other than those identified in the hypothesis do not have statistically significant coefficients, the hypothesis is rejected nonetheless. In the second case, the observation applies to any set of variables that happen to have “statistically significant” coefficients. Hence, it is the probability that any set of variables have resultant “significant” statistics that drives the a posteriori case. As the investigator will interpret any number of significant coefficients that happen to result, the probability of significant results, given that no relationships actually exist, is the probability of getting any pattern of significance across the set of explanatory variables. This is different from a specific a priori case in which the pattern is preestablished by the explicit hypotheses. See the literatures on False Discovery Rate (e.g., Benjamini and Hochberg 1995; Benjamini and Liu 1999; Yekutieli and Benjamini 1999; Kwong, Holland, and Cheung 2002; Sarkar 2004; Ghosh, Chen, and Raghunathan 2005) and Empirical Bayes (Efron et al. 2001; Cox and Wong 2004) for methods appropriate for a posteriori investigation.

Journal ArticleDOI
TL;DR: A balanced probability analysis is presented, which provides the biologist with an approach to interpret results in the context of the total number of genes truly differentially expressed and false discovery and false negative rates for the list of genes reaching any significance threshold.
Abstract: Nucleotide-microarray technology, which allows the simultaneous measurement of the expression of tens of thousands of genes, has become an important tool in the study of disease. In disorders such as malignancy, gene expression often undergoes broad changes of sizable magnitude, whereas in many common multifactorial diseases, such as diabetes, obesity, and atherosclerosis, the changes in gene expression are modest. In the latter circumstance, it is therefore challenging to distinguish the truly changing from nonchanging genes, especially because statistical significance must be considered in the context of multiple hypothesis testing. Here, we present a balanced probability analysis (BPA), which provides the biologist with an approach to interpret results in the context of the total number of genes truly differentially expressed and false discovery and false negative rates for the list of genes reaching any significance threshold. In situations where the changes are of modest magnitude, sole consideration of the false discovery rate can result in poor power to detect genes truly differentially expressed. Concomitant analysis of the rate of truly differentially expressed genes not identified, i.e., the false negative rate, allows balancing of the two error rates and a more thorough insight into the data. To this end, we have developed a unique, model-based procedure for the estimation of false negative rates, which allows application of BPA to real data in which changes are modest.

Journal ArticleDOI
TL;DR: Three general approaches for addressing multiplicity in large research problems are presented and a general framework for ensuring reproducible results in complex research, where a researcher faces more than just one large research problem is offered.
Abstract: The multiplicity problem is evident in the simplest form of statistical analysis of gene expression data ‐ the identification of differentially expressed genes. In more complex analysis, the problem is compounded by the multiplicity of hypotheses per gene. Thus, in some cases, it may be necessary to consider testing millions of hypotheses. We present three general approaches for addressing multiplicity in large research problems. (a) Use the scalability of false discovery rate (FDR) controlling procedures; (b) apply FDR-controlling procedures to a selected subset of hypotheses; (c) apply hierarchical FDR-controlling procedures.We also offer a general framework for ensuring reproducible results in complex research, where a researcher faces more than just one large research problem. We demonstrate these approaches by analyzing the results of a complex experiment involving the study of gene expression levels in different brain regions across multiple mouse strains.

Journal ArticleDOI
TL;DR: An omnibus test is constructed that combines SNP and haplotype analysis and balances the desire for statistical power against the implicit costs of false positive results, which up to now appear to be common in the literature.
Abstract: The genetic case-control association study of unrelated subjects is a leading method to identify single nucleotide polymorphisms (SNPs) and SNP haplotypes that modulate the risk of complex diseases. Association studies often genotype several SNPs in a number of candidate genes; we propose a two-stage approach to address the inherent statistical multiple comparisons problem. In the first stage, each gene's association with disease is summarized by a single p-value that controls a familywise error rate. In the second stage, summary p-values are adjusted for multiplicity using a false discovery rate (FDR) controlling procedure. For the first stage, we consider marginal and joint tests of SNPs and haplotypes within genes, and we construct an omnibus test that combines SNP and haplotype analysis. Simulation studies show that when disease susceptibility is conferred by a SNP, and all common SNPs in a gene are genotyped, marginal analysis of SNPs using the Simes test has similar or higher power than marginal or joint haplotype analysis. Conversely, haplotype analysis can be more powerful when disease susceptibility is conferred by a haplotype. The omnibus test tracks the more powerful of the two approaches, which is generally unknown. Multiple testing balances the desire for statistical power against the implicit costs of false positive results, which up to now appear to be common in the literature.

Journal ArticleDOI
TL;DR: In this review, strategies are described for optimising the genotyping cost by discarding promising genes at an earlier stage, saving resources for the genes that show a trend of association, and new methods of analysis that combine evidence across genes to increase sensitivity to multiple true associations in the presence of many non-associated genes are reviewed.
Abstract: Recent developments in the statistical analysis of genome-wide studies are reviewed. Genome-wide analyses are becoming increasingly common in areas such as scans for disease-associated markers and gene expression profiling. The data generated by these studies present new problems for statistical analysis, owing to the large number of hypothesis tests, comparatively small sample size and modest number of true gene effects. In this review, strategies are described for optimising the genotyping cost by discarding promising genes at an earlier stage, saving resources for the genes that show a trend of association. In addition, there is a review of new methods of analysis that combine evidence across genes to increase sensitivity to multiple true associations in the presence of many non-associated genes. Some methods achieve this by including only the most significant results, whereas others model the overall distribution of results as a mixture of distributions from true and null effects. Because genes are correlated even when having no effect, permutation testing is often necessary to estimate the overall significance, but this can be very time consuming. Efficiency can be improved by fitting a parametric distribution to permutation replicates, which can be re-used in subsequent analyses. Methods are also available to generate random draws from the permutation distribution. The review also includes discussion of new error measures that give a more reasonable interpretation of genome-wide studies, together with improved sensitivity. The false discovery rate allows a controlled proportion of positive results to be false, while detecting more true positives; and the local false discovery rate and false-positive report probability give clarity on whether or not a statistically significant test represents a real discovery.

Journal ArticleDOI
TL;DR: This work develops a procedure for estimating the latent FDR (ELF) based on a Poisson regression model and finds that ELF performs substantially better than the standard FDR approach in estimating the FDP.
Abstract: Motivation: Wide-scale correlations between genes are commonly observed in gene expression data, due to both biological and technical reasons. These correlations increase the variability of the standard estimate of the false discovery rate (FDR). We highlight the false discovery proportion (FDP, instead of the FDR) as the suitable quantity for assessing differential expression in microarray data, demonstrate the deleterious effects of correlation on FDP estimation and propose an improved estimation method that accounts for the correlations. Methods: We analyse the variation pattern of the distribution of test statistics under permutation using the singular value decomposition. The results suggest a latent FDR model that accounts for the effects of correlation, and is statistically closer to the FDP. We develop a procedure for estimating the latent FDR (ELF) based on a Poisson regression model. Results: For simulated data based on the correlation structure of real datasets, we find that ELF performs substantially better than the standard FDR approach in estimating the FDP. We illustrate the use of ELF in the analysis of breast cancer and lymphoma data. Availability: R code to perform ELF is available in http://www.meb.ki.se/~yudpaw. Contact: yudi.pawitan@ki.se Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The present communication shows that the variance of the proposed estimators may be intolerably high, the correlation structure of microarray data being the main cause of their instability.
Abstract: Some extended false discovery rate (FDR) controlling multiple testing procedures rely heavily on empirical estimates of the FDR constructed from gene expression data. Such estimates are also used as performance indicators when comparing different methods for microarray data analysis. The present communication shows that the variance of the proposed estimators may be intolerably high, the correlation structure of microarray data being the main cause of their instability.