scispace - formally typeset
Search or ask a question

Showing papers on "Bonferroni correction published in 2004"


Journal ArticleDOI
TL;DR: The meta-analysis on statistical power by Jennions and Moller (2003) revealed that, in the field of behavioral ecology and animal behavior, statistical power of less than 20% to detect a small effect and power of more than 50% to detects a medium effect existed.
Abstract: Recently, Jennions and Moller (2003) carried out a metaanalysis on statistical power in the field of behavioral ecology and animal behavior, reviewing 10 leading journals including Behavioral Ecology. Their results showed dismayingly low average statistical power (note that a meta-analytic review of statistical power is different from post hoc power analysis as criticized in Hoenig and Heisey, 2001). The statistical power of a null hypothesis (Ho) significance test is the probability that the test will reject Ho when a research hypothesis (Ha) is true. Knowledge of effect size is particularly important for statistical power analysis (for statistical power analysis, see Cohen, 1988; Nakagawa and Foster, in press). There are many kinds of effect size measures available (e.g., Pearson’s r, Cohen’s d, Hedges’s g), but most of these fall into one of two major types, namely the r family and the d family (Rosenthal, 1994). The r family shows the strength of relationship between two variables while the d family shows the size of difference between two variables. As a benchmark for research planning and evaluation, Cohen (1988) proposed ‘conventional’ values for small, medium, and large effects: r 1⁄4.10, .30, and .50 and d 1⁄4.20, .50, and .80, respectively (in the way that p values of .05, .01, and .001 are conventional points, although these conventional values of effect size have been criticized; e.g., Rosenthal et al., 2000). The meta-analysis on statistical power by Jennions and Moller (2003) revealed that, in the field of behavioral ecology and animal behavior, statistical power of less than 20% to detect a small effect and power of less than 50% to detect a medium effect existed. This means, for example, that the average behavioral scientist performing a statistical test has a greater probability of making a Type II error (or b) (i.e., not rejecting Ho when Ho is false; note that statistical power is equals to 1 2 b) than if they had flipped a coin, when an experiment effect is of medium size (i.e., r 1⁄4 .30, d 1⁄4 .50). Here, I highlight and discuss an implication of this low statistical power on one of the most widely used statistical procedures, Bonferroni correction (Cabin and Mitchell, 2000). Bonferroni corrections are employed to reduce Type I errors (i.e., rejecting Ho when Ho is true) when multiple tests or comparisons are conducted. Two kinds of Bonferroni procedures are commonly used. One is the standard Bonferroni procedure, where a modified significant criterion (a/k where k is the number of statistical tests conducted on given data) is used. The other is the sequential Bonferroni procedure, which was introduced by Holm (1979) and popularized in the field of ecology and evolution by Rice (1989) (see these papers for the procedure). For example, in a recent volume of Behavioral Ecology (vol. 13, 2002), nearly one-fifth of papers (23 out of 117) included Bonferroni corrections. Twelve articles employed the standard procedure while 11 articles employed the sequential procedure (10 citing Rice, 1989, and one citing Holm, 1979). A serious problem associated with the standard Bonferroni procedure is a substantial reduction in the statistical power of rejecting an incorrect Ho in each test (e.g., Holm, 1979; Perneger, 1998; Rice, 1989). The sequential Bonferroni procedure also incurs reduction in power, but to a lesser extent (which is the reason that the sequential procedure is used in preference by some researchers; Moran, 2003). Thus, both procedures exacerbate the existing problem of low power, identified by Jennions and Moller (2003). For example, suppose an experiment where both an experimental group and a control group consist of 30 subjects. After an experimental period, we measure five different variables and conduct a series of t tests on each variable. Even prior to applying Bonferroni corrections, the statistical power of each test to detect a medium effect is 61% (a 1⁄4 .05), which is less than a recommended acceptable 80% level (Cohen, 1988). In the field of behavioral ecology and animal behavior, it is usually difficult to use large sample sizes (in many cases, n , 30) because of practical and ethical reasons (see Still, 1992). When standard Bonferroni corrections are applied, the statistical power of each t test drops to as low as 33% (to detect a medium effect at a/5 1⁄4 .01). Although sequential Bonferroni corrections do not reduce the power of the tests to the same extent, on average (33–61% per t test), the probability of making a Type II error for some of the tests (b 1⁄4 1 2 power, so 39–66%) remains unacceptably high. Furthermore, statistical power would be even lower if we measured more than five variables or if we were interested in detecting a small effect. Bonferroni procedures appear to raise another set of problems. There is no formal consensus for when Bonferroni procedures should be used, even among statisticians (Perneger, 1998). It seems, in some cases, that Bonferroni corrections are applied only when their results remain significant. Some researchers may think that their results are ‘more significant’ if the results pass the rigor of Bonferroni corrections, although this is logically incorrect (Cohen, 1990, 1994; Yoccoz, 1991). Many researchers are already reluctant to report nonsignificant results ( Jennions and Moller, 2002a,b). The wide use of Bonferroni procedures may be aggravating the tendency of researchers not to present nonsignificant results, because presentation of more tests with nonsignificant results may make previously ‘significant’ results ‘nonsignificant’ under Bonferroni procedures. The more detailed research (i.e., research measuring more variables) researchers do, the less probability they have of finding significant results. Moran (2003) recently named this paradox as a hyper-Red Queen phenomenon (see the paper for more discussion on problems with the sequential method). Imagine that we conduct a study where we measure as many relevant variables as possible, 10 variables, for example. We find only two variables statistically significant. Then, what should we do? We could decide to write a paper highlighting these two variables (and not reporting the other eight at all) as if we had hypotheses about the two significant variables in the first place. Subsequently, our paper would be published. Alternatively, we could write a paper including all 10 variables. When the paper is reviewed, referees might tell us that there were no significant results if we had ‘appropriately’ employed Bonferroni corrections, so that our study would not be advisable for publication. However, the latter paper is Behavioral Ecology Vol. 15 No. 6: 1044–1045 doi:10.1093/beheco/arh107 Advance Access publication on June 30, 2004

1,996 citations


Journal ArticleDOI
01 Jun 2004-Oikos
TL;DR: It is concluded that some reasonable control of alpha inflation is required of authors as a safeguard against striking, but spurious findings, which may strongly affect the credibility of ecological research.
Abstract: I analyze some criticisms made about the application of alphainflation correction procedures to repeated-test tables in ecological studies. Common pitfalls during application, the statistical properties of many ecological datasets, and the strong control of the tablewise error rate made by the widely used sequential Bonferroni procedures, seem to be responsible for some ‘illogical’ results when such corrections are applied. Sharpened Bonferroni-type procedures may alleviate the decrease in power associated to standard methods as the number of tests increases. More powerful methods, based on controlling the false discovery rate (FDR), deserve a more frequent use in ecological studies, especially in those involving large repeated-test tables in which several or many individual null hypotheses have been rejected, and the most significant p-value is relatively large. I conclude that some reasonable control of alpha inflation is required of authors as a safeguard against striking, but spurious findings, which may strongly affect the credibility of ecological research. Moran (2003) recently suggested rejecting the application of the sequential Bonferroni rule in ecological studies. He based his proposal on certain mathematical, logical, and practical objections which led him to conclude that it would be better for ecological research to abandon the awkward constraints derived from the sequential Bonferroni rule, allowing the researcher to interpret more freely the multiple test outcomes without testing for alpha inflation. Thereby, detailed ecological research would be stimulated, while avoiding the loss of potentially relevant results, which are at risk of remaining unknown when authors are required to adhere strictly to the sequential Bonferroni rule. The likely increase in the frequency of ‘false positives’ in the ecological literature would be of minor importance, since these spurious results will not be confirmed by subsequent experiments. In other contexts, even stronger claims against alpha corrections have recently been the subject of controversy (Perneger 1998, 1999, Feise 2002). Surprisingly few people have questioned the same corrections which are implicit in the standard post hoc methods routinely applied to perform multiple comparisons between treatments for a single dependent variable. Accepting Moran’s arguments, it could be argued that relevant research results are perhaps not being published because people use these alpha-corrected methods instead of looking directly at the individual pairwise-test p-values. There is an apparent inconsistency between the unquestioned acceptance of the ‘‘alpha inflation under repeated test’’ principle in the univariate case, and the controversy about the convenience or not of applying the same statistical principle in the multivariate case. Arbitrary rejection of the application of a wellfounded statistical principle does not seem an acceptable scientific solution for a problem. If the way in which alpha-inflation corrections are routinely applied in multivariate ecological studies does not work, it seems more reasonable to analyze and improve the procedures rather than simply ‘kill the principle’.

546 citations


Journal ArticleDOI
TL;DR: An algorithm is described that avoids nested simulations to obtain P values that are corrected for multiple testing, which, apparently, limits the applicability of this approach because of computer running-time restrictions.
Abstract: Haplotypes—that is, linear arrangements of alleles on the same chromosome that were inherited as a unit—are expected to carry important information in the context of association fine mapping of complex diseases. In consideration of a set of tightly linked markers, there is an enormous number of different marker combinations that can be analyzed. Therefore, a severe multiple-testing problem is introduced. One method to deal with this problem is Bonferroni correction by the number of combinations that are considered. Bonferroni correction is appropriate for independent tests but will result in a loss of power in the presence of linkage disequilibrium in the region. A second method is to perform simulations. It is unfortunate that most methods of haplotype analysis already require simulations to obtain an uncorrected P value for a specific marker combination. Thus, it seems that nested simulations are necessary to obtain P values that are corrected for multiple testing, which, apparently, limits the applicability of this approach because of computer running-time restrictions. Here, an algorithm is described that avoids such nested simulations. We check the validity of our approach under two disease models for haplotype analysis of family data. The true type I error rate of our algorithm corresponds to the nominal significance level. Furthermore, we observe a strong gain in power with our method to obtain the global P value, compared with the Bonferroni procedure to calculate the global P value. The method described here has been implemented in the latest update of our program FAMHAP.

163 citations


Journal Article
TL;DR: The sequentially rejective Bonferroni test is an easily applied, versatile statistical tool that enables researchers to make simultaneous inferences from their data without risking an unacceptably high overall type I error rate.

84 citations


Journal ArticleDOI
TL;DR: The use of the sequentially rejective Bonferroni test is well known among statisticians, but it is not used routinely in scientific literature as mentioned in this paper, which is a limitation of this test.

83 citations


Journal ArticleDOI
TL;DR: At least one susceptibility locus for schizophrenia is located within the GRM8 region in Japanese, and none of two SNPs showed significant association with disease after Bonferroni correction.
Abstract: The glutamatergic dysfunction has been implicated in pathophysiology of schizophrenia. The Group III metabotropic glutamate receptor 4 (mGluR4), 6, 7, and 8 are thought to modulate glutamatergic transmission in the brain by inhibiting glutamate release at the synapse. We tested association of schizophrenia with GRM8 using 22 single nucleofide polymorphisms (SNPs) with the average intervals of 40.3 kb in the GRM8 region in 100 case-control pairs for the SNPs. Although we observed significant associations of schizophrenia with two SNPs, SNP18 (rs2237748, allele: P = 0.0279; genotype: P = 0.0124) and SNP19 (rs2299472, allele: P = 0.0302; genotype: P = 0.0127), none of two SNPs showed significant association with disease after Bonferroni correction. Both SNP18 and SNP19 were included in a large region (>330 kb) in which SNPs are in linkage disequilibrium (LD) at the 3' region of GRM8. We also tested haplotype association of schizophrenia with constructed haplotypes of the SNPs in LD. Significant associations were detected for the combinations of SNP5-SNP6 (chi(2) = 18.12, df = 3, P = 0.0004, P corr = 0.0924 with Bonferroni correction), SNP4-SNP5-SNP6 (chi(2) = 27.50, df = 7, P = 0.0075, P corr = 0.015 with Bonferroni correction), and SNP5-SNP6-SNP7 (chi(2) = 23.92, df = 7, P = 0.0011, P corr = 0.0022 with Bonferroni correction). Thus, we conclude that at least one susceptibility locus for schizophrenia is located within the GRM8 region in Japanese.

52 citations


Journal ArticleDOI
TL;DR: Several methods for performing a comparison of capability indices for two different processes or the same process before and after an adjustment are considered.
Abstract: When selecting a supplier or assessing process improvement, it is useful to compare capability indices for two different processes or the same process before and after adjustment. Using computer simulation, several methods for performing this comparison..

41 citations


Journal ArticleDOI
TL;DR: Questions that ask respondents to "choose all that apply" from a set of items occur frequently in surveys and it is often of interest to test for independence between two categorical variables.
Abstract: Questions that ask respondents to "choose all that apply" from a set of items occur frequently in surveys. Categorical variables that summarize this type of survey data are called both pick any/c variables and multiple-response categorical variables. It is often of interest to test for independence between two categorical variables. When both categorical variables can have multiple responses, traditional Pearson chi-square tests for independence should not be used because of the within-subject dependence among responses. An intuitively constructed version of the Pearson statistic is proposed to perform the test using bootstrap procedures to approximate its sampling distribution. First- and second-order adjustments to the proposed statistic are given in order to use a chi-square distribution approximation. A Bonferroni adjustment is proposed to perform the test when the joint set of responses for individual subjects is unavailable. Simulations show that the bootstrap procedures hold the correct size more consistently than the other procedures.

41 citations


Journal ArticleDOI
TL;DR: It is found that the coverage probabilities associated with the various methods of constructing simultaneous confidence intervals (for ratios) in manyto-one comparisons depend on the ratios of the coefficient of variation for the mean of the control group to the coefficient for themean of the treatments.
Abstract: Objectives: In this article, we illustrate and compare exact simultaneous confidence sets with various approximate simultaneous confidence intervals for multiple ratios as applied to many-to-one comparisons Quite different datasets are analyzed to clarify the points Methods: The methods are based on existing probability inequalities (eg, Bonferroni, Slepian and Sidak), estimation of nuisance parameters and re-sampling techniques Exact simultaneous confidence sets based on the multivariate t -distribution are constructed and compared with approximate simultaneous confidence intervals Results: It is found that the coverage probabilities associated with the various methods of constructing simultaneous confidence intervals (for ratios) in manyto- one comparisons depend on the ratios of the coefficient of variation for the mean of the control group to the coefficient of variation for the mean of the treatments If the ratios of the coefficients of variations are less than one, the Bonferroni corrected Fieller confidence intervals have almost the same coverage probability as the exact simultaneous confidence sets Otherwise, the use of Bonferroni intervals leads to conservative results Conclusions: When the ratio of the coefficient of variation for the mean of the control group to the coefficient of variation for the mean of the treatments are greater than one (eg, in balanced designs with increasing effects), the Bonferroni simultaneous confidence intervals are too conservative Therefore, we recommend not using Bonferroni for this kind of data On the other hand, the plug-in method maintains the intended confidence coefficient quite satisfactorily; therefore, it can serve as the best alternative in any case

41 citations


Journal ArticleDOI
TL;DR: A simple strategy is described that adjusts alpha for multiple primary efficacy measures, yet maintains statistical power for each test, and modifies the sample size to maintain statistical power.
Abstract: Background A researcher must carefully balance the risk of 2 undesirable outcomes when designing a clinical trial: false-positive results (type I error) and false-negative results (type II error). In planning the study, careful attention is routinely paid to statistical power (i.e., the complement of type II error) and corresponding sample size requirements. However, Bonferroni-type alpha adjustments to protect against type I error for multiple tests are often resisted. Here, a simple strategy is described that adjusts alpha for multiple primary efficacy measures, yet maintains statistical power for each test. Method To illustrate the approach, multiplicity-adjusted sample size requirements were estimated for effects of various magnitude with statistical power analyses for 2-tailed comparisons of 2 groups using chi2 tests and t tests. These analyses estimated the required sample size for hypothetical clinical trial protocols in which the prespecified number of primary efficacy measures ranged from 1 to 5. Corresponding Bonferroni-adjusted alpha levels were used for these calculations. Results Relative to that required for 1 test, the sample size increased by about 20% for 2 dependent variables and 30% for 3 dependent variables. Conclusion The strategy described adjusts alpha for multiple primary efficacy measures and, in turn, modifies the sample size to maintain statistical power. Although the strategy is not novel, it is typically overlooked in psychopharmacology trials. The number of primary efficacy measures must be prespecified and carefully limited when a clinical trial protocol is prepared. If multiple tests are designated in the protocol, the alpha-level adjustment should be anticipated and incorporated in sample size calculations.

40 citations


Journal ArticleDOI
TL;DR: Results support the following conclusions: failure to control for multiple significance testing results in unacceptable FWE rates, the FWE rate for the MPTs approximated the alpha set for the analyses, and the statistical power advantage that MPTs provide over Bonferroni adjustments is important when using small sample sizes such as those that are typical of recent electrocortical studies.
Abstract: This study examined the relative family-wise error (FWE) rate and statistical power of multivariate permutation tests (MPTs), Bonferroni-adjusted alpha, and uncorrected-alpha tests of significance for bivariate associations. Although there are many previous applications of MPTs, this is the first to apply it to testing bivariate associations. Electrocortical studies were selected as an example class because the sample sizes that are typical of electrocortical studies published in 2001 and 2002 are small and their multiple significance tests are typically nonindependent. Because Bonferroni adjustments assume independent predictors, we expected that MPTs would be more powerful than the Bonferroni adjustment. Results support the following conclusions: (a) failure to control for multiple significance testing results in unacceptable FWE rates, (b) the FWE rate for the MPTs approximated the alpha set for the analyses, and (c) the statistical power advantage that MPTs provide over Bonferroni adjustments is important when using small sample sizes such as those that are typical of recent electrocortical studies.

01 Jan 2004
TL;DR: The problem of multiple comparisons, in that the would like to control the false positive rate not just for any single test but also for the entire collection (or family of tests) of tests that makes up the experiment.
Abstract: Statistical analysis of a data set typically involves testing not just a single hypothesis, but rather many (often very many!). For any particular test, we may assign a pre-set probability α of a type-1 error (i.e., a false positive, rejecting the null hypothesis when in fact it is true). The problem is that using a (say) value of α = 0.05 means that roughly one out of every twenty such tests will show a false positive (rejecting the null hypothesis when in fact it is true). Thus, if our experiment involves performing 100 tests, we expect 5 to be declared as significant if we use a value ofα = 0.05 for each. This is the problem of multiple comparisons, in that we would like to control the false positive rate not just for any single test but also for the entire collection (or family) of tests that makes up our experiment.

Journal Article
TL;DR: Bonferroni test statistics were found to provide excellent approximations to the (asymptotically) exact test statistics and asymptotic experiment-wise error rate criterion was algebraically derived.
Abstract: A number of state assessment programs that employ Rasch-based common item equating procedures estimate the equating constant with only those common items for which the two tests' Rasch item difficulty parameter estimates differ by less than 0.3 logits. The results of this study presents evidence that this practice results in an inflated probability of incorrectly dropping an item from the common item set if the number of examinees is small (e.g., 500 or less) and the reverse if the number of examinees is large (e.g., 5000 or more). An asymptotic experiment-wise error rate criterion was algebraically derived. This same criterion can also be applied to the Mantel-Haenszel statistic. Bonferroni test statistics were found to provide excellent approximations to the (asymptotically) exact test statistics.

Journal ArticleDOI
TL;DR: This study compares a typical heuristic algorithm with classical and Bayesian regression models in ascertaining the presence of acute bronchopulmonary disease events in lung transplant recipients, suggesting the clinical usefulness of the Bayesian approach compared with the classical and heuristic approaches.
Abstract: This study compares a typical heuristic algorithm with classical and Bayesian regression models in ascertaining the presence of acute bronchopulmonary disease events in lung transplant recipients. These models attempt to predict whether an epoch will end in an event, based on the preceding two weeks of data. The data consist of 150 two-week epochs of daily to biweekly spirometry and symptom covariates for 30 subjects over 60 subject-years. Seventy-five 'event' epochs end on a day when an acute bronchopulmonary disease event is documented in the medical record; 75 randomly selected 'non-event' epochs end on a day when no event is documented. The data are partitioned by randomly assigning 15 subjects for training and the remaining 15 subjects for testing. For cross-validation, a second random partition is generated from the same data set. The statistical models are trained and tested on both partitions. For the heuristic algorithm, its historical event classifications on the same test cases are used. Classification performance on both partitions of all models is compared using receiver operating characteristic curves, sensitivity and specificity, and a Shannon information score. Data partition did not appreciably affect statistical model performance. All statistical models, unlike the heuristic algorithm, performed significantly different than chance (family significance < 0.05, Pearson independence chi-square, Bonferroni multiple correction), and better than the heuristic algorithm. The best models were Bayesian changepoint models. Through a clinically oriented discussion, a case classified by all of these algorithms is presented, suggesting the clinical usefulness of the Bayesian approach compared with the classical and heuristic approaches.

01 Jan 2004
TL;DR: It is found that among a collection of primatesequences, even an optimal sequences-weights approach is only 51% as efficient as the maximum-likelihood approach in inferences of base frequency parameters.
Abstract: Genetic epidemiology aims at identifying biological mechanisms responsible for human dis-eases. Genome-wide association studies, made possible by recent improvements in genotypingtechnologies, are now promisingly investigated. In these studies, common first-stage strategies fo-cus on marginal effects but lead to multiple-testing and are unable to capture the possibly complexinterplay between genetic factors.We have adapted the use of the local score statistic, already successfully applied to analyse longmolecular sequences. Via sum statistics, this method captures local and possible distant depen-dences between markers. Dedicated to genome-wide association studies, it is fast to compute,able to handle large datasets, circumvents the the multiple-testing problem and outlines a set ofgenomic regions (segments) for further analyses. Applied to simulated and real data, our approachoutperforms classical Bonferroni and FDR corrections for multiple-testing. It is implementedin a software termed LHiSA for Local High-scoring Segments for Association and available at:http://stat.genopole.cnrs.fr/software/lhisa.KEYWORDS: association studies, local score, sum statistics, SNP

Journal ArticleDOI
Peter Kraft1
TL;DR: The authors’ permutation procedure does not have the desired statistical property—that is, it rejects the global null hypothesis of no interaction too often when none of the estimated interaction parameters differ from their null value.
Abstract: To the Editor: Complex diseases are (by definition) influenced by multiple genes, environmental factors, and their interactions. There is currently a strong interest in studies testing for association between combinations of these factors and disease, in part because genes that affect the risk of disease only in the presence of another genetic variant or particular environment may not be detected in a marginal (gene-by-gene) analysis (Culverhouse et al. 2002). Such studies raise the problem of multiple comparisons. Even when a small number of candidate genes and environmental factors is examined, a large number of possible interactions may need to be tested, as illustrated by a recent article in The American Journal of Human Genetics (Bugawan et al. 2003). Bugawan et al. (2003) investigated potential interaction between the IL4R locus and five tightly linked SNPs in the IL4 and IL13 loci on chromosome 5, through use of a sample of 90 patients with type I diabetes and 94 population-based controls. They independently tested each of the chromosome 5 SNPs for interaction with IL4R, through use of logistic regression (cf. their table 7), and corrected for multiple comparisons through use of a permutation procedure. They concluded that there is statistically significant evidence for an epistatic interaction between at least one of the chromosome 5 SNPs and the IL4R locus. However, the authors’ permutation procedure does not have the desired statistical property—that is, it rejects the global null hypothesis of no interaction too often when none of the estimated interaction parameters differ from their null value. In this letter, I discuss why their procedure fails, present several alternatives, and compare the performance of these alternatives in a small simulation study. The procedure presented by Bugawan et al. (2003) amounts to plugging the order statistics for the observed p values, p(1),…,p(5), into their joint cumulative distribution function under the null: p=F0(p(1),…,p(5))=Pr(P(1)⩽p(1),…,P(5)⩽p(5)). (Here, italicized uppercase letters refer to random variables, and lowercase letters refer to observed values of the corresponding variables. This differs from the notation in the Bugawan et al. [2003] article.) The authors estimate F0 by permuting case-control labels 200 times and calculating the ordered p values for each permutation. A simple example shows that this approach is inappropriate. Consider the p values from two independent tests, P1 and P2. If we assume a large enough sample size, P1 and P2 are independently uniform on (0,1) under the null, and, hence, the cumulative distribution function for the associated order statistics, F0(p(1),p(2)), is P(1)(2p(2)-p(1)) (Bickel and Doksum 1977). The distribution of P=F0(P(1),P(2)) under the global null is shown in figure 1a. P does not have a uniform distribution under the null, as we expect for a p value. In this case, a test that rejects the global null hypothesis that both tests are null when P<.05 would have a type I error rate between 10% and 15%. As shown in figure 1b, the magnitude of the type I error rate increases as the number of independent tests increases. Figure 1 Density of global p values for the multiple-comparisons procedure used by Bugawan et al. (2003) under the global null hypothesis for two independent tests (a) and three independent tests (b). In panel a, P0≡F0(P(1),P(2) ... There are several alternative, theoretically justified and simple procedures that correct for multiple comparisons, besides the notoriously conservative Bonferroni correction. Simes’s test (Simes 1986), for example, controls the overall significance level (also known as the “familywise error rate”) when the tests are independent or exhibit a special type of dependence (Sarkar 1998). Simes’s test rejects the global null hypothesis that all K test-specific null hypotheses are true if p(k)⩽αk/K for any k in 1,…,K. Simulation results reported in table 1 suggest that Simes’s test has the appropriate false-positive rate, even when the tests are correlated. Table 1 Observed False-Positive Rates (False-Discovery Rates) for Procedures with Nominal 5% Rates in the Context of Testing Five Possible Gene × Gene Interactions, Calculated from 500 Simulated Data Sets[Note] Other approaches with particular appeal in the context of multiple-gene and multiple-environmental-factor studies aim to control the false-discovery rate—that is, the expected proportion of rejected null hypotheses that are falsely rejected. This approach is particularly useful when a portion of the null hypotheses can be assumed false, as in microarray studies. Devlin et al. (2003) recently proposed a variant of the Benjamini and Hochberg (1995) step-up procedure that controls the false-discovery rate when testing a large number of possible gene × gene interactions in multilocus association studies. The Benjamini and Hochberg procedure is related to Simes’s test; setting k*=maxk such that p(k)⩽αk/K, it rejects all k* null hypotheses corresponding to p(1),…,p(k*). In fact, the Benjamini and Hochberg procedure reduces to Simes’s test when all null hypotheses are true (Benjamini and Yekutieli 2001). Devlin et al.’s (2003) proof for the validity of their false-discovery-rate procedure requires that the analyzed genes be statistically independent. This is not the case for the IL4 and IL13 SNPs studied by Bugawan et al. (2003), but the simulation results in table 1 suggest that Devlin et al.’s (2003) procedure controls the false-discovery rate even when the analyzed genes are correlated. The p values reported in table 7 of Bugawan et al. (2003) do not lead to any significant results at the .05 level when any of the alternative procedures discussed here are used. Clearly, effective methods are needed for adjusting for multiple comparisons when testing for association between multiple factors and complex disease. On the one hand, blithely reporting any results marginally “significant” at the .05 level or relying on outdated and ill-performing stepwise model-building procedures (see, e.g., Burnham and Anderson [2002] and Devlin et al. [2003]) will lead to spurious results, expensive follow-up studies with little chance of replication, and confusion. On the other hand, overly conservative procedures will create missed opportunities. Although the procedures discussed here are known to control the familywise error rate or false-discovery rate in particular situations (e.g., independent covariates), their performance in more general situations needs further investigation.

Journal ArticleDOI
TL;DR: In this article, the authors adapt the rank-sum test to studies involving paired data and demonstrate that it too has power advantages for such alternatives for comparison of independent groups, and an example from a study measuring the effect of a particular treatment or experimental condition on a number of different response variables.
Abstract: Clinical trials and other types of studies often examine the effects of a particular treatment or experimental condition on a number of different response variables. Although the usual approach for analysing such data is to examine each variable separately, this can increase the chance of false positive findings. Bonferroni's inequality or Hotelling's T2 statistic can be employed to control the overall type I error rate, but these tests generally lack power for alternatives in which the treatment improves the outcome on most or all of the endpoints. For the comparison of independent groups, O'Brien (1984) developed a rank-sum type test that has greater power than the Bonferroni and T2 procedures when one treatment is uniformly better (i.e. for all endpoints) than the other treatment(s). In this paper we adapt the rank-sum test to studies involving paired data and demonstrate that it, too, has power advantages for such alternatives. Simulation results are described, and an example from a study measuring th...

01 Jan 2004
TL;DR: In this paper, some methods are presented that control the probability of making at least one type I error when testing more than one hypothesis at the same time, no matter how many (or which) of the hypotheses are false.
Abstract: When testing more than one hypothesis at the same time the probability of making at least one type I error increases. In this paper some methods are presented that control that probability to be smaller than the desired overall level α , no matter how many (or which) of the hypotheses are false. This paper will mainly deal with methods based on marginal p-values and these methods are of considerable practical importance. Sample size calculations are illustrated for two of the methods, the Bonferroni procedure and the Bonferroni-Holm procedure.

01 Jan 2004
TL;DR: This document provides a tutorial for using the multtest package, a collection of functions for multiple hypothesis testing that can be used to identify differentially expressed genes in microarray experiments.
Abstract: 1 Overview The multtest package contains a collection of functions for multiple hypothesis testing. These functions can be used to identify differentially expressed genes in microarray experiments, i.e., genes whose expression levels are associated with a response or covariate of interest. Introduction to multiple testing. This document provides a tutorial for using the multtest package. For a detailed introduction to multiple testing consult the document multtest.intro in the inst/doc directory of the package. See also Shaffer (1995) and Dudoit et al. (2002) for a review of multiple testing procedures and complete references. Multiple testing procedures implemented in multtest. The multtest package implements multiple testing procedures for controlling different Type I error rates. It includes procedures for controlling the family–wise Type I error rate (FWER): Bonferroni, Hochberg (1988), Holm (1979),

Journal ArticleDOI
TL;DR: It is shown that permutation testing can be used to obtain a desired false-positive error rate and, moreover, that such an approach has the added advantage of providing additional protection against false claims of nonsignificance, or type II error.
Abstract: To the Editor:Our study (Bugawan et al. 2003xAssociation and interaction of the IL4R, IL4, and IL13 loci with type 1 diabetes among Filipinos. Bugawan, TL, Mirel, DB, Valdes, AM, Panelo, A, Pozzilli, P, and Erlich, HA. Am J Hum Genet. 2003; 72: 1505–1514Abstract | Full Text | Full Text PDF | PubMed | Scopus (48)See all References2003) reported a negative association of a specific IL4-524 haplotype with type 1 diabetes (T1D), consistent with a previous report (Mirel et al. 2002xAssociation of IL4R haplotypes with type 1 diabetes. Mirel, DB, Valdes, AM, Lazzeroni, LC, Reynolds, RL, Erlich, HA, and Noble, JA. Diabetes. 2002; 51: 3336–3341Crossref | PubMedSee all References2002), and presented evidence for a genetic interaction between IL4-524 and IL4R SNPs. To test the latter, we computed relevant P values by permuting multilocus genotypes separately in case and control groups.The criticism raised by Kraft (2004xMultiple comparisons in studies of gene × gene, gene × environment interaction. Kraft, P. Am J Hum Genet. 2004; 74: 582–584Abstract | Full Text | Full Text PDF | PubMed | Scopus (9)See all References2004 [in this issue]) is not directed at our implementation of permutation testing, per se, but at permutation testing in general. His argument is that permutation testing does not properly account for multiple comparisons, resulting in an increase in false claims of significance, or type I familywise error (FWE). In the place of permutation testing, Kraft advocates the use of the Simes method—an elaboration of the classic Bonferroni procedure. In response, we wish to show that permutation testing can be used to obtain a desired false-positive error rate (as, indeed, can be demonstrated using Kraft’s example) and, moreover, that such an approach has the added advantage of providing additional protection against false claims of nonsignificance, or type II error.It should be noted that permutation methods are well established as a robust approach for obtaining overall significance levels while minimizing type II error (e.g., Good 1994xGood, P. CrossrefSee all References1994; Doerge and Churchill 1996xPermutation tests for multiple loci affecting a quantitative character. Doerge, RW and Churchill, GA. Genetics. 1996; 142: 285–294PubMedSee all References1996; Lynch and Walsh 1998xLynch, M and Walsh, B. : 441–442See all References1998), that such methods are extensible to multiple-testing scenarios (Westfall and Young 1993xWestfall, PH and Young, SS. See all References1993), and that examples of their application to human genetics are not uncommon (e.g., Lewis et al. 2003xGenome scan meta-analysis of schizophrenia and bipolar disorder, part II: schizophrenia. Lewis, CM, Levinson, DF, Wise, LH, DeLisi, LE, Straub, RE, Hovatta, I, Williams, NM et al. Am J Hum Genet. 2003; 73: 34–48Abstract | Full Text | Full Text PDF | PubMed | Scopus (807)See all References2003). However, as with any statistical method, the validity is dependent on correct application. Kraft provides an analysis of the permutation testing by discussing the distribution of two P values obtained from hypothetically permuted distributions (i.e., independent and uniformly distributed under the null hypothesis). The joint cumulative distribution function (CDF) for these two P values is given as F(P(1),P(2))=P(1)(2P(2)−P(1)), where P(1) and P(2) are, respectively, the first- and second-ordered P values. As such, Kraft notes that the Pr(P<.05) for this joint distribution is ∼0.1, indicating that we would expect to see the smaller P value, or P(1)<.05, about 10% of the time. Kraft’s argument, therefore, is that for independent tests, use of a critical value of .05 leads to a type I error rate of 10%.In fact, the proper approach for permutation testing—adjusted or unadjusted for multiple comparisons—is to find the critical value corresponding to the desired type I error rate. Specifically, if we consider the simulations presented by Kraft as equivalent to the result of a permutation test, we would seek the value of x in the permuted distribution for which Pr(P

Book ChapterDOI
01 Jan 2004
TL;DR: Two linear mixed models for the normal mouse data are considered that agree that array variance is much larger than other sources of variability, but differ somewhat in their lists of genes exhibiting the most significant mouse effects.
Abstract: We consider two linear mixed models for the normal mouse data [Pritchard et al, 2001] One models the log 2 intensity measurements directly and the other models the log2 ratios In each approach, we treat a mouse as a fixed effect, and alternatively, we also model it as a random effect to assess its variability directly We compare the results from these mixed model approaches The models agree that array variance is much larger than other sources of variability, but differ somewhat in their lists of genes exhibiting the most significant mouse effects Under a Bonferroni criterion, the ratio-based model we consider produces more genes with significant mouse effects than the intensity-based model, but fewer genes with significant tissue effects Both models demonstrate a general statistical framework for concurrently estimating sources of variability and assessing their significance

01 Jan 2004
TL;DR: It is concluded that, first, there are no differences in the performance of the FM-100 test, however, application of the Bonferroni correction does work in favour of accepting the null hypothesis.
Abstract: Summary of results for the fair, medium and dark groups. Reported as mean (standard deviation). no significant difference was found between the groups (ANOVA). It has been cautioned previously that the popular pro- cedure of analysing dTES should not be embarked on without first considering the distribution of the data.Ig The distribution of the TES for the population as a whole had a skewness of +0.59. For the dTES transformation, skewness was +0.01. A log,, transformation skewed the distribution in the opposite direction (skewness = -0.56). Since minimum skewness was found with the dTES data, all further comparisons between groups were made using dTES. Analysed using ANOVA or an unpaired t-test (equal variance, one tail) with Bonferroni correction, there were no sta- tistically significant differences in any of the measures used. It must be concluded that, first, there are no differences in the performance of the FM-100 test, however, application of the Bonferroni correction does work in favour of accepting the null hypothesis.20 Second, the observation