scispace - formally typeset
Search or ask a question

Showing papers on "Statistical hypothesis testing published in 2004"


Journal ArticleDOI
TL;DR: The steps of model selection are outlined and several ways that it is now being implemented are highlighted, so that researchers in ecology and evolution will find a valuable alternative to traditional null hypothesis testing, especially when more than one hypothesis is plausible.
Abstract: Recently, researchers in several areas of ecology and evolution have begun to change the way in which they analyze data and make biological inferences. Rather than the traditional null hypothesis testing approach, they have adopted an approach called model selection, in which several competing hypotheses are simultaneously confronted with data. Model selection can be used to identify a single best model, thus lending support to one particular hypothesis, or it can be used to make inferences based on weighted support from a complete set of competing models. Model selection is widely accepted and well developed in certain fields, most notably in molecular systematics and mark-recapture analysis. However, it is now gaining support in several other areas, from molecular evolution to landscape ecology. Here, we outline the steps of model selection and highlight several ways that it is now being implemented. By adopting this approach, researchers in ecology and evolution will find a valuable alternative to traditional null hypothesis testing, especially when more than one hypothesis is plausible.

3,489 citations


Journal ArticleDOI
TL;DR: The meta-analysis on statistical power by Jennions and Moller (2003) revealed that, in the field of behavioral ecology and animal behavior, statistical power of less than 20% to detect a small effect and power of more than 50% to detects a medium effect existed.
Abstract: Recently, Jennions and Moller (2003) carried out a metaanalysis on statistical power in the field of behavioral ecology and animal behavior, reviewing 10 leading journals including Behavioral Ecology. Their results showed dismayingly low average statistical power (note that a meta-analytic review of statistical power is different from post hoc power analysis as criticized in Hoenig and Heisey, 2001). The statistical power of a null hypothesis (Ho) significance test is the probability that the test will reject Ho when a research hypothesis (Ha) is true. Knowledge of effect size is particularly important for statistical power analysis (for statistical power analysis, see Cohen, 1988; Nakagawa and Foster, in press). There are many kinds of effect size measures available (e.g., Pearson’s r, Cohen’s d, Hedges’s g), but most of these fall into one of two major types, namely the r family and the d family (Rosenthal, 1994). The r family shows the strength of relationship between two variables while the d family shows the size of difference between two variables. As a benchmark for research planning and evaluation, Cohen (1988) proposed ‘conventional’ values for small, medium, and large effects: r 1⁄4.10, .30, and .50 and d 1⁄4.20, .50, and .80, respectively (in the way that p values of .05, .01, and .001 are conventional points, although these conventional values of effect size have been criticized; e.g., Rosenthal et al., 2000). The meta-analysis on statistical power by Jennions and Moller (2003) revealed that, in the field of behavioral ecology and animal behavior, statistical power of less than 20% to detect a small effect and power of less than 50% to detect a medium effect existed. This means, for example, that the average behavioral scientist performing a statistical test has a greater probability of making a Type II error (or b) (i.e., not rejecting Ho when Ho is false; note that statistical power is equals to 1 2 b) than if they had flipped a coin, when an experiment effect is of medium size (i.e., r 1⁄4 .30, d 1⁄4 .50). Here, I highlight and discuss an implication of this low statistical power on one of the most widely used statistical procedures, Bonferroni correction (Cabin and Mitchell, 2000). Bonferroni corrections are employed to reduce Type I errors (i.e., rejecting Ho when Ho is true) when multiple tests or comparisons are conducted. Two kinds of Bonferroni procedures are commonly used. One is the standard Bonferroni procedure, where a modified significant criterion (a/k where k is the number of statistical tests conducted on given data) is used. The other is the sequential Bonferroni procedure, which was introduced by Holm (1979) and popularized in the field of ecology and evolution by Rice (1989) (see these papers for the procedure). For example, in a recent volume of Behavioral Ecology (vol. 13, 2002), nearly one-fifth of papers (23 out of 117) included Bonferroni corrections. Twelve articles employed the standard procedure while 11 articles employed the sequential procedure (10 citing Rice, 1989, and one citing Holm, 1979). A serious problem associated with the standard Bonferroni procedure is a substantial reduction in the statistical power of rejecting an incorrect Ho in each test (e.g., Holm, 1979; Perneger, 1998; Rice, 1989). The sequential Bonferroni procedure also incurs reduction in power, but to a lesser extent (which is the reason that the sequential procedure is used in preference by some researchers; Moran, 2003). Thus, both procedures exacerbate the existing problem of low power, identified by Jennions and Moller (2003). For example, suppose an experiment where both an experimental group and a control group consist of 30 subjects. After an experimental period, we measure five different variables and conduct a series of t tests on each variable. Even prior to applying Bonferroni corrections, the statistical power of each test to detect a medium effect is 61% (a 1⁄4 .05), which is less than a recommended acceptable 80% level (Cohen, 1988). In the field of behavioral ecology and animal behavior, it is usually difficult to use large sample sizes (in many cases, n , 30) because of practical and ethical reasons (see Still, 1992). When standard Bonferroni corrections are applied, the statistical power of each t test drops to as low as 33% (to detect a medium effect at a/5 1⁄4 .01). Although sequential Bonferroni corrections do not reduce the power of the tests to the same extent, on average (33–61% per t test), the probability of making a Type II error for some of the tests (b 1⁄4 1 2 power, so 39–66%) remains unacceptably high. Furthermore, statistical power would be even lower if we measured more than five variables or if we were interested in detecting a small effect. Bonferroni procedures appear to raise another set of problems. There is no formal consensus for when Bonferroni procedures should be used, even among statisticians (Perneger, 1998). It seems, in some cases, that Bonferroni corrections are applied only when their results remain significant. Some researchers may think that their results are ‘more significant’ if the results pass the rigor of Bonferroni corrections, although this is logically incorrect (Cohen, 1990, 1994; Yoccoz, 1991). Many researchers are already reluctant to report nonsignificant results ( Jennions and Moller, 2002a,b). The wide use of Bonferroni procedures may be aggravating the tendency of researchers not to present nonsignificant results, because presentation of more tests with nonsignificant results may make previously ‘significant’ results ‘nonsignificant’ under Bonferroni procedures. The more detailed research (i.e., research measuring more variables) researchers do, the less probability they have of finding significant results. Moran (2003) recently named this paradox as a hyper-Red Queen phenomenon (see the paper for more discussion on problems with the sequential method). Imagine that we conduct a study where we measure as many relevant variables as possible, 10 variables, for example. We find only two variables statistically significant. Then, what should we do? We could decide to write a paper highlighting these two variables (and not reporting the other eight at all) as if we had hypotheses about the two significant variables in the first place. Subsequently, our paper would be published. Alternatively, we could write a paper including all 10 variables. When the paper is reviewed, referees might tell us that there were no significant results if we had ‘appropriately’ employed Bonferroni corrections, so that our study would not be advisable for publication. However, the latter paper is Behavioral Ecology Vol. 15 No. 6: 1044–1045 doi:10.1093/beheco/arh107 Advance Access publication on June 30, 2004

1,996 citations


Journal ArticleDOI
TL;DR: In this article, it was shown that the goal of the two approaches are essentially equivalent, and that the FDR point estimates can be used to define valid FDR controlling procedures in both finite sample and asymptotic settings.
Abstract: Summary. The false discovery rate (FDR) is a multiple hypothesis testing quantity that describes the expected proportion of false positive results among all rejected null hypotheses. Benjamini and Hochberg introduced this quantity and proved that a particular step-up p-value method controls the FDR. Storey introduced a point estimate of the FDR for fixed significance regions. The former approach conservatively controls the FDR at a fixed predetermined level, and the latter provides a conservatively biased estimate of the FDR for a fixed predetermined significance region. In this work, we show in both finite sample and asymptotic settings that the goals of the two approaches are essentially equivalent. In particular, the FDR point estimates can be used to define valid FDR controlling procedures. In the asymptotic setting, we also show that the point estimates can be used to estimate the FDR conservatively over all significance regions simultaneously, which is equivalent to controlling the FDR at all levels simultaneously. The main tool that we use is to translate existing FDR methods into procedures involving empirical processes. This simplifies finite sample proofs, provides a framework for asymptotic results and proves that these procedures are valid even under certain forms of dependence.

1,413 citations


Book
15 Jul 2004
TL;DR: In this paper, the authors present a method for estimating risk and risk of cancer in public health data using statistical methods for spatial data in the context of geographic information systems (GISs).
Abstract: Preface.Acknowledgments.1 Introduction.1.1 Why Spatial Data in Public Health?1.2 Why Statistical Methods for Spatial Data?1.3 Intersection of Three Fields of Study.1.4 Organization of the Book.2 Analyzing Public Health Data.2.1 Observational vs. Experimental Data.2.2 Risk and Rates.2.2.1 Incidence and Prevalence.2.2.2 Risk.2.2.3 Estimating Risk: Rates and Proportions.2.2.4 Relative and Attributable Risks.2.3 Making Rates Comparable: Standardized Rates.2.3.1 Direct Standardization.2.3.2 Indirect Standardization.2.3.3 Direct or Indirect?2.3.4 Standardizing to What Standard?2.3.5 Cautions with Standardized Rates.2.4 Basic Epidemiological Study Designs.2.4.1 Prospective Cohort Studies.2.4.2 Retrospective Case-Control Studies.2.4.3 Other Types of Epidemiological Studies.2.5 Basic Analytic Tool: The Odds Ratio.2.6 Modeling Counts and Rates.2.6.1 Generalized Linear Models.2.6.2 Logistic Regression.2.6.3 Poisson Regression.2.7 Challenges in the Analysis of Observational Data.2.7.1 Bias.2.7.2 Confounding.2.7.3 Effect Modification.2.7.4 Ecological Inference and the Ecological Fallacy.2.8 Additional Topics and Further Reading.2.9 Exercises.3 Spatial Data.3.1 Components of Spatial Data.3.2 An Odyssey into Geodesy.3.2.1 Measuring Location: Geographical Coordinates.3.2.2 Flattening the Globe: Map Projections and Coordinate Systems.3.2.3 Mathematics of Location: Vector and Polygon Geometry.3.3 Sources of Spatial Data.3.3.1 Health Data.3.3.2 Census-Related Data.3.3.3 Geocoding.3.3.4 Digital Cartographic Data.3.3.5 Environmental and Natural Resource Data.3.3.6 Remotely Sensed Data.3.3.7 Digitizing.3.3.8 Collect Your Own!3.4 Geographic Information Systems.3.4.1 Vector and Raster GISs.3.4.2 Basic GIS Operations.3.4.3 Spatial Analysis within GIS.3.5 Problems with Spatial Data and GIS.3.5.1 Inaccurate and Incomplete Databases.3.5.2 Confidentiality.3.5.3 Use of ZIP Codes.3.5.4 Geocoding Issues.3.5.5 Location Uncertainty.4 Visualizing Spatial Data.4.1 Cartography: The Art and Science of Mapmaking.4.2 Types of Statistical Maps.MAP STUDY: Very Low Birth Weights in Georgia Health Care District 9.4.2.1 Maps for Point Features.4.2.2 Maps for Areal Features.4.3 Symbolization.4.3.1 Map Generalization.4.3.2 Visual Variables.4.3.3 Color.4.4 Mapping Smoothed Rates and Probabilities.4.4.1 Locally Weighted Averages.4.4.2 Nonparametric Regression.4.4.3 Empirical Bayes Smoothing.4.4.4 Probability Mapping.4.4.5 Practical Notes and Recommendations.CASE STUDY: Smoothing New York Leukemia Data.4.5 Modifiable Areal Unit Problem.4.6 Additional Topics and Further Reading.4.6.1 Visualization.4.6.2 Additional Types of Maps.4.6.3 Exploratory Spatial Data Analysis.4.6.4 Other Smoothing Approaches.4.6.5 Edge Effects.4.7 Exercises.5 Analysis of Spatial Point Patterns.5.1 Types of Patterns.5.2 Spatial Point Processes.5.2.1 Stationarity and Isotropy.5.2.2 Spatial Poisson Processes and CSR.5.2.3 Hypothesis Tests of CSR via Monte Carlo Methods.5.2.4 Heterogeneous Poisson Processes.5.2.5 Estimating Intensity Functions.DATA BREAK: Early Medieval Grave Sites.5.3 K Function.5.3.1 Estimating the K Function.5.3.2 Diagnostic Plots Based on the K Function.5.3.3 Monte Carlo Assessments of CSR Based on the K Function.DATA BREAK: Early Medieval Grave Sites.5.3.4 Roles of First- and Second-Order Properties.5.4 Other Spatial Point Processes.5.4.1 Poisson Cluster Processes.5.4.2 Contagion/Inhibition Processes.5.4.3 Cox Processes.5.4.4 Distinguishing Processes.5.5 Additional Topics and Further Reading.5.6 Exercises.6 Spatial Clusters of Health Events: Point Data for Cases and Controls.6.1 What Do We Have? Data Types and Related Issues.6.2 What Do We Want? Null and Alternative Hypotheses.6.3 Categorization of Methods.6.4 Comparing Point Process Summaries.6.4.1 Goals.6.4.2 Assumptions and Typical Output.6.4.3 Method: Ratio of Kernel Intensity Estimates.DATA BREAK: Early Medieval Grave Sites.6.4.4 Method: Difference between K Functions.DATA BREAK: Early Medieval Grave Sites.6.5 Scanning Local Rates.6.5.1 Goals.6.5.2 Assumptions and Typical Output.6.5.3 Method: Geographical Analysis Machine.6.5.4 Method: Overlapping Local Case Proportions.DATA BREAK: Early Medieval Grave Sites.6.5.5 Method: Spatial Scan Statistics.DATA BREAK: Early Medieval Grave Sites.6.6 Nearest-Neighbor Statistics.6.6.1 Goals.6.6.2 Assumptions and Typical Output.6.6.3 Method: q Nearest Neighbors of Cases.CASE STUDY: San Diego Asthma.6.7 Further Reading.6.8 Exercises.7 Spatial Clustering of Health Events: Regional Count Data.7.1 What Do We Have and What Do We Want?7.1.1 Data Structure.7.1.2 Null Hypotheses.7.1.3 Alternative Hypotheses.7.2 Categorization of Methods.7.3 Scanning Local Rates.7.3.1 Goals.7.3.2 Assumptions.7.3.3 Method: Overlapping Local Rates.DATA BREAK: New York Leukemia Data.7.3.4 Method: Turnbull et al.'s CEPP.7.3.5 Method: Besag and Newell Approach.7.3.6 Method: Spatial Scan Statistics.7.4 Global Indexes of Spatial Autocorrelation.7.4.1 Goals.7.4.2 Assumptions and Typical Output.7.4.3 Method: Moran's I .7.4.4 Method: Geary's c.7.5 Local Indicators of Spatial Association.7.5.1 Goals.7.5.2 Assumptions and Typical Output.7.5.3 Method: Local Moran's I.7.6 Goodness-of-Fit Statistics.7.6.1 Goals.7.6.2 Assumptions and Typical Output.7.6.3 Method: Pearson's chi2.7.6.4 Method: Tango's Index.7.6.5 Method: Focused Score Tests of Trend.7.7 Statistical Power and Related Considerations.7.7.1 Power Depends on the Alternative Hypothesis.7.7.2 Power Depends on the Data Structure.7.7.3 Theoretical Assessment of Power.7.7.4 Monte Carlo Assessment of Power.7.7.5 Benchmark Data and Conditional Power Assessments.7.8 Additional Topics and Further Reading.7.8.1 Related Research Regarding Indexes of Spatial Association.7.8.2 Additional Approaches for Detecting Clusters and/or Clustering.7.8.3 Space-Time Clustering and Disease Surveillance.7.9 Exercises.8 Spatial Exposure Data.8.1 Random Fields and Stationarity.8.2 Semivariograms.8.2.1 Relationship to Covariance Function and Correlogram.8.2.2 Parametric Isotropic Semivariogram Models.8.2.3 Estimating the Semivariogram.DATA BREAK: Smoky Mountain pH Data.8.2.4 Fitting Semivariogram Models.8.2.5 Anisotropic Semivariogram Modeling.8.3 Interpolation and Spatial Prediction.8.3.1 Inverse-Distance Interpolation.8.3.2 Kriging.CASE STUDY: Hazardous Waste Site Remediation.8.4 Additional Topics and Further Reading.8.4.1 Erratic Experimental Semivariograms.8.4.2 Sampling Distribution of the Classical Semivariogram Estimator.8.4.3 Nonparametric Semivariogram Models.8.4.4 Kriging Non-Gaussian Data.8.4.5 Geostatistical Simulation.8.4.6 Use of Non-Euclidean Distances in Geostatistics.8.4.7 Spatial Sampling and Network Design.8.5 Exercises.9 Linking Spatial Exposure Data to Health Events.9.1 Linear Regression Models for Independent Data.9.1.1 Estimation and Inference.9.1.2 Interpretation and Use with Spatial Data.DATA BREAK: Raccoon Rabies in Connecticut.9.2 Linear Regression Models for Spatially Autocorrelated Data.9.2.1 Estimation and Inference.9.2.2 Interpretation and Use with Spatial Data.9.2.3 Predicting New Observations: Universal Kriging.DATA BREAK: New York Leukemia Data.9.3 Spatial Autoregressive Models.9.3.1 Simultaneous Autoregressive Models.9.3.2 Conditional Autoregressive Models.9.3.3 Concluding Remarks on Conditional Autoregressions.9.3.4 Concluding Remarks on Spatial Autoregressions.9.4 Generalized Linear Models.9.4.1 Fixed Effects and the Marginal Specification.9.4.2 Mixed Models and Conditional Specification.9.4.3 Estimation in Spatial GLMs and GLMMs.DATA BREAK: Modeling Lip Cancer Morbidity in Scotland.9.4.4 Additional Considerations in Spatial GLMs.CASE STUDY: Very Low Birth Weights in Georgia Health Care District 9.9.5 Bayesian Models for Disease Mapping.9.5.1 Hierarchical Structure.9.5.2 Estimation and Inference.9.5.3 Interpretation and Use with Spatial Data.9.6 Parting Thoughts.9.7 Additional Topics and Further Reading.9.7.1 General References.9.7.2 Restricted Maximum Likelihood Estimation.9.7.3 Residual Analysis with Spatially Correlated Error Terms.9.7.4 Two-Parameter Autoregressive Models.9.7.5 Non-Gaussian Spatial Autoregressive Models.9.7.6 Classical/Bayesian GLMMs.9.7.7 Prediction with GLMs.9.7.8 Bayesian Hierarchical Models for Spatial Data.9.8 Exercises.References.Author Index.Subject Index.

1,134 citations


Journal ArticleDOI
TL;DR: In this paper, the authors compare the accuracy of thematic maps derived by image classification analyses in remote sensing studies using the kappa coefficient of agreement derived for each map, which is a subjective assessment of the observed difference in accuracy but should be undertaken in a statistically rigorous fashion.
Abstract: The accuracy of thematic maps derived by image classification analyses is often compared in remote sensing studies. This comparison is typically achieved by a basic subjective assessment of the observed difference in accuracy but should be undertaken in a statistically rigorous fashion. One approach for the evaluation of the statistical significance of a difference in map accuracy that has been widely used in remote sensing research is based on the comparison of the kappa coefficient of agreement derived for each map. The conventional approach to the comparison of kappa coefficients assumes that the samples used in their calculation are independent, an assumption that is commonly unsatisfied because the same sample of ground data sites is often used for each map. Alternative methods to evaluate the statistical significance of differences in accuracy are available for both related and independent samples. Approaches for map comparison based on the kappa coefficient and proportion of correctly allocated cases, the two most widely used metrics of thematic map accuracy in remote sensing, are discussed. An example illustrates how classifications based on the same sample of ground data sites may be compared rigorously and highlights the importance of distinguishing between one- and two-sided statistical tests in the comparison of classification accuracy statements.

1,003 citations


BookDOI
01 Jan 2004
TL;DR: Beyond Significance Testing as mentioned in this paper provides integrative and clear presentations about the limitations of statistical tests and reviews alternative methods of data analysis, such as effect size estimation (at both the group and case levels) and interval estimation (i.e., confidence intervals).
Abstract: Practices of data analysis in psychology and related disciplines are changing. This is evident in the longstanding controversy about statistical tests in the behavioral sciences and the increasing number of journals requiring effect size information. Beyond Significance Testing offers integrative and clear presentations about the limitations of statistical tests and reviews alternative methods of data analysis, such as effect size estimation (at both the group and case levels) and interval estimation (i.e., confidence intervals). Written in a clear and accessible style, the book is intended for applied researchers and students who may not have strong quantitative backgrounds. Readers will learn how to measure effect size on continuous or dichotomous outcomes in comparative studies with independent or dependent samples. They will also learn how to calculate and correctly interpret confidence intervals for effect sizes. Numerous research examples from a wide range of areas illustrate the application of these principles and how to estimate substantive significance instead of just statistical significance.

924 citations


Journal ArticleDOI
TL;DR: In this paper, higher criticism is used to test whether n normal means are all zero versus the alternative that a small fraction of nonzero means is nonzero, and it is shown that higher criticism works well over a range of non-Gaussian cases.
Abstract: Higher criticism, or second-level significance testing, is a multiple-comparisons concept mentioned in passing by Tukey. It concerns a situation where there are many independent tests of significance and one is interested in rejecting the joint null hypothesis. Tukey suggested comparing the fraction of observed significances at a given α-level to the expected fraction under the joint null. In fact, he suggested standardizing the difference of the two quantities and forming a z-score; the resulting z-score tests the significance of the body of significance tests. We consider a generalization, where we maximize this z-score over a range of significance levels 0<α≤α0. We are able to show that the resulting higher criticism statistic is effective at resolving a very subtle testing problem: testing whether n normal means are all zero versus the alternative that a small fraction is nonzero. The subtlety of this “sparse normal means” testing problem can be seen from work of Ingster and Jin, who studied such problems in great detail. In their studies, they identified an interesting range of cases where the small fraction of nonzero means is so small that the alternative hypothesis exhibits little noticeable effect on the distribution of the p-values either for the bulk of the tests or for the few most highly significant tests. In this range, when the amplitude of nonzero means is calibrated with the fraction of nonzero means, the likelihood ratio test for a precisely specified alternative would still succeed in separating the two hypotheses. We show that the higher criticism is successful throughout the same region of amplitude sparsity where the likelihood ratio test would succeed. Since it does not require a specification of the alternative, this shows that higher criticism is in a sense optimally adaptive to unknown sparsity and size of the nonnull effects. While our theoretical work is largely asymptotic, we provide simulations in finite samples and suggest some possible applications. We also show that higher critcism works well over a range of non-Gaussian cases.

812 citations


Journal ArticleDOI
TL;DR: In this paper, the authors propose unit root tests for large n and T panels in which the cross-sectional units are correlated and derive their asymptotic distribution under the null hypothesis of a unit root and local alternatives.

717 citations


Journal ArticleDOI
TL;DR: When a statistical equation incorporates a multiplicative term in an attempt to model interaction effects, the statistical significance of the lower-order coefficients is largely useless for the typical purposes of hypothesis testing.
Abstract: When a statistical equation incorporates a multiplicative term in an attempt to model interaction effects, the statistical significance of the lower-order coefficients is largely useless for the typical purposes of hypothesis testing. This fact remains largely unappreciated in political science, however. This brief article explains this point, provides examples, and offers some suggestions for more meaningful interpretation.I am grateful to Tim McDaniel, Anne Sartori, and Beth Simmons for comments on a previous draft.

717 citations


Journal ArticleDOI
TL;DR: NCPA performed appropriately in these simulated samples and was not prone to a high rate of false positives under sampling assumptions that typify real data sets, and cross‐validation using multiple DNA regions is shown to be a powerful method of minimizing inference errors.
Abstract: Nested clade phylogeographical analysis (NCPA) has become a common tool in intraspecific phylogeography. To evaluate the validity of its inferences, NCPA was applied to actual data sets with 150 strong a priori expectations, the majority of which had not been analysed previously by NCPA. NCPA did well overall, but it sometimes failed to detect an expected event and less commonly resulted in a false positive. An examination of these errors suggested some alterations in the NCPA inference key, and these modifications reduce the incidence of false positives at the cost of a slight reduction in power. Moreover, NCPA does equally well in inferring events regardless of the presence or absence of other, unrelated events. A reanalysis of some recent computer simulations that are seemingly discordant with these results revealed that NCPA performed appropriately in these simulated samples and was not prone to a high rate of false positives under sampling assumptions that typify real data sets. NCPA makes a posteriori use of an explicit inference key for biological interpretation after statistical hypothesis testing. Alternatives to NCPA that claim that biological inference emerges directly from statistical testing are shown in fact to use an a priori inference key, albeit implicitly. It is argued that the a priori and a posteriori approaches to intraspecific phylogeography are complementary, not contradictory. Finally, cross-validation using multiple DNA regions is shown to be a powerful method of minimizing inference errors. A likelihood ratio hypothesis testing framework has been developed that allows testing of phylogeographical hypotheses, extends NCPA to testing specific hypotheses not within the formal inference key (such as the out-of-Africa replacement hypothesis of recent human evolution) and integrates intra- and interspecific phylogeographical inference.

657 citations


Journal ArticleDOI
TL;DR: How to compute statistical power of both fixed- and mixed-effects moderator tests in meta-analysis that are analogous to the analysis of variance and multiple regression analysis for effect sizes and how to compute power of tests for goodness of fit associated with these models are described.
Abstract: Calculation of the statistical power of statistical tests is important in planning and interpreting the results of research studies, including meta-analyses. It is particularly important in moderator analyses in meta-analysis, which are often used as sensitivity analyses to rule out moderator effects but also may have low statistical power. This article describes how to compute statistical power of both fixed- and mixed-effects moderator tests in meta-analysis that are analogous to the analysis of variance and multiple regression analysis for effect sizes. It also shows how to compute power of tests for goodness of fit associated with these models. Examples from a published meta-analysis demonstrate that power of moderator tests and goodness-of-fit tests is not always high.

Book
01 Jan 2004
TL;DR: This book provides a valuable primer that delineates what the authors know, what they would like to know, and the limits of what they can know, when they try to learn about a system that is composed of other learners.
Abstract: 1. The Interactive Learning Problem 2. Reinforcement and Regret 3. Equilibrium 4. Conditional No-Regret Learning 5. Prediction, Postdiction, and Calibration 6. Fictitious Play and Its Variants 7. Bayesian Learning 8. Hypothesis Testing 9. Conclusion

Journal ArticleDOI
TL;DR: This paper illustrates the nonparametric analysis of ordinal data obtained from two-way factorial designs, including a repeated measures design, and shows how to quantify the effects of experimental factors on ratings through estimated relative marginal effects.
Abstract: Plant disease severity often is assessed using an ordinal rating scale rather than a continuous scale of measurement. Although such data usually should be analyzed with nonparametric methods, and not with the typical parametric techniques (such as analysis of variance), limitations in the statistical methodology available had meant that experimental designs generally could not be more complicated than a one-way layout. Very recent advancements in the theoretical formulation of hypotheses and associated test statistics within a nonparametric framework, together with development of software for implementing the methods, have made it possible for plant pathologists to analyze properly ordinal data from more complicated designs using nonparametric techniques. In this paper, we illustrate the nonparametric analysis of ordinal data obtained from two-way factorial designs, including a repeated measures design, and show how to quantify the effects of experimental factors on ratings through estimated rela...

Journal ArticleDOI
TL;DR: A simple natural test, seen as an asymptotic version of the well-known anova F-test, is proposed for testing the null hypothesis of equality of their respective mean functions.

Book
01 Jan 2004
TL;DR: Presenting and summarizing data Probability Random variables and their distributions Population and sample Hypotheses testing Correlation Two independent variables Concepts of experimental design Blocking Split-plot design Analysis of covariance Repeated measures Lack of fit
Abstract: Presenting and summarizing data Probability Random variables and their distributions Population and sample Hypotheses testing Correlation Two independent variables Concepts of experimental design Blocking Split-plot design Analysis of covariance Repeated measures Lack of fit

Journal ArticleDOI
TL;DR: An approach is described that overcomes some of the problems associated with analyzing community datasets and offers an approach that makes data interpretation simple and effective and introduces a quantitative measure of sample dispersion that is suggested as an ideal coefficient to be used for the construction of similarity matrices.
Abstract: Terminal restriction fragment length polymorphism (T-RFLP) is increasingly being used to examine microbial community structure and accordingly, a range of approaches have been used to analyze data sets. A number of published reports have included data and results that were statistically flawed or lacked rigorous statistical testing. A range of simple, yet powerful techniques are available to examine community data, however their use is seldom, if ever, discussed in microbial literature. We describe an approach that overcomes some of the problems associated with analyzing community datasets and offer an approach that makes data interpretation simple and effective. The Bray-Curtis coefficient is suggested as an ideal coefficient to be used for the construction of similarity matrices. Its strengths include its ability to deal with data sets containing multiple blocks of zeros in a meaningful manner. Non-metric multi-dimensional scaling is described as a powerful, yet easily interpreted method to examine community patterns based on T-RFLP data. Importantly, we describe the use of significance testing of data sets to allow quantitative assessment of similarity, removing subjectivity in comparing complex data sets. Finally, we introduce a quantitative measure of sample dispersion and suggest its usefulness in describing site heterogeneity.

Journal ArticleDOI
Motohiro Yogo1
TL;DR: In this article, weak instruments can lead to bias in estimators and size distortion in hypothesis tests in the instrumental variables regression model, and weak instruments affect the identification of the elasticity of intertemporal substitution (EIS) through the linearized Euler equation.
Abstract: In the instrumental variables (IV) regression model, weak instruments can lead to bias in estimators and size distortion in hypothesis tests. This paper examines how weak instruments affect the identification of the elasticity of intertemporal substitution (EIS) through the linearized Euler equation. Conventional IV methods result in an empirical puzzle that the EIS is significantly less than 1 but its reciprocal is not different from 1. This paper shows that weak instruments can explain the puzzle and reports valid confidence intervals for the EIS using pivotal statistics. The EIS is less than 1 and not significantly different from 0 for eleven developed countries.

Journal ArticleDOI
TL;DR: In this article, the authors adopt a decision-theoretic approach, using loss functions that combine the competing goals of discovering as many differentially expressed genes as possible, while keeping the number of false discoveries manageable.
Abstract: We consider the choice of an optimal sample size for multiple-comparison problems. The motivating application is the choice of the number of microarray experiments to be carried out when learning about differential gene expression. However, the approach is valid in any application that involves multiple comparisons in a large number of hypothesis tests. We discuss two decision problems in the context of this setup: the sample size selection and the decision about the multiple comparisons. We adopt a decision-theoretic approach, using loss functions that combine the competing goals of discovering as many differentially expressed genes as possible, while keeping the number of false discoveries manageable. For consistency, we use the same loss function for both decisions. The decision rule that emerges for the multiple-comparison problem takes the exact form of the rules proposed in the recent literature to control the posterior expected falsediscovery rate. For the sample size selection, we combine the expe...

Book ChapterDOI
13 Jul 2004
TL;DR: In this article, the authors propose a new statistical approach to analyze stochastic systems against specifications given in a sublogic of CSL, where the system under investigation is an unknown, deployed black-box that can be passively observed to obtain sample traces, but cannot be controlled.
Abstract: We propose a new statistical approach to analyzing stochastic systems against specifications given in a sublogic of continuous stochastic logic (CSL). Unlike past numerical and statistical analysis methods, we assume that the system under investigation is an unknown, deployed black-box that can be passively observed to obtain sample traces, but cannot be controlled. Given a set of executions (obtained by Monte Carlo simulation) and a property, our algorithm checks, based on statistical hypothesis testing, whether the sample provides evidence to conclude the satisfaction or violation of a property, and computes a quantitative measure (p-value of the tests) of confidence in its answer; if the sample does not provide statistical evidence to conclude the satisfaction or violation of the property, the algorithm may respond with a “don’t know” answer. We implemented our algorithm in a Java-based prototype tool called VeStA, and experimented with the tool using case studies analyzed in [15]. Our empirical results show that our approach may, at least in some cases, be faster than previous analysis methods.

Journal ArticleDOI
TL;DR: An algorithm is proposed, AFTER, to convexly combine the models for a better performance of prediction, and the results show an advantage of combining by AFTER over selection in terms of forecasting accuracy at several settings.

Journal Article
TL;DR: In this article, the authors propose a new statistical approach to analyze stochastic systems against specifications given in a sublogic of CSL, where the system under investigation is an unknown, deployed black-box that can be passively observed to obtain sample traces, but cannot be controlled.
Abstract: We propose a new statistical approach to analyzing stochastic systems against specifications given in a sublogic of continuous stochastic logic (CSL). Unlike past numerical and statistical analysis methods, we assume that the system under investigation is an unknown, deployed black-box that can be passively observed to obtain sample traces, but cannot be controlled. Given a set of executions (obtained by Monte Carlo simulation) and a property, our algorithm checks, based on statistical hypothesis testing, whether the sample provides evidence to conclude the satisfaction or violation of a property, and computes a quantitative measure (p-value of the tests) of confidence in its answer; if the sample does not provide statistical evidence to conclude the satisfaction or violation of the property, the algorithm may respond with a don't know answer. We implemented our algorithm in a Java-based prototype tool called VESTA, and experimented with the tool using case studies analyzed in [15]. Our empirical results show that our approach may, at least in some cases, be faster than previous analysis methods.

Journal ArticleDOI
TL;DR: In this paper, the authors study the behavior of the chi-square difference test in such a circumstance and show that when the base model is misspecified, the z test for the statistical significance of a parameter estimate can also be misleading.
Abstract: In mean and covariance structure analysis, the chi-square difference test is often applied to evaluate the number of factors, cross-group constraints, and other nested model comparisons. Let model Ma be the base model within which model Mb is nested. In practice, this test is commonly used to justify Mb even when Ma is misspecified. The authors study the behavior of the chi-square difference test in such a circumstance. Monte Carlo results indicate that a nonsignificant chi-square difference cannot be used to justify the constraints in Mb. They also show that when the base model is misspecified, the z test for the statistical significance of a parameter estimate can also be misleading. For specific models, the analysis further shows that the intercept and slope parameters in growth curve models can be estimated consistently even when the covariance structure is misspecified, but only in linear growth models. Similarly, with misspecified covariance structures, the mean parameters in multiple group models can be estimated consistently under null conditions.

Book
01 Jan 2004
TL;DR: This paper presents the results of three experiments on the design and normalization of single-channel cDNA arrays in the presence of a non-negative background, using data from the Anna Amtmann Microarray Experiment as a guide.
Abstract: Preface.1 Preliminaries.1.1 Using the R Computing Environment.1.1.1 Installing smida.1.1.2 Loading smida.1.2 Data Sets from Biological Experiments.1.2.1 Arabidopsis experiment: Anna Amtmann.1.2.2 Skin cancer experiment: Nighean Barr.1.2.3 Breast cancer experiment: John Bartlett.1.2.4 Mammary gland experiment: Gusterson group.1.2.5 Tuberculosis experiment: B G@S group.I Getting Good Data.2 Set-up of a Microarray Experiment.2.1 Nucleic Acids: DNA and RNA.2.2 Simple cDNA Spotted Microarray Experiment.2.2.1 Growing experimental material.2.2.2 Obtaining RNA.2.2.3 Adding spiking RNA and poly-T primer.2.2.4 Preparing the enzyme environment.2.2.5 Obtaining labelled cDNA.2.2.6 Preparing cDNA mixture for hybridization.2.2.7 Slide hybridization.3 Statistical Design of Microarrays.3.1 Sources of Variation.3.2 Replication.3.2.1 Biological and technical replication.3.2.2 How many replicates?3.2.3 Pooling samples.3.3 Design Principles.3.3.1 Blocking, crossing and randomization.3.3.2 Design and normalization.3.4 Single-channelMicroarray Design.3.4.1 Design issues.3.4.2 Design layout.3.4.3 Dealing with technical replicates.3.5 Two-channelMicroarray Designs.3.5.1 Optimal design of dual-channel arrays.3.5.2 Several practical two-channel designs.4 Normalization.4.1 Image Analysis.4.1.1 Filtering.4.1.2 Gridding.4.1.3 Segmentation.4.1.4 Quantification.4.2 Introduction to Normalization.4.2.1 Scale of gene expression data.4.2.2 Using control spots for normalization.4.2.3 Missing data.4.3 Normalization for Dual-channel Arrays.4.3.1 Order for the normalizations.4.3.2 Spatial correction.4.3.3 Background correction.4.3.4 Dye effect normalization.4.3.5 Normalization within and across conditions.4.4 Normalization of Single-channel Arrays.4.4.1 Affymetrix data structure.4.4.2 Normalization of Affymetrix data.5 Quality Assessment.5.1 Using MIAME in Quality Assessment.5.1.1 Components of MIAME.5.2 Comparing Multivariate Data.5.2.1 Measurement scale.5.2.2 Dissimilarity and distance measures.5.2.3 Representing multivariate data.5.3 Detecting Data Problems.5.3.1 Clerical errors.5.3.2 Normalization problems.5.3.3 Hybridization problems.5.3.4 Array mishandling.5.4 Consequences of Quality Assessment Checks.6 Microarray Myths: Data.6.1 Design.6.1.1 Single-versus dual-channel designs?6.1.2 Dye-swap experiments.6.2 Normalization.6.2.1 Myth: 'microarray data is Gaussian'.6.2.2 Myth: 'microarray data is not Gaussian'.6.2.3 Confounding spatial and dye effect.6.2.4 Myth: 'non-negative background subtraction'.II Getting Good Answers.7 Microarray Discoveries.7.1 Discovering Sample Classes.7.1.1 Why cluster samples?7.1.2 Sample dissimilarity measures.7.1.3 Clustering methods for samples.7.2 Exploratory Supervised Learning.7.2.1 Labelled dendrograms.7.2.2 Labelled PAM-type clusterings.7.3 Discovering Gene Clusters.7.3.1 Similarity measures for expression profiles.7.3.2 Gene clustering methods.8 Differential Expression.8.1 Introduction.8.1.1 Classical versus Bayesian hypothesis testing.8.1.2 Multiple testing 'problem'.8.2 Classical Hypothesis Testing.8.2.1 What is a hypothesis test?8.2.2 Hypothesis tests for two conditions.8.2.3 Decision rules.8.2.4 Results from skin cancer experiment.8.3 Bayesian Hypothesis Testing.8.3.1 A general testing procedure.8.3.2 Bayesian t-test.9 Predicting Outcomes with Gene Expression Profiles.9.1 Introduction.9.1.1 Probabilistic classification theory.9.1.2 Modelling and predicting continuous variables.9.2 Curse of Dimensionality: Gene Filtering.9.2.1 Use only significantly expressed genes.9.2.2 PCA and gene clustering.9.2.3 Penalized methods.9.2.4 Biological selection.9.3 Predicting ClassMemberships.9.3.1 Variance-bias trade-off in prediction.9.3.2 Linear discriminant analysis.9.3.3 k-nearest neighbour classification.9.4 Predicting Continuous Responses.9.4.1 Penalized regression: LASSO.9.4.2 k-nearest neighbour regression.10 Microarray Myths: Inference.10.1 Differential Expression.10.1.1 Myth: 'Bonferroni is too conservative'.10.1.2 FPR and collective multiple testing.10.1.3 Misinterpreting FDR.10.2 Prediction and Learning.10.2.1 Cross-validation.Bibliography.Index.

01 Jan 2004
TL;DR: It is demonstrated that the introduction of phases permits us to take history into account when making action choices, and this can result in policies of higher quality than the authors would get if they ignored history dependence.
Abstract: Asynchronous stochastic systems are abundant in the real world. Examples include queuing systems, telephone exchanges, and computer networks. Yet, little attention has been given to such systems in the model checking and planning literature, at least not without making limiting and often unrealistic assumptions regarding the dynamics of the systems. The most common assumption is that of history-independence: the Markov assumption. In this thesis, we consider the problems of verification and planning for stochastic processes with asynchronous events, without relying on the Markov assumption. We establish the foundation for statistical probabilistic model checking, an approach to probabilistic model checking based on hypothesis testing and simulation. We demonstrate that this approach is competitive with state-of-the-art numerical solution methods for probabilistic model checking. While the verification result can be guaranteed only with some probability of error, we can set this error bound arbitrarily low (at the cost of efficiency). Our contribution in planning consists of a formalism, the generalized semi-Markov decision process (GSMDP), for planning with asynchronous stochastic events. We consider both goal directed and decision theoretic planning. In the former case, we rely on statistical model checking to verify plans, and use the simulation traces to guide plan repair. In the latter case, we present the use of phase-type distributions to approximate a GSMDP with a continuous-time MDP, which can then be solved using existing techniques. We demonstrate that the introduction of phases permits us to take history into account when making action choices, and this can result in policies of higher quality than we would get if we ignored history dependence.

Journal ArticleDOI
TL;DR: Tests of equivalence are suggested, which use the hypothesis of dissimilarity as the null hypothesis, that is, the null hypotheses is that the model is unacceptable and flip the burden of proof back onto the model.

Journal ArticleDOI
TL;DR: From the results of this survey, it is shown that likelihood ratios are able to serve all the important statistical needs of researchers in empirical psychology in a format that is more straightforward and easier to interpret than traditional inferential statistics.
Abstract: Empirical studies in psychology typically employ null hypothesis significance testing to draw statistical inferences. We propose that likelihood ratios are a more straightforward alternative to this approach. Likelihood ratios provide a measure of the fit of two competing models; the statistic represents a direct comparison of the relative likelihood of the data, given the best fit of the two models. Likelihood ratios offer an intuitive, easily interpretable statistic that allows the researcher great flexibility in framing empirical arguments. In support of this position, we report the results of a survey of empirical articles in psychology, in which the common uses of statistics by empirical psychologists is examined. From the results of this survey, we show that likelihood ratios are able to serve all the important statistical needs of researchers in empirical psychology in a format that is more straightforward and easier to interpret than traditional inferential statistics.

Journal ArticleDOI
TL;DR: Kline as discussed by the authors reviewed the controversy regarding significance testing, and offered methods for effect size and confidence interval estimation, and suggested some alternative methodologies, and concluded that there is no "magical alternative" to statistical tests and that such tests are appropriate in some circumstances when applied correctly.
Abstract: REX B. KLINE Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research Washington, DC: American Psychological Association, 2004, 336 pages (ISBN 1-59147-118-4, US$49.95 Hardcover) In 1999, a blue-ribbon task force assembled by the American Psychological Association published their findings with regards to the long-standing controversy pertaining to null hypothesis significance testing (NHST). The task force dictated effect sizes and confidence intervals be reported, and p values and dichotomous accept-reject decisions be given less weight. Editorial policies in a number of journals came to reflect the views of the task force as did a subsequent revision to the American Psychological Association Publication Manual. Rex B. Kline wrote Beyond Significance Testing. Reforming Data Analysis Methods in Behavioral Research as a follow-up to both the task force recommendations and the revision to the publication manual. Kline's 1998 book Principles and Practice of Structural Equation Modeling (Guilford Press) was well received and a second edition is being published this fall. In Beyond Significance Testing, Kline reviews the controversy regarding significance testing, offers methods for effect size and confidence interval estimation, and suggests some alternative methodologies. There is an accompanying website that includes resources for instructors and students. Part I of the book is a review of fundamental concepts and the debate regarding significance testing. Part II provides statistics for effect size and confidence interval estimation for parametric and nonparametric two-group, oneway, and factorial designs. Part III examines metaanalysis, resampling, and Bayesian estimation procedures. In the first chapter, Kline provides a scholarly summary of the null hypothesis testing debate concluding with the APA task force findings and what Kline regards as ambiguous recommendations in the publication manual. Kline predicts the future will see a smaller role for traditional statistical testing (p values) in psychology. This change will take time and may not occur until the next generation of researchers are trained, but Kline anticipates the social sciences will then become more like the natural sciences in that "we will report the directions and magnitudes of our effects, determine whether they replicate, and evaluate them for their theoretical, clinical, or practical significance" (p. 15). Chapter 2 is a review of fundamental concepts of research design, including sampling and estimation, the logic of statistical significance testing, and t, F, and chi-square tests. The problems with statistical tests are revisited in Chapter 3. What follows is a long list of errors in interpretation of p values and conclusions made after null hypothesis testing. The emphasis on null hypothesis significance testing in psychology is also argued to inhibit advancement of the discipline. To be fair, Kline recognizes there is yet no "magical alternative" to statistical tests and that such tests are appropriate in some circumstances when applied correctly. Nonetheless, Kline envisions a future where effect sizes and confidence intervals are reported, substantive rather than statistical significance predominates, and "NHST-Centric" thinking has diminished. Part II covers effect size and confidence interval calculations. Chapter 4 is a presentation of parametric effect size indexes. Independent and dependent sample statistics are covered separately. The textbook's website has a supplementary chapter on twogroup multivariate designs. Group difference indexes such as d are distinguished from measures of association such as r. Case level analyses of group differences are also reviewed. Sections not relevant to a reader's needs can be skipped without loss of continuity. Interpretive guidelines for effect size magnitude and how one might be fooled by effect size estimation are sections that should not be passed over. …

Journal ArticleDOI
TL;DR: This article extends false discovery rates to random fields, for which there are uncountably many hypothesis tests, and develops a method for finding regions in the field's domain where there is a significant signal while controlling either theportion of area or the proportion of clusters in which false rejections occur.
Abstract: This article extends false discovery rates to random fields, for which there are uncountably many hypothesis tests. We develop a method for finding regions in the field's domain where there is a significant signal while controlling either the proportion of area or the proportion of clusters in which false rejections occur. The method produces confidence envelopes for the proportion of false discoveries as a function of the rejection threshold. From the confidence envelopes, we derive threshold procedures to control either the mean or the specified tail probabilities of the false discovery proportion. An essential ingredient of this construnction is a new algorithm to compute a confidence superset for the set of all true-null locations. We demonstrate our method with applications to scan statistics and functional neuroimaging.

Journal ArticleDOI
TL;DR: The intermediate and deep layers of the superior colliculus (SC) compose a retinotopically organized motor map and are known to be important for the control of saccadic eye movements, but recent studies have shown that the functions of the SC are not restricted to the motor control ofSaccades.

Journal ArticleDOI
TL;DR: Standard methods, such as t tests and analyses of variance, may be poor choices for data that have unique features and the use of proper statistical methods leads to more meaningful study results and conclusions.
Abstract: OBJECTIVE: Psychiatric clinical studies, including those in drug abuse research, often provide data that are challenging to analyze and use for hypothesis testing because they are heavily skewed and marked by an abundance of zero values. The authors consider methods of analyzing data with those characteristics. METHOD: The possible meaning of zero values and the statistical methods that are appropriate for analyzing data with many zero values in both cross-sectional and longitudinal designs are reviewed. The authors illustrate the application of these alternative methods using sample data collected with the Addiction Severity Index. RESULTS: Data that include many zeros, if the zero value is considered the lowest value on a scale that measures severity, may be analyzed with several methods other than standard parametric tests. If zero values are considered an indication of a case without a problem, for which a measure of severity is not meaningful, analyses should include separate statistical models for t...