scispace - formally typeset
Search or ask a question
Topic

Statistical hypothesis testing

About: Statistical hypothesis testing is a research topic. Over the lifetime, 19580 publications have been published within this topic receiving 1037815 citations. The topic is also known as: statistical hypothesis testing & confirmatory data analysis.


Papers
More filters
Journal ArticleDOI
TL;DR: As a potential alternative to standard null hypothesis significance testing, methods for graphical presentation of data--particularly condition means and their corresponding confidence intervals--for a wide range of factorial designs used in experimental psychology are described.
Abstract: As a potential alternative to standard null hypothesis significance testing, we describe methods for graphical presentation of data--particularly condi- tion means and their corresponding confidence inter- vals--for a wide range of factorial designs used in ex- perimental psychology. We describe and illustrate con- fidence intervals specifically appropriate for between- subject versus within-subject factors. For designs in- volving more than two levels of a factor, we describe the use of contrasts for graphical illustration of theo- retically meaningful components of main effects and interactions. These graphical techniques lend them- selves to a natural and straightforward assessment of statistical power. extent that a variety of informative means of construct- ing inferences from data are made available and clearly understood, researchers will increase their likelihood of forming appropriate conclusions and communicating effectively with their audiences. A number of years ago, we advocated and de- scribed computational approaches to the use of confi- dence intervals as part of a graphical approach to data interpretation (Loftus & Masson, 1994; see also, Loftus, 2002). The power and effectiveness of graphical data presentation is undeniable (Tufte, 1983) and is common in all forms of scientific communication in experimen- tal psychology and in other fields. In many instances, however, plots of descriptive statistics (typically means) are not accompanied by any indication of vari- ability or stability associated with those descriptive statistics. The diligent reader, then, is forced to refer to a dreary accompanying recital of significance tests to determine how the pattern of means should be inter- preted.

754 citations

Book
03 Sep 2009
TL;DR: This valuable book shows second language researchers how to use the statistical program SPSS to conduct statistical tests frequently done in second language research, including chi-square, t-tests, correlation, multiple regression, ANOVA and non-parametric analogs to these tests.
Abstract: This valuable book shows second language researchers how to use the statistical program SPSS to conduct statistical tests frequently done in SLA research. Using data sets from real SLA studies, A Guide to Doing Statistics in Second Language Research Using SPSS shows newcomers to both statistics and SPSS how to generate descriptive statistics, how to choose a statistical test, and how to conduct and interpret a variety of basic statistical tests. The author covers the statistical tests that are most commonly used in second language research, including chi-square, t-tests, correlation, multiple regression, ANOVA and non-parametric analogs to these tests. The text is abundantly illustrated with graphs and tables depicting actual data sets, and exercises throughout the book help readers understand concepts (such as the difference between independent and dependent variables) and work out statistical analyses. Answers to all exercises are provided on the book�s companion website, along with sample data sets and other supplementary material.

754 citations

Journal ArticleDOI
TL;DR: A list of some of the simpler checks that might improve one’s confidence that a candidate biomarker is not simply a statistical artefact is provided, and a series of preferred tests and visualisation tools that can assist readers and authors in assessing papers are suggested.
Abstract: Many metabolomics, and other high-content or high-throughput, experiments are set up such that the primary aim is the discovery of biomarker metabolites that can discriminate, with a certain level of certainty, between nominally matched ‘case’ and ‘control’ samples. However, it is unfortunately very easy to find markers that are apparently persuasive but that are in fact entirely spurious, and there are well-known examples in the proteomics literature. The main types of danger are not entirely independent of each other, but include bias, inadequate sample size (especially relative to the number of metabolite variables and to the required statistical power to prove that a biomarker is discriminant), excessive false discovery rate due to multiple hypothesis testing, inappropriate choice of particular numerical methods, and overfitting (generally caused by the failure to perform adequate validation and cross-validation). Many studies fail to take these into account, and thereby fail to discover anything of true significance (despite their claims). We summarise these problems, and provide pointers to a substantial existing literature that should assist in the improved design and evaluation of metabolomics experiments, thereby allowing robust scientific conclusions to be drawn from the available data. We provide a list of some of the simpler checks that might improve one’s confidence that a candidate biomarker is not simply a statistical artefact, and suggest a series of preferred tests and visualisation tools that can assist readers and authors in assessing papers. These tools can be applied to individual metabolites by using multiple univariate tests performed in parallel across all metabolite peaks. They may also be applied to the validation of multivariate models. We stress in particular that classical p-values such as “p < 0.05”, that are often used in biomedicine, are far too optimistic when multiple tests are done simultaneously (as in metabolomics). Ultimately it is desirable that all data and metadata are available electronically, as this allows the entire community to assess conclusions drawn from them. These analyses apply to all high-dimensional ‘omics’ datasets.

747 citations

Journal ArticleDOI
TL;DR: In this article, the authors consider an alternative explanation, which adds the hypothesis that people like to be perceived as fair, which has additional testable implications, the validity of which they confirm through new experiments.
Abstract: A norm of 50-50 division appears to have considerable force in a wide range of economic environments, both in the real world and in the laboratory. Even in settings where one party unilaterally determines the allocation of a prize (the dictator game), many subjects voluntarily cede exactly half to another individual. The hypothesis that people care about fairness does not by itself account for key experimental patterns. We consider an alternative explanation, which adds the hypothesis that people like to be perceived as fair. The properties of equilibria for the resulting signaling game correspond closely to laboratory observations. The theory has additional testable implications, the validity of which we confirm through new experiments.

733 citations

Journal ArticleDOI
TL;DR: This paper characterize an important class of problems in which the LRT and the F-test fail and illustrate this nonstandard behavior, and briefly sketch several possible acceptable alternatives, focusing on Bayesian posterior predictive probability values.
Abstract: The likelihood ratio test (LRT) and the related F-test, popularized in astrophysics by Eadie and coworkers in 1971, Bevington in 1969, Lampton, Margon, & Bowyer, in 1976, Cash in 1979, and Avni in 1978, do not (even asymptotically) adhere to their nominal χ2 and F-distributions in many statistical tests common in astrophysics, thereby casting many marginal line or source detections and nondetections into doubt. Although the above authors illustrate the many legitimate uses of these statistics, in some important cases it can be impossible to compute the correct false positive rate. For example, it has become common practice to use the LRT or the F-test to detect a line in a spectral model or a source above background despite the lack of certain required regularity conditions. (These applications were not originally suggested by Cash or by Bevington.) In these and other settings that involve testing a hypothesis that is on the boundary of the parameter space, contrary to common practice, the nominal χ2 distribution for the LRT or the F-distribution for the F-test should not be used. In this paper, we characterize an important class of problems in which the LRT and the F-test fail and illustrate this nonstandard behavior. We briefly sketch several possible acceptable alternatives, focusing on Bayesian posterior predictive probability values. We present this method in some detail since it is a simple, robust, and intuitive approach. This alternative method is illustrated using the gamma-ray burst of 1997 May 8 (GRB 970508) to investigate the presence of an Fe K emission line during the initial phase of the observation. There are many legitimate uses of the LRT and the F-test in astrophysics, and even when these tests are inappropriate, there remain several statistical alternatives (e.g., judicious use of error bars and Bayes factors). Nevertheless, there are numerous cases of the inappropriate use of the LRT and similar tests in the literature, bringing substantive scientific results into question.

730 citations


Network Information
Related Topics (5)
Estimator
97.3K papers, 2.6M citations
88% related
Linear model
19K papers, 1M citations
88% related
Inference
36.8K papers, 1.3M citations
87% related
Regression analysis
31K papers, 1.7M citations
86% related
Sampling (statistics)
65.3K papers, 1.2M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023267
2022696
2021959
2020998
20191,033
2018943