scispace - formally typeset
Open AccessJournal ArticleDOI

Effect Size, Confidence Intervals and Statistical Power in Psychological Research

Arnoldo Téllez, +2 more
- 01 Jan 2015 - 
- Vol. 8, Iss: 3, pp 27-46
Reads0
Chats0
TLDR
In this article, the disadvantages of null hypothesis significance testing and the advantages of using effect size, confidence intervals and statistical power in quantitative psychological research, especially in clinical studies are analyzed.
Abstract
IntroductionIn the last three decades, a critical movement occurring within quantitative psychological research began to develop. This development emerged in response to the misguided use of classical statistics based on null hypothesis significance testing (NHST). NHST promotes dichotomous thinking and provides limited information regarding the essence of investigated phenomena. Dichotomous thinking in science -- manifested as only accepting or rejecting research hypotheses -- prevents the advancement of science and may even skew the accumulation of knowledge by stimulating the exclusive publication of studies whose hypotheses have been accepted. The so-called "new statistics" movement critically challenges NHST postulates and operations. This movement is an approach based on the analytical tool of estimation (Cummings, 2014), which promotes the use of effect size as descriptive statistic, confidence intervals as inferential statistics, and meta-analysis as a reliable form of knowledge accumulation. This paper is intended to analyze the disadvantages of NHST and the advantages of using effect size, confidence intervals and statistical power in quantitative psychological research, especially in clinical studies. Also noted and stressed is the need for editors of scientific psychological journals to adhere to policies recommended by the A.P.A. in this regard.Quantitative research in psychologyTypically, quantitative psychological research is focused on detecting the occurrence of certain population phenomena by analyzing data from a sample. An example is the case in which a researcher wishes to know if a treatment to improve the quality of life of those who suffer from breast cancer performs better than a placebo treatment for another group or those on a waiting list (also known as the control group or contrast group) (Wilkinson, 1999). Similarly, to make the decision to confirm that an independent variable or treatment did or did not have an important effect, statistics is used. In quantitative research methodology, there are two ways of quantifying this effect: (1) Null Hypothesis Significance Testing and (2) Effect Size (ES), as well as its respective confidence intervals (Cl). These two approaches are reviewed below.Null Hypothesis Significance Testing (NHST)NHST comes from the effect size on the population, the size of the sample used and the alpha level or p value that is selected (p being the abbreviation for probability). Most psychology research is focused on rejecting the null hypothesis and obtaining a small p value instead of observing the relevance of the results that are obtained (Kirk, 2001).Among the limitations of NHST, we found its sensitivity to sample size, its inability to accept the null hypothesis, and its lack of capacity to determine the practical significance of the statistical results.Kirk (2001) states that NHST only establishes the probability of obtaining a more or less extreme effect if the null hypothesis is true. It does not, however, communicate the magnitude of the effect or its practical significance, meaning whether the effect is found to be useful or important. As a result, inferential statistical testing has been criticized; as expressed by Ivarsson, Andersen, Johnson, and Lindwall (2013): "p levels may have little, if anything, to do with real-world meaning and practical value" (p.97). Some authors, such as Schmidt (1996), even suggest that statistical contrast is unnecessary and recommend focusing only on ES estimation, and Cohen (1994) suggests that "NHST has not only failed to support the advance of psychology as a science but also has seriously impeded it." (p. 997).Ronald Fisher was the father of modern statistics and experimental design. Since his time, it has been established as a convention that the p value for statistical significance must be less than .05, which means that an observed difference between two groups has less than a 5% probability of occurring by chance or sampling error if the null hypothesis is assumed to be true initially. …

read more

Content maybe subject to copyright    Report

Psychology in Russia: State of the Art
Volume 8, Issue 3, 2015
Lomonosov
Moscow State
University
Russian
Psychological
Society
ISSN 2074-6857 (Print) / ISSN 2307-2202 (Online)
© Lomonosov Moscow State University, 2015
© Russian Psychological Society, 2015
doi: 10.11621/pir.2015.0303
http://psychologyinrussia.com
Eect size, condence intervals and statistical power
in psychological research
1
Arnoldo Téllez
a,b*
, Cirilo H. García
a
, Víctor Corral-Verdugo
c
a
Psychology School, Universidad Autónoma de Nuevo León (UANL), Monterrey, México
b
Centro de Investigación y Desarrollo en Ciencias de la Salud (CIDICS),
Universidad Autónoma de Nuevo León, Monterrey, México
c
Universidad de Sonora, Hermosillo, México
*
Corresponding author. E-mail: arnoldo.tellez@uanl.mx
Quantitative psychological research is focused on detecting the occurrence of certain
population phenomena by analyzing data from a sample, and statistics is a particu-
larly helpful mathematical tool that is used by researchers to evaluate hypotheses and
make decisions to accept or reject such hypotheses. In this paper, the various statistical
tools in psychological research are reviewed. e limitations of null hypothesis signi-
cance testing (NHST) and the advantages of using eect size and its respective con-
dence intervals are explained, as the latter two measurements can provide important
information about the results of a study. ese measurements also can facilitate data
interpretation and easily detect trivial eects, enabling researchers to make decisions
in a more clinically relevant fashion. Moreover, it is recommended to establish an ap-
propriate sample size by calculating the optimum statistical power at the moment that
the research is designed. Psychological journal editors are encouraged to follow APA
recommendations strictly and ask authors of original research studies to report the ef-
fect size, its condence intervals, statistical power and, when required, any measure of
clinical signicance. Additionally, we must account for the teaching of statistics at the
graduate level. At that level, students do not receive sucient information concerning
the importance of using dierent types of eect sizes and their condence intervals
according to the dierent types of research designs; instead, most of the information is
focused on the various tools of NHST.
Keywords: eect size, condence intervals, statistical power, NHST
1 A brighter day is dawning in which researchers will ask not only if a sample result is likely but also
if an eect is practically noteworthy or replicable (ompson, 2002).

28 A. Téllez, C. H. García, V. Corral-Verdugo
Introduction
In the last three decades, a critical movement occurring within quantitative psy-
chological research began to develop. is development emerged in response to the
misguided use of classical statistics based on null hypothesis signicance testing
(NHST). NHST promotes dichotomous thinking and provides limited informa-
tion regarding the essence of investigated phenomena. Dichotomous thinking in
science -- manifested as only accepting or rejecting research hypotheses -- prevents
the advancement of science and may even skew the accumulation of knowledge by
stimulating the exclusive publication of studies whose hypotheses have been ac-
cepted. e so-called “new statistics” movement critically challenges NHST postu-
lates and operations. is movement is an approach based on the analytical tool of
estimation (Cummings, 2014), which promotes the use of eect size as descriptive
statistic, condence intervals as inferential statistics, and meta-analysis as a reliable
form of knowledge accumulation. is paper is intended to analyze the disadvan-
tages of NHST and the advantages of using eect size, condence intervals and sta-
tistical power in quantitative psychological research, especially in clinical studies.
Also noted and stressed is the need for editors of scientic psychological journals
to adhere to policies recommended by the A.P.A. in this regard.
Quantitative research in psychology
Typically, quantitative psychological research is focused on detecting the occur-
rence of certain population phenomena by analyzing data from a sample. An exam-
ple is the case in which a researcher wishes to know if a treatment to improve the
quality of life of those who suer from breast cancer performs better than a placebo
treatment for another group or those on a waiting list (also known as the control
group or contrast group) (Wilkinson, 1999). Similarly, to make the decision to con-
rm that an independent variable or treatment did or did not have an important
eect, statistics is used. In quantitative research methodology, there are two ways of
quantifying this eect: (1) Null Hypothesis Signicance Testing and (2) Eect Size
(ES), as well as its respective condence intervals (CI). ese two approaches are
reviewed below.
Null Hypothesis Signicance Testing (NHST)
NHST comes from the eect size on the population, the size of the sample used and
the alpha level or p value that is selected (p being the abbreviation for probability).
Most psychology research is focused on rejecting the null hypothesis and obtaining
a small p value instead of observing the relevance of the results that are obtained
(Kirk, 2001).
Among the limitations of NHST, we found its sensitivity to sample size, its in-
ability to accept the null hypothesis, and its lack of capacity to determine the practi-
cal signicance of the statistical results.
Kirk (2001) states that NHST only establishes the probability of obtaining a
more or less extreme eect if the null hypothesis is true. It does not, however,
communicate the magnitude of the eect or its practical signicance, meaning

Eect size, condence intervals and statistical power in psychological research 29
whether the eect is found to be useful or important. As a result, inferential sta-
tistical testing has been criticized; as expressed by Ivarsson, Andersen, Johnson,
and Lindwall (2013): p levels may have little, if anything, to do with real-world
meaning and practical value” (p.97). Some authors, such as Schmidt (1996), even
suggest that statistical contrast is unnecessary and recommend focusing only on
ES estimation, and Cohen (1994) suggests that “NHST has not only failed to sup-
port the advance of psychology as a science but also has seriously impeded it.
(p. 997).
Ronald Fisher was the father of modern statistics and experimental design.
Since his time, it has been established as a convention that the p value for statisti-
cal signicance must be less than .05, which means that an observed dierence
between two groups has less than a 5% probability of occurring by chance or sam-
pling error if the null hypothesis is assumed to be true initially. In other words,
if the p was equal or less than .05, then the null hypothesis could be rejected be-
cause 95 times out of 100, the observed dierence between the means is not due to
chance. Some researchers use more strict signicance levels, such as p .01 (1%)
and p .005 (0.5%). e convention of p < .05 has been used almost blindly until
now. e question is why a dierent p value has not been agreed upon, for example,
.06 or .04. Indeed, there is no reason or theoretical or practical argument that sus-
tains the criterion of p > .05 as an important cut-o point. is circumstance has led
some statistical experts and methodologists, such as Rosnow and Rosenthal (1989),
to express sarcastically that “Surely, God loves the .06 nearly as much as the .05.
(p. 1277).
NHST is deeply implanted in the mind of researchers who encourage dichoto-
mous thinking, a type of thinking that perceives the world only black or white world,
without intermediate shades (Kirk, 2001). From the perspective of NHST, results
are signicant or not signicant, and even worse, this approach has led to the idea
that if the results are signicant, they are real, and if they are not signicant, then
they are not real, which has slowed the advancement of science. Furthermore, this
approach has provoked researchers not to report the data from their work because
they consider such data to be not signicant; their reasoning is that “there were no
important resultsor that the “hypothesis was not proved.Moreover, publishing
only the statistically signicant data in scientic journals skews the correspond-
ing knowledge and gives the wrong idea about psychological phenomena (Cum-
ming, 2014). For that reason, Cumming (2014) has proposed that we consider the
so-called “new statistics,the transition from dichotomous thinking to estimation
thinking, by using ES, condence intervals and meta-analysis.
Some authors (Cumming, 2014; Schmidt and Hunter, 2004; Tressoldi, Giofre,
Sella, & Cumming, 2013) summarize the diculties of NHST as follows:
1. NHST is centered on null hypothesis rejection at a level that was previ-
ously chosen, usually .05; thus, researchers shall obtain only an answer to
“if there is or is not a change that is different from zero.
2. It is probable that the p value is different if the experiment is repeated, which
means that p values offer a very loose measure of result replicability.
3. NHST does not offer an ES estimation.

30 A. Téllez, C. H. García, V. Corral-Verdugo
4. NHST does not provide information about accuracy and error probability
from an estimated parameter.
5. Randomness (in sampling or in participants’ assignments to groups) is one
of the key pieces of the NHST procedure because without it such statisti-
cal contrasts are irrelevant when the null hypothesis is assumed to be false
a priori. Nevertheless, there is no evidence that the null hypothesis is true
with respect to attempting to reject it.
6. The likelihood of rejecting the null hypothesis increases as the sample size
increases; therefore, NHST tells us more about N than about the hypothesis.
The interpretation of statistical significance becomes meaningless when
the sample size is so large that any detected difference, however small or
even trivial, shall allow the rejection of the null hypothesis. In this way, for
example, when applying an intervention to a group of N = 50 compared
with a control group of the same size, in a quality-of-life scale from 0 to
100 points, a difference of 10 points between the two groups will be needed
to reach p < .05, but with a sample of 500 persons in each group, statistical
significance can be reached with a difference of only 3 points (see Figure 1).
The question is: for a treatment that produces a quality-of-life improvement
of only 3%, is it important to patients regardless of whether it is statistically
significant?
7. Many researchers espouse the idea that the significance level is equal to
causality, but it is not; statistical significance is only one element among
many that enables us to discuss causality (Nyirongo, Mukaka & Kalilani-
Phiri, 2008).
Some statisticians and researchers believe that not only is the NHST unneces-
sary, it has also damaged scientic development. As stated by Schmidt and Hunter
(2002, p. 65): Signicance tests are a disastrous method for testing hypotheses, but
a better method does exist: the use of point estimates (ESs) and condence inter-
vals (CIs).
Sample Size
Difference of life quality
in a 0–100 scale
16 –
14 –
12 –
10 –
8 –
6 –
4 –
2 –
0 –
20 50 100 200 300 400 500
Figure 1. is gure shows that the larger the sample size is, the smaller
the dierences between groups that will be detected at a signicance of p < .05.

Eect size, condence intervals and statistical power in psychological research 31
Why are NHSTs still used?
If NHST oers little information to prove a hypothesis, then the question becomes,
why is it still used? Perhaps the explanation of their intensive use in psychological
research can be found in the fact that most of the measurements are ordinal.
Inferential statistical tests have been used and are still used in making the deci-
sion to reject or accept a hypothesis. It is likely that the main attraction of NHSTs is
their objectivity when establishing a criterion such as the p < .05 minimum value,
which excludes researcher subjectivity; on the other hand, practical and clinical
signicance requires a subjectivity component. Oen, researchers did not want
to be committed to a decision that is impregnated into implicit subjectivity, for
example, in terms of social relevance, clinical importance, and nancial benet.
Nonetheless, in the foregoing, Kirk (2001) argues that science would gain greater
benets if a researcher was focused on the magnitude of an eect and its practical
signicance, believing as well that “No one is in a better position than the research-
er who collected and analyzed the data to decide whether the eects are trivial or
not” (p. 214).
Change of APA editorial policy (under pressure) regarding ES
For many years, critics of NHST, usually experts in statistics and methodology in
social sciences and behavior, have recommend reporting
ES in addition to statisti-
cal signicance (Wilkinson, 1999).
is pressure was especially high in the American Psychological Association
(APA) and was reected for the rst time in the Publication Manual of the APA 4th
ed. (1994), in which authors of research studies are encouraged” to report the ES
(p. 18). is so recommendation, however, contrasted with rigid demands for less
essential aspects, such as the order and form of the literature references.
In 1999, aer a long period of work, Wilkinson and the APA Task Force on Sta-
tistical Inferences prepared a report that stated: “researchers must always publish
the eect size in the main results” (p.599).
In response to Wilkinson and the Task Force recommendations, APA, in its
Publication Manual, 5th ed. (2001), recommended the following to researchers:
“For the reader to fully understand the importance of your ndings, it is almost
always necessary to include some index of the eect size or strength of the rela-
tionship in your Results section(p.25). As observed, APA had yet to dare to fully
endorse the use of ES.
In the sixth edition, APA (2010) stated that “NHST is but a starting point and
that additional reporting elements such as eect sizes, condence intervals, and ex-
tensive description are needed to convey the most complete meaning of the results
(p. 33). Additionally, it stated that a complete report of all of the hypotheses proven,
eect size estimates and their condence intervals are the minimum expectations
for all APA journals (APA, 2010, p.33). In this last edition, APA already widely
recommended the use of ES in addition to ES condence intervals, and the fact
that it armed that NHST “is but a starting point” indicates a very clear change in
the method of analyzing and construing research results in psychology. is was a
consequence of the pressure from outstanding researchers and statisticians, such as

Citations
More filters

Рекомендации по проведению, описанию, редактированию и публикации результатов научной работы в медицинских журналах // recommendations for the conduct, reporting, editing and publication of scholarly work in medical journals

TL;DR: The ICMJE recommendations, the Recommendations for the Conduct, Reporting, Editing and Publication of Scholarly Work in Medical Journal, are published for the first time online at www.ICMJE.org.

Analyzing Students' Misconceptions about Newton's Laws through Four-Tier Newtonian Test (FTNT).

TL;DR: In this article, a study aimed at analyzing student misconceptions about Newton's Laws through four-tier Newtonian Test (FTNT) was carried out on 30 students (15 boys and 15 girls, whose ages were moderate of 16 years-old) at one of Senior High School in Bandung, Indonesia.
Journal ArticleDOI

The measurement scale of resilience among family caregivers of children with cancer: a psychometric evaluation

TL;DR: The Mexican Measurement Scale of Resilience RESI-M shows reliability and construct validity in family caregivers of children with cancer and does not show a bias in relation to social desirability.
Journal ArticleDOI

Reduced perineuronal net expression in Fmr1 KO mice auditory cortex and amygdala is linked to impaired fear-associated memory.

TL;DR: These studies suggest a link between impaired PV and PNN regulation within specific regions of the fear conditioning circuit and impaired tone memory formation in Fmr1 KO mice.
Journal ArticleDOI

Psychological Effects of Group Hypnotherapy on Breast Cancer Patients During Chemotherapy

TL;DR: Results show that the hypnotherapy group significantly decreased anxiety, distress, increased self- esteem, and optimism in the first 12 sessions, but at the end of the 24 sessions, only self-esteem and optimism remained significant compared with the control group.
References
More filters
Book

Statistical Power Analysis for the Behavioral Sciences

TL;DR: The concepts of power analysis are discussed in this paper, where Chi-square Tests for Goodness of Fit and Contingency Tables, t-Test for Means, and Sign Test are used.
Journal ArticleDOI

A power primer.

TL;DR: A convenient, although not comprehensive, presentation of required sample sizes is providedHere the sample sizes necessary for .80 power to detect effects at these levels are tabled for eight standard statistical tests.
Journal ArticleDOI

The earth is round (p < .05)

TL;DR: The authors reviewed the problems with null hypothesis significance testing, including near universal misinterpretation of p as the probability that H is false, the misinterpretation that its complement is the probability of successful replication, and the mistaken assumption that if one rejects H₀ one thereby affirms the theory that led to the test.
Journal ArticleDOI

Interpretation of changes in health-related quality of life the remarkable universality of half a standard deviation

TL;DR: In most circumstances, the threshold of discrimination for changes in health-related quality of life for chronic diseases appears to be approximately half a SD, which research in psychology has shown is approximately 1 part in 7.
Related Papers (5)

Manipulating the Alpha Level Cannot Cure Significance Testing

David Trafimow, +60 more