Psychology in Russia: State of the Art
Volume 8, Issue 3, 2015
Lomonosov
Moscow State
University
Russian
Psychological
Society
ISSN 2074-6857 (Print) / ISSN 2307-2202 (Online)
© Lomonosov Moscow State University, 2015
© Russian Psychological Society, 2015
doi: 10.11621/pir.2015.0303
http://psychologyinrussia.com
Eect size, condence intervals and statistical power
in psychological research
1
Arnoldo Téllez
a,b*
, Cirilo H. García
a
, Víctor Corral-Verdugo
c
a
Psychology School, Universidad Autónoma de Nuevo León (UANL), Monterrey, México
b
Centro de Investigación y Desarrollo en Ciencias de la Salud (CIDICS),
Universidad Autónoma de Nuevo León, Monterrey, México
c
Universidad de Sonora, Hermosillo, México
*
Corresponding author. E-mail: arnoldo.tellez@uanl.mx
Quantitative psychological research is focused on detecting the occurrence of certain
population phenomena by analyzing data from a sample, and statistics is a particu-
larly helpful mathematical tool that is used by researchers to evaluate hypotheses and
make decisions to accept or reject such hypotheses. In this paper, the various statistical
tools in psychological research are reviewed. e limitations of null hypothesis signi-
cance testing (NHST) and the advantages of using eect size and its respective con-
dence intervals are explained, as the latter two measurements can provide important
information about the results of a study. ese measurements also can facilitate data
interpretation and easily detect trivial eects, enabling researchers to make decisions
in a more clinically relevant fashion. Moreover, it is recommended to establish an ap-
propriate sample size by calculating the optimum statistical power at the moment that
the research is designed. Psychological journal editors are encouraged to follow APA
recommendations strictly and ask authors of original research studies to report the ef-
fect size, its condence intervals, statistical power and, when required, any measure of
clinical signicance. Additionally, we must account for the teaching of statistics at the
graduate level. At that level, students do not receive sucient information concerning
the importance of using dierent types of eect sizes and their condence intervals
according to the dierent types of research designs; instead, most of the information is
focused on the various tools of NHST.
Keywords: eect size, condence intervals, statistical power, NHST
1 A brighter day is dawning in which researchers will ask not only if a sample result is likely but also
if an eect is practically noteworthy or replicable (ompson, 2002).
28 A. Téllez, C. H. García, V. Corral-Verdugo
Introduction
In the last three decades, a critical movement occurring within quantitative psy-
chological research began to develop. is development emerged in response to the
misguided use of classical statistics based on null hypothesis signicance testing
(NHST). NHST promotes dichotomous thinking and provides limited informa-
tion regarding the essence of investigated phenomena. Dichotomous thinking in
science -- manifested as only accepting or rejecting research hypotheses -- prevents
the advancement of science and may even skew the accumulation of knowledge by
stimulating the exclusive publication of studies whose hypotheses have been ac-
cepted. e so-called “new statistics” movement critically challenges NHST postu-
lates and operations. is movement is an approach based on the analytical tool of
estimation (Cummings, 2014), which promotes the use of eect size as descriptive
statistic, condence intervals as inferential statistics, and meta-analysis as a reliable
form of knowledge accumulation. is paper is intended to analyze the disadvan-
tages of NHST and the advantages of using eect size, condence intervals and sta-
tistical power in quantitative psychological research, especially in clinical studies.
Also noted and stressed is the need for editors of scientic psychological journals
to adhere to policies recommended by the A.P.A. in this regard.
Quantitative research in psychology
Typically, quantitative psychological research is focused on detecting the occur-
rence of certain population phenomena by analyzing data from a sample. An exam-
ple is the case in which a researcher wishes to know if a treatment to improve the
quality of life of those who suer from breast cancer performs better than a placebo
treatment for another group or those on a waiting list (also known as the control
group or contrast group) (Wilkinson, 1999). Similarly, to make the decision to con-
rm that an independent variable or treatment did or did not have an important
eect, statistics is used. In quantitative research methodology, there are two ways of
quantifying this eect: (1) Null Hypothesis Signicance Testing and (2) Eect Size
(ES), as well as its respective condence intervals (CI). ese two approaches are
reviewed below.
Null Hypothesis Signicance Testing (NHST)
NHST comes from the eect size on the population, the size of the sample used and
the alpha level or p value that is selected (p being the abbreviation for probability).
Most psychology research is focused on rejecting the null hypothesis and obtaining
a small p value instead of observing the relevance of the results that are obtained
(Kirk, 2001).
Among the limitations of NHST, we found its sensitivity to sample size, its in-
ability to accept the null hypothesis, and its lack of capacity to determine the practi-
cal signicance of the statistical results.
Kirk (2001) states that NHST only establishes the probability of obtaining a
more or less extreme eect if the null hypothesis is true. It does not, however,
communicate the magnitude of the eect or its practical signicance, meaning
Eect size, condence intervals and statistical power in psychological research 29
whether the eect is found to be useful or important. As a result, inferential sta-
tistical testing has been criticized; as expressed by Ivarsson, Andersen, Johnson,
and Lindwall (2013): “p levels may have little, if anything, to do with real-world
meaning and practical value” (p.97). Some authors, such as Schmidt (1996), even
suggest that statistical contrast is unnecessary and recommend focusing only on
ES estimation, and Cohen (1994) suggests that “NHST has not only failed to sup-
port the advance of psychology as a science but also has seriously impeded it.”
(p. 997).
Ronald Fisher was the father of modern statistics and experimental design.
Since his time, it has been established as a convention that the p value for statisti-
cal signicance must be less than .05, which means that an observed dierence
between two groups has less than a 5% probability of occurring by chance or sam-
pling error if the null hypothesis is assumed to be true initially. In other words,
if the p was equal or less than .05, then the null hypothesis could be rejected be-
cause 95 times out of 100, the observed dierence between the means is not due to
chance. Some researchers use more strict signicance levels, such as p ≤ .01 (1%)
and p ≤ .005 (0.5%). e convention of p < .05 has been used almost blindly until
now. e question is why a dierent p value has not been agreed upon, for example,
.06 or .04. Indeed, there is no reason or theoretical or practical argument that sus-
tains the criterion of p > .05 as an important cut-o point. is circumstance has led
some statistical experts and methodologists, such as Rosnow and Rosenthal (1989),
to express sarcastically that “Surely, God loves the .06 nearly as much as the .05.”
(p. 1277).
NHST is deeply implanted in the mind of researchers who encourage dichoto-
mous thinking, a type of thinking that perceives the world only black or white world,
without intermediate shades (Kirk, 2001). From the perspective of NHST, results
are signicant or not signicant, and even worse, this approach has led to the idea
that if the results are signicant, they are real, and if they are not signicant, then
they are not real, which has slowed the advancement of science. Furthermore, this
approach has provoked researchers not to report the data from their work because
they consider such data to be not signicant; their reasoning is that “there were no
important results” or that the “hypothesis was not proved.” Moreover, publishing
only the statistically signicant data in scientic journals skews the correspond-
ing knowledge and gives the wrong idea about psychological phenomena (Cum-
ming, 2014). For that reason, Cumming (2014) has proposed that we consider the
so-called “new statistics,” the transition from dichotomous thinking to estimation
thinking, by using ES, condence intervals and meta-analysis.
Some authors (Cumming, 2014; Schmidt and Hunter, 2004; Tressoldi, Giofre,
Sella, & Cumming, 2013) summarize the diculties of NHST as follows:
1. NHST is centered on null hypothesis rejection at a level that was previ-
ously chosen, usually .05; thus, researchers shall obtain only an answer to
“if there is or is not a change that is different from zero”.
2. It is probable that the p value is different if the experiment is repeated, which
means that p values offer a very loose measure of result replicability.
3. NHST does not offer an ES estimation.
30 A. Téllez, C. H. García, V. Corral-Verdugo
4. NHST does not provide information about accuracy and error probability
from an estimated parameter.
5. Randomness (in sampling or in participants’ assignments to groups) is one
of the key pieces of the NHST procedure because without it such statisti-
cal contrasts are irrelevant when the null hypothesis is assumed to be false
a priori. Nevertheless, there is no evidence that the null hypothesis is true
with respect to attempting to reject it.
6. The likelihood of rejecting the null hypothesis increases as the sample size
increases; therefore, NHST tells us more about N than about the hypothesis.
The interpretation of statistical significance becomes meaningless when
the sample size is so large that any detected difference, however small or
even trivial, shall allow the rejection of the null hypothesis. In this way, for
example, when applying an intervention to a group of N = 50 compared
with a control group of the same size, in a quality-of-life scale from 0 to
100 points, a difference of 10 points between the two groups will be needed
to reach p < .05, but with a sample of 500 persons in each group, statistical
significance can be reached with a difference of only 3 points (see Figure 1).
The question is: for a treatment that produces a quality-of-life improvement
of only 3%, is it important to patients regardless of whether it is statistically
significant?
7. Many researchers espouse the idea that the significance level is equal to
causality, but it is not; statistical significance is only one element among
many that enables us to discuss causality (Nyirongo, Mukaka & Kalilani-
Phiri, 2008).
Some statisticians and researchers believe that not only is the NHST unneces-
sary, it has also damaged scientic development. As stated by Schmidt and Hunter
(2002, p. 65): “Signicance tests are a disastrous method for testing hypotheses, but
a better method does exist: the use of point estimates (ESs) and condence inter-
vals (CIs).”
Sample Size
Difference of life quality
in a 0–100 scale
16 –
14 –
12 –
10 –
8 –
6 –
4 –
2 –
0 –
20 50 100 200 300 400 500
Figure 1. is gure shows that the larger the sample size is, the smaller
the dierences between groups that will be detected at a signicance of p < .05.
Eect size, condence intervals and statistical power in psychological research 31
Why are NHSTs still used?
If NHST oers little information to prove a hypothesis, then the question becomes,
why is it still used? Perhaps the explanation of their intensive use in psychological
research can be found in the fact that most of the measurements are ordinal.
Inferential statistical tests have been used and are still used in making the deci-
sion to reject or accept a hypothesis. It is likely that the main attraction of NHSTs is
their objectivity when establishing a criterion such as the p < .05 minimum value,
which excludes researcher subjectivity; on the other hand, practical and clinical
signicance requires a subjectivity component. Oen, researchers did not want
to be committed to a decision that is impregnated into implicit subjectivity, for
example, in terms of social relevance, clinical importance, and nancial benet.
Nonetheless, in the foregoing, Kirk (2001) argues that science would gain greater
benets if a researcher was focused on the magnitude of an eect and its practical
signicance, believing as well that “No one is in a better position than the research-
er who collected and analyzed the data to decide whether the eects are trivial or
not” (p. 214).
Change of APA editorial policy (under pressure) regarding ES
For many years, critics of NHST, usually experts in statistics and methodology in
social sciences and behavior, have recommend reporting
ES in addition to statisti-
cal signicance (Wilkinson, 1999).
is pressure was especially high in the American Psychological Association
(APA) and was reected for the rst time in the Publication Manual of the APA 4th
ed. (1994), in which authors of research studies are “encouraged” to report the ES
(p. 18). is so recommendation, however, contrasted with rigid demands for less
essential aspects, such as the order and form of the literature references.
In 1999, aer a long period of work, Wilkinson and the APA Task Force on Sta-
tistical Inferences prepared a report that stated: “researchers must always publish
the eect size in the main results” (p.599).
In response to Wilkinson and the Task Force recommendations, APA, in its
Publication Manual, 5th ed. (2001), recommended the following to researchers:
“For the reader to fully understand the importance of your ndings, it is almost
always necessary to include some index of the eect size or strength of the rela-
tionship in your Results section” (p.25). As observed, APA had yet to dare to fully
endorse the use of ES.
In the sixth edition, APA (2010) stated that “NHST is but a starting point and
that additional reporting elements such as eect sizes, condence intervals, and ex-
tensive description are needed to convey the most complete meaning of the results”
(p. 33). Additionally, it stated that a complete report of all of the hypotheses proven,
eect size estimates and their condence intervals are the minimum expectations
for all APA journals (APA, 2010, p.33). In this last edition, APA already widely
recommended the use of ES in addition to ES condence intervals, and the fact
that it armed that NHST “is but a starting point” indicates a very clear change in
the method of analyzing and construing research results in psychology. is was a
consequence of the pressure from outstanding researchers and statisticians, such as