Effect Size, Confidence Intervals and Statistical Power in Psychological Research

doi:10.11621/PIR.2015.0303

Psychology in Russia: State of the Art

Volume 8, Issue 3, 2015

Lomonosov

Moscow State

University

Russian

Psychological

Society

ISSN 2074-6857 (Print) / ISSN 2307-2202 (Online)

doi: 10.11621/pir.2015.0303

http://psychologyinrussia.com

Eect size, condence intervals and statistical power

in psychological research

1

Arnoldo Téllez

a,b*

, Cirilo H. García

a

, Víctor Corral-Verdugo

c

a

Psychology School, Universidad Autónoma de Nuevo León (UANL), Monterrey, México

b

Centro de Investigación y Desarrollo en Ciencias de la Salud (CIDICS),

Universidad Autónoma de Nuevo León, Monterrey, México

c

Universidad de Sonora, Hermosillo, México

*

Corresponding author. E-mail: arnoldo.tellez@uanl.mx

Quantitative psychological research is focused on detecting the occurrence of certain

population phenomena by analyzing data from a sample, and statistics is a particu-

larly helpful mathematical tool that is used by researchers to evaluate hypotheses and

make decisions to accept or reject such hypotheses. In this paper, the various statistical

tools in psychological research are reviewed. e limitations of null hypothesis signi-

cance testing (NHST) and the advantages of using eect size and its respective con-

dence intervals are explained, as the latter two measurements can provide important

information about the results of a study. ese measurements also can facilitate data

interpretation and easily detect trivial eects, enabling researchers to make decisions

in a more clinically relevant fashion. Moreover, it is recommended to establish an ap-

propriate sample size by calculating the optimum statistical power at the moment that

the research is designed. Psychological journal editors are encouraged to follow APA

recommendations strictly and ask authors of original research studies to report the ef-

fect size, its condence intervals, statistical power and, when required, any measure of

clinical signicance. Additionally, we must account for the teaching of statistics at the

graduate level. At that level, students do not receive sucient information concerning

the importance of using dierent types of eect sizes and their condence intervals

according to the dierent types of research designs; instead, most of the information is

focused on the various tools of NHST.

Keywords: eect size, condence intervals, statistical power, NHST

1 A brighter day is dawning in which researchers will ask not only if a sample result is likely but also

if an eect is practically noteworthy or replicable (ompson, 2002).

28 A. Téllez, C. H. García, V. Corral-Verdugo

Introduction

In the last three decades, a critical movement occurring within quantitative psy-

chological research began to develop. is development emerged in response to the

misguided use of classical statistics based on null hypothesis signicance testing

(NHST). NHST promotes dichotomous thinking and provides limited informa-

tion regarding the essence of investigated phenomena. Dichotomous thinking in

science -- manifested as only accepting or rejecting research hypotheses -- prevents

the advancement of science and may even skew the accumulation of knowledge by

stimulating the exclusive publication of studies whose hypotheses have been ac-

cepted. e so-called “new statistics” movement critically challenges NHST postu-

lates and operations. is movement is an approach based on the analytical tool of

estimation (Cummings, 2014), which promotes the use of eect size as descriptive

statistic, condence intervals as inferential statistics, and meta-analysis as a reliable

form of knowledge accumulation. is paper is intended to analyze the disadvan-

tages of NHST and the advantages of using eect size, condence intervals and sta-

tistical power in quantitative psychological research, especially in clinical studies.

Also noted and stressed is the need for editors of scientic psychological journals

to adhere to policies recommended by the A.P.A. in this regard.

Quantitative research in psychology

Typically, quantitative psychological research is focused on detecting the occur-

rence of certain population phenomena by analyzing data from a sample. An exam-

ple is the case in which a researcher wishes to know if a treatment to improve the

quality of life of those who suer from breast cancer performs better than a placebo

treatment for another group or those on a waiting list (also known as the control

group or contrast group) (Wilkinson, 1999). Similarly, to make the decision to con-

rm that an independent variable or treatment did or did not have an important

eect, statistics is used. In quantitative research methodology, there are two ways of

quantifying this eect: (1) Null Hypothesis Signicance Testing and (2) Eect Size

(ES), as well as its respective condence intervals (CI). ese two approaches are

reviewed below.

Null Hypothesis Signicance Testing (NHST)

NHST comes from the eect size on the population, the size of the sample used and

the alpha level or p value that is selected (p being the abbreviation for probability).

Most psychology research is focused on rejecting the null hypothesis and obtaining

a small p value instead of observing the relevance of the results that are obtained

(Kirk, 2001).

Among the limitations of NHST, we found its sensitivity to sample size, its in-

ability to accept the null hypothesis, and its lack of capacity to determine the practi-

cal signicance of the statistical results.

Kirk (2001) states that NHST only establishes the probability of obtaining a

more or less extreme eect if the null hypothesis is true. It does not, however,

communicate the magnitude of the eect or its practical signicance, meaning

Eect size, condence intervals and statistical power in psychological research 29

whether the eect is found to be useful or important. As a result, inferential sta-

tistical testing has been criticized; as expressed by Ivarsson, Andersen, Johnson,

and Lindwall (2013): “p levels may have little, if anything, to do with real-world

meaning and practical value” (p.97). Some authors, such as Schmidt (1996), even

suggest that statistical contrast is unnecessary and recommend focusing only on

ES estimation, and Cohen (1994) suggests that “NHST has not only failed to sup-

port the advance of psychology as a science but also has seriously impeded it.”

(p. 997).

Ronald Fisher was the father of modern statistics and experimental design.

Since his time, it has been established as a convention that the p value for statisti-

cal signicance must be less than .05, which means that an observed dierence

between two groups has less than a 5% probability of occurring by chance or sam-

pling error if the null hypothesis is assumed to be true initially. In other words,

if the p was equal or less than .05, then the null hypothesis could be rejected be-

cause 95 times out of 100, the observed dierence between the means is not due to

chance. Some researchers use more strict signicance levels, such as p ≤ .01 (1%)

and p ≤ .005 (0.5%). e convention of p < .05 has been used almost blindly until

now. e question is why a dierent p value has not been agreed upon, for example,

.06 or .04. Indeed, there is no reason or theoretical or practical argument that sus-

tains the criterion of p > .05 as an important cut-o point. is circumstance has led

some statistical experts and methodologists, such as Rosnow and Rosenthal (1989),

to express sarcastically that “Surely, God loves the .06 nearly as much as the .05.”

(p. 1277).

NHST is deeply implanted in the mind of researchers who encourage dichoto-

mous thinking, a type of thinking that perceives the world only black or white world,

without intermediate shades (Kirk, 2001). From the perspective of NHST, results

are signicant or not signicant, and even worse, this approach has led to the idea

that if the results are signicant, they are real, and if they are not signicant, then

they are not real, which has slowed the advancement of science. Furthermore, this

approach has provoked researchers not to report the data from their work because

they consider such data to be not signicant; their reasoning is that “there were no

important results” or that the “hypothesis was not proved.” Moreover, publishing

only the statistically signicant data in scientic journals skews the correspond-

ing knowledge and gives the wrong idea about psychological phenomena (Cum-

ming, 2014). For that reason, Cumming (2014) has proposed that we consider the

so-called “new statistics,” the transition from dichotomous thinking to estimation

thinking, by using ES, condence intervals and meta-analysis.

Some authors (Cumming, 2014; Schmidt and Hunter, 2004; Tressoldi, Giofre,

Sella, & Cumming, 2013) summarize the diculties of NHST as follows:

1. NHST is centered on null hypothesis rejection at a level that was previ-

ously chosen, usually .05; thus, researchers shall obtain only an answer to

“if there is or is not a change that is different from zero”.

2. It is probable that the p value is different if the experiment is repeated, which

means that p values offer a very loose measure of result replicability.

3. NHST does not offer an ES estimation.

30 A. Téllez, C. H. García, V. Corral-Verdugo

4. NHST does not provide information about accuracy and error probability

from an estimated parameter.

5. Randomness (in sampling or in participants’ assignments to groups) is one

of the key pieces of the NHST procedure because without it such statisti-

cal contrasts are irrelevant when the null hypothesis is assumed to be false

a priori. Nevertheless, there is no evidence that the null hypothesis is true

with respect to attempting to reject it.

6. The likelihood of rejecting the null hypothesis increases as the sample size

increases; therefore, NHST tells us more about N than about the hypothesis.

The interpretation of statistical significance becomes meaningless when

the sample size is so large that any detected difference, however small or

even trivial, shall allow the rejection of the null hypothesis. In this way, for

example, when applying an intervention to a group of N = 50 compared

with a control group of the same size, in a quality-of-life scale from 0 to

100 points, a difference of 10 points between the two groups will be needed

to reach p < .05, but with a sample of 500 persons in each group, statistical

significance can be reached with a difference of only 3 points (see Figure 1).

The question is: for a treatment that produces a quality-of-life improvement

of only 3%, is it important to patients regardless of whether it is statistically

significant?

7. Many researchers espouse the idea that the significance level is equal to

causality, but it is not; statistical significance is only one element among

many that enables us to discuss causality (Nyirongo, Mukaka & Kalilani-

Phiri, 2008).

Some statisticians and researchers believe that not only is the NHST unneces-

sary, it has also damaged scientic development. As stated by Schmidt and Hunter

(2002, p. 65): “Signicance tests are a disastrous method for testing hypotheses, but

a better method does exist: the use of point estimates (ESs) and condence inter-

vals (CIs).”

Sample Size

Difference of life quality

in a 0–100 scale

16 –

14 –

12 –

10 –

8 –

6 –

4 –

2 –

0 –

20 50 100 200 300 400 500

Figure 1. is gure shows that the larger the sample size is, the smaller

the dierences between groups that will be detected at a signicance of p < .05.

Eect size, condence intervals and statistical power in psychological research 31

Why are NHSTs still used?

If NHST oers little information to prove a hypothesis, then the question becomes,

why is it still used? Perhaps the explanation of their intensive use in psychological

research can be found in the fact that most of the measurements are ordinal.

Inferential statistical tests have been used and are still used in making the deci-

sion to reject or accept a hypothesis. It is likely that the main attraction of NHSTs is

their objectivity when establishing a criterion such as the p < .05 minimum value,

which excludes researcher subjectivity; on the other hand, practical and clinical

signicance requires a subjectivity component. Oen, researchers did not want

to be committed to a decision that is impregnated into implicit subjectivity, for

example, in terms of social relevance, clinical importance, and nancial benet.

Nonetheless, in the foregoing, Kirk (2001) argues that science would gain greater

benets if a researcher was focused on the magnitude of an eect and its practical

signicance, believing as well that “No one is in a better position than the research-

er who collected and analyzed the data to decide whether the eects are trivial or

not” (p. 214).

Change of APA editorial policy (under pressure) regarding ES

For many years, critics of NHST, usually experts in statistics and methodology in

social sciences and behavior, have recommend reporting

ES in addition to statisti-

cal signicance (Wilkinson, 1999).

is pressure was especially high in the American Psychological Association

(APA) and was reected for the rst time in the Publication Manual of the APA 4th

ed. (1994), in which authors of research studies are “encouraged” to report the ES

(p. 18). is so recommendation, however, contrasted with rigid demands for less

essential aspects, such as the order and form of the literature references.

In 1999, aer a long period of work, Wilkinson and the APA Task Force on Sta-

tistical Inferences prepared a report that stated: “researchers must always publish

the eect size in the main results” (p.599).

In response to Wilkinson and the Task Force recommendations, APA, in its

Publication Manual, 5th ed. (2001), recommended the following to researchers:

“For the reader to fully understand the importance of your ndings, it is almost

always necessary to include some index of the eect size or strength of the rela-

tionship in your Results section” (p.25). As observed, APA had yet to dare to fully

endorse the use of ES.

In the sixth edition, APA (2010) stated that “NHST is but a starting point and

that additional reporting elements such as eect sizes, condence intervals, and ex-

tensive description are needed to convey the most complete meaning of the results”

(p. 33). Additionally, it stated that a complete report of all of the hypotheses proven,

eect size estimates and their condence intervals are the minimum expectations

for all APA journals (APA, 2010, p.33). In this last edition, APA already widely

recommended the use of ES in addition to ES condence intervals, and the fact

that it armed that NHST “is but a starting point” indicates a very clear change in

the method of analyzing and construing research results in psychology. is was a

consequence of the pressure from outstanding researchers and statisticians, such as

Effect Size, Confidence Intervals and Statistical Power in Psychological Research

Citations

Analyzing Students' Misconceptions about Newton's Laws through Four-Tier Newtonian Test (FTNT).

The measurement scale of resilience among family caregivers of children with cancer: a psychometric evaluation

Reduced perineuronal net expression in Fmr1 KO mice auditory cortex and amygdala is linked to impaired fear-associated memory.

Psychological Effects of Group Hypnotherapy on Breast Cancer Patients During Chemotherapy

References

Statistical Power Analysis for the Behavioral Sciences

Statistical Power Analysis for the Behavioral Sciences (2nd ed.)

A power primer.

The earth is round (p < .05)

Interpretation of changes in health-related quality of life the remarkable universality of half a standard deviation

Related Papers (5)

A power primer.

Thou Shalt Not Bear False Witness Against Null Hypothesis Significance Testing

Moving Beyond p < 0.05 in Ecotoxicology: A Guide for Practitioners.

Biostatistics series module 2: Overview of hypothesis testing

Manipulating the Alpha Level Cannot Cure Significance Testing