UvA-DARE is a service provided by the library of the University of Amsterdam (http
s
://dare.uva.nl)
UvA-DARE (Digital Academic Repository)
Using Bayes factor hypothesis testing in neuroscience to establish evidence of
absence
Keysers, C.; Gazzola, V.; Wagenmakers, E.-J.
DOI
10.1038/s41593-020-0660-4
Publication date
2020
Document Version
Final published version
Published in
Nature Neuroscience
License
Article 25fa Dutch Copyright Act
Link to publication
Citation for published version (APA):
Keysers, C., Gazzola, V., & Wagenmakers, E-J. (2020). Using Bayes factor hypothesis
testing in neuroscience to establish evidence of absence.
Nature Neuroscience
,
23
(7), 788-
799. https://doi.org/10.1038/s41593-020-0660-4
General rights
It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s)
and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open
content license (like Creative Commons).
Disclaimer/Complaints regulations
If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please
let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material
inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter
to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You
will be contacted as soon as possible.
Download date:09 Aug 2022
Review ARticle
https://doi.org/10.1038/s41593-020-0660-4
1
Netherlands Institute for Neuroscience, Royal Netherlands Academy of Arts and Sciences, Amsterdam, The Netherlands.
2
Department of Psychology,
University of Amsterdam, Amsterdam, The Netherlands.
✉
e-mail: c.keysers@nin.knaw.nl
N
euroscientists would need to know and publish whether a
manipulation does not have an effect as much as whether
it does. One may use drugs to block a candidate pathway.
If the drug has an effect, that pathway is involved; if it doesn’t, one
would like to conclude the pathway is not involved. Or one may alter
activity in a brain region X and measure behavior B. If de-activating
X changes B, X is involved in B; if B remains unchanged, one would
like to conclude that X is not involved in B.
Neuroscience research is characterized by advanced measure
-
ment techniques and sophisticated experimental designs, but the
data analyses almost always employ the standard framework of
frequentist statistics, featuring P value null-hypothesis significance
testing (NHST). NHST is arguably appropriate when one wants to
quantify evidence against the null hypothesis (H
0
: there is no effect)
and therefore for the presence of an effect (but see ref.
1
); however,
NHST is problematic when one wants to quantify evidence for
the null hypothesis. It is notoriously difficult to establish whether
non-significant results support the null hypothesis (i.e., yield evi
-
dence for absence) or are simply not informative (i.e., show absence
of evidence
2–4
). NHST biases us to emphasize positive effects,
because those are the effects it equips us to quantify, and to ignore
null findings. If we agree that the absence of an effect is important
information, isn’t this bias unacceptable? Here we aim to highlight
how an alternative statistical framework—Bayesian inference—can
resolve this problem in neuroscience practice.
We will first illustrate why it is problematic to quantify evi
-
dence for the null hypothesis based on the dominant frequentist
approaches. We will then show how Bayesian statistics provides a
way out of this predicament through simple tutorial-style exam
-
ples of Bayesian t-tests and ANOVA using the open-source
project JASP
5
.
The P value predicament
When we conduct a t-test to compare two conditions A and B,
a resulting P value below a critical threshold α shows that one is
unlikely to encounter differences this extreme or more if the experi
-
mental manipulation had no effect (H
0
: μ
A
= μ
B
). For a fixed sample
size, the smaller the P, the more evidence we have against H
0
. Fisher
argued that a low P value signals that “either the null hypothesis
is false, or an exceptionally rare event has occurred.”
6
But what if
we find no significant effect (for example, P = 0.3)? Apart from
sampling variability (i.e., ‘bad luck’), there are two fundamentally
different causal explanations for a non-significant P value: the
manipulation had a non-zero effect, but the sample size was too
small to detect it (i.e., there was insufficient power); or the manipu
-
lation had no effect (i.e., the true effect is zero). When sample size
is small, either explanation is plausible. As sample size grows, a
non-significant P value increasingly suggests the manipulation
did not have an effect (or an effect so small it is not meaningful).
While a power analysis can help disentangle these alternatives, the
relationship between sample size, power, P value and evidence for
H
0
is complex enough that we are rightly reticent to draw strong
conclusions from a non-significant P value. This has been famously
and elegantly phrased in the antimetabole: ‘absence of evidence
[read: the data are not informative, the design was underpowered]
is not evidence of absence [read: the data provide support in favor
of the null]’
7
.
Intuitively, one may believe that if lower P values provide more
evidence against H
0
, higher P values should provide more evidence
in favor of H
0
. We would thus expect that if we simulate truly ran-
dom data with no effect, high P values should be relatively frequent,
especially with large sample sizes. This, however, is not the case.
When we draw random samples from two identical distributions
(i.e., where H
0
is true; Fig. 1a leftmost column), P < 0.05 is rare
(as expected), but all P values are equally likely. As sample size
increases, and we thus intuitively have more evidence that the two
distributions have the same mean, high P values do not become
more frequent (Fig. 1a, leftmost column comparing top and bot
-
tom row). Higher P values are thus not a reliable metric for more
evidence for H
0
.
Hence, NHST leaves the neuroscientist in a peculiar predica
-
ment: significant P values indicate evidence against H
0
(but see
refs.
1,8
), but non-significant P values do not allow us to conclude that
the data support H
0
. This inherent limitation of P values impedes
our ability to draw the important conclusion that a manipulation
has no effect and hence that a particular molecular pathway or brain
circuitry is not involved or that a particular stimulus dimension
does not matter for brain activity.
Using Bayes factor hypothesis testing in
neuroscience to establish evidence of absence
Christian Keysers
1,2
✉
, Valeria Gazzola
1,2
and Eric-Jan Wagenmakers
2
Most neuroscientists would agree that for brain research to progress, we have to know which experimental manipulations have
no effect as much as we must identify those that do have an effect. The dominant statistical approaches used in neuroscience
rely on P values and can establish the latter but not the former. This makes non-significant findings difficult to interpret: do
they support the null hypothesis or are they simply not informative? Here we show how Bayesian hypothesis testing can be
used in neuroscience studies to establish both whether there is evidence of absence and whether there is absence of evidence.
Through simple tutorial-style examples of Bayesian t-tests and ANOVA using the open-source project JASP, this article aims to
empower neuroscientists to use this approach to provide compelling and rigorous evidence for the absence of an effect.
NATURE NEUROSCIENCE | VOL 23 | JULY 2020 | 788–799 | www.nature.com/natureneuroscience
788
Review ARticle
NATurE NEurOSciENcE
A Bayesian solution
In contrast to frequentist NHST, which focuses exclusively on the
null hypothesis (H
0
), Bayesian hypothesis testing aims to quantify
the relative plausibility of alternative hypotheses H
1
and H
0
(Box 1).
Figure 2 shows an example of how evidence is computed, using
a Bayesian approach, for the case of a t-test when the question of
interest is whether an experimental manipulation has a positive
effect. This translates into two rival hypotheses: the manipulation
had no effect versus the manipulation increased the dependent vari
-
able. Rather than expressing hypotheses in raw values specific to
a given experiment, they are expressed using the population stan
-
dardized effect size δ (with δ = (μ
A
– μ
B
)/σ). The sceptic’s hypothesis,
H
0
: δ = 0, states that the effect is absent, whereas the alternative
hypothesis, H
+
: δ > 0, states that the effect is positive (Fig. 2a). Note
that a ‘one-tailed’ H
1
is denoted as H
+
to indicate the direction of
the hypothesized effect. To quantify which hypothesis best predicts
the data, we quantify the observed effect size d (d = (m
A
– m
B
)/s)
in the data and transform it into a t-value
t
¼
d ´
ffiffiffi
n
p
ðÞ
I
, because
the distribution of t-values expected for any δ is well known. Next,
we transform the qualitative hypotheses H
0
and H
+
into quantita-
tive predictions about the probability of encountering every t-value
using this t-distribution. This is achieved by assigning prior proba
-
bility distributions to δ (Fig. 2b), and then computing the probability
of each observable t based on these δ-value distributions (Fig. 2c).
For the sceptic’s H
0
: δ = 0, the distribution of effect sizes is simply a
spike at δ = 0 (red in Fig. 2b), and this makes predictions about the
likelihood of each observable t-value using the same distribution
that is used in a frequentist t-test with n participants: the Student’s
t distribution with n – 2 degrees of freedom (red in Fig. 2c). For
H
+
: δ > 0, we need to be specific about the probability of each pos-
sible positive δ to become specific about t. The one-tailed nature of
our hypothesis is reflected in a truncated distribution, with nega
-
tive values having zero probability under H
+
(ref.
9
p. 283; note that
two-tailed hypotheses are usually implemented by means of sym
-
metrical distributions, for example, the dotted line in Fig. 3b). We
also know that most neuroscience papers report effect sizes of δ <
1 (ref.
10
), with smaller effect sizes being more common than larger
effect sizes; this is reflected in a peak for small positive δ and low
probability for δ > 1. Indeed, that we feel that we need to perform a
test in the first place corresponds to this presumption that the effect
size must be fairly small
9
. These considerations about the plausible
direction and magnitudes of the effect under H
+
generate the prior
distribution shown in blue in Fig. 2b (see section “Default priors
provide an objective anchor” for guidance on how to define this
prior distribution). For each of the hypothesized δ values, we can
make predictions about t using the non-central t distribution with
μ = δ. The mixture of these non-central t-distributions associated
with each δ, weighted by the prior plausibility of that δ, predicts the
probability of each possible t-value under H
+
(blue in Fig. 2c). When
the data arrive (Fig. 2d), we first calculate the t-value for our data,
which we will call t
1
, and then see where t
1
falls on the t-distribution
expected under H
0
(red) and under H
+
(blue). The traditional fre-
quentist P value corresponds to the area to the right of t
1
on the
red distribution; note that the predictions from H
+
, indicated by the
blue distribution, are entirely ignored in the frequentist approach.
In contrast, for the Bayesian approach, we take the ordinates p(t
1
|
H
0
) and p(t
1
| H
+
) and calculate the evidence that the data provide in
favor of H
+
over H
0
as p(t
1
| H
+
) ÷ p(t
1
| H
0
) (Fig. 2e). At that specific
t
1
value, the ratio equals 4, indicating that our data was predicted
four times better by H
+
than H
0
; we may conclude that our data sup-
ports H
+
. The evidence—the relative predictive performance of H
0
versus H
+
—is known as the Bayes factor
9,11,12
(Box 1). We abbreviate
it as BF and use subscripts to denote which model is in the numera
-
tor versus the denominator; thus, BF
+0
= p(t
1
| H
+
) ÷ p(t
1
| H
0
) and
BF
0+
= p(t
1
| H
0
) ÷ p(t
1
| H
+
).
If the t-value from our data were to be closer to 0, as exemplified
by another hypothetical t-value, t
2
(Fig. 2e), the ordinates of the red
and blue distributions would be about equally high, indicating that
the observed t
2
is about equally likely to occur under H
0
and H
+
;
hence the predictive performance of H
0
and H
+
is about equal, the
Bayes factor is near 1, and consequently we have absence of evi
-
dence. If the t-value were to fall at t
3
(Fig. 2e), this value would be 4
times more likely to occur under H
0
than under H
+
; consequently,
BF
+0
= ¼, that is, BF
0+
= 4, and we may conclude that our data sup-
port H
0
—in other words, we have some evidence of absence.
Thus, the P value of a frequentist approach has two logical states,
significant versus not significant, which translate into evidence for
H
1
(“great, I found the effect”) versus a state of suspended disbe-
lief (“I did not find an effect, but it could be because I was unlucky
or because the effect does not exist or because my sample size was
too small”), whereas the BF has three qualitatively different logical
states: BF
10
> x (“great, I have compelling evidence for the effect”),
1/x < BF
10
< x (“oops, my data are not sufficiently diagnostic”), BF
10
< 1/x (“great, I have compelling evidence for the absence of the
effect”). Here
x is the researcher-defined target level of evidence.
The BF should primarily be seen as a continuous measure of evi
-
dence. However, since larger deviations from 1 provide stronger
evidence, Jeffreys proposed reference values to guide the interpreta
-
tion of the strength of the evidence
9
. These values were spaced out
in exponential half steps of 10, 10
0.5
≈ 3, 10
1
= 10, 10
1.5
≈ 30, etc., to
be equidistant on a log scale. He then compared these values with
critical values in frequentist t-tests (see Extended Data Fig. 1a for a
modern equivalent) and χ
2
tests, and declared, “Users of these tests
speak of the 5 per cent point [p = 0.05] in much the same way as I
Fig. 1 | P value of a t-test and BF
+0
as a function of effect size and sample size. a, Each histogram shows the distribution of P values obtained from
1,000 one-tailed one-sample t-tests based on n random numbers drawn from a normal distribution with mean µ and s.d. = 1. To differentiate levels of
significance, the first bin was split into multiple bins based on standard critical values. Note how, when there is an effect in the data (i.e., µ > 0, all but
leftmost column), increasing sample size (downwards) or effect size (rightwards) leads to a leftwards shift of the distribution: more evidence for an effect
leads to lower P values. In this case, P values <0.05 are considered hits and are shown in green, while P values >0.05 are considered misses and shown
in red. However, somewhat counterintuitively, the converse does not hold true: in the absence of an effect, (µ = 0, leftwards column), increasing sample
size does not lead to a rightward shift (increase) of the P values. Instead the distribution is completely flat, with all P values equally likely (note that the
distribution seems to thin out below 0.05, but this is because we subdivided the last leftmost bin into several bins to resolve levels of significance). In
this case, P < 0.05 represents false alarms, shown in red, and P > 0.05 represents correct rejections, shown in green. P values are thus not a symmetrical
instrument: cases with much evidence for H
1
(high effect size and sample size) give us quasi-certainty to find a very low P value, whereas cases with
much evidence for H
0
(for example, µ = 0 with n = 100) do not make P values close to 1 highly likely; instead, any P value remains as likely as any other.
b, Distribution of BF
+0
(using
r ¼
ffiffiffi
2
p
=2
I
for the effect size prior Cauchy width) values obtained from 1,000 t-tests based on n random numbers drawn from
an N(µ,1) normal distribution with mean µ and s.d. = 1. Each histogram has the same bounds specified below the graphs, representing conventional limits
for moderate and strong evidence. When an effect is absent (μ = 0, leftmost column), evidence of absence (green bars and percentages, BF
+0
< 1/3)
increases with increasing sample size and the false alarm rate is well controlled. When an effect is present (μ > 0), evidence for a positive effect
(BF
+0
> 3, green bars and green percentages) increases with sample size and effect size, and misses (BF
+0
< 1/3, red bars and red percentages) are rare
(μ = 0.5) or absent (μ = 1.2 or 2). When percentages are not shown, they are 0% (red) or 100% (green). Data can be found at https://osf.io/md9kp/.
NATURE NEUROSCIENCE | VOL 23 | JULY 2020 | 788–799 | www.nature.com/natureneuroscience
789
Review ARticle
NATurE NEurOSciENcE
should speak of the K = 10
1/2
[i.e. BF
10
= 3] point, and of the 1 per
cent [p = 0.01], point as I should speak of the K = 10
1
point [i.e. BF
10
= 10]; and for moderate numbers of observations the points are not
very different.”
9
These reference values remain in use: BF > 3 is con-
sidered moderate evidence for the hypothesis in the numerator (i.e.,
H
1
if BF
10
> 3), roughly similar to P < 0.05; BF > 10 is considered
5
6
7
8
9
10
15
20
100
5
6
7
8
9
10
15
20
100
5%
5%
5%
5%
5%
5%
5%
5%
5%
24%
28%
32%
36%
39%
42%
58%
100%
100%
100%
100%
0.00
0.01
0.05
0.10
1.00
0.00
0.01
0.05
0.10
1.00
0.00
0.01
0.05
0.10
1.00
0.00
0.01
0.05
0.10
1.00
p
µ = 0.0 µ = 0.5 µ = 1.2 µ = 2.0
Effect size
0310
BF
+0
40%
44%
47%
49%
52%
53%
62%
67%
86%
8%
8%
8%
8%
7%
7%
5%
3%
2%
3%
4%
4%
3%
3%
3%
3%
2%
1%
20%
23%
25%
28%
31%
32%
43%
54%
100%
66%
75%
82%
87%
90%
93%
99%
100%
100%
96%
99%
100%
a
b
Hit (BF
+0
> 3 when µ > 0) or
correct rejection (BF
+0
< 1/3 when
µ = 0)
Miss (BF
+0
< 1/3 when µ > 0) or
false alarm (BF
+0
> 3 when
µ = 0)
Absence of evidence (1/3 < BF
+0
< 3)
Hit (P < 0.05 when µ > 0) or correct rejection (P > 0.05 when µ = 0) Miss (P > 0.05 when µ > 0) or false alarm (P < 0.05 when µ = 0)
Sample size (n)Sample size (n)
1
10
1
3
03101
10
1
3
03101
10
1
3
03101
10
1
3
NATURE NEUROSCIENCE | VOL 23 | JULY 2020 | 788–799 | www.nature.com/natureneuroscience
790
Review ARticle
NATurE NEurOSciENcE
strong evidence, roughly similar to P < 0.01 (ref.
13
). Because BF
10
= 1/BF
01
, this also defines the bounds for evidence for the hypoth-
esis in the denominator: BF < 1/3 is moderate and BF < 1/10 is
strong evidence. BF values between 1/3 and 3 indicate that there
is insufficient evidence to draw a conclusion for or against either
hypothesis. While these guidelines enable us to reach somewhat dis
-
crete conclusions, the magnitude of the BF should be considered as
a continuous quantity, and the strength of the conclusions expressed
in the discussion section of a paper should reflect the magnitude of
the BF. For new discoveries, Jeffreys suggested that x = 10 is more
appropriate than x = 3; however, each scientist and field will need
to decide whether to privilege the sensitivity of the test for small
samples or effects by using smaller x values such as 3, or to avoid
false conclusions by using higher x values such as 10. Regardless,
readers can judge the strength of the evidence directly from the
numerical value of BF, with a BF twice as high providing evidence
twice as strong. In contrast, it can be difficult to interpret an actual P
value as strength of evidence, as P = 0.01 does not provide five times
as much evidence as P = 0.05.
Crucially, the three-state system of the Bayes factor allows us to
differentiate between evidence of absence and absence of evidence.
This represents a fundamental conceptual step forward in the way
we interpret data: instead of one outcome (i.e., P < α) that generates
knowledge, we now have two (i.e., BF
10
> x and BF
01
> x).
Box 1 | Bayesian updating
e Bayesian formalism describes how an optimal observer up-
dates beliefs in response to data. In the context of hypothesis test-
ing, at the start, observers entertain a set of two or more rival ac-
counts. In the context of a t-test, they would be called hypotheses
H
0
and H
1
; in the case of an ANOVA, they would be called models.
Each is specied via parameters we can call θ, for example, the
eect size δ in a t-test hypothesis or a regression parameter β in
an ANOVA. Prior to looking at the data, the rival accounts have
prior probabilities, and the parameter values within each account
also have prior probabilities. At the level of the accounts, we may
assume them to be equally believable a priori (for example, prior
hypothesis probabilities p(H
0
) = p(H
1
) = 0.5). At the level of the
parameters within each account, they are associated with prior
parameter distributions (for example, H
0
: δ = 0, H
1
: d ~ Cauchy;
Fig. 2). When data become available, the probabilities are reallo
-
cated: accounts and parameters-within-accounts that predict the
data relatively well receive a boost in credibility, whereas those that
predict the data poorly suer a decline
30
. Note the similarity to
models of reinforcement learning
31
. Mathematically, this updating
is done using Bayes’ rule, as we describe below separately for pa
-
rameters and accounts.
Updating parameter estimates
p θjdataðÞ
|fflfflfflfflffl{zfflfflfflfflffl}
posterior beliefs about θ
¼ p θðÞ
|{z}
prior beliefs about θ
´
p datajθðÞ
p dataðÞ
|fflfflfflfflffl{zfflfflfflfflffl}
predictive updating factor
Here the probability of each possible value of θ within an account
after seeing the data (i.e., posterior parameter beliefs) are cal
-
culated as the product of the prior probability of that value (i.e.
parameter prior beliefs) times the predictive updating factor. The
latter reflects how likely the observed data is according to that par
-
ticular parameter value divided by the average predictive perfor-
mance across all values of θ weighted by their prior probability, i.e.
p data
ð Þ¼
R
p data
j
θ
ð Þ
p θ
ðÞ
d
θ
I
. This posterior parameter belief is
the basis for the credible intervals (CI) that the Bayesian analysis
provides for the parameters conditional on a given model.
Updating the plausibility of the rival accounts
For two rival accounts of the data (for example, H
0
vs H
1
), Bayes’
rule can best be written in the form of odds
32
:
pH
0
jdataðÞ
pH
1
jdataðÞ
|fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl}
posterior odds for H
0
vs H
1
¼
pH
0
ðÞ
pH
1
ðÞ
|fflffl{zfflffl}
prior odds for H
0
vs H
1
´
p datajH
0
ðÞ
p datajH
1
ðÞ
|fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl}
predictive updating factor
This equation shows that the change from prior hypothesis odds
to posterior hypothesis odds is brought about by the predic
-
tive updating factor—commonly known as the Bayes factor
12
.
For instance, assume the rival hypotheses are equally plausible
a priori (i.e., p(H
0
) = p(H
1
) = 0.5). The prior hypothesis odds are
then equal to one. If the predictive updating factor is 10 (i.e., the
observed data is 10 times more likely under H
0
than under H
1
),
this means that the posterior odds are then also 10. Given that
for mutually exclusive hypotheses p(H
0
)+p(H
1
) = 1, these odds
mean that the data have increased the probability of H
0
from 0.5
(the prior hypothesis probability) to 10/11 ≈ 0.91 (the posterior
H
0
probability).
e Bayes factor quanties the degree to which the data warrant
a change in beliefs, and it therefore represents the strength of
evidence that the data provide for H
0
vs H
1
. Note that this strength
measure is symmetric: evidence may support H
0
just as it may
support H
1
; neither of the rival hypotheses enjoys a special status.
For a neuroscientist who wants to know whether or not their
manipulation had an eect, the posterior odds might seem like
the most obvious metric, as they reect the plausibility of one
hypothesis over another aer considering the data. However,
these posterior odds depend both on the evidence provided by
the data (i.e., the Bayes factor) and the prior odds. e prior odds
capture subjective beliefs before the experiment and introduce
an oen-undesirable element of subjectivity that could bias the
conclusions drawn from the posterior beliefs. Scientists who
embrace a certain theoretical standpoint and those who do not
might ercely disagree on these prior odds while agreeing on
the evidence, that is, the extent to which the data should change
their beliefs. As beliefs are considered less valuable for scientic
reporting than evidence, the data-informed Bayes factor is the less
controversial and thus favored metric to report.
ere are three broad qualitative categories of Bayes factors.
First, the Bayes factor may support H
1
; second, the Bayes factor
may support H
0
; third, the Bayes factor may be near 1 and support
neither of the two rival hypotheses. In the second case we have
‘evidence of absence’, and in the third care we have ‘absence of
evidence’ (see also ref.
2
). More ne-grained classication schemes
have been proposed
16
.
To develop an intuition for the continuous strength of evidence
that a Bayes factor provides, one may use a probability wheel.
Examples are shown in Fig. 3b. To construct the wheel, we have
assumed that H
0
and H
1
are equally likely; the red part in the
wheel is then the posterior probability for H
1
, and the blue part
is the complementary probability for H
0
. Now pretend that the
wheel is a pizza, with the red area covered with pepperoni and
the blue area covered with mozzarella. Imagine that you poke
your nger blindly onto the pizza and that it comes back covered
in the non-dominant topping (in this case, pepperoni). How
surprised are you? Your level of imagined surprise is an indication
for the strength of evidence that a Bayes factor provides. We
additionally compare the BF with traditional P values in Extended
Data Fig. 1.
NATURE NEUROSCIENCE | VOL 23 | JULY 2020 | 788–799 | www.nature.com/natureneuroscience
791