Journal Article•DOI•

Using Bayes factor hypothesis testing in neuroscience to establish evidence of absence

Christian Keysers¹, Christian Keysers², Valeria Gazzola¹, Valeria Gazzola², Eric-Jan Wagenmakers¹ - Show less +1 more•Institutions (2)

University of Amsterdam¹, Royal Netherlands Academy of Arts and Sciences²

29 Jun 2020-Nature Neuroscience (Nature Publishing Group)-Vol. 23, Iss: 7, pp 788-799

TL;DR: Why P values do not differentiate inconclusive null findings from those that provide important evidence for the absence of an effect is shown and a tutorial on how to use Bayesian hypothesis testing to overcome this issue is provided.

read less

Abstract: Most neuroscientists would agree that for brain research to progress, we have to know which experimental manipulations have no effect as much as we must identify those that do have an effect The dominant statistical approaches used in neuroscience rely on P values and can establish the latter but not the former This makes non-significant findings difficult to interpret: do they support the null hypothesis or are they simply not informative? Here we show how Bayesian hypothesis testing can be used in neuroscience studies to establish both whether there is evidence of absence and whether there is absence of evidence Through simple tutorial-style examples of Bayesian t-tests and ANOVA using the open-source project JASP, this article aims to empower neuroscientists to use this approach to provide compelling and rigorous evidence for the absence of an effect

...read moreread less

Summary (2 min read)

Jump to: [Introduction] – [The P value predicament] – [Example of an aNoVa] – [5. Bayesian inference allows researchers to monitor the results] – [6. Bayes factor hypothesis testing allows researchers to include] – [Concluding comments] – [Data availability] and [Competing interests]

Introduction

One may use drugs to block a candidate pathway.
NHST is arguably appropriate when one wants to quantify evidence against the null hypothesis (H0: there is no effect) and therefore for the presence of an effect (but see ref. 1); however, NHST is problematic when one wants to quantify evidence for the null hypothesis.
Here the authors aim to highlight how an alternative statistical framework—Bayesian inference—can resolve this problem in neuroscience practice.

The P value predicament

For a fixed sample size, the smaller the P, the more evidence the authors have against H0.
When sample size is small, either explanation is plausible.
When the authors draw random samples from two identical distributions (i.e., where H0 is true; Fig. 1a leftmost column), P < 0.05 is rare (as expected), but all P values are equally likely.
Higher P values are thus not a reliable metric for more evidence for H0.
Hence, NHST leaves the neuroscientist in a peculiar predicament: significant P values indicate evidence against H0 (but see refs.

Example of an aNoVa

The authors can also examine whether muscimol had a greater effect on ShockObs than on CS by assessing evidence for an interaction between group (saline vs muscimol) and condition (ShockObs vs CS)17,18.
When selecting the default option ‘across all models’, for each component, the BFincl (last column) is calculated as p(models with that factor | data) ÷ p(models without that factor | data).
Had the authors found a BFincl < 1/3, they would have had evidence of absence: that muscimol has the same effect on ShockObs and CS.
The authors recommend using these default parameter priors to increase the objectivity of the analyses and to provide a common frame of reference that ensures the direct comparability of Bayes factors from different experiments.

5. Bayesian inference allows researchers to monitor the results

As illustrated in Box 1 and Supplementary Fig. 1, the Bayesian predict–update cycle of learning continues indefinitely.
In an experimental setting, neuroscientists may decide to terminate data collection when the result is deemed compelling or when they have run out of time, money or patience8,36.
This means that experiments can be flexibly shortened or lengthened according to the evidence that has already been collected.
If error control guarantees are put in place, such flexibility can reduce the required sample size by as much as 50%34,37.

6. Bayes factor hypothesis testing allows researchers to include

Prior knowledge for a more diagnostic test.
These distributions can be adjusted in light of relevant background information.
This is because Bayesian statistics can provide evidence for H0 and H1, whereas NHST can only provide evidence against H0.
For n > 20, the BF shows a mild upwards trend, and extending this trend shows that hundreds of animals would probably have to be added for the analysis to provide evidence for the presence of an effect (BF+0 > 3).

Concluding comments

Bayesian inference offers unique practical advantages for neuroscience (Box 2).
There is a bias toward publishing significant results, and the authors have become increasingly aware of the negative impact that the resulting P value hacking has on the progress and replicability of science.
Neuroscientists have been slow to take up Bayesian statistics, presumably out of a perception that Bayesian hypothesis testing is difficult to perform and interpret.
Supplementing frequentist approaches with Bayesian analyses will lead to richer data interpretations that allow more informative conclusions.

Data availability

All data and code can be downloaded at https://osf.io/md9kp/.
Gelman, A. & Stern, H. The difference between “significant” and “not significant” is not itself statistically significant.
Wagenmakers, E.-J. et al. Bayesian inference for psychology.

Competing interests

E.J.W. declares that he coordinates the development of the open-source software package JASP (https://jasp-stats.org), a non-commercial, publicly-funded effort to make Bayesian statistics accessible to a broader group of researchers and students.
The two black dashed lines mark BF+0=1, i.e. the line of no evidence, and BF+0=1/3, the bound for moderate evidence of absence.
The bottom row compares the predicted probability of finding particular t-values under the two models, and shows how values close to zero (i.e., small or no difference between the groups) are predicted more often by the Null compared to the Group Model, while the opposite is true for large t-values.
Review ARticle NATurE NEurOSciENcE NaTUrE NEUrosCiENCE | www.nature.com/natureneuroscience.

Did you find this useful? Give us your feedback

Figures (5)

Fig. 2 | Hypothesis testing under the Bayesian framework. a, Two competing qualitative hypotheses are expressed in terms of a test parameter, such as the population effect size δ. H+ represents a directional hypothesis of a positive effect size. b, The two rival hypotheses are

Fig. 3 | Further outputs for the Bayesian t-test on muscimol1.csv. a, Clicking the option ‘Bayes Factor Robustness Check’ will plot for each variable (ShockObs on the left and CS on the right) the BF as a function of the effect size prior. The user prior (gray) is by default set at Cauchy scale 0.707 as recommended in ref. 19. The wide and ultrawide prior are flatter priors that are sometimes used, especially when the goal is parameter estimation. As

Fig. 5 | screenshot from the ‘Bayesian independent samples T-Test’ in JasP. Top right: the Bayes factor for the two variables, followed by the

Fig. 1 | P value of a t-test and BF+0 as a function of effect size and sample size. a, Each histogram shows the distribution of P values obtained from 1,000 one-tailed one-sample t-tests based on n random numbers drawn from a normal distribution with mean µ and s.d. = 1. To differentiate levels of significance, the first bin was split into multiple bins based on standard critical values. Note how, when there is an effect in the data (i.e., µ > 0, all but leftmost column), increasing sample size (downwards) or effect size (rightwards) leads to a leftwards shift of the distribution: more evidence for an effect leads to lower P values. In this case, P values <0.05 are considered hits and are shown in green, while P values >0.05 are considered misses and shown

Fig. 4 | illustration of the data for the two simulated scenarios. Muscimol1 data were simulated using µ = 70 and σ = 20 for all conditions (imposing a floor of 0 and a ceiling of 100), except ShockObs (in blue) under muscimol, which was simulated using µ = 40. Muscimol2 data were simulated using the same parameters except for CS (in orange) under muscimol, which had μ = 65 and σ = 40. Based on these data, we should find evidence for H+: saline > muscimol in all cases for ShockObs. For CS (orange), muscimol1 should reveal evidence for H0 (evidence of absence) given that data were drawn from the same μ = 70, σ = 20 distributions. For muscimol2, CS was drawn from different distributions for saline and muscimol, but with n = 20, it might be hard to

Content maybe subject to copyright Report

UvA-DARE is a service provided by the library of the University of Amsterdam (http

://dare.uva.nl)

UvA-DARE (Digital Academic Repository)

Using Bayes factor hypothesis testing in neuroscience to establish evidence of

absence

Keysers, C.; Gazzola, V.; Wagenmakers, E.-J.

DOI

10.1038/s41593-020-0660-4

Publication date

2020

Document Version

Final published version

Published in

Nature Neuroscience

License

Article 25fa Dutch Copyright Act

Link to publication

Citation for published version (APA):

Keysers, C., Gazzola, V., & Wagenmakers, E-J. (2020). Using Bayes factor hypothesis

testing in neuroscience to establish evidence of absence.

Nature Neuroscience

(7), 788-

799. https://doi.org/10.1038/s41593-020-0660-4

General rights

It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s)

and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open

content license (like Creative Commons).

Disclaimer/Complaints regulations

If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please

let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material

inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter

to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You

will be contacted as soon as possible.

Download date:09 Aug 2022

Review ARticle

https://doi.org/10.1038/s41593-020-0660-4

Netherlands Institute for Neuroscience, Royal Netherlands Academy of Arts and Sciences, Amsterdam, The Netherlands.

Department of Psychology,

University of Amsterdam, Amsterdam, The Netherlands.

✉

e-mail: c.keysers@nin.knaw.nl

euroscientists would need to know and publish whether a

manipulation does not have an effect as much as whether

it does. One may use drugs to block a candidate pathway.

If the drug has an effect, that pathway is involved; if it doesn’t, one

would like to conclude the pathway is not involved. Or one may alter

activity in a brain region X and measure behavior B. If de-activating

X changes B, X is involved in B; if B remains unchanged, one would

like to conclude that X is not involved in B.

Neuroscience research is characterized by advanced measure

ment techniques and sophisticated experimental designs, but the

data analyses almost always employ the standard framework of

frequentist statistics, featuring P value null-hypothesis significance

testing (NHST). NHST is arguably appropriate when one wants to

quantify evidence against the null hypothesis (H

: there is no effect)

and therefore for the presence of an effect (but see ref.

); however,

NHST is problematic when one wants to quantify evidence for

the null hypothesis. It is notoriously difficult to establish whether

non-significant results support the null hypothesis (i.e., yield evi

dence for absence) or are simply not informative (i.e., show absence

of evidence

2–4

). NHST biases us to emphasize positive effects,

because those are the effects it equips us to quantify, and to ignore

null findings. If we agree that the absence of an effect is important

information, isn’t this bias unacceptable? Here we aim to highlight

how an alternative statistical framework—Bayesian inference—can

resolve this problem in neuroscience practice.

We will first illustrate why it is problematic to quantify evi

dence for the null hypothesis based on the dominant frequentist

approaches. We will then show how Bayesian statistics provides a

way out of this predicament through simple tutorial-style exam

ples of Bayesian t-tests and ANOVA using the open-source

project JASP

The P value predicament

When we conduct a t-test to compare two conditions A and B,

a resulting P value below a critical threshold α shows that one is

unlikely to encounter differences this extreme or more if the experi

mental manipulation had no effect (H

: μ

= μ

). For a fixed sample

size, the smaller the P, the more evidence we have against H

. Fisher

argued that a low P value signals that “either the null hypothesis

is false, or an exceptionally rare event has occurred.”

But what if

we find no significant effect (for example, P = 0.3)? Apart from

sampling variability (i.e., ‘bad luck’), there are two fundamentally

different causal explanations for a non-significant P value: the

manipulation had a non-zero effect, but the sample size was too

small to detect it (i.e., there was insufficient power); or the manipu

lation had no effect (i.e., the true effect is zero). When sample size

is small, either explanation is plausible. As sample size grows, a

non-significant P value increasingly suggests the manipulation

did not have an effect (or an effect so small it is not meaningful).

While a power analysis can help disentangle these alternatives, the

relationship between sample size, power, P value and evidence for

is complex enough that we are rightly reticent to draw strong

conclusions from a non-significant P value. This has been famously

and elegantly phrased in the antimetabole: ‘absence of evidence

[read: the data are not informative, the design was underpowered]

is not evidence of absence [read: the data provide support in favor

of the null]’

Intuitively, one may believe that if lower P values provide more

evidence against H

, higher P values should provide more evidence

in favor of H

. We would thus expect that if we simulate truly ran-

dom data with no effect, high P values should be relatively frequent,

especially with large sample sizes. This, however, is not the case.

When we draw random samples from two identical distributions

(i.e., where H

is true; Fig. 1a leftmost column), P < 0.05 is rare

(as expected), but all P values are equally likely. As sample size

increases, and we thus intuitively have more evidence that the two

distributions have the same mean, high P values do not become

more frequent (Fig. 1a, leftmost column comparing top and bot

tom row). Higher P values are thus not a reliable metric for more

evidence for H

Hence, NHST leaves the neuroscientist in a peculiar predica

ment: significant P values indicate evidence against H

(but see

refs.

1,8

), but non-significant P values do not allow us to conclude that

the data support H

. This inherent limitation of P values impedes

our ability to draw the important conclusion that a manipulation

has no effect and hence that a particular molecular pathway or brain

circuitry is not involved or that a particular stimulus dimension

does not matter for brain activity.

Using Bayes factor hypothesis testing in

neuroscience to establish evidence of absence

Christian Keysers 

1,2

✉

, Valeria Gazzola

1,2

and Eric-Jan Wagenmakers 

Most neuroscientists would agree that for brain research to progress, we have to know which experimental manipulations have

no effect as much as we must identify those that do have an effect. The dominant statistical approaches used in neuroscience

rely on P values and can establish the latter but not the former. This makes non-significant findings difficult to interpret: do

they support the null hypothesis or are they simply not informative? Here we show how Bayesian hypothesis testing can be

used in neuroscience studies to establish both whether there is evidence of absence and whether there is absence of evidence.

Through simple tutorial-style examples of Bayesian t-tests and ANOVA using the open-source project JASP, this article aims to

empower neuroscientists to use this approach to provide compelling and rigorous evidence for the absence of an effect.

NATURE NEUROSCIENCE | VOL 23 | JULY 2020 | 788–799 | www.nature.com/natureneuroscience

788

Review ARticle

NATurE NEurOSciENcE

A Bayesian solution

In contrast to frequentist NHST, which focuses exclusively on the

null hypothesis (H

), Bayesian hypothesis testing aims to quantify

the relative plausibility of alternative hypotheses H

and H

(Box 1).

Figure 2 shows an example of how evidence is computed, using

a Bayesian approach, for the case of a t-test when the question of

interest is whether an experimental manipulation has a positive

effect. This translates into two rival hypotheses: the manipulation

had no effect versus the manipulation increased the dependent vari

able. Rather than expressing hypotheses in raw values specific to

a given experiment, they are expressed using the population stan

dardized effect size δ (with δ = (μ

– μ

)/σ). The sceptic’s hypothesis,

: δ = 0, states that the effect is absent, whereas the alternative

hypothesis, H

: δ > 0, states that the effect is positive (Fig. 2a). Note

that a ‘one-tailed’ H

is denoted as H

to indicate the direction of

the hypothesized effect. To quantify which hypothesis best predicts

the data, we quantify the observed effect size d (d = (m

– m

)/s)

in the data and transform it into a t-value

d ´

ﬃﬃﬃ

ðÞ

, because

the distribution of t-values expected for any δ is well known. Next,

we transform the qualitative hypotheses H

and H

into quantita-

tive predictions about the probability of encountering every t-value

using this t-distribution. This is achieved by assigning prior proba

bility distributions to δ (Fig. 2b), and then computing the probability

of each observable t based on these δ-value distributions (Fig. 2c).

For the sceptic’s H

: δ = 0, the distribution of effect sizes is simply a

spike at δ = 0 (red in Fig. 2b), and this makes predictions about the

likelihood of each observable t-value using the same distribution

that is used in a frequentist t-test with n participants: the Student’s

t distribution with n – 2 degrees of freedom (red in Fig. 2c). For

: δ > 0, we need to be specific about the probability of each pos-

sible positive δ to become specific about t. The one-tailed nature of

our hypothesis is reflected in a truncated distribution, with nega

tive values having zero probability under H

(ref.

p. 283; note that

two-tailed hypotheses are usually implemented by means of sym

metrical distributions, for example, the dotted line in Fig. 3b). We

also know that most neuroscience papers report effect sizes of δ <

1 (ref.

), with smaller effect sizes being more common than larger

effect sizes; this is reflected in a peak for small positive δ and low

probability for δ > 1. Indeed, that we feel that we need to perform a

test in the first place corresponds to this presumption that the effect

size must be fairly small

. These considerations about the plausible

direction and magnitudes of the effect under H

generate the prior

distribution shown in blue in Fig. 2b (see section “Default priors

provide an objective anchor” for guidance on how to define this

prior distribution). For each of the hypothesized δ values, we can

make predictions about t using the non-central t distribution with

μ = δ. The mixture of these non-central t-distributions associated

with each δ, weighted by the prior plausibility of that δ, predicts the

probability of each possible t-value under H

(blue in Fig. 2c). When

the data arrive (Fig. 2d), we first calculate the t-value for our data,

which we will call t

, and then see where t

falls on the t-distribution

expected under H

(red) and under H

(blue). The traditional fre-

quentist P value corresponds to the area to the right of t

on the

red distribution; note that the predictions from H

, indicated by the

blue distribution, are entirely ignored in the frequentist approach.

In contrast, for the Bayesian approach, we take the ordinates p(t

) and p(t

| H

) and calculate the evidence that the data provide in

favor of H

over H

as p(t

| H

) ÷ p(t

| H

) (Fig. 2e). At that specific

value, the ratio equals 4, indicating that our data was predicted

four times better by H

than H

; we may conclude that our data sup-

ports H

. The evidence—the relative predictive performance of H

versus H

—is known as the Bayes factor

9,11,12

(Box 1). We abbreviate

it as BF and use subscripts to denote which model is in the numera

tor versus the denominator; thus, BF

= p(t

| H

) ÷ p(t

| H

) and

= p(t

| H

) ÷ p(t

| H

If the t-value from our data were to be closer to 0, as exemplified

by another hypothetical t-value, t

(Fig. 2e), the ordinates of the red

and blue distributions would be about equally high, indicating that

the observed t

is about equally likely to occur under H

and H

;

hence the predictive performance of H

and H

is about equal, the

Bayes factor is near 1, and consequently we have absence of evi

dence. If the t-value were to fall at t

(Fig. 2e), this value would be 4

times more likely to occur under H

than under H

; consequently,

= ¼, that is, BF

= 4, and we may conclude that our data sup-

port H

—in other words, we have some evidence of absence.

Thus, the P value of a frequentist approach has two logical states,

significant versus not significant, which translate into evidence for

(“great, I found the effect”) versus a state of suspended disbe-

lief (“I did not find an effect, but it could be because I was unlucky

or because the effect does not exist or because my sample size was

too small”), whereas the BF has three qualitatively different logical

states: BF

> x (“great, I have compelling evidence for the effect”),

1/x < BF

< x (“oops, my data are not sufficiently diagnostic”), BF

< 1/x (“great, I have compelling evidence for the absence of the

effect”). Here

x is the researcher-defined target level of evidence.

The BF should primarily be seen as a continuous measure of evi

dence. However, since larger deviations from 1 provide stronger

evidence, Jeffreys proposed reference values to guide the interpreta

tion of the strength of the evidence

. These values were spaced out

in exponential half steps of 10, 10

0.5

≈ 3, 10

= 10, 10

1.5

≈ 30, etc., to

be equidistant on a log scale. He then compared these values with

critical values in frequentist t-tests (see Extended Data Fig. 1a for a

modern equivalent) and χ

tests, and declared, “Users of these tests

speak of the 5 per cent point [p = 0.05] in much the same way as I

Fig. 1 | P value of a t-test and BF

as a function of effect size and sample size. a, Each histogram shows the distribution of P values obtained from

1,000 one-tailed one-sample t-tests based on n random numbers drawn from a normal distribution with mean µ and s.d. = 1. To differentiate levels of

significance, the first bin was split into multiple bins based on standard critical values. Note how, when there is an effect in the data (i.e., µ > 0, all but

leftmost column), increasing sample size (downwards) or effect size (rightwards) leads to a leftwards shift of the distribution: more evidence for an effect

leads to lower P values. In this case, P values <0.05 are considered hits and are shown in green, while P values >0.05 are considered misses and shown

in red. However, somewhat counterintuitively, the converse does not hold true: in the absence of an effect, (µ = 0, leftwards column), increasing sample

size does not lead to a rightward shift (increase) of the P values. Instead the distribution is completely flat, with all P values equally likely (note that the

distribution seems to thin out below 0.05, but this is because we subdivided the last leftmost bin into several bins to resolve levels of significance). In

this case, P < 0.05 represents false alarms, shown in red, and P > 0.05 represents correct rejections, shown in green. P values are thus not a symmetrical

instrument: cases with much evidence for H

(high effect size and sample size) give us quasi-certainty to find a very low P value, whereas cases with

much evidence for H

(for example, µ = 0 with n = 100) do not make P values close to 1 highly likely; instead, any P value remains as likely as any other.

b, Distribution of BF

(using

r ¼

ﬃﬃﬃ

for the effect size prior Cauchy width) values obtained from 1,000 t-tests based on n random numbers drawn from

an N(µ,1) normal distribution with mean µ and s.d. = 1. Each histogram has the same bounds specified below the graphs, representing conventional limits

for moderate and strong evidence. When an effect is absent (μ = 0, leftmost column), evidence of absence (green bars and percentages, BF

< 1/3)

increases with increasing sample size and the false alarm rate is well controlled. When an effect is present (μ > 0), evidence for a positive effect

(BF

> 3, green bars and green percentages) increases with sample size and effect size, and misses (BF

< 1/3, red bars and red percentages) are rare

(μ = 0.5) or absent (μ = 1.2 or 2). When percentages are not shown, they are 0% (red) or 100% (green). Data can be found at https://osf.io/md9kp/.

NATURE NEUROSCIENCE | VOL 23 | JULY 2020 | 788–799 | www.nature.com/natureneuroscience

789

Review ARticle

NATurE NEurOSciENcE

should speak of the K = 10

1/2

[i.e. BF

= 3] point, and of the 1 per

cent [p = 0.01], point as I should speak of the K = 10

point [i.e. BF

= 10]; and for moderate numbers of observations the points are not

very different.”

These reference values remain in use: BF > 3 is con-

sidered moderate evidence for the hypothesis in the numerator (i.e.,

if BF

> 3), roughly similar to P < 0.05; BF > 10 is considered

100

24%

28%

32%

36%

39%

42%

58%

100%

0.00

0.01

0.05

0.10

1.00

0.00

0.01

0.05

0.10

1.00

0.00

0.01

0.05

0.10

1.00

0.00

0.01

0.05

0.10

1.00

µ = 0.0 µ = 0.5 µ = 1.2 µ = 2.0

Effect size

0310

40%

44%

47%

49%

52%

53%

62%

67%

86%

20%

23%

25%

28%

31%

32%

43%

54%

100%

66%

75%

82%

87%

90%

93%

99%

100%

96%

99%

100%

Hit (BF

> 3 when µ > 0) or

correct rejection (BF

< 1/3 when

µ = 0)

Miss (BF

< 1/3 when µ > 0) or

false alarm (BF

> 3 when

µ = 0)

Absence of evidence (1/3 < BF

< 3)

Hit (P < 0.05 when µ > 0) or correct rejection (P > 0.05 when µ = 0) Miss (P > 0.05 when µ > 0) or false alarm (P < 0.05 when µ = 0)

Sample size (n)Sample size (n)

03101

NATURE NEUROSCIENCE | VOL 23 | JULY 2020 | 788–799 | www.nature.com/natureneuroscience

790

Review ARticle

NATurE NEurOSciENcE

strong evidence, roughly similar to P < 0.01 (ref.

). Because BF

= 1/BF

, this also defines the bounds for evidence for the hypoth-

esis in the denominator: BF < 1/3 is moderate and BF < 1/10 is

strong evidence. BF values between 1/3 and 3 indicate that there

is insufficient evidence to draw a conclusion for or against either

hypothesis. While these guidelines enable us to reach somewhat dis

crete conclusions, the magnitude of the BF should be considered as

a continuous quantity, and the strength of the conclusions expressed

in the discussion section of a paper should reflect the magnitude of

the BF. For new discoveries, Jeffreys suggested that x = 10 is more

appropriate than x = 3; however, each scientist and field will need

to decide whether to privilege the sensitivity of the test for small

samples or effects by using smaller x values such as 3, or to avoid

false conclusions by using higher x values such as 10. Regardless,

readers can judge the strength of the evidence directly from the

numerical value of BF, with a BF twice as high providing evidence

twice as strong. In contrast, it can be difficult to interpret an actual P

value as strength of evidence, as P = 0.01 does not provide five times

as much evidence as P = 0.05.

Crucially, the three-state system of the Bayes factor allows us to

differentiate between evidence of absence and absence of evidence.

This represents a fundamental conceptual step forward in the way

we interpret data: instead of one outcome (i.e., P < α) that generates

knowledge, we now have two (i.e., BF

> x and BF

> x).

Box 1 | Bayesian updating

e Bayesian formalism describes how an optimal observer up-

dates beliefs in response to data. In the context of hypothesis test-

ing, at the start, observers entertain a set of two or more rival ac-

counts. In the context of a t-test, they would be called hypotheses

and H

; in the case of an ANOVA, they would be called models.

Each is specied via parameters we can call θ, for example, the

eect size δ in a t-test hypothesis or a regression parameter β in

an ANOVA. Prior to looking at the data, the rival accounts have

prior probabilities, and the parameter values within each account

also have prior probabilities. At the level of the accounts, we may

assume them to be equally believable a priori (for example, prior

hypothesis probabilities p(H

) = p(H

) = 0.5). At the level of the

parameters within each account, they are associated with prior

parameter distributions (for example, H

: δ = 0, H

: d ~ Cauchy;

Fig. 2). When data become available, the probabilities are reallo

cated: accounts and parameters-within-accounts that predict the

data relatively well receive a boost in credibility, whereas those that

predict the data poorly suer a decline

. Note the similarity to

models of reinforcement learning

. Mathematically, this updating

is done using Bayes’ rule, as we describe below separately for pa

rameters and accounts.

Updating parameter estimates

p θjdataðÞ

|ﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄ}

posterior beliefs about θ

¼ p θðÞ

|{z}

prior beliefs about θ

p datajθðÞ

p dataðÞ

|ﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄ}

predictive updating factor

Here the probability of each possible value of θ within an account

after seeing the data (i.e., posterior parameter beliefs) are cal

culated as the product of the prior probability of that value (i.e.

parameter prior beliefs) times the predictive updating factor. The

latter reflects how likely the observed data is according to that par

ticular parameter value divided by the average predictive perfor-

mance across all values of θ weighted by their prior probability, i.e.

p data

ð Þ¼

p data

ð Þ

p θ

ðÞ

. This posterior parameter belief is

the basis for the credible intervals (CI) that the Bayesian analysis

provides for the parameters conditional on a given model.

Updating the plausibility of the rival accounts

For two rival accounts of the data (for example, H

vs H

), Bayes’

rule can best be written in the form of odds

jdataðÞ

|ﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄ}

posterior odds for H

vs H

ðÞ

|ﬄﬄ{zﬄﬄ}

prior odds for H

vs H

p datajH

ðÞ

p datajH

ðÞ

|ﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄ}

predictive updating factor

This equation shows that the change from prior hypothesis odds

to posterior hypothesis odds is brought about by the predic

tive updating factor—commonly known as the Bayes factor

For instance, assume the rival hypotheses are equally plausible

a priori (i.e., p(H

) = p(H

) = 0.5). The prior hypothesis odds are

then equal to one. If the predictive updating factor is 10 (i.e., the

observed data is 10 times more likely under H

than under H

this means that the posterior odds are then also 10. Given that

for mutually exclusive hypotheses p(H

)+p(H

) = 1, these odds

mean that the data have increased the probability of H

from 0.5

(the prior hypothesis probability) to 10/11 ≈ 0.91 (the posterior

probability).

e Bayes factor quanties the degree to which the data warrant

a change in beliefs, and it therefore represents the strength of

evidence that the data provide for H

vs H

. Note that this strength

measure is symmetric: evidence may support H

just as it may

support H

; neither of the rival hypotheses enjoys a special status.

For a neuroscientist who wants to know whether or not their

manipulation had an eect, the posterior odds might seem like

the most obvious metric, as they reect the plausibility of one

hypothesis over another aer considering the data. However,

these posterior odds depend both on the evidence provided by

the data (i.e., the Bayes factor) and the prior odds. e prior odds

capture subjective beliefs before the experiment and introduce

an oen-undesirable element of subjectivity that could bias the

conclusions drawn from the posterior beliefs. Scientists who

embrace a certain theoretical standpoint and those who do not

might ercely disagree on these prior odds while agreeing on

the evidence, that is, the extent to which the data should change

their beliefs. As beliefs are considered less valuable for scientic

reporting than evidence, the data-informed Bayes factor is the less

controversial and thus favored metric to report.

ere are three broad qualitative categories of Bayes factors.

First, the Bayes factor may support H

; second, the Bayes factor

may support H

; third, the Bayes factor may be near 1 and support

neither of the two rival hypotheses. In the second case we have

‘evidence of absence’, and in the third care we have ‘absence of

evidence’ (see also ref.

). More ne-grained classication schemes

have been proposed

To develop an intuition for the continuous strength of evidence

that a Bayes factor provides, one may use a probability wheel.

Examples are shown in Fig. 3b. To construct the wheel, we have

assumed that H

and H

are equally likely; the red part in the

wheel is then the posterior probability for H

, and the blue part

is the complementary probability for H

. Now pretend that the

wheel is a pizza, with the red area covered with pepperoni and

the blue area covered with mozzarella. Imagine that you poke

your nger blindly onto the pizza and that it comes back covered

in the non-dominant topping (in this case, pepperoni). How

surprised are you? Your level of imagined surprise is an indication

for the strength of evidence that a Bayes factor provides. We

additionally compare the BF with traditional P values in Extended

Data Fig. 1.

NATURE NEUROSCIENCE | VOL 23 | JULY 2020 | 788–799 | www.nature.com/natureneuroscience

791

HTML Viewer

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "Using bayes factor hypothesis testing in neuroscience to establish evidence of absence" ?

In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website this paper.

Using Bayes factor hypothesis testing in neuroscience to establish evidence of absence

Summary (2 min read)

Introduction

The P value predicament

Example of an aNoVa

5. Bayesian inference allows researchers to monitor the results

6. Bayes factor hypothesis testing allows researchers to include

Concluding comments

Data availability

Competing interests

Figures (5)

Citations

References

Related Papers (5)

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "Using bayes factor hypothesis testing in neuroscience to establish evidence of absence" ?