scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Using Bayes factor hypothesis testing in neuroscience to establish evidence of absence

29 Jun 2020-Nature Neuroscience (Nature Publishing Group)-Vol. 23, Iss: 7, pp 788-799
TL;DR: Why P values do not differentiate inconclusive null findings from those that provide important evidence for the absence of an effect is shown and a tutorial on how to use Bayesian hypothesis testing to overcome this issue is provided.
Abstract: Most neuroscientists would agree that for brain research to progress, we have to know which experimental manipulations have no effect as much as we must identify those that do have an effect The dominant statistical approaches used in neuroscience rely on P values and can establish the latter but not the former This makes non-significant findings difficult to interpret: do they support the null hypothesis or are they simply not informative? Here we show how Bayesian hypothesis testing can be used in neuroscience studies to establish both whether there is evidence of absence and whether there is absence of evidence Through simple tutorial-style examples of Bayesian t-tests and ANOVA using the open-source project JASP, this article aims to empower neuroscientists to use this approach to provide compelling and rigorous evidence for the absence of an effect

Summary (2 min read)

Introduction

  • One may use drugs to block a candidate pathway.
  • NHST is arguably appropriate when one wants to quantify evidence against the null hypothesis (H0: there is no effect) and therefore for the presence of an effect (but see ref. 1); however, NHST is problematic when one wants to quantify evidence for the null hypothesis.
  • Here the authors aim to highlight how an alternative statistical framework—Bayesian inference—can resolve this problem in neuroscience practice.

The P value predicament

  • For a fixed sample size, the smaller the P, the more evidence the authors have against H0.
  • When sample size is small, either explanation is plausible.
  • When the authors draw random samples from two identical distributions (i.e., where H0 is true; Fig. 1a leftmost column), P < 0.05 is rare (as expected), but all P values are equally likely.
  • Higher P values are thus not a reliable metric for more evidence for H0.
  • Hence, NHST leaves the neuroscientist in a peculiar predicament: significant P values indicate evidence against H0 (but see refs.

Example of an aNoVa

  • The authors can also examine whether muscimol had a greater effect on ShockObs than on CS by assessing evidence for an interaction between group (saline vs muscimol) and condition (ShockObs vs CS)17,18.
  • When selecting the default option ‘across all models’, for each component, the BFincl (last column) is calculated as p(models with that factor | data) ÷ p(models without that factor | data).
  • Had the authors found a BFincl < 1/3, they would have had evidence of absence: that muscimol has the same effect on ShockObs and CS.
  • The authors recommend using these default parameter priors to increase the objectivity of the analyses and to provide a common frame of reference that ensures the direct comparability of Bayes factors from different experiments.

5. Bayesian inference allows researchers to monitor the results

  • As illustrated in Box 1 and Supplementary Fig. 1, the Bayesian predict–update cycle of learning continues indefinitely.
  • In an experimental setting, neuroscientists may decide to terminate data collection when the result is deemed compelling or when they have run out of time, money or patience8,36.
  • This means that experiments can be flexibly shortened or lengthened according to the evidence that has already been collected.
  • If error control guarantees are put in place, such flexibility can reduce the required sample size by as much as 50%34,37.

6. Bayes factor hypothesis testing allows researchers to include

  • Prior knowledge for a more diagnostic test.
  • These distributions can be adjusted in light of relevant background information.
  • This is because Bayesian statistics can provide evidence for H0 and H1, whereas NHST can only provide evidence against H0.
  • For n > 20, the BF shows a mild upwards trend, and extending this trend shows that hundreds of animals would probably have to be added for the analysis to provide evidence for the presence of an effect (BF+0 > 3).

Concluding comments

  • Bayesian inference offers unique practical advantages for neuroscience (Box 2).
  • There is a bias toward publishing significant results, and the authors have become increasingly aware of the negative impact that the resulting P value hacking has on the progress and replicability of science.
  • Neuroscientists have been slow to take up Bayesian statistics, presumably out of a perception that Bayesian hypothesis testing is difficult to perform and interpret.
  • Supplementing frequentist approaches with Bayesian analyses will lead to richer data interpretations that allow more informative conclusions.

Data availability

  • All data and code can be downloaded at https://osf.io/md9kp/.
  • Gelman, A. & Stern, H. The difference between “significant” and “not significant” is not itself statistically significant.
  • Wagenmakers, E.-J. et al. Bayesian inference for psychology.

Competing interests

  • E.J.W. declares that he coordinates the development of the open-source software package JASP (https://jasp-stats.org), a non-commercial, publicly-funded effort to make Bayesian statistics accessible to a broader group of researchers and students.
  • The two black dashed lines mark BF+0=1, i.e. the line of no evidence, and BF+0=1/3, the bound for moderate evidence of absence.
  • The bottom row compares the predicted probability of finding particular t-values under the two models, and shows how values close to zero (i.e., small or no difference between the groups) are predicted more often by the Null compared to the Group Model, while the opposite is true for large t-values.
  • Review ARticle NATurE NEurOSciENcE NaTUrE NEUrosCiENCE | www.nature.com/natureneuroscience.

Did you find this useful? Give us your feedback

Figures (5)

Content maybe subject to copyright    Report

UvA-DARE is a service provided by the library of the University of Amsterdam (http
s
://dare.uva.nl)
UvA-DARE (Digital Academic Repository)
Using Bayes factor hypothesis testing in neuroscience to establish evidence of
absence
Keysers, C.; Gazzola, V.; Wagenmakers, E.-J.
DOI
10.1038/s41593-020-0660-4
Publication date
2020
Document Version
Final published version
Published in
Nature Neuroscience
License
Article 25fa Dutch Copyright Act
Link to publication
Citation for published version (APA):
Keysers, C., Gazzola, V., & Wagenmakers, E-J. (2020). Using Bayes factor hypothesis
testing in neuroscience to establish evidence of absence.
Nature Neuroscience
,
23
(7), 788-
799. https://doi.org/10.1038/s41593-020-0660-4
General rights
It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s)
and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open
content license (like Creative Commons).
Disclaimer/Complaints regulations
If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please
let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material
inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter
to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You
will be contacted as soon as possible.
Download date:09 Aug 2022

Review ARticle
https://doi.org/10.1038/s41593-020-0660-4
1
Netherlands Institute for Neuroscience, Royal Netherlands Academy of Arts and Sciences, Amsterdam, The Netherlands.
2
Department of Psychology,
University of Amsterdam, Amsterdam, The Netherlands.
e-mail: c.keysers@nin.knaw.nl
N
euroscientists would need to know and publish whether a
manipulation does not have an effect as much as whether
it does. One may use drugs to block a candidate pathway.
If the drug has an effect, that pathway is involved; if it doesn’t, one
would like to conclude the pathway is not involved. Or one may alter
activity in a brain region X and measure behavior B. If de-activating
X changes B, X is involved in B; if B remains unchanged, one would
like to conclude that X is not involved in B.
Neuroscience research is characterized by advanced measure
-
ment techniques and sophisticated experimental designs, but the
data analyses almost always employ the standard framework of
frequentist statistics, featuring P value null-hypothesis significance
testing (NHST). NHST is arguably appropriate when one wants to
quantify evidence against the null hypothesis (H
0
: there is no effect)
and therefore for the presence of an effect (but see ref.
1
); however,
NHST is problematic when one wants to quantify evidence for
the null hypothesis. It is notoriously difficult to establish whether
non-significant results support the null hypothesis (i.e., yield evi
-
dence for absence) or are simply not informative (i.e., show absence
of evidence
24
). NHST biases us to emphasize positive effects,
because those are the effects it equips us to quantify, and to ignore
null findings. If we agree that the absence of an effect is important
information, isn’t this bias unacceptable? Here we aim to highlight
how an alternative statistical framework—Bayesian inference—can
resolve this problem in neuroscience practice.
We will first illustrate why it is problematic to quantify evi
-
dence for the null hypothesis based on the dominant frequentist
approaches. We will then show how Bayesian statistics provides a
way out of this predicament through simple tutorial-style exam
-
ples of Bayesian t-tests and ANOVA using the open-source
project JASP
5
.
The P value predicament
When we conduct a t-test to compare two conditions A and B,
a resulting P value below a critical threshold α shows that one is
unlikely to encounter differences this extreme or more if the experi
-
mental manipulation had no effect (H
0
: μ
A
= μ
B
). For a fixed sample
size, the smaller the P, the more evidence we have against H
0
. Fisher
argued that a low P value signals that “either the null hypothesis
is false, or an exceptionally rare event has occurred.
6
But what if
we find no significant effect (for example, P = 0.3)? Apart from
sampling variability (i.e., ‘bad luck’), there are two fundamentally
different causal explanations for a non-significant P value: the
manipulation had a non-zero effect, but the sample size was too
small to detect it (i.e., there was insufficient power); or the manipu
-
lation had no effect (i.e., the true effect is zero). When sample size
is small, either explanation is plausible. As sample size grows, a
non-significant P value increasingly suggests the manipulation
did not have an effect (or an effect so small it is not meaningful).
While a power analysis can help disentangle these alternatives, the
relationship between sample size, power, P value and evidence for
H
0
is complex enough that we are rightly reticent to draw strong
conclusions from a non-significant P value. This has been famously
and elegantly phrased in the antimetabole: ‘absence of evidence
[read: the data are not informative, the design was underpowered]
is not evidence of absence [read: the data provide support in favor
of the null]’
7
.
Intuitively, one may believe that if lower P values provide more
evidence against H
0
, higher P values should provide more evidence
in favor of H
0
. We would thus expect that if we simulate truly ran-
dom data with no effect, high P values should be relatively frequent,
especially with large sample sizes. This, however, is not the case.
When we draw random samples from two identical distributions
(i.e., where H
0
is true; Fig. 1a leftmost column), P < 0.05 is rare
(as expected), but all P values are equally likely. As sample size
increases, and we thus intuitively have more evidence that the two
distributions have the same mean, high P values do not become
more frequent (Fig. 1a, leftmost column comparing top and bot
-
tom row). Higher P values are thus not a reliable metric for more
evidence for H
0
.
Hence, NHST leaves the neuroscientist in a peculiar predica
-
ment: significant P values indicate evidence against H
0
(but see
refs.
1,8
), but non-significant P values do not allow us to conclude that
the data support H
0
. This inherent limitation of P values impedes
our ability to draw the important conclusion that a manipulation
has no effect and hence that a particular molecular pathway or brain
circuitry is not involved or that a particular stimulus dimension
does not matter for brain activity.
Using Bayes factor hypothesis testing in
neuroscience to establish evidence of absence
Christian Keysers
1,2
 ✉
, Valeria Gazzola
1,2
and Eric-Jan Wagenmakers
2
Most neuroscientists would agree that for brain research to progress, we have to know which experimental manipulations have
no effect as much as we must identify those that do have an effect. The dominant statistical approaches used in neuroscience
rely on P values and can establish the latter but not the former. This makes non-significant findings difficult to interpret: do
they support the null hypothesis or are they simply not informative? Here we show how Bayesian hypothesis testing can be
used in neuroscience studies to establish both whether there is evidence of absence and whether there is absence of evidence.
Through simple tutorial-style examples of Bayesian t-tests and ANOVA using the open-source project JASP, this article aims to
empower neuroscientists to use this approach to provide compelling and rigorous evidence for the absence of an effect.
NATURE NEUROSCIENCE | VOL 23 | JULY 2020 | 788–799 | www.nature.com/natureneuroscience
788

Review ARticle
NATurE NEurOSciENcE
A Bayesian solution
In contrast to frequentist NHST, which focuses exclusively on the
null hypothesis (H
0
), Bayesian hypothesis testing aims to quantify
the relative plausibility of alternative hypotheses H
1
and H
0
(Box 1).
Figure 2 shows an example of how evidence is computed, using
a Bayesian approach, for the case of a t-test when the question of
interest is whether an experimental manipulation has a positive
effect. This translates into two rival hypotheses: the manipulation
had no effect versus the manipulation increased the dependent vari
-
able. Rather than expressing hypotheses in raw values specific to
a given experiment, they are expressed using the population stan
-
dardized effect size δ (with δ = (μ
A
μ
B
)/σ). The sceptics hypothesis,
H
0
: δ = 0, states that the effect is absent, whereas the alternative
hypothesis, H
+
: δ > 0, states that the effect is positive (Fig. 2a). Note
that a ‘one-tailedH
1
is denoted as H
+
to indicate the direction of
the hypothesized effect. To quantify which hypothesis best predicts
the data, we quantify the observed effect size d (d = (m
A
m
B
)/s)
in the data and transform it into a t-value
t
¼
d ´
ffiffi
n
p
ðÞ
I
, because
the distribution of t-values expected for any δ is well known. Next,
we transform the qualitative hypotheses H
0
and H
+
into quantita-
tive predictions about the probability of encountering every t-value
using this t-distribution. This is achieved by assigning prior proba
-
bility distributions to δ (Fig. 2b), and then computing the probability
of each observable t based on these δ-value distributions (Fig. 2c).
For the sceptic’s H
0
: δ = 0, the distribution of effect sizes is simply a
spike at δ = 0 (red in Fig. 2b), and this makes predictions about the
likelihood of each observable t-value using the same distribution
that is used in a frequentist t-test with n participants: the Students
t distribution with n – 2 degrees of freedom (red in Fig. 2c). For
H
+
: δ > 0, we need to be specific about the probability of each pos-
sible positive δ to become specific about t. The one-tailed nature of
our hypothesis is reflected in a truncated distribution, with nega
-
tive values having zero probability under H
+
(ref.
9
p. 283; note that
two-tailed hypotheses are usually implemented by means of sym
-
metrical distributions, for example, the dotted line in Fig. 3b). We
also know that most neuroscience papers report effect sizes of δ <
1 (ref.
10
), with smaller effect sizes being more common than larger
effect sizes; this is reflected in a peak for small positive δ and low
probability for δ > 1. Indeed, that we feel that we need to perform a
test in the first place corresponds to this presumption that the effect
size must be fairly small
9
. These considerations about the plausible
direction and magnitudes of the effect under H
+
generate the prior
distribution shown in blue in Fig. 2b (see section “Default priors
provide an objective anchor” for guidance on how to define this
prior distribution). For each of the hypothesized δ values, we can
make predictions about t using the non-central t distribution with
μ = δ. The mixture of these non-central t-distributions associated
with each δ, weighted by the prior plausibility of that δ, predicts the
probability of each possible t-value under H
+
(blue in Fig. 2c). When
the data arrive (Fig. 2d), we first calculate the t-value for our data,
which we will call t
1
, and then see where t
1
falls on the t-distribution
expected under H
0
(red) and under H
+
(blue). The traditional fre-
quentist P value corresponds to the area to the right of t
1
on the
red distribution; note that the predictions from H
+
, indicated by the
blue distribution, are entirely ignored in the frequentist approach.
In contrast, for the Bayesian approach, we take the ordinates p(t
1
|
H
0
) and p(t
1
| H
+
) and calculate the evidence that the data provide in
favor of H
+
over H
0
as p(t
1
| H
+
) ÷ p(t
1
| H
0
) (Fig. 2e). At that specific
t
1
value, the ratio equals 4, indicating that our data was predicted
four times better by H
+
than H
0
; we may conclude that our data sup-
ports H
+
. The evidence—the relative predictive performance of H
0
versus H
+
—is known as the Bayes factor
9,11,12
(Box 1). We abbreviate
it as BF and use subscripts to denote which model is in the numera
-
tor versus the denominator; thus, BF
+0
= p(t
1
| H
+
) ÷ p(t
1
| H
0
) and
BF
0+
= p(t
1
| H
0
) ÷ p(t
1
| H
+
).
If the t-value from our data were to be closer to 0, as exemplified
by another hypothetical t-value, t
2
(Fig. 2e), the ordinates of the red
and blue distributions would be about equally high, indicating that
the observed t
2
is about equally likely to occur under H
0
and H
+
;
hence the predictive performance of H
0
and H
+
is about equal, the
Bayes factor is near 1, and consequently we have absence of evi
-
dence. If the t-value were to fall at t
3
(Fig. 2e), this value would be 4
times more likely to occur under H
0
than under H
+
; consequently,
BF
+0
= ¼, that is, BF
0+
= 4, and we may conclude that our data sup-
port H
0
—in other words, we have some evidence of absence.
Thus, the P value of a frequentist approach has two logical states,
significant versus not significant, which translate into evidence for
H
1
(“great, I found the effect”) versus a state of suspended disbe-
lief (“I did not find an effect, but it could be because I was unlucky
or because the effect does not exist or because my sample size was
too small”), whereas the BF has three qualitatively different logical
states: BF
10
> x (“great, I have compelling evidence for the effect”),
1/x < BF
10
< x (“oops, my data are not sufficiently diagnostic”), BF
10
< 1/x (“great, I have compelling evidence for the absence of the
effect”). Here
x is the researcher-defined target level of evidence.
The BF should primarily be seen as a continuous measure of evi
-
dence. However, since larger deviations from 1 provide stronger
evidence, Jeffreys proposed reference values to guide the interpreta
-
tion of the strength of the evidence
9
. These values were spaced out
in exponential half steps of 10, 10
0.5
3, 10
1
= 10, 10
1.5
30, etc., to
be equidistant on a log scale. He then compared these values with
critical values in frequentist t-tests (see Extended Data Fig. 1a for a
modern equivalent) and χ
2
tests, and declared, “Users of these tests
speak of the 5 per cent point [p = 0.05] in much the same way as I
Fig. 1 | P value of a t-test and BF
+0
as a function of effect size and sample size. a, Each histogram shows the distribution of P values obtained from
1,000 one-tailed one-sample t-tests based on n random numbers drawn from a normal distribution with mean µ and s.d. = 1. To differentiate levels of
significance, the first bin was split into multiple bins based on standard critical values. Note how, when there is an effect in the data (i.e., µ > 0, all but
leftmost column), increasing sample size (downwards) or effect size (rightwards) leads to a leftwards shift of the distribution: more evidence for an effect
leads to lower P values. In this case, P values <0.05 are considered hits and are shown in green, while P values >0.05 are considered misses and shown
in red. However, somewhat counterintuitively, the converse does not hold true: in the absence of an effect, (µ = 0, leftwards column), increasing sample
size does not lead to a rightward shift (increase) of the P values. Instead the distribution is completely flat, with all P values equally likely (note that the
distribution seems to thin out below 0.05, but this is because we subdivided the last leftmost bin into several bins to resolve levels of significance). In
this case, P < 0.05 represents false alarms, shown in red, and P > 0.05 represents correct rejections, shown in green. P values are thus not a symmetrical
instrument: cases with much evidence for H
1
(high effect size and sample size) give us quasi-certainty to find a very low P value, whereas cases with
much evidence for H
0
(for example, µ = 0 with n = 100) do not make P values close to 1 highly likely; instead, any P value remains as likely as any other.
b, Distribution of BF
+0
(using
r ¼
ffiffi
2
p
=2
I
for the effect size prior Cauchy width) values obtained from 1,000 t-tests based on n random numbers drawn from
an N(µ,1) normal distribution with mean µ and s.d. = 1. Each histogram has the same bounds specified below the graphs, representing conventional limits
for moderate and strong evidence. When an effect is absent (μ = 0, leftmost column), evidence of absence (green bars and percentages, BF
+0
< 1/3)
increases with increasing sample size and the false alarm rate is well controlled. When an effect is present (μ > 0), evidence for a positive effect
(BF
+0
> 3, green bars and green percentages) increases with sample size and effect size, and misses (BF
+0
< 1/3, red bars and red percentages) are rare
(μ = 0.5) or absent (μ = 1.2 or 2). When percentages are not shown, they are 0% (red) or 100% (green). Data can be found at https://osf.io/md9kp/.
NATURE NEUROSCIENCE | VOL 23 | JULY 2020 | 788–799 | www.nature.com/natureneuroscience
789

Review ARticle
NATurE NEurOSciENcE
should speak of the K = 10
1/2
[i.e. BF
10
= 3] point, and of the 1 per
cent [p = 0.01], point as I should speak of the K = 10
1
point [i.e. BF
10
= 10]; and for moderate numbers of observations the points are not
very different.
9
These reference values remain in use: BF > 3 is con-
sidered moderate evidence for the hypothesis in the numerator (i.e.,
H
1
if BF
10
> 3), roughly similar to P < 0.05; BF > 10 is considered
5
6
7
8
9
10
15
20
100
5
6
7
8
9
10
15
20
100
5%
5%
5%
5%
5%
5%
5%
5%
5%
24%
28%
32%
36%
39%
42%
58%
100%
100%
100%
100%
0.00
0.01
0.05
0.10
1.00
0.00
0.01
0.05
0.10
1.00
0.00
0.01
0.05
0.10
1.00
0.00
0.01
0.05
0.10
1.00
p
µ = 0.0 µ = 0.5 µ = 1.2 µ = 2.0
Effect size
0310
BF
+0
40%
44%
47%
49%
52%
53%
62%
67%
86%
8%
8%
8%
8%
7%
7%
5%
3%
2%
3%
4%
4%
3%
3%
3%
3%
2%
1%
20%
23%
25%
28%
31%
32%
43%
54%
100%
66%
75%
82%
87%
90%
93%
99%
100%
100%
96%
99%
100%
a
b
Hit (BF
+0
> 3 when µ > 0) or
correct rejection (BF
+0
< 1/3 when
µ = 0)
Miss (BF
+0
< 1/3 when µ > 0) or
false alarm (BF
+0
> 3 when
µ = 0)
Absence of evidence (1/3 < BF
+0
< 3)
Hit (P < 0.05 when µ > 0) or correct rejection (P > 0.05 when µ = 0) Miss (P > 0.05 when µ > 0) or false alarm (P < 0.05 when µ = 0)
Sample size (n)Sample size (n)
1
10
1
3
03101
10
1
3
03101
10
1
3
03101
10
1
3
NATURE NEUROSCIENCE | VOL 23 | JULY 2020 | 788–799 | www.nature.com/natureneuroscience
790

Review ARticle
NATurE NEurOSciENcE
strong evidence, roughly similar to P < 0.01 (ref.
13
). Because BF
10
= 1/BF
01
, this also defines the bounds for evidence for the hypoth-
esis in the denominator: BF < 1/3 is moderate and BF < 1/10 is
strong evidence. BF values between 1/3 and 3 indicate that there
is insufficient evidence to draw a conclusion for or against either
hypothesis. While these guidelines enable us to reach somewhat dis
-
crete conclusions, the magnitude of the BF should be considered as
a continuous quantity, and the strength of the conclusions expressed
in the discussion section of a paper should reflect the magnitude of
the BF. For new discoveries, Jeffreys suggested that x = 10 is more
appropriate than x = 3; however, each scientist and field will need
to decide whether to privilege the sensitivity of the test for small
samples or effects by using smaller x values such as 3, or to avoid
false conclusions by using higher x values such as 10. Regardless,
readers can judge the strength of the evidence directly from the
numerical value of BF, with a BF twice as high providing evidence
twice as strong. In contrast, it can be difficult to interpret an actual P
value as strength of evidence, as P = 0.01 does not provide five times
as much evidence as P = 0.05.
Crucially, the three-state system of the Bayes factor allows us to
differentiate between evidence of absence and absence of evidence.
This represents a fundamental conceptual step forward in the way
we interpret data: instead of one outcome (i.e., P < α) that generates
knowledge, we now have two (i.e., BF
10
> x and BF
01
> x).
Box 1 | Bayesian updating
e Bayesian formalism describes how an optimal observer up-
dates beliefs in response to data. In the context of hypothesis test-
ing, at the start, observers entertain a set of two or more rival ac-
counts. In the context of a t-test, they would be called hypotheses
H
0
and H
1
; in the case of an ANOVA, they would be called models.
Each is specied via parameters we can call θ, for example, the
eect size δ in a t-test hypothesis or a regression parameter β in
an ANOVA. Prior to looking at the data, the rival accounts have
prior probabilities, and the parameter values within each account
also have prior probabilities. At the level of the accounts, we may
assume them to be equally believable a priori (for example, prior
hypothesis probabilities p(H
0
) = p(H
1
) = 0.5). At the level of the
parameters within each account, they are associated with prior
parameter distributions (for example, H
0
: δ = 0, H
1
: d ~ Cauchy;
Fig. 2). When data become available, the probabilities are reallo
-
cated: accounts and parameters-within-accounts that predict the
data relatively well receive a boost in credibility, whereas those that
predict the data poorly suer a decline
30
. Note the similarity to
models of reinforcement learning
31
. Mathematically, this updating
is done using Bayes’ rule, as we describe below separately for pa
-
rameters and accounts.
Updating parameter estimates
p θjdataðÞ
|fflfflfflfflffl{zfflfflfflfflffl}
posterior beliefs about θ
¼ p θðÞ
|{z}
prior beliefs about θ
´
p datajθðÞ
p dataðÞ
|fflfflfflfflffl{zfflfflfflfflffl}
predictive updating factor
Here the probability of each possible value of θ within an account
after seeing the data (i.e., posterior parameter beliefs) are cal
-
culated as the product of the prior probability of that value (i.e.
parameter prior beliefs) times the predictive updating factor. The
latter reflects how likely the observed data is according to that par
-
ticular parameter value divided by the average predictive perfor-
mance across all values of θ weighted by their prior probability, i.e.
p data
ð Þ¼
R
p data
j
θ
ð Þ
p θ
ðÞ
d
θ
I
. This posterior parameter belief is
the basis for the credible intervals (CI) that the Bayesian analysis
provides for the parameters conditional on a given model.
Updating the plausibility of the rival accounts
For two rival accounts of the data (for example, H
0
vs H
1
), Bayes
rule can best be written in the form of odds
32
:
pH
0
jdataðÞ
pH
1
jdataðÞ
|fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl}
posterior odds for H
0
vs H
1
¼
pH
0
ðÞ
pH
1
ðÞ
|fflffl{zfflffl}
prior odds for H
0
vs H
1
´
p datajH
0
ðÞ
p datajH
1
ðÞ
|fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl}
predictive updating factor
This equation shows that the change from prior hypothesis odds
to posterior hypothesis odds is brought about by the predic
-
tive updating factor—commonly known as the Bayes factor
12
.
For instance, assume the rival hypotheses are equally plausible
a priori (i.e., p(H
0
) = p(H
1
) = 0.5). The prior hypothesis odds are
then equal to one. If the predictive updating factor is 10 (i.e., the
observed data is 10 times more likely under H
0
than under H
1
),
this means that the posterior odds are then also 10. Given that
for mutually exclusive hypotheses p(H
0
)+p(H
1
) = 1, these odds
mean that the data have increased the probability of H
0
from 0.5
(the prior hypothesis probability) to 10/11 0.91 (the posterior
H
0
probability).
e Bayes factor quanties the degree to which the data warrant
a change in beliefs, and it therefore represents the strength of
evidence that the data provide for H
0
vs H
1
. Note that this strength
measure is symmetric: evidence may support H
0
just as it may
support H
1
; neither of the rival hypotheses enjoys a special status.
For a neuroscientist who wants to know whether or not their
manipulation had an eect, the posterior odds might seem like
the most obvious metric, as they reect the plausibility of one
hypothesis over another aer considering the data. However,
these posterior odds depend both on the evidence provided by
the data (i.e., the Bayes factor) and the prior odds. e prior odds
capture subjective beliefs before the experiment and introduce
an oen-undesirable element of subjectivity that could bias the
conclusions drawn from the posterior beliefs. Scientists who
embrace a certain theoretical standpoint and those who do not
might ercely disagree on these prior odds while agreeing on
the evidence, that is, the extent to which the data should change
their beliefs. As beliefs are considered less valuable for scientic
reporting than evidence, the data-informed Bayes factor is the less
controversial and thus favored metric to report.
ere are three broad qualitative categories of Bayes factors.
First, the Bayes factor may support H
1
; second, the Bayes factor
may support H
0
; third, the Bayes factor may be near 1 and support
neither of the two rival hypotheses. In the second case we have
evidence of absence, and in the third care we have ‘absence of
evidence’ (see also ref.
2
). More ne-grained classication schemes
have been proposed
16
.
To develop an intuition for the continuous strength of evidence
that a Bayes factor provides, one may use a probability wheel.
Examples are shown in Fig. 3b. To construct the wheel, we have
assumed that H
0
and H
1
are equally likely; the red part in the
wheel is then the posterior probability for H
1
, and the blue part
is the complementary probability for H
0
. Now pretend that the
wheel is a pizza, with the red area covered with pepperoni and
the blue area covered with mozzarella. Imagine that you poke
your nger blindly onto the pizza and that it comes back covered
in the non-dominant topping (in this case, pepperoni). How
surprised are you? Your level of imagined surprise is an indication
for the strength of evidence that a Bayes factor provides. We
additionally compare the BF with traditional P values in Extended
Data Fig. 1.
NATURE NEUROSCIENCE | VOL 23 | JULY 2020 | 788–799 | www.nature.com/natureneuroscience
791

Citations
More filters
Journal ArticleDOI
TL;DR: These guidelines are geared towards analyses performed with the open-source statistical software JASP, and most guidelines extend to Bayesian inference in general.
Abstract: Despite the increasing popularity of Bayesian inference in empirical research, few practical guidelines provide detailed recommendations for how to apply Bayesian procedures and interpret the results. Here we offer specific guidelines for four different stages of Bayesian statistical reasoning in a research setting: planning the analysis, executing the analysis, interpreting the results, and reporting the results. The guidelines for each stage are illustrated with a running example. Although the guidelines are geared towards analyses performed with the open-source statistical software JASP, most guidelines extend to Bayesian inference in general.

378 citations

14 Sep 2018
TL;DR: These guidelines are geared toward analyses performed with the open-source statistical software JASP, and most guidelines extend to Bayesian inference in general.
Abstract: Despite the increasing popularity of Bayesian inference in empirical research, few practical guidelines provide detailed recommendations for how to apply Bayesian procedures and interpret the results. Here we offer specific guidelines for four different stages of Bayesian statistical reasoning in a research setting: planning the analysis, executing the analysis, interpreting the results, and reporting the results. The guidelines for each stage are illustrated with a running example. Although the guidelines are geared toward analyses performed with the open-source statistical software JASP, most guidelines extend to Bayesian inference in general.

255 citations

Journal ArticleDOI
TL;DR: A stimulation technique that delivers mid-infrared light energy through opened skull or even non-invasively through thinned intact skull and can activate brain neurons in vivo without introducing any exogeneous gene is demonstrated with a great translational potential for activating brain neurons and boosting brain learning capability.
Abstract: Neurostimulant drugs or magnetic/electrical stimulation techniques can overcome attention deficits, but these drugs or techniques are weakly beneficial in boosting the learning capabilities of healthy subjects. Here, we report a stimulation technique, mid-infrared modulation (MIM), that delivers mid-infrared light energy through the opened skull or even non-invasively through a thinned intact skull and can activate brain neurons in vivo without introducing any exogeneous gene. Using c-Fos immunohistochemistry, in vivo single-cell electrophysiology and two-photon Ca2+ imaging in mice, we demonstrate that MIM significantly induces firing activities of neurons in the targeted cortical area. Moreover, mice that receive MIM targeting to the auditory cortex during an auditory associative learning task exhibit a faster learning speed (~50% faster) than control mice. Together, this non-invasive, opsin-free MIM technique is demonstrated with potential for modulating neuronal activity. Neurostimulant drugs or magnetic/electrical stimulation techniques have shown limited effects on learning capabilities of healthy subjects. The authors show that, without introducing an exogeneous gene, mid-infrared light can modulate firing activity of neurons in vivo and accelerate learning in mice.

37 citations

Journal ArticleDOI
TL;DR: Using traditional frequentist meta-analysis, the conclusion is that there is absence of evidence for a therapeutic effect, with a point estimate effect size of 0.05 (95% confidence interval -0.00 to 0.10, P = 0.055) as discussed by the authors.
Abstract: Numerous clinical trials of anti-amyloid beta (Aβ) immunotherapy in Alzheimer's disease have been performed. None of these have provided convincing evidence for beneficial effects. Using traditional frequentist meta-analysis, the conclusion is that there is absence of evidence for a therapeutic effect, with a point estimate effect size of 0.05 (95% confidence interval -0.00 to 0.10, P = .055). In addition, this non-significant effect equates to 0.4 points per year on the cognitive subscale of the Alzheimer's Disease Assessment Scale. This is well below the minimally clinically important difference. Bayesian meta-analysis of these trial data provides strong evidence of absence of a therapeutic effect, with a Bayes factor of 11.27 in favor of the null hypothesis, opposed to a Bayes factor of 0.09 in favor of a treatment effect. Bayesian analysis is particularly valuable in this context of repeatedly reported small, non-significant effect sizes in individual trials. Mechanisms other than removal of Aβ from the brain may be probed to slow progression of Alzheimer's disease.

30 citations

Journal ArticleDOI
TL;DR: In this article, the authors present an overview of the problems associated with undisclosed analytic flexibility, discuss why and how EEG researchers would benefit from adopting preregistration, provide guidelines and examples on how to preregister data preprocessing and analysis steps in typical ERP studies, and conclude by discussing possibilities and limitations of this open science practice.

26 citations

References
More filters
Book
01 Jan 1988
TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Abstract: Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Their discussion ranges from the history of the field's intellectual foundations to the most recent developments and applications. The only necessary mathematical background is familiarity with elementary concepts of probability. The book is divided into three parts. Part I defines the reinforcement learning problem in terms of Markov decision processes. Part II provides basic solution methods: dynamic programming, Monte Carlo methods, and temporal-difference learning. Part III presents a unified view of the solution methods and incorporates artificial neural networks, eligibility traces, and planning; the two final chapters present case studies and consider the future of reinforcement learning.

37,989 citations

Book
01 Jan 1939
TL;DR: In this paper, the authors introduce the concept of direct probabilities, approximate methods and simplifications, and significant importance tests for various complications, including one new parameter, and various complications for frequency definitions and direct methods.
Abstract: 1. Fundamental notions 2. Direct probabilities 3. Estimation problems 4. Approximate methods and simplifications 5. Significance tests: one new parameter 6. Significance tests: various complications 7. Frequency definitions and direct methods 8. General questions

7,086 citations

Journal ArticleDOI
TL;DR: To facilitate use of the Bayes factor, an easy-to-use, Web-based program is provided that performs the necessary calculations and has better properties than other methods of inference that have been advocated in the psychological literature.
Abstract: Progress in science often comes from discovering invariances in relationships among variables; these invariances often correspond to null hypotheses. As is commonly known, it is not possible to state evidence for the null hypothesis in conventional significance testing. Here we highlight a Bayes factor alternative to the conventional t test that will allow researchers to express preference for either the null hypothesis or the alternative. The Bayes factor has a natural and straightforward interpretation, is based on reasonable assumptions, and has better properties than other methods of inference that have been advocated in the psychological literature. To facilitate use of the Bayes factor, we provide an easy-to-use, Web-based program that performs the necessary calculations.

3,012 citations

Journal ArticleDOI
TL;DR: In this paper, the authors introduce the concept of direct probabilities, approximate methods and simplifications, and significant importance tests for various complications, including one new parameter, and various complications for frequency definitions and direct methods.
Abstract: 1. Fundamental notions 2. Direct probabilities 3. Estimation problems 4. Approximate methods and simplifications 5. Significance tests: one new parameter 6. Significance tests: various complications 7. Frequency definitions and direct methods 8. General questions

2,990 citations

Journal ArticleDOI
Daniel J. Benjamin1, James O. Berger2, Magnus Johannesson1, Magnus Johannesson3, Brian A. Nosek4, Brian A. Nosek5, Eric-Jan Wagenmakers6, Richard A. Berk7, Kenneth A. Bollen8, Björn Brembs9, Lawrence D. Brown7, Colin F. Camerer10, David Cesarini11, David Cesarini12, Christopher D. Chambers13, Merlise A. Clyde2, Thomas D. Cook14, Thomas D. Cook15, Paul De Boeck16, Zoltan Dienes17, Anna Dreber3, Kenny Easwaran18, Charles Efferson19, Ernst Fehr20, Fiona Fidler21, Andy P. Field17, Malcolm R. Forster22, Edward I. George7, Richard Gonzalez23, Steven N. Goodman24, Edwin J. Green25, Donald P. Green26, Anthony G. Greenwald27, Jarrod D. Hadfield28, Larry V. Hedges15, Leonhard Held20, Teck-Hua Ho29, Herbert Hoijtink30, Daniel J. Hruschka31, Kosuke Imai32, Guido W. Imbens24, John P. A. Ioannidis24, Minjeong Jeon33, James Holland Jones34, Michael Kirchler35, David Laibson36, John A. List37, Roderick J. A. Little23, Arthur Lupia23, Edouard Machery38, Scott E. Maxwell39, Michael A. McCarthy21, Don A. Moore40, Stephen L. Morgan41, Marcus R. Munafò42, Shinichi Nakagawa43, Brendan Nyhan44, Timothy H. Parker45, Luis R. Pericchi46, Marco Perugini47, Jeffrey N. Rouder48, Judith Rousseau49, Victoria Savalei50, Felix D. Schönbrodt51, Thomas Sellke52, Betsy Sinclair53, Dustin Tingley36, Trisha Van Zandt16, Simine Vazire54, Duncan J. Watts55, Christopher Winship36, Robert L. Wolpert2, Yu Xie32, Cristobal Young24, Jonathan Zinman44, Valen E. Johnson18, Valen E. Johnson1 
University of Southern California1, Duke University2, Stockholm School of Economics3, University of Virginia4, Center for Open Science5, University of Amsterdam6, University of Pennsylvania7, University of North Carolina at Chapel Hill8, University of Regensburg9, California Institute of Technology10, Research Institute of Industrial Economics11, New York University12, Cardiff University13, Mathematica Policy Research14, Northwestern University15, Ohio State University16, University of Sussex17, Texas A&M University18, Royal Holloway, University of London19, University of Zurich20, University of Melbourne21, University of Wisconsin-Madison22, University of Michigan23, Stanford University24, Rutgers University25, Columbia University26, University of Washington27, University of Edinburgh28, National University of Singapore29, Utrecht University30, Arizona State University31, Princeton University32, University of California, Los Angeles33, Imperial College London34, University of Innsbruck35, Harvard University36, University of Chicago37, University of Pittsburgh38, University of Notre Dame39, University of California, Berkeley40, Johns Hopkins University41, University of Bristol42, University of New South Wales43, Dartmouth College44, Whitman College45, University of Puerto Rico46, University of Milan47, University of California, Irvine48, Paris Dauphine University49, University of British Columbia50, Ludwig Maximilian University of Munich51, Purdue University52, Washington University in St. Louis53, University of California, Davis54, Microsoft55
TL;DR: The default P-value threshold for statistical significance is proposed to be changed from 0.05 to 0.005 for claims of new discoveries in order to reduce uncertainty in the number of discoveries.
Abstract: We propose to change the default P-value threshold for statistical significance from 0.05 to 0.005 for claims of new discoveries.

1,586 citations

Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "Using bayes factor hypothesis testing in neuroscience to establish evidence of absence" ?

In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website this paper.