ZurichOpenRepositoryand
Archive
UniversityofZurich
UniversityLibrary
Strickhofstrasse39
CH-8057Zurich
www.zora.uzh.ch
Year:2018
Onp-ValuesandBayesFactors
Held,Leonhard;Ott,Manuela
Abstract:Thep-valuequantiesthediscrepancybetweenthedataandanullhypothesisofinterest,
usuallytheassumptionofno dierenceor noeect.ABayesianapproachallowsthe calibrationof
p-valuesbytransformingthemtodirectmeasuresoftheevidenceagainstthenullhypothesis,so-called
Bayesfactors.Wereviewtheavailableliteratureinthisareaandconsidertwo-sidedsignicancetestsfora
pointnullhypothesisinmoredetail.Wedistinguishsimplefromlocalalternativehypothesesandcontrast
traditionalBayesfactorsbasedonthedatawithBayesfactorsbasedonp-valuesorteststatistics.A
well-knownndingisthattheminimumBayesfactor,thesmallestpossibleBayesfactorwithinacertain
classofalternativehypotheses,provideslessevidenceagainstthenullhypothesisthanthecorresponding
p-valuemightsuggest.Itislessknownthattherelationshipbetweenp-valuesandminimumBayesfactors
alsodependsonthesamplesizeandonthedimensionoftheparameterofinterest.Weillustratethe
transformationofp-valuestominimumBayesfactorswithtwoexamplesfromclinicalresearch.
DOI:https://doi.org/10.1146/annurev-statistics-031017-100307
PostedattheZurichOpenRepositoryandArchive,UniversityofZurich
ZORAURL:https://doi.org/10.5167/uzh-148600
JournalArticle
AcceptedVersion
Originallypublishedat:
Held,Leonhard;Ott,Manuela(2018).Onp-ValuesandBayesFactors.AnnualReviewofStatisticsand
ItsApplication,5(1):593-419.
DOI:https://doi.org/10.1146/annurev-statistics-031017-100307
On P -values and Bayes
factors
Leonhard Held and Manuela Ott
Epidemiology, Biostatistics and Prevention Institute, University of Zurich,
Zurich, Switzerland, CH-8001; email: leonhard.held@uzh.ch, manuela.ott@uzh.ch
Xxxx. Xxx. Xxx. Xxx. YYYY. AA:1–28
https://doi.org/10.1146/((please add
article doi))
Copyright
c
YYYY by Annual Reviews.
All rights reserved
Keywords
Bayes factor, evidence, minimum Bayes factor, objective Bayes,
P -value, sample size
Abstract
The P -value quantifies the discrepancy between the data and a null
hypothesis of interest, usually the assumption of no difference or no ef-
fect. A Bayesian approach allows to calibrate P -values by transforming
them to direct measures of the evidence against the null hypothesis,
so-called Bayes fact o rs. We review the available literat u re in this are a
and consid er two-sid ed significance tests for a point null hypothesis in
more detail. We distinguish simple from local alternative hypotheses
and contrast traditional Bayes factors based on the data with Bayes fac-
tors based on P -values or test statistics. A well-known finding is that
the minimum Bayes factor, the smallest possible Bayes factor within a
certain class of alternative hypotheses, provides less evidence against
the null hypo t h es is than the corresponding P -value might suggest. It is
less known that the relationship between P -values and minimum Bayes
factors also depends on the sample size and on the dimension of the
parameter of interest. We illustrate the transformation of P -values to
minimum Bayes factors with two examples from clinica l research.
1
1. INTRO DU CTI ON
The P -value is the probability, under the assumption of no association or no effect (the
null hypothesis H
0
), of obtaining a result equal to or more ex tre me than what was actually
observed (Goodman 2005). P -values for point null hypotheses still dominate most of the
applied literature (Greenland & Poole 2013), despite the fact that P -values are c o m mo n ly
misused (Wasserstein & Laz a r 2016; Matthews et al. 2017) . Specifically, a quantitative
interpretation of P -values beyond the notorious dichotomization into “significant” and “non-
significant” has caused a lot of confusion and misinterpretations are commonplace. Most
prominent is the widespread bel ie f that the P -value is the probability of a “chance finding” ,
i. e. the probability of the null hypothesis, but many other misinterpretations can also be
found (Goodman 2008; Greenland et al. 2016).
P -value: the
probability, under
the assumption of no
effect (the null
hypothesis H
0
), of
obtaining a result
equal to or more
extreme than what
was actually
observed.
A first step towards a quantitative interpretation of P -values is a categorization into
more than two levels, making a step away from the Neyman-Pearson hypothesis test
paradigm to Fisher’s significance test. Cox & Donnelly (2011, Section 8.4) give the fol-
lowing guidelines to interpret P -values as measures of evidence against a null hypothesis
H
0
: if p ≃ 0.1 there is “a suggestion of evidence” against H
0
; if p ≃ 0.05 there is “mo d est
evidence” against H
0
; if p ≃ 0.01 there is “strong evid en c e” against H
0
. Bland (2015, Sec-
tion 9.4) suggests a similar “ro u gh and ready guide” with fi ve levels, reproduced in Table
1.
1
Similar categories have been proposed in many other applied statistics textbooks, for
example Ramsey & Schafer (2002) .
However, such categorizations always carry a level of arbit rari n es s. In addition, P -values
are only indire ct measures of evidence: A P -value is computed under the assumption th a t
the null hypothesis H
0
is true, so it is cond it i o n al on H
0
. It does not allow for conclu si o n s
about the probability of H
0
given the data, which is usually of primary interest. More
precisely, a P -value is a quantitative measure of discrepancy between the data and the point
null hypothesis H
0
(Goodman 1999a). But, as Cox (2006, page 83) puts it, “conclusions
expressed in terms of prob a b il ity are on the face of it more powerful than those expressed
indirectly via confidence intervals and p-values”. Such direct conclusions can be obtained
by using Bayes factors. Assuming an alternative hypothesis H
1
has also been specified,
the Bayes factor directly quantifies whether the data have increased or decreas ed the odds
of H
0
. A better approach than categorizing a P -value is thus to transform a P -value to
a Bayes factor or a lower bound o n a Bayes factor, a so-called m in i mum Bayes factor
(Goodman 1999b). But many such ways have been proposed to calibrate P -valu es, and
there is currently no consensus how P -values should b e transformed to Bayes factors.
First, there is an important distinction between tests for direction and tests for existence
(Marsman & Wagenmakers 20 1 7 ) . Tests for direction investigate whet h er the parameter of
interest is above or below a specific value, assuming that th ere is an effect. For example,
a test for direction can be used to assess whether a treatment effect is pos it ive or nega-
tive. Tests for direction are usually conducted with one-sided P -values and there is a close
correspondence to the Bayesian approach based on the posterior probability that the effec t
is positive or negative. In fact, this posterior probability is often equal or approximately
equal to the one-sid ed P -value, if a non-informative prior is used (Casella & Berger 1987).
A simple example is given in Lee (2004, Section 4.2).
One-sided P -value:
based on the
probabilities of
extreme values in
one pre-specified
direction of a point
null hypothesis.
1
Note that the categories in the right column are shifted since Cox & Donnelly (2011) specify
the amount of evidence of specific P -values (p ≃ 0.1, 0.05 and 0.01), which correspond to certain
cutpoints in the categorization by Bland (2015).
2 Held and Ott
Strength of evidence against H
0
P -value Bland (2015) Cox & Donnelly (2011)
> 0.1 Little or no evidence
A suggestion of evidence
0.1 to 0.05 Weak evidence
Modest evidence
0.05 to 0.01 Evidence
Strong evidence
0.01 to 0.001 Strong evidence
(not available)
< 0.001 Very strong evidence
Table 1 Categorization of P -values into levels of evidence against H
0
In contrast, tests for existence want to summarize the evi d en c e against the point null
hypothesis of no effect. Tests for existence can be conduct ed with one- sid e d or two-sided
P -values, b u t the correspondence of the P -va l u e to the Bayesian posterior probability of
the null is now lost and care has to be taken to trans form P -values to Bayes factors.
Two-sided P -value:
based on the
probabilities of
extreme values in
both directions of a
point null
hypothesis.
In this paper we consider tests for existence. We will review different methods being
proposed to c a li b ra te P -values, identify pro b l ems with some of the proposed methods and
give general recommendations how to transform P -values to (minimum) Bayes factors. We
will emphasize that this transformation dep en d s on how the P -value has been calculated.
Specifically, the samp le size as well as the dimension of the parameter of interest matters.
It also matters whether the P -value came from a study with a well-defined alternative
hypothesis, or from a study used to generate possible hypotheses.
1.1. Bayes Factors
Consider a significance test for existence with a point null hypothesis H
0
: θ = θ
0
where
the paramet er of interest θ may be a scalar o r a vector. In many problems θ
0
= 0, for
example when testing if there is evidence for a difference θ between two treatment groups .
The alternative hypothesis may be simple, i. e. H
1
: θ = θ
1
6= θ
0
or compo si te, usually
H
1
: θ 6= θ
0
. In the latter case, a Bayesian approach now requires a prior dist rib u t i o n
f(θ |H
1
) to be specified. Local alternatives, represented by a unimodal symmetric prior
distribution cent ered around the null value θ
0
, are the common choice. In contrast, non-
local alternatives (Johnson & Ros sel l 2010) have zero probability mass in a neighborhood
of θ
0
, with the simple alternative H
1
: θ = θ
1
6= θ
0
being a special case.
Lo c al alternatives: a
unimo dal symmetric
prior distribution of
alternatives centered
around the null
value.
The Bayes factor (BF) transforms the prior odds Pr(H
0
)/ Pr(H
1
) (where Pr(H
1
) =
1 − Pr(H
0
)) to the posterior odds Pr(H
0
|y)/ Pr(H
1
|y) in the light of the data y:
Pr(H
0
|y)
Pr(H
1
|y)
= BF(y) ·
Pr(H
0
)
Pr(H
1
)
. (1)
The Bayes fac t o r BF(y) thus is a direct quantitative measure how the data y have increased
or d ec rea s ed the odds of H
0
, regardless of the actual value of the prior probability Pr(H
0
).
The Bayes fac to r (or its logarithm) is therefore often referred to as the “strength of evidence”
or “weight of evidence” (Good 1950; Bernardo & Smith 2000). If nec es sa ry, we may add
an index to BF(y), where BF
01
(y) stands for “H
0
versus H
1
”, so BF
10
(y) = 1/BF
01
(y).
Bayes factor:
compares the
likelihoo d of the
data y under the
null hypothesis H
0
to the likelihood
under the alternative
hypothesis H
1
.
In (1), the Bayes fac t o r
BF(y) =
f(y |H
0
)
f(y |H
1
)
(2)
www.annualreviews.org
•
On P -values and Bayes factors 3
Strength of evidence against H
0
Bayes factor Jeffreys (1961) Goodman (1999b) Held & Ott (2016)
1 to 1/3 Bare mention
Weak
Weak
1/3 to 1/10 Substantial
Moderate
Moderate
1/10 to 1/30 Strong
Moderate to strong
Substantial
1/30 to 1/100 Very strong
Strong to very strong
Strong
1/100 to 1/300 Decisive
(not available)
Very strong
< 1/300 Decisive
Table 2 Categorization of Bayes factors BF ≤ 1 into levels of evidence against H
0
is the ratio of the likelihood f (y |H
0
) = f (y |θ = θ
0
) of the observed data y under the null
hypothesis H
0
and the likelihood
f(y |H
1
) =
Z
f(y |θ)f (θ |H
1
)dθ (3)
under the alternative hypothesis H
1
. For a simp le alternative, (3) reduces to the ordinary
likelihood f(y |H
1
) = f (y |θ = θ
1
) and the Baye s factor (2) reduces to a likelihood ratio. In
general (3) represents a ma rg i n al likelihood, i. e. the average likelihood f(y |θ) with respect
to the prior distribution f(θ |H
1
) for θ under the alternative H
1
(Kass & Raftery 1995).
Note that the computation of the Bayes factor via (2) does not require the specification of
the prior probability Pr(H
0
).
Marginal likelihood:
the average
likelihoo d with
respect to a prior
distribution for
alternative
hypotheses.
In this paper we focus on the evidence aga i n st a point null hypothesis provided by small
Bayes factors BF
01
≤ 1, such that Bayes factors lie in the same range as P - values, which
facilitates comparisons. To categorize such Bayes factors, Held & Ott (2 0 1 6 ) provided a
six-grade scale reproduced in Table 2, which was proposed as a compromise of the grades
proposed in Jeffreys (1961, Appendix B) and Goodman (1999b, Table 1 and 2) (also shown
in Table 2).
2
Communication of Bayes factors is of central import a n c e. The categories shown in Table
2 are helpful in this respect, but there remains a level of arbitrari n ess in the definition of
the category levels. Ideally, the Bayes factor itself should be reported and comprehensive
formatting of Bayes fac t o rs is now crucial. We recommend to present Bayes factors as ratios,
for example BF
01
= 1/7, since this underlines the symm et ry of Bayes factors if numerator
and denominator are exchanged, here BF
10
= 7/1. For Bayes factors smaller than 1/10,
say, it is usually sufficient to report Bayes factors in the 1/x format, where x is an integer.
If the Bayes factor i s larger, then we recommend to use an additional decimal place for x,
e.g. BF= 1/2.5 or BF= 1/1.3, to achieve better accuracy.
The Bayes factor (2) is based on the data y, sometimes called a data-based Bayes fac t or
(Held et al. 2015 ) to distinguish it from Bayes fact o rs based on test statistics or P -values.
Indeed, the step from a P -value p to a Bayes factor is most easily accomplished by treating
p as the data y in (2) to ob ta i n a P -based Bayes factor based on the sampling distribution
2
Jeffreys has actually used the slightly different cutpoints (1/
√
10)
a
, a = 1, 2, 3, 4, whereas
Goodman has specified his evidence categories for Bayes factors of 1/5, 1/10 , 1/20 and 1/100,
which we have somewhat shifted to our cutpoints 1/3, 1/10, 1/30 and 1/100.
4 Held and Ott