scispace - formally typeset
Open AccessJournal ArticleDOI

On p-Values and Bayes Factors

Leonhard Held, +1 more
- 07 Mar 2018 - 
- Vol. 5, Iss: 1, pp 393
Reads0
Chats0
TLDR
The relationship between p-values and minimum Bayes factors also depends on the sample size and on the dimension of the parameter of interest as discussed by the authors, and the relationship between the two-sided significance tests for a point null hypothesis in more detail.
Abstract
The p-value quantifies the discrepancy between the data and a null hypothesis of interest, usually the assumption of no difference or no effect. A Bayesian approach allows the calibration of p-values by transforming them to direct measures of the evidence against the null hypothesis, so-called Bayes factors. We review the available literature in this area and consider two-sided significance tests for a point null hypothesis in more detail. We distinguish simple from local alternative hypotheses and contrast traditional Bayes factors based on the data with Bayes factors based on p-values or test statistics. A well-known finding is that the minimum Bayes factor, the smallest possible Bayes factor within a certain class of alternative hypotheses, provides less evidence against the null hypothesis than the corresponding p-value might suggest. It is less known that the relationship between p-values and minimum Bayes factors also depends on the sample size and on the dimension of the parameter of interest. We i...

read more

Content maybe subject to copyright    Report

ZurichOpenRepositoryand
Archive
UniversityofZurich
UniversityLibrary
Strickhofstrasse39
CH-8057Zurich
www.zora.uzh.ch
Year:2018
Onp-ValuesandBayesFactors
Held,Leonhard;Ott,Manuela
Abstract:Thep-valuequantiesthediscrepancybetweenthedataandanullhypothesisofinterest,
usuallytheassumptionofno dierenceor noeect.ABayesianapproachallowsthe calibrationof
p-valuesbytransformingthemtodirectmeasuresoftheevidenceagainstthenullhypothesis,so-called
Bayesfactors.Wereviewtheavailableliteratureinthisareaandconsidertwo-sidedsignicancetestsfora
pointnullhypothesisinmoredetail.Wedistinguishsimplefromlocalalternativehypothesesandcontrast
traditionalBayesfactorsbasedonthedatawithBayesfactorsbasedonp-valuesorteststatistics.A
well-knownndingisthattheminimumBayesfactor,thesmallestpossibleBayesfactorwithinacertain
classofalternativehypotheses,provideslessevidenceagainstthenullhypothesisthanthecorresponding
p-valuemightsuggest.Itislessknownthattherelationshipbetweenp-valuesandminimumBayesfactors
alsodependsonthesamplesizeandonthedimensionoftheparameterofinterest.Weillustratethe
transformationofp-valuestominimumBayesfactorswithtwoexamplesfromclinicalresearch.
DOI:https://doi.org/10.1146/annurev-statistics-031017-100307
PostedattheZurichOpenRepositoryandArchive,UniversityofZurich
ZORAURL:https://doi.org/10.5167/uzh-148600
JournalArticle
AcceptedVersion
Originallypublishedat:
Held,Leonhard;Ott,Manuela(2018).Onp-ValuesandBayesFactors.AnnualReviewofStatisticsand
ItsApplication,5(1):593-419.
DOI:https://doi.org/10.1146/annurev-statistics-031017-100307

On P -values and Bayes
factors
Leonhard Held and Manuela Ott
Epidemiology, Biostatistics and Prevention Institute, University of Zurich,
Zurich, Switzerland, CH-8001; email: leonhard.held@uzh.ch, manuela.ott@uzh.ch
Xxxx. Xxx. Xxx. Xxx. YYYY. AA:1–28
https://doi.org/10.1146/((please add
article doi))
Copyright
c
YYYY by Annual Reviews.
All rights reserved
Keywords
Bayes factor, evidence, minimum Bayes factor, objective Bayes,
P -value, sample size
Abstract
The P -value quantifies the discrepancy between the data and a null
hypothesis of interest, usually the assumption of no difference or no ef-
fect. A Bayesian approach allows to calibrate P -values by transforming
them to direct measures of the evidence against the null hypothesis,
so-called Bayes fact o rs. We review the available literat u re in this are a
and consid er two-sid ed significance tests for a point null hypothesis in
more detail. We distinguish simple from local alternative hypotheses
and contrast traditional Bayes factors based on the data with Bayes fac-
tors based on P -values or test statistics. A well-known finding is that
the minimum Bayes factor, the smallest possible Bayes factor within a
certain class of alternative hypotheses, provides less evidence against
the null hypo t h es is than the corresponding P -value might suggest. It is
less known that the relationship between P -values and minimum Bayes
factors also depends on the sample size and on the dimension of the
parameter of interest. We illustrate the transformation of P -values to
minimum Bayes factors with two examples from clinica l research.
1

1. INTRO DU CTI ON
The P -value is the probability, under the assumption of no association or no effect (the
null hypothesis H
0
), of obtaining a result equal to or more ex tre me than what was actually
observed (Goodman 2005). P -values for point null hypotheses still dominate most of the
applied literature (Greenland & Poole 2013), despite the fact that P -values are c o m mo n ly
misused (Wasserstein & Laz a r 2016; Matthews et al. 2017) . Specifically, a quantitative
interpretation of P -values beyond the notorious dichotomization into “significant” and “non-
significant” has caused a lot of confusion and misinterpretations are commonplace. Most
prominent is the widespread bel ie f that the P -value is the probability of a “chance finding” ,
i. e. the probability of the null hypothesis, but many other misinterpretations can also be
found (Goodman 2008; Greenland et al. 2016).
P -value: the
probability, under
the assumption of no
effect (the null
hypothesis H
0
), of
obtaining a result
equal to or more
extreme than what
was actually
observed.
A first step towards a quantitative interpretation of P -values is a categorization into
more than two levels, making a step away from the Neyman-Pearson hypothesis test
paradigm to Fisher’s significance test. Cox & Donnelly (2011, Section 8.4) give the fol-
lowing guidelines to interpret P -values as measures of evidence against a null hypothesis
H
0
: if p 0.1 there is “a suggestion of evidence” against H
0
; if p 0.05 there is “mo d est
evidence” against H
0
; if p 0.01 there is “strong evid en c e” against H
0
. Bland (2015, Sec-
tion 9.4) suggests a similar “ro u gh and ready guide” with ve levels, reproduced in Table
1.
1
Similar categories have been proposed in many other applied statistics textbooks, for
example Ramsey & Schafer (2002) .
However, such categorizations always carry a level of arbit rari n es s. In addition, P -values
are only indire ct measures of evidence: A P -value is computed under the assumption th a t
the null hypothesis H
0
is true, so it is cond it i o n al on H
0
. It does not allow for conclu si o n s
about the probability of H
0
given the data, which is usually of primary interest. More
precisely, a P -value is a quantitative measure of discrepancy between the data and the point
null hypothesis H
0
(Goodman 1999a). But, as Cox (2006, page 83) puts it, “conclusions
expressed in terms of prob a b il ity are on the face of it more powerful than those expressed
indirectly via confidence intervals and p-values”. Such direct conclusions can be obtained
by using Bayes factors. Assuming an alternative hypothesis H
1
has also been specified,
the Bayes factor directly quantifies whether the data have increased or decreas ed the odds
of H
0
. A better approach than categorizing a P -value is thus to transform a P -value to
a Bayes factor or a lower bound o n a Bayes factor, a so-called m in i mum Bayes factor
(Goodman 1999b). But many such ways have been proposed to calibrate P -valu es, and
there is currently no consensus how P -values should b e transformed to Bayes factors.
First, there is an important distinction between tests for direction and tests for existence
(Marsman & Wagenmakers 20 1 7 ) . Tests for direction investigate whet h er the parameter of
interest is above or below a specific value, assuming that th ere is an effect. For example,
a test for direction can be used to assess whether a treatment effect is pos it ive or nega-
tive. Tests for direction are usually conducted with one-sided P -values and there is a close
correspondence to the Bayesian approach based on the posterior probability that the effec t
is positive or negative. In fact, this posterior probability is often equal or approximately
equal to the one-sid ed P -value, if a non-informative prior is used (Casella & Berger 1987).
A simple example is given in Lee (2004, Section 4.2).
One-sided P -value:
based on the
probabilities of
extreme values in
one pre-specified
direction of a point
null hypothesis.
1
Note that the categories in the right column are shifted since Cox & Donnelly (2011) specify
the amount of evidence of specific P -values (p 0.1, 0.05 and 0.01), which correspond to certain
cutpoints in the categorization by Bland (2015).
2 Held and Ott

Strength of evidence against H
0
P -value Bland (2015) Cox & Donnelly (2011)
> 0.1 Little or no evidence
A suggestion of evidence
0.1 to 0.05 Weak evidence
Modest evidence
0.05 to 0.01 Evidence
Strong evidence
0.01 to 0.001 Strong evidence
(not available)
< 0.001 Very strong evidence
Table 1 Categorization of P -values into levels of evidence against H
0
In contrast, tests for existence want to summarize the evi d en c e against the point null
hypothesis of no effect. Tests for existence can be conduct ed with one- sid e d or two-sided
P -values, b u t the correspondence of the P -va l u e to the Bayesian posterior probability of
the null is now lost and care has to be taken to trans form P -values to Bayes factors.
Two-sided P -value:
based on the
probabilities of
extreme values in
both directions of a
point null
hypothesis.
In this paper we consider tests for existence. We will review different methods being
proposed to c a li b ra te P -values, identify pro b l ems with some of the proposed methods and
give general recommendations how to transform P -values to (minimum) Bayes factors. We
will emphasize that this transformation dep en d s on how the P -value has been calculated.
Specifically, the samp le size as well as the dimension of the parameter of interest matters.
It also matters whether the P -value came from a study with a well-defined alternative
hypothesis, or from a study used to generate possible hypotheses.
1.1. Bayes Factors
Consider a significance test for existence with a point null hypothesis H
0
: θ = θ
0
where
the paramet er of interest θ may be a scalar o r a vector. In many problems θ
0
= 0, for
example when testing if there is evidence for a difference θ between two treatment groups .
The alternative hypothesis may be simple, i. e. H
1
: θ = θ
1
6= θ
0
or compo si te, usually
H
1
: θ 6= θ
0
. In the latter case, a Bayesian approach now requires a prior dist rib u t i o n
f(θ |H
1
) to be specified. Local alternatives, represented by a unimodal symmetric prior
distribution cent ered around the null value θ
0
, are the common choice. In contrast, non-
local alternatives (Johnson & Ros sel l 2010) have zero probability mass in a neighborhood
of θ
0
, with the simple alternative H
1
: θ = θ
1
6= θ
0
being a special case.
Lo c al alternatives: a
unimo dal symmetric
prior distribution of
alternatives centered
around the null
value.
The Bayes factor (BF) transforms the prior odds Pr(H
0
)/ Pr(H
1
) (where Pr(H
1
) =
1 Pr(H
0
)) to the posterior odds Pr(H
0
|y)/ Pr(H
1
|y) in the light of the data y:
Pr(H
0
|y)
Pr(H
1
|y)
= BF(y) ·
Pr(H
0
)
Pr(H
1
)
. (1)
The Bayes fac t o r BF(y) thus is a direct quantitative measure how the data y have increased
or d ec rea s ed the odds of H
0
, regardless of the actual value of the prior probability Pr(H
0
).
The Bayes fac to r (or its logarithm) is therefore often referred to as the “strength of evidence”
or “weight of evidence” (Good 1950; Bernardo & Smith 2000). If nec es sa ry, we may add
an index to BF(y), where BF
01
(y) stands for H
0
versus H
1
”, so BF
10
(y) = 1/BF
01
(y).
Bayes factor:
compares the
likelihoo d of the
data y under the
null hypothesis H
0
to the likelihood
under the alternative
hypothesis H
1
.
In (1), the Bayes fac t o r
BF(y) =
f(y |H
0
)
f(y |H
1
)
(2)
www.annualreviews.org
On P -values and Bayes factors 3

Strength of evidence against H
0
Bayes factor Jeffreys (1961) Goodman (1999b) Held & Ott (2016)
1 to 1/3 Bare mention
Weak
Weak
1/3 to 1/10 Substantial
Moderate
Moderate
1/10 to 1/30 Strong
Moderate to strong
Substantial
1/30 to 1/100 Very strong
Strong to very strong
Strong
1/100 to 1/300 Decisive
(not available)
Very strong
< 1/300 Decisive
Table 2 Categorization of Bayes factors BF 1 into levels of evidence against H
0
is the ratio of the likelihood f (y |H
0
) = f (y |θ = θ
0
) of the observed data y under the null
hypothesis H
0
and the likelihood
f(y |H
1
) =
Z
f(y |θ)f (θ |H
1
) (3)
under the alternative hypothesis H
1
. For a simp le alternative, (3) reduces to the ordinary
likelihood f(y |H
1
) = f (y |θ = θ
1
) and the Baye s factor (2) reduces to a likelihood ratio. In
general (3) represents a ma rg i n al likelihood, i. e. the average likelihood f(y |θ) with respect
to the prior distribution f(θ |H
1
) for θ under the alternative H
1
(Kass & Raftery 1995).
Note that the computation of the Bayes factor via (2) does not require the specification of
the prior probability Pr(H
0
).
Marginal likelihood:
the average
likelihoo d with
respect to a prior
distribution for
alternative
hypotheses.
In this paper we focus on the evidence aga i n st a point null hypothesis provided by small
Bayes factors BF
01
1, such that Bayes factors lie in the same range as P - values, which
facilitates comparisons. To categorize such Bayes factors, Held & Ott (2 0 1 6 ) provided a
six-grade scale reproduced in Table 2, which was proposed as a compromise of the grades
proposed in Jeffreys (1961, Appendix B) and Goodman (1999b, Table 1 and 2) (also shown
in Table 2).
2
Communication of Bayes factors is of central import a n c e. The categories shown in Table
2 are helpful in this respect, but there remains a level of arbitrari n ess in the definition of
the category levels. Ideally, the Bayes factor itself should be reported and comprehensive
formatting of Bayes fac t o rs is now crucial. We recommend to present Bayes factors as ratios,
for example BF
01
= 1/7, since this underlines the symm et ry of Bayes factors if numerator
and denominator are exchanged, here BF
10
= 7/1. For Bayes factors smaller than 1/10,
say, it is usually sufficient to report Bayes factors in the 1/x format, where x is an integer.
If the Bayes factor i s larger, then we recommend to use an additional decimal place for x,
e.g. BF= 1/2.5 or BF= 1/1.3, to achieve better accuracy.
The Bayes factor (2) is based on the data y, sometimes called a data-based Bayes fac t or
(Held et al. 2015 ) to distinguish it from Bayes fact o rs based on test statistics or P -values.
Indeed, the step from a P -value p to a Bayes factor is most easily accomplished by treating
p as the data y in (2) to ob ta i n a P -based Bayes factor based on the sampling distribution
2
Jeffreys has actually used the slightly different cutpoints (1/
10)
a
, a = 1, 2, 3, 4, whereas
Goodman has specified his evidence categories for Bayes factors of 1/5, 1/10 , 1/20 and 1/100,
which we have somewhat shifted to our cutpoints 1/3, 1/10, 1/30 and 1/100.
4 Held and Ott

Citations
More filters
Journal ArticleDOI

Deep generative modeling for single-cell transcriptomics.

TL;DR: Single-cell variational inference (scVI) is a ready-to-use generative deep learning tool for large-scale single-cell RNA-seq data that enables raw data processing and a wide range of rapid and accurate downstream analyses.
Journal ArticleDOI

Sensitivity and specificity of information criteria.

TL;DR: In some cases the comparison of two models using ICs can be viewed as equivalent to a likelihood ratio test, with the different criteria representing different alpha levels and BIC being a more conservative test than AIC.
Journal ArticleDOI

Three Recommendations for Improving the Use of p-Values

TL;DR: Researchers commonly use p-values to answer the question: How strongly does the evidence favor the alternative hypothesis relative to the null hypothesis? But pvalues themselves do not directly answer... as discussed by the authors.
Journal ArticleDOI

Rewriting results sections in the language of evidence

TL;DR: The authors suggest language of evidence that allows for a more nuanced approach to communicate scientific findings as a simple and intuitive alternative to statistical significance testing, and provide examples for rewriting results sections in research papers accordingly.
Journal ArticleDOI

Rewriting results sections in the language of evidence

TL;DR: The authors suggest language of evidence that allows for a more nuanced approach to communicate scientific findings as a simple and intuitive alternative to statistical significance testing, and provide examples for rewriting results sections in research papers accordingly.
Related Papers (5)

Redefine statistical significance

Daniel J. Benjamin, +76 more
Trending Questions (2)
What is the difference between a robust p-value and a traditional p-value?

The provided paper does not discuss the difference between a robust p-value and a traditional p-value.

Why is the US's GDP so high?

The given text does not provide any information about the US's GDP or the reasons for its high value.