On p-Values and Bayes Factors

doi:10.1146/ANNUREV-STATISTICS-031017-100307

ZurichOpenRepositoryand

Archive

UniversityofZurich

UniversityLibrary

Strickhofstrasse39

CH-8057Zurich

www.zora.uzh.ch

Year:2018

Onp-ValuesandBayesFactors

Held,Leonhard;Ott,Manuela

Abstract:Thep-valuequantiesthediscrepancybetweenthedataandanullhypothesisofinterest,

usuallytheassumptionofno dierenceor noeect.ABayesianapproachallowsthe calibrationof

p-valuesbytransformingthemtodirectmeasuresoftheevidenceagainstthenullhypothesis,so-called

Bayesfactors.Wereviewtheavailableliteratureinthisareaandconsidertwo-sidedsignicancetestsfora

pointnullhypothesisinmoredetail.Wedistinguishsimplefromlocalalternativehypothesesandcontrast

traditionalBayesfactorsbasedonthedatawithBayesfactorsbasedonp-valuesorteststatistics.A

well-knownndingisthattheminimumBayesfactor,thesmallestpossibleBayesfactorwithinacertain

classofalternativehypotheses,provideslessevidenceagainstthenullhypothesisthanthecorresponding

p-valuemightsuggest.Itislessknownthattherelationshipbetweenp-valuesandminimumBayesfactors

alsodependsonthesamplesizeandonthedimensionoftheparameterofinterest.Weillustratethe

transformationofp-valuestominimumBayesfactorswithtwoexamplesfromclinicalresearch.

DOI:https://doi.org/10.1146/annurev-statistics-031017-100307

PostedattheZurichOpenRepositoryandArchive,UniversityofZurich

ZORAURL:https://doi.org/10.5167/uzh-148600

JournalArticle

AcceptedVersion

Originallypublishedat:

Held,Leonhard;Ott,Manuela(2018).Onp-ValuesandBayesFactors.AnnualReviewofStatisticsand

ItsApplication,5(1):593-419.

DOI:https://doi.org/10.1146/annurev-statistics-031017-100307

On P -values and Bayes

factors

Leonhard Held and Manuela Ott

Epidemiology, Biostatistics and Prevention Institute, University of Zurich,

Zurich, Switzerland, CH-8001; email: leonhard.held@uzh.ch, manuela.ott@uzh.ch

Xxxx. Xxx. Xxx. Xxx. YYYY. AA:1–28

https://doi.org/10.1146/((please add

article doi))

Copyright

c

 YYYY by Annual Reviews.

Keywords

Bayes factor, evidence, minimum Bayes factor, objective Bayes,

P -value, sample size

Abstract

The P -value quantiﬁes the discrepancy between the data and a null

hypothesis of interest, usually the assumption of no diﬀerence or no ef-

fect. A Bayesian approach allows to calibrate P -values by transforming

them to direct measures of the evidence against the null hypothesis,

so-called Bayes fact o rs. We review the available literat u re in this are a

and consid er two-sid ed signiﬁcance tests for a point null hypothesis in

more detail. We distinguish simple from local alternative hypotheses

and contrast traditional Bayes factors based on the data with Bayes fac-

tors based on P -values or test statistics. A well-known ﬁnding is that

the minimum Bayes factor, the smallest possible Bayes factor within a

certain class of alternative hypotheses, provides less evidence against

the null hypo t h es is than the corresponding P -value might suggest. It is

less known that the relationship between P -values and minimum Bayes

factors also depends on the sample size and on the dimension of the

parameter of interest. We illustrate the transformation of P -values to

minimum Bayes factors with two examples from clinica l research.

1

1. INTRO DU CTI ON

The P -value is the probability, under the assumption of no association or no eﬀect (the

null hypothesis H

0

), of obtaining a result equal to or more ex tre me than what was actually

observed (Goodman 2005). P -values for point null hypotheses still dominate most of the

applied literature (Greenland & Poole 2013), despite the fact that P -values are c o m mo n ly

misused (Wasserstein & Laz a r 2016; Matthews et al. 2017) . Speciﬁcally, a quantitative

interpretation of P -values beyond the notorious dichotomization into “signiﬁcant” and “non-

signiﬁcant” has caused a lot of confusion and misinterpretations are commonplace. Most

prominent is the widespread bel ie f that the P -value is the probability of a “chance ﬁnding” ,

i. e. the probability of the null hypothesis, but many other misinterpretations can also be

found (Goodman 2008; Greenland et al. 2016).

P -value: the

probability, under

the assumption of no

eﬀect (the null

hypothesis H

0

), of

obtaining a result

equal to or more

extreme than what

was actually

observed.

A ﬁrst step towards a quantitative interpretation of P -values is a categorization into

more than two levels, making a step away from the Neyman-Pearson hypothesis test

paradigm to Fisher’s signiﬁcance test. Cox & Donnelly (2011, Section 8.4) give the fol-

lowing guidelines to interpret P -values as measures of evidence against a null hypothesis

H

0

: if p ≃ 0.1 there is “a suggestion of evidence” against H

0

; if p ≃ 0.05 there is “mo d est

evidence” against H

0

; if p ≃ 0.01 there is “strong evid en c e” against H

0

. Bland (2015, Sec-

tion 9.4) suggests a similar “ro u gh and ready guide” with ﬁ ve levels, reproduced in Table

1.

1

Similar categories have been proposed in many other applied statistics textbooks, for

example Ramsey & Schafer (2002) .

However, such categorizations always carry a level of arbit rari n es s. In addition, P -values

are only indire ct measures of evidence: A P -value is computed under the assumption th a t

the null hypothesis H

0

is true, so it is cond it i o n al on H

0

. It does not allow for conclu si o n s

about the probability of H

0

given the data, which is usually of primary interest. More

precisely, a P -value is a quantitative measure of discrepancy between the data and the point

null hypothesis H

0

(Goodman 1999a). But, as Cox (2006, page 83) puts it, “conclusions

expressed in terms of prob a b il ity are on the face of it more powerful than those expressed

indirectly via conﬁdence intervals and p-values”. Such direct conclusions can be obtained

by using Bayes factors. Assuming an alternative hypothesis H

1

has also been speciﬁed,

the Bayes factor directly quantiﬁes whether the data have increased or decreas ed the odds

of H

0

. A better approach than categorizing a P -value is thus to transform a P -value to

a Bayes factor or a lower bound o n a Bayes factor, a so-called m in i mum Bayes factor

(Goodman 1999b). But many such ways have been proposed to calibrate P -valu es, and

there is currently no consensus how P -values should b e transformed to Bayes factors.

First, there is an important distinction between tests for direction and tests for existence

(Marsman & Wagenmakers 20 1 7 ) . Tests for direction investigate whet h er the parameter of

interest is above or below a speciﬁc value, assuming that th ere is an eﬀect. For example,

a test for direction can be used to assess whether a treatment eﬀect is pos it ive or nega-

tive. Tests for direction are usually conducted with one-sided P -values and there is a close

correspondence to the Bayesian approach based on the posterior probability that the eﬀec t

is positive or negative. In fact, this posterior probability is often equal or approximately

equal to the one-sid ed P -value, if a non-informative prior is used (Casella & Berger 1987).

A simple example is given in Lee (2004, Section 4.2).

One-sided P -value:

based on the

probabilities of

extreme values in

one pre-speciﬁed

direction of a point

null hypothesis.

1

Note that the categories in the right column are shifted since Cox & Donnelly (2011) specify

the amount of evidence of speciﬁc P -values (p ≃ 0.1, 0.05 and 0.01), which correspond to certain

cutpoints in the categorization by Bland (2015).

2 Held and Ott

Strength of evidence against H

0

P -value Bland (2015) Cox & Donnelly (2011)

> 0.1 Little or no evidence

A suggestion of evidence

0.1 to 0.05 Weak evidence

Modest evidence

0.05 to 0.01 Evidence

Strong evidence

0.01 to 0.001 Strong evidence

(not available)

< 0.001 Very strong evidence

Table 1 Categorization of P -values into levels of evidence against H

0

In contrast, tests for existence want to summarize the evi d en c e against the point null

hypothesis of no eﬀect. Tests for existence can be conduct ed with one- sid e d or two-sided

P -values, b u t the correspondence of the P -va l u e to the Bayesian posterior probability of

the null is now lost and care has to be taken to trans form P -values to Bayes factors.

Two-sided P -value:

based on the

probabilities of

extreme values in

both directions of a

point null

hypothesis.

In this paper we consider tests for existence. We will review diﬀerent methods being

proposed to c a li b ra te P -values, identify pro b l ems with some of the proposed methods and

give general recommendations how to transform P -values to (minimum) Bayes factors. We

will emphasize that this transformation dep en d s on how the P -value has been calculated.

Speciﬁcally, the samp le size as well as the dimension of the parameter of interest matters.

It also matters whether the P -value came from a study with a well-deﬁned alternative

hypothesis, or from a study used to generate possible hypotheses.

1.1. Bayes Factors

Consider a signiﬁcance test for existence with a point null hypothesis H

0

: θ = θ

0

where

the paramet er of interest θ may be a scalar o r a vector. In many problems θ

0

= 0, for

example when testing if there is evidence for a diﬀerence θ between two treatment groups .

The alternative hypothesis may be simple, i. e. H

1

: θ = θ

1

6= θ

0

or compo si te, usually

H

1

: θ 6= θ

0

. In the latter case, a Bayesian approach now requires a prior dist rib u t i o n

f(θ |H

1

) to be speciﬁed. Local alternatives, represented by a unimodal symmetric prior

distribution cent ered around the null value θ

0

, are the common choice. In contrast, non-

local alternatives (Johnson & Ros sel l 2010) have zero probability mass in a neighborhood

of θ

0

, with the simple alternative H

1

: θ = θ

1

6= θ

0

being a special case.

Lo c al alternatives: a

unimo dal symmetric

prior distribution of

alternatives centered

around the null

value.

The Bayes factor (BF) transforms the prior odds Pr(H

0

)/ Pr(H

1

) (where Pr(H

1

) =

1 − Pr(H

0

)) to the posterior odds Pr(H

0

|y)/ Pr(H

1

|y) in the light of the data y:

Pr(H

0

|y)

Pr(H

1

|y)

= BF(y) ·

Pr(H

0

)

Pr(H

1

)

. (1)

The Bayes fac t o r BF(y) thus is a direct quantitative measure how the data y have increased

or d ec rea s ed the odds of H

0

, regardless of the actual value of the prior probability Pr(H

0

).

The Bayes fac to r (or its logarithm) is therefore often referred to as the “strength of evidence”

or “weight of evidence” (Good 1950; Bernardo & Smith 2000). If nec es sa ry, we may add

an index to BF(y), where BF

01

(y) stands for “H

0

versus H

1

”, so BF

10

(y) = 1/BF

01

(y).

Bayes factor:

compares the

likelihoo d of the

data y under the

null hypothesis H

0

to the likelihood

under the alternative

hypothesis H

1

.

In (1), the Bayes fac t o r

BF(y) =

f(y |H

0

)

f(y |H

1

)

(2)

www.annualreviews.org

•

On P -values and Bayes factors 3

Strength of evidence against H

0

Bayes factor Jeﬀreys (1961) Goodman (1999b) Held & Ott (2016)

1 to 1/3 Bare mention

Weak

1/3 to 1/10 Substantial

Moderate

1/10 to 1/30 Strong

Moderate to strong

Substantial

1/30 to 1/100 Very strong

Strong to very strong

Strong

1/100 to 1/300 Decisive

(not available)

Very strong

< 1/300 Decisive

Table 2 Categorization of Bayes factors BF ≤ 1 into levels of evidence against H

0

is the ratio of the likelihood f (y |H

0

) = f (y |θ = θ

0

) of the observed data y under the null

hypothesis H

0

and the likelihood

f(y |H

1

) =

Z

f(y |θ)f (θ |H

1

)dθ (3)

under the alternative hypothesis H

1

. For a simp le alternative, (3) reduces to the ordinary

likelihood f(y |H

1

) = f (y |θ = θ

1

) and the Baye s factor (2) reduces to a likelihood ratio. In

general (3) represents a ma rg i n al likelihood, i. e. the average likelihood f(y |θ) with respect

to the prior distribution f(θ |H

1

) for θ under the alternative H

1

(Kass & Raftery 1995).

Note that the computation of the Bayes factor via (2) does not require the speciﬁcation of

the prior probability Pr(H

0

).

Marginal likelihood:

the average

likelihoo d with

respect to a prior

distribution for

alternative

hypotheses.

In this paper we focus on the evidence aga i n st a point null hypothesis provided by small

Bayes factors BF

01

≤ 1, such that Bayes factors lie in the same range as P - values, which

facilitates comparisons. To categorize such Bayes factors, Held & Ott (2 0 1 6 ) provided a

six-grade scale reproduced in Table 2, which was proposed as a compromise of the grades

proposed in Jeﬀreys (1961, Appendix B) and Goodman (1999b, Table 1 and 2) (also shown

in Table 2).

2

Communication of Bayes factors is of central import a n c e. The categories shown in Table

2 are helpful in this respect, but there remains a level of arbitrari n ess in the deﬁnition of

the category levels. Ideally, the Bayes factor itself should be reported and comprehensive

formatting of Bayes fac t o rs is now crucial. We recommend to present Bayes factors as ratios,

for example BF

01

= 1/7, since this underlines the symm et ry of Bayes factors if numerator

and denominator are exchanged, here BF

10

= 7/1. For Bayes factors smaller than 1/10,

say, it is usually suﬃcient to report Bayes factors in the 1/x format, where x is an integer.

If the Bayes factor i s larger, then we recommend to use an additional decimal place for x,

e.g. BF= 1/2.5 or BF= 1/1.3, to achieve better accuracy.

The Bayes factor (2) is based on the data y, sometimes called a data-based Bayes fac t or

(Held et al. 2015 ) to distinguish it from Bayes fact o rs based on test statistics or P -values.

Indeed, the step from a P -value p to a Bayes factor is most easily accomplished by treating

p as the data y in (2) to ob ta i n a P -based Bayes factor based on the sampling distribution

2

Jeﬀreys has actually used the slightly diﬀerent cutpoints (1/

√

10)

a

, a = 1, 2, 3, 4, whereas

Goodman has speciﬁed his evidence categories for Bayes factors of 1/5, 1/10 , 1/20 and 1/100,

which we have somewhat shifted to our cutpoints 1/3, 1/10, 1/30 and 1/100.

4 Held and Ott

On p-Values and Bayes Factors

Figures

Citations

Deep generative modeling for single-cell transcriptomics.

Sensitivity and specificity of information criteria.

Three Recommendations for Improving the Use of p-Values

Rewriting results sections in the language of evidence

Rewriting results sections in the language of evidence

Related Papers (5)

The ASA's Statement on p-Values: Context, Process, and Purpose

Redefine statistical significance

Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy

Why Most Published Research Findings Are False

Moving to a World Beyond “p < 0.05”

Trending Questions (2)