What have the authors stated for future works in "Sequential hypothesis testing with bayes factors: e ciently testing mean di↵erences" ?

Among these, the authors wish to stress that it makes a commonly used procedure perfectly acceptable, which has been considered as questionable so far: While in NHST this option is taboo, using the SBF it can be done without any guilt. Not only it can be done, but doing so results in a more e cient research strategy, provided that some rules are followed. Meta-analysis of clinical trials with early stopping: An investigation of potential bias.

Why do the authors underestimate the true eect size?

Due to the Bayesian shrinkage of early terminations, meta-analytic aggregations of multiple SBF studies underestimate the true e↵ect size by 5-9%.

How many participants did the authors increase the sample size in order to keep the simulation time manageable?

In order to keep simulation time manageable, the authors increased the sample in several step sizes: +1 participant until n = 100, +5 participants until n = 1000, +10 participants until n = 2500, +20 participants until n = 5000, and +50 participants from that point on.

What is the often-heard critique of Bayesian approaches?

One of the most often-heard critiques of Bayesian approaches is about the necessity to choose a prior distribution of the parameters (e.g., Simmons et al., 2011).

What is the way to fine-tune the rate of misleading evidence?

the choice of the minimal sample size before the optional stopping procedure is started is another parameter for fine-tuning the expected rate of misleading evidence.

What was the mean posterior eect size in the final sample?

The mean posterior e↵ect size in the final sample was Cohen’s d = 0.72, with a 95% highest posterior density (HPD) interval of [0.22; 1.21].

What was the critical BF for stopping the sequential sampling?

The minimum sample size was set to nmin = 20 in each group, and the critical BF10 for stopping the sequential sampling was set to 10 (resp. 1/10).

(Open Access) Sequential hypothesis testing with Bayes factors: Efficiently testing mean differences. (2017) | Felix D. Schönbrodt

Q: What are the contributions in "Sequential hypothesis testing with bayes factors: e ciently testing mean di↵erences" ?

Despite these criticisms this research practice is not uncommon, probably as it appeals to researcher ’ s intuition to collect more data in order to push an indecisive result into a decisive region. In this contribution the authors investigate the properties of a procedure for Bayesian hypothesis testing that allows optional stopping with unlimited multiple testing, even after each participant. The authors investigated the long-term rate of misleading evidence, the average expected sample sizes, and the biasedness of e↵ect size estimates when an SBF design is applied to a test of mean di↵erences between two groups. This article may not exactly replicate the final version published in the APA journal.

Q: How many participants can be used to calculate a BF?

Run a minimal number of participants (e.g., nmin = 20 per group), increase sample size as often as desired and compute a BF at each stage (even after each participant).

Q: What is the problem with a p value of.08?

But as journals tend to reject non-significant results, a p value of .08 can pose a real practical problem and a conflict of interest for researchers.

Sequential Hypothesis Testing With Bayes Factors: Eﬃciently Testing

Mean Di↵erences

Felix D. Schönbrodt

Ludwig-Maximilians-Universität München, Germany

Eric-Jan Wagenmakers

University of Amsterdam

Michael Zehetleitner

Ludwig-Maximilians-Universität München, Germany

Marco Perugini

University of Milan – Bicocca

Unplanned optional stopping rules have been criticized for inﬂating Type I error rates under

the null hypothesis signiﬁcance testing (NHST) paradigm. Despite these criticisms this re-

search practice is not uncommon, probably as it appeals to researcher’s intuition to collect

more data in order to push an indecisive result into a decisive region. In this contribution we

investigate the properties of a procedure for Bayesian hypothesis testing that allows optional

stopping with unlimited multiple testing, even after each participant. In this procedure, which

we call Sequential Bayes Factors (SBF), Bayes factors are computed until an a priori deﬁned

level of evidence is reached. This allows ﬂexible sampling plans and is not dependent upon

correct e↵ect size guesses in an a priori power analysis. We investigated the long-term rate

of misleading evidence, the average expected sample sizes, and the biasedness of e↵ect size

estimates when an SBF design is applied to a test of mean di↵erences between two groups.

Compared to optimal NHST, the SBF design typically needs 50% to 70% smaller samples to

reach a conclusion about the presence of an e↵ect, while having the same or lower long-term

rate of wrong inference.

Manuscript accepted for publication in Psychological Methods.

doi:10.1037/met0000061

This article may not exactly replicate the ﬁnal version published in the APA

journal. It is not the copy of record.

Keywords: Bayes factor, eﬃciency, hypothesis testing, optional stopping, sequential designs

The goal of science is to increase knowledge about the

world. For this endeavor, scientists have to weigh the ev-

idence of competing theories and hypotheses, for example:

‘Does drug X help to cure cancer or not?’, ‘Which type of

exercise, A or B, is more e↵ective to reduce weight?’, or

‘Does maternal responsivity increase intelligence of the chil-

dren?’. How do scientist come to conclusions concerning

Felix D. Schönbrodt, Department of Psychology, Ludwig-

Maximilians-Universität München, Germany. Acknowledgements.

We thank Richard Morey and Je↵ Rouder for assistance with Bayes-

related R scripts, and Daniël Lakens and Alexander Ly for com-

ments on a previous version.

Reproducible analysis scripts for the simulations and

analyses are available at the Open Science Framework

(https://osf.io/qny5x/).

Correspondence concerning this article should be addressed

to Felix Schönbrodt, Leopoldstr. 13, 80802 München, Germany.

Email: felix@nicebread.de. Phone: +49 89 2180 5217. Fax: +49

89 2180 99 5214.

such competing hypotheses?

The typical current procedure for hypothesis testing is a

hybrid between what Sir Ronald Fisher, Jerzy Neyman and

Egon Pearson have proposed in the early 20th century: The

null-hypothesis signiﬁcance test (NHST; for an accessible

overview, see Dienes, 2008). It soon became the standard

model for hypothesis testing in many disciplines like psy-

chology, medicine, and most other disciplines that use statis-

tics. However, the NHST has been repeatedly criticized in

the past decades and in particular in the last years (e.g.,

Cumming, 2014; Kruschke, 2012; Rouder, Speckman, Sun,

Morey, & Iverson, 2009). Despite these critics, it is the de-

facto standard in psychology – but it is not the only possible

procedure for testing scientiﬁc hypothesis. The purpose of

this paper is to propose an alternative procedure based on

sequentially testing Bayes factors. This procedure, hence-

forward called ‘Sequential Bayes Factor (SBF) design’ pro-

poses to collect an initial sample and to compute a Bayes

factor (BF). The BF quantiﬁes the relative evidence in the

data, with respect to whether data is better predicted by one

hypothesis (e.g., a null hypothesis, ‘there is no e↵ect in the

2 SCHÖNBRODT

population’) or a competing hypothesis (e.g., ‘there is a non-

zero e↵ect in the population’). Then, the sample size can be

optionally increased and a new BF be computed until a pre-

deﬁned threshold of evidential strength is reached. A more

detailed introduction of the procedure will be given below.

This procedure does not presume a predeﬁned and ﬁxed sam-

ple size, but rather accumulates data until a suﬃcient cer-

tainty about the presence or absence of an e↵ect is achieved.

Hence, the SBF design applies an optional stopping rule on

the sampling plan. This procedure has been proposed several

times (e.g., Dienes, 2008; Kass & Raftery, 1995; Lindley,

1957; Wagenmakers, Wetzels, Borsboom, v. d. Maas, &

Kievit, 2012) and already has been applied in experimental

studies (Matzke et al., 2015; Wagenmakers et al., 2012; Wa-

genmakers et al., 2015).

From a Bayesian point of view the interpretation of a

study only depends on the data at hand, the priors, and the

speciﬁc model of the data-generating process (i.e., the likeli-

hood function). In contrast to frequentist approaches it does

not depend on the sampling intentions of the researcher, such

as when to stop a study, or outcomes from hypothetical other

studies that have not been conducted (e.g., Berger & Wolpert,

1988; Dienes, 2011; Kruschke, 2012).

For planning a study, however, also for Bayesians it makes

sense to investigate the outcomes from hypothetical studies

by studying the properties of a Bayesian procedure under

several conditions (Sanborn et al., 2014). The goal of this

paper is to investigate such properties of the SBF design via

Monte-Carlo simulations. Throughout the paper we will re-

fer to the scenario of testing the hypothesis of a two-group

mean di↵erence, where H

: m

= m

, and H

: m

, m

The true e↵ect size  expresses a standardized mean di↵er-

ence in the population.

The paper is organized as follows. In the ﬁrst section, we

describe three research designs: NHST with a priori power

analysis, group sequential designs, and Sequential Bayes

Factors. In the second section, we describe three properties

of the SBF design that are investigated in our simulations: (1)

the long-term rate of misleading evidence (i.e., ‘How often

do I get strong evidence for an e↵ect although there is none,

or strong evidence for H

although there is an e↵ect?’), (2)

the necessary sample size to get evidence of a certain strength

(i.e., a Bayesian power analysis), and (3) the biasedness of

the e↵ect size estimates (i.e., ‘Do empirical e↵ect size esti-

mates over- or underestimate the true e↵ect on average?’).

The third section reports the results of our simulations, and

shows how SBF performs on each of the three properties in

comparison to the other two research designs. The fourth

section gives some practical recommendations how to com-

pute Bayes factors and how to use the SBF design. Finally,

the ﬁfth section discusses advantages and disadvantages of

the SBF design.

Three Research Designs

In the following sections, we will describe and discuss

three research designs: NHST with a priori power analysis,

group sequential designs, and Sequential Bayes Factors. For

illustration purposes, we introduce an empirical example to

which we apply each research design. We used open data

from the ManyLabs 1 project (Klein et al., 2014), speciﬁcally

the replication data of the retrospective gambler’s fallacy

study (Oppenheimer & Monin, 2009). The data are avail-

able at the Open Science Framework (https://osf.io/wx7ck/).

Theory predicts that participants will perceive unlikely out-

comes to have come from longer sequences than more com-

mon outcomes. The original study investigated the scenario

that participants observe a person rolling a dice and see that

two times (resp. three times) in a row the number ‘6’ comes

up. After observing three 6s in a row (‘three-6’ condition),

participants thought that the person has been rolling the dice

for a longer time than after observing two 6s in a row (‘two-

6’ condition). We chose this data set in favor of the NHST-

PA method, as the population e↵ect size (as estimated by the

full sample of 5942 participants; d = 0.60, 95% CI [0.55;

0.65]) is very close to the e↵ect size of the original study (d

= 0.69). We drew random samples from the full pool of 5942

participants to simulate a ﬁxed-n, a group sequential, and a

SBF study.

The NHST Procedure With a Priori Power Analysis and

Some of Its Problems

In its current best-practice version (e.g., Cohen, 1988), the

Neyman-Pearson procedure entails the following steps:

1. Estimate the expected e↵ect size from the literature, or

deﬁne the minimal meaningful e↵ect size.

2. A priori deﬁne the tolerated long-term rate of false

positive decisions (usually ↵ = 5%) and the tolerated

long-term rate of false negative decisions (usually 

between 5% and 20%).

3. Run an a priori power analysis, which gives the neces-

sary sample size to detect an e↵ect (i.e., to reject H

)

within the limits of the deﬁned error rates.

4. Optionally, for conﬁrmatory research: Pre-register the

study and the statistical analysis that will be con-

ducted.

5. Run the study with the sample size that was obtained

from the a priori power analysis.

6. Do the pre-deﬁned analysis and compute a p value.

Reject the H

if p < ↵. Report a point estimate and

the conﬁdence interval for the e↵ect size.

SEQUENTIAL BAYES FACTORS 3

Henceforward, this procedure will be called the NHST-PA

procedure (‘Null-Hypothesis Signiﬁcance Test with a priori

Power Analysis’). This type of sampling plan is also called a

ﬁxed-n design, as the sample size is predetermined and ﬁxed.

Over the last years, psychology has seen a large debate

about problems in current research practice. Many of these

cover (intentionally or unintentionally) wrong applications of

the NHST-PA procedure, such as too much ﬂexibility in data

analysis (Bakker, van Dijk, & Wicherts, 2012; Simmons,

Nelson, & Simonsohn, 2011), or even outright fraud (Simon-

sohn, 2013). Other papers revive a general critique of the

ritual of NHST (e.g., Cumming, 2014; Kline, 2004; Schmidt

& Hunter, 1997; Wagenmakers, 2007), which recognize that

they are to a large part a reformulation of older critiques (e.g.,

Cohen, 1994) which are a reformulation of even older articles

(Bakan, 1966; Rozeboom, 1960), which claim themselves

that they are ‘hardly original’ (Bakan, 1966, p. 423).

The many theoretical arguments against NHST are not re-

peated here. We rather focus on three interconnected, prac-

tical problems with NHST, that partly are inherent to the

method, and partly stem from an improper application of the

method: The dependence of NHST-PA’s performance on the

a priori e↵ect size estimate, the problem of ‘nearly signiﬁcant

results’, and the related temptation of optionally increasing

the sample size.

Dependence of NHST-PA on the a priori e↵ect size esti-

mate. The eﬃciency and the quality of NHST-PA depends

on how close the a priori e↵ect size estimate is to the true

e↵ect size. If  is smaller than the assumed e↵ect size, the

proportion of Type II errors will increase. For example, if 

is 25% smaller than expected, one has not enough power to

reliably detect the actually smaller e↵ect, and Type II errors

will rise from 5% to about 24%. This problem can be tack-

led using a safeguard power analysis (Perugini, Gallucci, &

Costantini, 2014). This procedure takes into account that the

e↵ect size point estimates are surrounded by conﬁdence in-

tervals. Hence, if a researcher wants to run a more conclu-

sive test of whether an e↵ect can be replicated, he or she is

advised to aim for the lower end of the initial e↵ect size inter-

val in order to have enough statistical power, even when the

point estimate is biased upwards. Depending on the accuracy

of the published e↵ect size, the safeguard e↵ect size can be

considerably lower than the point estimate of the e↵ect size.

Inserting conservative e↵ect sizes into an a priori power

analysis helps against increased Type II errors, but it has its

costs. If the original point estimate indeed was correct, going

for a conservative e↵ect size would lead to sample sizes that

are bigger than strictly needed. For example, if  is 25%

larger than expected, the sample size prescribed by a safe-

guard power analysis will be about 1.5 times higher com-

pared to an optimal sample size. Under many conditions, this

represents an advantage rather than a problem. In fact, a side

beneﬁt of using safeguard power analysis is that the parame-

ter of interest will be estimated more precisely. Nonetheless,

it can be argued to be statistically ineﬃcient insofar the sam-

ple size needed to reach the conclusion can be bigger than

what could have been necessary.

Optimal eﬃciency can only be achieved when the a pri-

ori e↵ect size estimate exactly matches the true e↵ect size.

Henceforward, we will use the label optimal NHST-PA for

that ideal case which can represent a benchmark condition

of maximal eﬃciency under the NHST paradigm. In other

words, this is how good NHST can get.

The ‘p = .08 problem’. Whereas safeguard power anal-

ysis can be a good solution for an inappropriate a priori e↵ect

size estimate, it is not a solution for the ‘almost signiﬁcant’

problem. Imagine you ran a study, and obtained a p value of

.08. What do you do? Probably based on their ‘Bayesian Id’s

wishful thinking’ (Gigerenzer, Krauss, & Vitouch, 2004),

many researchers would label this ﬁnding, for example, as

‘teetering on the brink of signiﬁcance’.

By doing so, the

p value is interpreted as an indicator for the strength of ev-

idence against H

(or for H

). This interpretation would be

incorrect from a Neyman-Pearson perspective (Gigerenzer et

al., 2004; Hubbard, 2011), but valid from a Fisherian per-

spective (Royall, 1997), which reﬂects the confusion in the

literature about what p values are and what they are not.

Such ‘nearly signiﬁcant’ p values are not an actual prob-

lem of a proper NHST – it is just a possible result of a statis-

tical procedure. But as journals tend to reject non-signiﬁcant

results, a p value of .08 can pose a real practical problem

and a conﬂict of interest for researchers.

By exploiting re-

searcher degrees of freedom (Simmons et al., 2011), p values

can be tweaked (‘p-hacking’, Simonsohn, Nelson, & Sim-

mons, 2014), and the current system has incentives for p-

hacking (Bakker et al., 2012).

Optionally increasing the sample size: A typical ques-

tionable research practice. Faced with the ‘p = .08 prob-

lem’, a researcher’s intuition could suggest to increase the

sample size and to see whether the p value drops below

the .05 criterion. This intuition is correct from an accuracy

point of view: More data leads to more precise estimates

(e.g., Maxwell, Kelley, & Rausch, 2008; Schönbrodt & Pe-

rugini, 2013). According to John, Loewenstein, and Prelec

(2012), optionally increasing the sample size when the re-

sults are not signiﬁcant is one of the most common (ques-

tionable) research practices. Furthermore, Yu, Sprenger,

Thomas, and Dougherty (2013) showed empirically which

http://mchankins.wordpress.com/2013/04/21/still-not-

signiﬁcant-2/

There have been recent calls for changes in editorial policies,

in a way that studies with any p value can be published as long as

they are well-powered (van Assen, van Aert, Nuijten, & Wicherts,

2014). Furthermore, several journals started to accept registered

reports, which publish results independent of their outcome (e.g.,

Chambers, 2013; Nosek & Lakens, 2014).

4 SCHÖNBRODT

(incorrect) heuristics researchers used in their optional stop-

ping practice. Adaptively increasing the sample size can be

also framed as a framework of multiple testing – one con-

ducts an interim test, and based on the p value data collection

is either stopped (if p < .05), or the sample size is increased

if the p value is in a promising region (e.g., if .05 < p < .10;

Murayama, Pekrun, & Fiedler, 2013).

However, this practice of unplanned multiple testing is

not allowed in the classical NHST paradigm, as it increases

Type I error rates (Armitage, McPherson, & Rowe, 1969).

Of course one can calculate statistics during data collection,

but the results of these tests must not have any inﬂuence on

optionally stopping data collection. If an interim test with

optional stopping is performed, and the ﬁrst test was done

at a 5% level, already a 5% Type I error is spent. It should

be noted that the increase in Type I error is small for a sin-

gle interim test when there is a promising result (it increases

from 5% to 7.1%, cf. Murayama et al., 2013). However, the

increase depends on how many interim tests are performed

and with enough interim tests the Type I error rate can be

pushed towards 100% (Armitage et al., 1969; Proschan, Lan,

& Wittes, 2006).

The empirical example in the NHST-PA design. In

this section, we demonstrate how the NHST-PA procedure

would have been applied to the empirical example.

Method and participants. An a priori power analysis with

an expected e↵ect size of d = 0.69, Type I error rate of 5%,

and a statistical power of 95% resulted in a necessary sample

size of n = 56 in each group.

Results. A t-test for independent groups rejected H

(t(77.68)=3.72, p < .001), indicating a signiﬁcant group dif-

ference in the expected direction (two-6: M = 1.86, SD =

1.42; three-6: M = 3.54; SD = 3.05). The e↵ect size in the

sample was d = 0.70, 95% CI [0.32; 1.09].

Group Sequential Designs

Optionally increasing the sample size is considered a

questionable research practice in the ﬁxed-n design, as it in-

creases the rate of false-positive results. If the interim tests

are planned a-priori, however, multiple testing is possible un-

der the NHST paradigm. Several extensions of the NHST

paradigm have been developed for that purpose. The most

common sequential designs are called group sequential (GS)

designs (e.g., Lai, Lavori, & Shih, 2012, Proschan et al.,

2006).

In a GS design, the number and the sample sizes

of the interim tests (e.g., at n

=25, n

=50, and n

= 75) and a

ﬁnal test (e.g., at n

max

= 100) are planned a priori. The sam-

ple size spacings of the interim tests and critical values for

the test statistic at each stage are designed in a way that the

overall Type I error rate is controlled at, say, 5%. If the test

statistic exceeds an upper boundary at an interim test, data

collection is stopped early, as the e↵ect is strong enough that

it is already reliably detected in the smaller sample (‘stopping

for eﬃcacy’). If the test statistic falls short of the boundary,

data collection is continued until the next interim test, or the

ﬁnal test is due. Some GS designs also allow for ‘stopping

for futility’, when the test statistic falls below a lower bound-

ary. In this case it is unlikely that even with the maximal

sample size n

max

an e↵ect can be detected. The more often

interim tests are performed, the higher the maximal sample

size must be in order to achieve the same power as a ﬁxed-n

design without interim tests. But if an e↵ect exists, there is a

considerable chance of stopping earlier than at n

max

. Hence,

on average, GS designs need less participants compared to

NHST-PA with the same error rates.

If done correctly, GS designs can be a partial solution to

the ‘p = .08 problem’. However, all sequential designs based

on NHST have one property in common: They have a lim-

ited number of tests, which in the case of GS designs has to

be deﬁned a priori. But what do you do when your ﬁnal test

results in p = .08? Once the ﬁnal test is done, all Type I error

has been spent, and the same problem arises again.

The example in the GS design. We demonstrate below

how the GS procedure would have been applied to the em-

pirical example:

Method and participants. We employed a group sequen-

tial design with four looks (three interim looks plus the ﬁnal

look), with a total Type I error rate of 5% and a statistical

power of 95%. Necessary sample sizes and critical bound-

aries were computed using the default settings of the gsDe-

sign package (Anderson, 2014). The planned sample sizes

were n = 16, 31, 46, and 61 in each group for the ﬁrst to the

fourth look, with corresponding critical two-sided p-values

of .0016, .0048, .0147, and .0440.

Results. The ﬁrst and the second interim test failed to re-

ject H

at the critical level (p

= .0304; p

= .0052). As the

p-value fell below the critical level at the third interim test (p

= .0003), we rejected H

and stopped sampling. Hence, the

ﬁnal sample consisted of n = 46 participants in each group

(two-6: M = 1.71, SD = 1.48; three-6: M = 3.50; SD =

2.85).

Sequential Bayes Factors: An Alternative Hypothesis

Testing Procedure

Under the NHST paradigm it is not allowed to increase

sample size after you have run your (last planned) hypoth-

esis test. This section elaborates on an alternative way of

choosing between competing hypotheses, that sets p values

and NHST completely aside and allows unlimited multiple

testing: Sequential Bayes Factors (SBF).

An accessible introduction to GS designs is provided by Lak-

ens (2014), who also gives advice on how to plan GS designs in

practice. Beyond GS designs other sequential designs have been

proposed, such as adaptive designs (e.g., Lai et al., 2012), or a ﬂex-

ible sequential strategy based on p-values (Frick, 1998), which are

not discussed here.

SEQUENTIAL BAYES FACTORS 5

NHST focuses on how incompatible the actual data (or

more extreme data) are with the H

. In Bayesian hypoth-

esis testing via BFs, in contrast, it is assessed whether the

data at hand are more compatible with H

or an alternative

hypothesis H

(Berger, 2006; Dienes, 2011; Je↵reys, 1961;

Wagenmakers, 2007). BFs provide a numerical value that

quantiﬁes how well a hypothesis predicts the empirical data

relative to a competing hypothesis. Hence, the BF belongs

to the larger family of likelihood ratio tests, and the SBF

resembles the sequential probability ratio test proposed by

Wald and Wolfowitz (1948). Formally, BFs are deﬁned as:

p(D|H

)

p(D|H

)

(1)

For example, if the BF

is 4, this indicates: ‘These empir-

ical data D are 4 times more probable if H

were true than if

were true’. A BF

between 0 and 1, in contrast, indicates

support for the H

BFs can be calculated once for a ﬁnalized data set. But it

has also repeatedly been proposed to employ BFs in sequen-

tial designs with optional stopping rules, where sample sizes

are increased until a BF of a certain size has been achieved

(Dienes, 2008; Kass & Raftery, 1995; Lindley, 1957; Wa-

genmakers et al., 2012). While unplanned optional stop-

ping is highly problematic for NHST, it is not a problem for

Bayesian statistics. For example, Edwards, Lindman, and

Savage (1963) state, ‘the rules governing when data collec-

tion stops are irrelevant to data interpretation. It is entirely

appropriate to collect data until a point has been proven or

disproven, or until the data collector runs out of time, money,

or patience’ (p. 193; see also Lindley, 1957).

Although many authors agree about the theoretical advan-

tages of BFs, until recently it was complicated and unclear

how to compute a BF even for the simplest standard designs

(Rouder, Morey, Speckman, & Province, 2012). Fortunately,

over the last years BFs for several standard designs have been

developed (e.g., Dienes, 2014; Gönen, Johnson, Lu, & West-

fall, 2005; Kuiper, Klugkist, & Hoijtink, 2010; Morey &

Rouder, 2011; Mulder, Hoijtink, & de Leeuw, 2012; Rouder

et al., 2012, 2009). In the current simulations, we use the

default Bayes factor proposed by Rouder et al. (2009). This

BF tests H

: m

= m

against H

:  ⇠ Cauchy(r), where

r is a scale parameter that controls the width of the Cauchy

distribution. This prior distribution deﬁnes the plausibility

of possible e↵ect sizes under H

(more details below).

The SBF procedure can be outlined as following:

1. Deﬁne a priori a threshold which indicates the re-

quested decisiveness of evidence, for example a BF

of 10 for H

and the reciprocal value of 1/10 for H

(e.g., ‘When data indicate that data are 10 times more

likely under the H

than under H

, or vice versa, I

stop sampling.’). Henceforward, these thresholds are

referred to as ‘H

boundary’ and ‘H

boundary’.

2. Choose a prior distribution for the e↵ect sizes under

. This distribution describes the plausibility that ef-

fects of certain sizes exist.

3. Optionally, for conﬁrmatory research: Pre-register the

study along with the predeﬁned threshold and prior ef-

fect size distribution.

4. Run a minimal number of participants (e.g., n

min

= 20

per group), increase sample size as often as desired

and compute a BF at each stage (even after each par-

ticipant).

5. As soon as one of the thresholds deﬁned in step 1 is

reached or exceeded (either the H

boundary or the H

boundary), stop sampling and report the ﬁnal BF. As a

Bayesian e↵ect size estimate, report the mean and the

highest posterior density (HPD) interval of the poste-

rior distribution of the e↵ect size estimate, or plot the

entire posterior distribution.

Figure 1 shows some exemplary trajectories of how a BF

could evolve with increasing sample size. The true e↵ect size

was  = 0.4, and the threshold was set to 30, resp. 1/30.

Selecting a threshold. As a guideline, verbal labels for

BFs (‘grades of evidence’; Je↵reys, 1961, p. 432) have been

suggested (Je↵reys, 1961, Kass & Raftery, 1995; see also

Lee & Wagenmakers, 2013). If 1 < BF < 3, the BF indicates

anecdotal evidence,3< BF < 10 moderate evidence, 10 <

BF < 30 strong evidence, and BF > 30 very strong evidence.

(Kass & Raftery, 1995, suggest 20 as threshold for ‘strong

evidence’).

Selecting an e↵ect size prior for H

. For the calcula-

tion of the BF prior distributions must be speciﬁed, which

quantify the plausibility of parameter values. In the default

BF for t tests (Morey & Rouder, 2011, 2015; Rouder et al.,

2009) which we employ here, the plausibility of e↵ect sizes

(expressed as Cohen’s d) is modeled as a Cauchy distribu-

tion, which is called a JZS prior. The spread of the distri-

bution can be adjusted with the scale parameter r. Figure 2

shows the Cauchy distributions for the three default values

provided in the BayesFactor package (r =

2/2, 1, and

2).

Higher r values lead to fatter tails, which corresponds to a

higher plausibility of large e↵ect sizes under the H

The family of JZS priors was constructed based on general

desiderata (ly_harold_inpress; e.g., Bayarri, Berger, Forte,

& García-Donato, 2012; Je↵reys, 1961), without recourse to

substantive knowledge about the speciﬁcs of the problem at

hand, and in this sense it is an objective prior (Rouder et al.,

Recently, it has been debated whether BF are also biased by

optional stopping rules (Sanborn & Hills, 2013; Yu et al., 2013).

For a rebuttal of these positions, see Rouder (2014), and also the

reply by Sanborn et al. (2014).

The Cauchy distribution is a t-distribution with one degree of

freedom.

Sequential hypothesis testing with Bayes factors: Efficiently testing mean differences.

Figures

Citations

Bayesian inference for psychology. Part II: Example applications with JASP

Bayesian inference for psychology. Part I: Theoretical advantages and practical ramifications.

The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective

Bayesian Statistical Inference for Psychological Research

Bayes factor design analysis: Planning for compelling evidence

References

R: A language and environment for statistical computing.

Statistical Power Analysis for the Behavioral Sciences

Conducting Meta-Analyses in R with the metafor Package

Introduction to Meta-Analysis

False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant

Related Papers (5)

Bayesian T Tests for Accepting and Rejecting the Null Hypothesis

Theory of probability

A practical solution to the pervasive problems of p values.

G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences

False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant

Frequently Asked Questions (10)

Q1. What are the contributions in "Sequential hypothesis testing with bayes factors: e ciently testing mean di↵erences" ?

Q2. What have the authors stated for future works in "Sequential hypothesis testing with bayes factors: e ciently testing mean di↵erences" ?

Q3. Why do the authors underestimate the true eect size?

Q4. How many participants did the authors increase the sample size in order to keep the simulation time manageable?

Q5. How many participants can be used to calculate a BF?

Q6. What is the often-heard critique of Bayesian approaches?

Q7. What is the way to fine-tune the rate of misleading evidence?

Q8. What was the mean posterior eect size in the final sample?

Q9. What was the critical BF for stopping the sequential sampling?

Q10. What is the problem with a p value of.08?