scispace - formally typeset
Open AccessJournal ArticleDOI

Sequential hypothesis testing with Bayes factors: Efficiently testing mean differences.

TLDR
This contribution investigates the properties of a procedure for Bayesian hypothesis testing that allows optional stopping with unlimited multiple testing, even after each participant, and investigates the long-term rate of misleading evidence, the average expected sample sizes, and the biasedness of effect size estimates when an SBF design is applied to a test of mean differences between 2 groups.
Abstract
Unplanned optional stopping rules have been criticized for inflating Type I error rates under the null hypothesis significance testing (NHST) paradigm. Despite these criticisms, this research practice is not uncommon, probably because it appeals to researcher's intuition to collect more data to push an indecisive result into a decisive region. In this contribution, we investigate the properties of a procedure for Bayesian hypothesis testing that allows optional stopping with unlimited multiple testing, even after each participant. In this procedure, which we call Sequential Bayes Factors (SBFs), Bayes factors are computed until an a priori defined level of evidence is reached. This allows flexible sampling plans and is not dependent upon correct effect size guesses in an a priori power analysis. We investigated the long-term rate of misleading evidence, the average expected sample sizes, and the biasedness of effect size estimates when an SBF design is applied to a test of mean differences between 2 groups. Compared with optimal NHST, the SBF design typically needs 50% to 70% smaller samples to reach a conclusion about the presence of an effect, while having the same or lower long-term rate of wrong inference. (PsycINFO Database Record

read more

Content maybe subject to copyright    Report

Sequential Hypothesis Testing With Bayes Factors: Eciently Testing
Mean Dierences
Felix D. Schönbrodt
Ludwig-Maximilians-Universität München, Germany
Eric-Jan Wagenmakers
University of Amsterdam
Michael Zehetleitner
Ludwig-Maximilians-Universität München, Germany
Marco Perugini
University of Milan Bicocca
Unplanned optional stopping rules have been criticized for inflating Type I error rates under
the null hypothesis significance testing (NHST) paradigm. Despite these criticisms this re-
search practice is not uncommon, probably as it appeals to researcher’s intuition to collect
more data in order to push an indecisive result into a decisive region. In this contribution we
investigate the properties of a procedure for Bayesian hypothesis testing that allows optional
stopping with unlimited multiple testing, even after each participant. In this procedure, which
we call Sequential Bayes Factors (SBF), Bayes factors are computed until an a priori defined
level of evidence is reached. This allows flexible sampling plans and is not dependent upon
correct eect size guesses in an a priori power analysis. We investigated the long-term rate
of misleading evidence, the average expected sample sizes, and the biasedness of eect size
estimates when an SBF design is applied to a test of mean dierences between two groups.
Compared to optimal NHST, the SBF design typically needs 50% to 70% smaller samples to
reach a conclusion about the presence of an eect, while having the same or lower long-term
rate of wrong inference.
Manuscript accepted for publication in Psychological Methods.
doi:10.1037/met0000061
This article may not exactly replicate the final version published in the APA
journal. It is not the copy of record.
Keywords: Bayes factor, eciency, hypothesis testing, optional stopping, sequential designs
The goal of science is to increase knowledge about the
world. For this endeavor, scientists have to weigh the ev-
idence of competing theories and hypotheses, for example:
‘Does drug X help to cure cancer or not?’, ‘Which type of
exercise, A or B, is more eective to reduce weight?’, or
‘Does maternal responsivity increase intelligence of the chil-
dren?’. How do scientist come to conclusions concerning
Felix D. Schönbrodt, Department of Psychology, Ludwig-
Maximilians-Universität München, Germany. Acknowledgements.
We thank Richard Morey and Je Rouder for assistance with Bayes-
related R scripts, and Daniël Lakens and Alexander Ly for com-
ments on a previous version.
Reproducible analysis scripts for the simulations and
analyses are available at the Open Science Framework
(https://osf.io/qny5x/).
Correspondence concerning this article should be addressed
to Felix Schönbrodt, Leopoldstr. 13, 80802 München, Germany.
Email: felix@nicebread.de. Phone: +49 89 2180 5217. Fax: +49
89 2180 99 5214.
such competing hypotheses?
The typical current procedure for hypothesis testing is a
hybrid between what Sir Ronald Fisher, Jerzy Neyman and
Egon Pearson have proposed in the early 20th century: The
null-hypothesis significance test (NHST; for an accessible
overview, see Dienes, 2008). It soon became the standard
model for hypothesis testing in many disciplines like psy-
chology, medicine, and most other disciplines that use statis-
tics. However, the NHST has been repeatedly criticized in
the past decades and in particular in the last years (e.g.,
Cumming, 2014; Kruschke, 2012; Rouder, Speckman, Sun,
Morey, & Iverson, 2009). Despite these critics, it is the de-
facto standard in psychology but it is not the only possible
procedure for testing scientific hypothesis. The purpose of
this paper is to propose an alternative procedure based on
sequentially testing Bayes factors. This procedure, hence-
forward called ‘Sequential Bayes Factor (SBF) design’ pro-
poses to collect an initial sample and to compute a Bayes
factor (BF). The BF quantifies the relative evidence in the
data, with respect to whether data is better predicted by one
hypothesis (e.g., a null hypothesis, ‘there is no eect in the

2 SCHÖNBRODT
population’) or a competing hypothesis (e.g., ‘there is a non-
zero eect in the population’). Then, the sample size can be
optionally increased and a new BF be computed until a pre-
defined threshold of evidential strength is reached. A more
detailed introduction of the procedure will be given below.
This procedure does not presume a predefined and fixed sam-
ple size, but rather accumulates data until a sucient cer-
tainty about the presence or absence of an eect is achieved.
Hence, the SBF design applies an optional stopping rule on
the sampling plan. This procedure has been proposed several
times (e.g., Dienes, 2008; Kass & Raftery, 1995; Lindley,
1957; Wagenmakers, Wetzels, Borsboom, v. d. Maas, &
Kievit, 2012) and already has been applied in experimental
studies (Matzke et al., 2015; Wagenmakers et al., 2012; Wa-
genmakers et al., 2015).
From a Bayesian point of view the interpretation of a
study only depends on the data at hand, the priors, and the
specific model of the data-generating process (i.e., the likeli-
hood function). In contrast to frequentist approaches it does
not depend on the sampling intentions of the researcher, such
as when to stop a study, or outcomes from hypothetical other
studies that have not been conducted (e.g., Berger & Wolpert,
1988; Dienes, 2011; Kruschke, 2012).
For planning a study, however, also for Bayesians it makes
sense to investigate the outcomes from hypothetical studies
by studying the properties of a Bayesian procedure under
several conditions (Sanborn et al., 2014). The goal of this
paper is to investigate such properties of the SBF design via
Monte-Carlo simulations. Throughout the paper we will re-
fer to the scenario of testing the hypothesis of a two-group
mean dierence, where H
0
: m
1
= m
2
, and H
1
: m
1
, m
2
.
The true eect size expresses a standardized mean dier-
ence in the population.
The paper is organized as follows. In the first section, we
describe three research designs: NHST with a priori power
analysis, group sequential designs, and Sequential Bayes
Factors. In the second section, we describe three properties
of the SBF design that are investigated in our simulations: (1)
the long-term rate of misleading evidence (i.e., ‘How often
do I get strong evidence for an eect although there is none,
or strong evidence for H
0
although there is an eect?’), (2)
the necessary sample size to get evidence of a certain strength
(i.e., a Bayesian power analysis), and (3) the biasedness of
the eect size estimates (i.e., ‘Do empirical eect size esti-
mates over- or underestimate the true eect on average?’).
The third section reports the results of our simulations, and
shows how SBF performs on each of the three properties in
comparison to the other two research designs. The fourth
section gives some practical recommendations how to com-
pute Bayes factors and how to use the SBF design. Finally,
the fifth section discusses advantages and disadvantages of
the SBF design.
Three Research Designs
In the following sections, we will describe and discuss
three research designs: NHST with a priori power analysis,
group sequential designs, and Sequential Bayes Factors. For
illustration purposes, we introduce an empirical example to
which we apply each research design. We used open data
from the ManyLabs 1 project (Klein et al., 2014), specifically
the replication data of the retrospective gambler’s fallacy
study (Oppenheimer & Monin, 2009). The data are avail-
able at the Open Science Framework (https://osf.io/wx7ck/).
Theory predicts that participants will perceive unlikely out-
comes to have come from longer sequences than more com-
mon outcomes. The original study investigated the scenario
that participants observe a person rolling a dice and see that
two times (resp. three times) in a row the number ‘6’ comes
up. After observing three 6s in a row (‘three-6’ condition),
participants thought that the person has been rolling the dice
for a longer time than after observing two 6s in a row (‘two-
6’ condition). We chose this data set in favor of the NHST-
PA method, as the population eect size (as estimated by the
full sample of 5942 participants; d = 0.60, 95% CI [0.55;
0.65]) is very close to the eect size of the original study (d
= 0.69). We drew random samples from the full pool of 5942
participants to simulate a fixed-n, a group sequential, and a
SBF study.
The NHST Procedure With a Priori Power Analysis and
Some of Its Problems
In its current best-practice version (e.g., Cohen, 1988), the
Neyman-Pearson procedure entails the following steps:
1. Estimate the expected eect size from the literature, or
define the minimal meaningful eect size.
2. A priori define the tolerated long-term rate of false
positive decisions (usually = 5%) and the tolerated
long-term rate of false negative decisions (usually
between 5% and 20%).
3. Run an a priori power analysis, which gives the neces-
sary sample size to detect an eect (i.e., to reject H
0
)
within the limits of the defined error rates.
4. Optionally, for confirmatory research: Pre-register the
study and the statistical analysis that will be con-
ducted.
5. Run the study with the sample size that was obtained
from the a priori power analysis.
6. Do the pre-defined analysis and compute a p value.
Reject the H
0
if p < . Report a point estimate and
the confidence interval for the eect size.

SEQUENTIAL BAYES FACTORS 3
Henceforward, this procedure will be called the NHST-PA
procedure (‘Null-Hypothesis Significance Test with a priori
Power Analysis’). This type of sampling plan is also called a
fixed-n design, as the sample size is predetermined and fixed.
Over the last years, psychology has seen a large debate
about problems in current research practice. Many of these
cover (intentionally or unintentionally) wrong applications of
the NHST-PA procedure, such as too much flexibility in data
analysis (Bakker, van Dijk, & Wicherts, 2012; Simmons,
Nelson, & Simonsohn, 2011), or even outright fraud (Simon-
sohn, 2013). Other papers revive a general critique of the
ritual of NHST (e.g., Cumming, 2014; Kline, 2004; Schmidt
& Hunter, 1997; Wagenmakers, 2007), which recognize that
they are to a large part a reformulation of older critiques (e.g.,
Cohen, 1994) which are a reformulation of even older articles
(Bakan, 1966; Rozeboom, 1960), which claim themselves
that they are ‘hardly original’ (Bakan, 1966, p. 423).
The many theoretical arguments against NHST are not re-
peated here. We rather focus on three interconnected, prac-
tical problems with NHST, that partly are inherent to the
method, and partly stem from an improper application of the
method: The dependence of NHST-PAs performance on the
a priori eect size estimate, the problem of ‘nearly significant
results’, and the related temptation of optionally increasing
the sample size.
Dependence of NHST-PA on the a priori eect size esti-
mate. The eciency and the quality of NHST-PA depends
on how close the a priori eect size estimate is to the true
eect size. If is smaller than the assumed eect size, the
proportion of Type II errors will increase. For example, if
is 25% smaller than expected, one has not enough power to
reliably detect the actually smaller eect, and Type II errors
will rise from 5% to about 24%. This problem can be tack-
led using a safeguard power analysis (Perugini, Gallucci, &
Costantini, 2014). This procedure takes into account that the
eect size point estimates are surrounded by confidence in-
tervals. Hence, if a researcher wants to run a more conclu-
sive test of whether an eect can be replicated, he or she is
advised to aim for the lower end of the initial eect size inter-
val in order to have enough statistical power, even when the
point estimate is biased upwards. Depending on the accuracy
of the published eect size, the safeguard eect size can be
considerably lower than the point estimate of the eect size.
Inserting conservative eect sizes into an a priori power
analysis helps against increased Type II errors, but it has its
costs. If the original point estimate indeed was correct, going
for a conservative eect size would lead to sample sizes that
are bigger than strictly needed. For example, if is 25%
larger than expected, the sample size prescribed by a safe-
guard power analysis will be about 1.5 times higher com-
pared to an optimal sample size. Under many conditions, this
represents an advantage rather than a problem. In fact, a side
benefit of using safeguard power analysis is that the parame-
ter of interest will be estimated more precisely. Nonetheless,
it can be argued to be statistically inecient insofar the sam-
ple size needed to reach the conclusion can be bigger than
what could have been necessary.
Optimal eciency can only be achieved when the a pri-
ori eect size estimate exactly matches the true eect size.
Henceforward, we will use the label optimal NHST-PA for
that ideal case which can represent a benchmark condition
of maximal eciency under the NHST paradigm. In other
words, this is how good NHST can get.
The p = .08 problem’. Whereas safeguard power anal-
ysis can be a good solution for an inappropriate a priori eect
size estimate, it is not a solution for the ‘almost significant’
problem. Imagine you ran a study, and obtained a p value of
.08. What do you do? Probably based on their ‘Bayesian Id’s
wishful thinking’ (Gigerenzer, Krauss, & Vitouch, 2004),
many researchers would label this finding, for example, as
‘teetering on the brink of significance’.
1
By doing so, the
p value is interpreted as an indicator for the strength of ev-
idence against H
0
(or for H
1
). This interpretation would be
incorrect from a Neyman-Pearson perspective (Gigerenzer et
al., 2004; Hubbard, 2011), but valid from a Fisherian per-
spective (Royall, 1997), which reflects the confusion in the
literature about what p values are and what they are not.
Such ‘nearly significant’ p values are not an actual prob-
lem of a proper NHST it is just a possible result of a statis-
tical procedure. But as journals tend to reject non-significant
results, a p value of .08 can pose a real practical problem
and a conflict of interest for researchers.
2
By exploiting re-
searcher degrees of freedom (Simmons et al., 2011), p values
can be tweaked (‘p-hacking’, Simonsohn, Nelson, & Sim-
mons, 2014), and the current system has incentives for p-
hacking (Bakker et al., 2012).
Optionally increasing the sample size: A typical ques-
tionable research practice. Faced with the p = .08 prob-
lem’, a researcher’s intuition could suggest to increase the
sample size and to see whether the p value drops below
the .05 criterion. This intuition is correct from an accuracy
point of view: More data leads to more precise estimates
(e.g., Maxwell, Kelley, & Rausch, 2008; Schönbrodt & Pe-
rugini, 2013). According to John, Loewenstein, and Prelec
(2012), optionally increasing the sample size when the re-
sults are not significant is one of the most common (ques-
tionable) research practices. Furthermore, Yu, Sprenger,
Thomas, and Dougherty (2013) showed empirically which
1
http://mchankins.wordpress.com/2013/04/21/still-not-
significant-2/
2
There have been recent calls for changes in editorial policies,
in a way that studies with any p value can be published as long as
they are well-powered (van Assen, van Aert, Nuijten, & Wicherts,
2014). Furthermore, several journals started to accept registered
reports, which publish results independent of their outcome (e.g.,
Chambers, 2013; Nosek & Lakens, 2014).

4 SCHÖNBRODT
(incorrect) heuristics researchers used in their optional stop-
ping practice. Adaptively increasing the sample size can be
also framed as a framework of multiple testing one con-
ducts an interim test, and based on the p value data collection
is either stopped (if p < .05), or the sample size is increased
if the p value is in a promising region (e.g., if .05 < p < .10;
Murayama, Pekrun, & Fiedler, 2013).
However, this practice of unplanned multiple testing is
not allowed in the classical NHST paradigm, as it increases
Type I error rates (Armitage, McPherson, & Rowe, 1969).
Of course one can calculate statistics during data collection,
but the results of these tests must not have any influence on
optionally stopping data collection. If an interim test with
optional stopping is performed, and the first test was done
at a 5% level, already a 5% Type I error is spent. It should
be noted that the increase in Type I error is small for a sin-
gle interim test when there is a promising result (it increases
from 5% to 7.1%, cf. Murayama et al., 2013). However, the
increase depends on how many interim tests are performed
and with enough interim tests the Type I error rate can be
pushed towards 100% (Armitage et al., 1969; Proschan, Lan,
& Wittes, 2006).
The empirical example in the NHST-PA design. In
this section, we demonstrate how the NHST-PA procedure
would have been applied to the empirical example.
Method and participants. An a priori power analysis with
an expected eect size of d = 0.69, Type I error rate of 5%,
and a statistical power of 95% resulted in a necessary sample
size of n = 56 in each group.
Results. A t-test for independent groups rejected H
0
(t(77.68)=3.72, p < .001), indicating a significant group dif-
ference in the expected direction (two-6: M = 1.86, SD =
1.42; three-6: M = 3.54; SD = 3.05). The eect size in the
sample was d = 0.70, 95% CI [0.32; 1.09].
Group Sequential Designs
Optionally increasing the sample size is considered a
questionable research practice in the fixed-n design, as it in-
creases the rate of false-positive results. If the interim tests
are planned a-priori, however, multiple testing is possible un-
der the NHST paradigm. Several extensions of the NHST
paradigm have been developed for that purpose. The most
common sequential designs are called group sequential (GS)
designs (e.g., Lai, Lavori, & Shih, 2012, Proschan et al.,
2006).
3
In a GS design, the number and the sample sizes
of the interim tests (e.g., at n
1
=25, n
2
=50, and n
3
= 75) and a
final test (e.g., at n
max
= 100) are planned a priori. The sam-
ple size spacings of the interim tests and critical values for
the test statistic at each stage are designed in a way that the
overall Type I error rate is controlled at, say, 5%. If the test
statistic exceeds an upper boundary at an interim test, data
collection is stopped early, as the eect is strong enough that
it is already reliably detected in the smaller sample (‘stopping
for ecacy’). If the test statistic falls short of the boundary,
data collection is continued until the next interim test, or the
final test is due. Some GS designs also allow for ‘stopping
for futility’, when the test statistic falls below a lower bound-
ary. In this case it is unlikely that even with the maximal
sample size n
max
an eect can be detected. The more often
interim tests are performed, the higher the maximal sample
size must be in order to achieve the same power as a fixed-n
design without interim tests. But if an eect exists, there is a
considerable chance of stopping earlier than at n
max
. Hence,
on average, GS designs need less participants compared to
NHST-PA with the same error rates.
If done correctly, GS designs can be a partial solution to
the p = .08 problem’. However, all sequential designs based
on NHST have one property in common: They have a lim-
ited number of tests, which in the case of GS designs has to
be defined a priori. But what do you do when your final test
results in p = .08? Once the final test is done, all Type I error
has been spent, and the same problem arises again.
The example in the GS design. We demonstrate below
how the GS procedure would have been applied to the em-
pirical example:
Method and participants. We employed a group sequen-
tial design with four looks (three interim looks plus the final
look), with a total Type I error rate of 5% and a statistical
power of 95%. Necessary sample sizes and critical bound-
aries were computed using the default settings of the gsDe-
sign package (Anderson, 2014). The planned sample sizes
were n = 16, 31, 46, and 61 in each group for the first to the
fourth look, with corresponding critical two-sided p-values
of .0016, .0048, .0147, and .0440.
Results. The first and the second interim test failed to re-
ject H
0
at the critical level (p
1
= .0304; p
2
= .0052). As the
p-value fell below the critical level at the third interim test (p
3
= .0003), we rejected H
0
and stopped sampling. Hence, the
final sample consisted of n = 46 participants in each group
(two-6: M = 1.71, SD = 1.48; three-6: M = 3.50; SD =
2.85).
Sequential Bayes Factors: An Alternative Hypothesis
Testing Procedure
Under the NHST paradigm it is not allowed to increase
sample size after you have run your (last planned) hypoth-
esis test. This section elaborates on an alternative way of
choosing between competing hypotheses, that sets p values
and NHST completely aside and allows unlimited multiple
testing: Sequential Bayes Factors (SBF).
3
An accessible introduction to GS designs is provided by Lak-
ens (2014), who also gives advice on how to plan GS designs in
practice. Beyond GS designs other sequential designs have been
proposed, such as adaptive designs (e.g., Lai et al., 2012), or a flex-
ible sequential strategy based on p-values (Frick, 1998), which are
not discussed here.

SEQUENTIAL BAYES FACTORS 5
NHST focuses on how incompatible the actual data (or
more extreme data) are with the H
0
. In Bayesian hypoth-
esis testing via BFs, in contrast, it is assessed whether the
data at hand are more compatible with H
0
or an alternative
hypothesis H
1
(Berger, 2006; Dienes, 2011; Jereys, 1961;
Wagenmakers, 2007). BFs provide a numerical value that
quantifies how well a hypothesis predicts the empirical data
relative to a competing hypothesis. Hence, the BF belongs
to the larger family of likelihood ratio tests, and the SBF
resembles the sequential probability ratio test proposed by
Wald and Wolfowitz (1948). Formally, BFs are defined as:
BF
10
=
p(D|H
1
)
p(D|H
0
)
(1)
For example, if the BF
10
is 4, this indicates: ‘These empir-
ical data D are 4 times more probable if H
1
were true than if
H
0
were true’. A BF
10
between 0 and 1, in contrast, indicates
support for the H
0
.
BFs can be calculated once for a finalized data set. But it
has also repeatedly been proposed to employ BFs in sequen-
tial designs with optional stopping rules, where sample sizes
are increased until a BF of a certain size has been achieved
(Dienes, 2008; Kass & Raftery, 1995; Lindley, 1957; Wa-
genmakers et al., 2012). While unplanned optional stop-
ping is highly problematic for NHST, it is not a problem for
Bayesian statistics. For example, Edwards, Lindman, and
Savage (1963) state, ‘the rules governing when data collec-
tion stops are irrelevant to data interpretation. It is entirely
appropriate to collect data until a point has been proven or
disproven, or until the data collector runs out of time, money,
or patience’ (p. 193; see also Lindley, 1957).
4
Although many authors agree about the theoretical advan-
tages of BFs, until recently it was complicated and unclear
how to compute a BF even for the simplest standard designs
(Rouder, Morey, Speckman, & Province, 2012). Fortunately,
over the last years BFs for several standard designs have been
developed (e.g., Dienes, 2014; Gönen, Johnson, Lu, & West-
fall, 2005; Kuiper, Klugkist, & Hoijtink, 2010; Morey &
Rouder, 2011; Mulder, Hoijtink, & de Leeuw, 2012; Rouder
et al., 2012, 2009). In the current simulations, we use the
default Bayes factor proposed by Rouder et al. (2009). This
BF tests H
0
: m
1
= m
2
against H
1
: Cauchy(r), where
r is a scale parameter that controls the width of the Cauchy
5
distribution. This prior distribution defines the plausibility
of possible eect sizes under H
1
(more details below).
The SBF procedure can be outlined as following:
1. Define a priori a threshold which indicates the re-
quested decisiveness of evidence, for example a BF
10
of 10 for H
1
and the reciprocal value of 1/10 for H
0
(e.g., ‘When data indicate that data are 10 times more
likely under the H
1
than under H
0
, or vice versa, I
stop sampling.’). Henceforward, these thresholds are
referred to as H
0
boundary’ and H
1
boundary’.
2. Choose a prior distribution for the eect sizes under
H
1
. This distribution describes the plausibility that ef-
fects of certain sizes exist.
3. Optionally, for confirmatory research: Pre-register the
study along with the predefined threshold and prior ef-
fect size distribution.
4. Run a minimal number of participants (e.g., n
min
= 20
per group), increase sample size as often as desired
and compute a BF at each stage (even after each par-
ticipant).
5. As soon as one of the thresholds defined in step 1 is
reached or exceeded (either the H
0
boundary or the H
1
boundary), stop sampling and report the final BF. As a
Bayesian eect size estimate, report the mean and the
highest posterior density (HPD) interval of the poste-
rior distribution of the eect size estimate, or plot the
entire posterior distribution.
Figure 1 shows some exemplary trajectories of how a BF
10
could evolve with increasing sample size. The true eect size
was = 0.4, and the threshold was set to 30, resp. 1/30.
Selecting a threshold. As a guideline, verbal labels for
BFs (‘grades of evidence’; Jereys, 1961, p. 432) have been
suggested (Jereys, 1961, Kass & Raftery, 1995; see also
Lee & Wagenmakers, 2013). If 1 < BF < 3, the BF indicates
anecdotal evidence,3< BF < 10 moderate evidence, 10 <
BF < 30 strong evidence, and BF > 30 very strong evidence.
(Kass & Raftery, 1995, suggest 20 as threshold for ‘strong
evidence’).
Selecting an eect size prior for H
1
. For the calcula-
tion of the BF prior distributions must be specified, which
quantify the plausibility of parameter values. In the default
BF for t tests (Morey & Rouder, 2011, 2015; Rouder et al.,
2009) which we employ here, the plausibility of eect sizes
(expressed as Cohen’s d) is modeled as a Cauchy distribu-
tion, which is called a JZS prior. The spread of the distri-
bution can be adjusted with the scale parameter r. Figure 2
shows the Cauchy distributions for the three default values
provided in the BayesFactor package (r =
p
2/2, 1, and
p
2).
Higher r values lead to fatter tails, which corresponds to a
higher plausibility of large eect sizes under the H
1
.
The family of JZS priors was constructed based on general
desiderata (ly_harold_inpress; e.g., Bayarri, Berger, Forte,
& García-Donato, 2012; Jereys, 1961), without recourse to
substantive knowledge about the specifics of the problem at
hand, and in this sense it is an objective prior (Rouder et al.,
4
Recently, it has been debated whether BF are also biased by
optional stopping rules (Sanborn & Hills, 2013; Yu et al., 2013).
For a rebuttal of these positions, see Rouder (2014), and also the
reply by Sanborn et al. (2014).
5
The Cauchy distribution is a t-distribution with one degree of
freedom.

Citations
More filters
Journal ArticleDOI

The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective

TL;DR: In this paper, the authors compare Bayesian and frequentist approaches to hypothesis testing and estimation with confidence or credible intervals, and explain how Bayesian methods achieve the goals of the New Statistics better than frequentist methods.
Journal ArticleDOI

Bayes factor design analysis: Planning for compelling evidence

TL;DR: This work explores Bayes Factor Design Analysis (BFDA) as a useful tool to design studies for maximum efficiency and informativeness and demonstrates how the properties of each design can be evaluated using Monte Carlo simulations.
References
More filters
Journal Article

R: A language and environment for statistical computing.

R Core Team
- 01 Jan 2014 - 
TL;DR: Copyright (©) 1999–2012 R Foundation for Statistical Computing; permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and permission notice are preserved on all copies.
Book

Statistical Power Analysis for the Behavioral Sciences

TL;DR: The concepts of power analysis are discussed in this paper, where Chi-square Tests for Goodness of Fit and Contingency Tables, t-Test for Means, and Sign Test are used.
Journal ArticleDOI

Conducting Meta-Analyses in R with the metafor Package

TL;DR: The metafor package provides functions for conducting meta-analyses in R and includes functions for fitting the meta-analytic fixed- and random-effects models and allows for the inclusion of moderators variables (study-level covariates) in these models.
Journal ArticleDOI

False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant

TL;DR: It is shown that despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings, flexibility in data collection, analysis, and reporting dramatically increases actual false- positive rates, and a simple, low-cost, and straightforwardly effective disclosure-based solution is suggested.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What are the contributions in "Sequential hypothesis testing with bayes factors: e ciently testing mean di↵erences" ?

Despite these criticisms this research practice is not uncommon, probably as it appeals to researcher ’ s intuition to collect more data in order to push an indecisive result into a decisive region. In this contribution the authors investigate the properties of a procedure for Bayesian hypothesis testing that allows optional stopping with unlimited multiple testing, even after each participant. The authors investigated the long-term rate of misleading evidence, the average expected sample sizes, and the biasedness of e↵ect size estimates when an SBF design is applied to a test of mean di↵erences between two groups. This article may not exactly replicate the final version published in the APA journal. 

Among these, the authors wish to stress that it makes a commonly used procedure perfectly acceptable, which has been considered as questionable so far: While in NHST this option is taboo, using the SBF it can be done without any guilt. Not only it can be done, but doing so results in a more e cient research strategy, provided that some rules are followed. Meta-analysis of clinical trials with early stopping: An investigation of potential bias. 

Due to the Bayesian shrinkage of early terminations, meta-analytic aggregations of multiple SBF studies underestimate the true e↵ect size by 5-9%. 

In order to keep simulation time manageable, the authors increased the sample in several step sizes: +1 participant until n = 100, +5 participants until n = 1000, +10 participants until n = 2500, +20 participants until n = 5000, and +50 participants from that point on. 

Run a minimal number of participants (e.g., nmin = 20 per group), increase sample size as often as desired and compute a BF at each stage (even after each participant). 

One of the most often-heard critiques of Bayesian approaches is about the necessity to choose a prior distribution of the parameters (e.g., Simmons et al., 2011). 

the choice of the minimal sample size before the optional stopping procedure is started is another parameter for fine-tuning the expected rate of misleading evidence. 

The mean posterior e↵ect size in the final sample was Cohen’s d = 0.72, with a 95% highest posterior density (HPD) interval of [0.22; 1.21]. 

The minimum sample size was set to nmin = 20 in each group, and the critical BF10 for stopping the sequential sampling was set to 10 (resp. 1/10). 

But as journals tend to reject non-significant results, a p value of .08 can pose a real practical problem and a conflict of interest for researchers.