scispace - formally typeset
Open AccessJournal ArticleDOI

An adaptive Metropolis algorithm

Heikki Haario, +2 more
- 01 Apr 2001 - 
- Vol. 7, Iss: 2, pp 223-242
Reads0
Chats0
TLDR
An adaptive Metropolis (AM) algorithm, where the Gaussian proposal distribution is updated along the process using the full information cumulated so far, which establishes here that it has the correct ergodic properties.
Abstract
A proper choice of a proposal distribution for Markov chain Monte Carlo methods, for example for the Metropolis-Hastings algorithm, is well known to be a crucial factor for the convergence of the algorithm. In this paper we introduce an adaptive Metropolis (AM) algorithm, where the Gaussian proposal distribution is updated along the process using the full information cumulated so far. Due to the adaptive nature of the process, the AM algorithm is non-Markovian, but we establish here that it has the correct ergodic properties. We also include the results of our numerical tests, which indicate that the AM algorithm competes well with traditional Metropolis-Hastings algorithms, and demonstrate that the AM algorithm is easy to use in practical computation.

read more

Content maybe subject to copyright    Report

An adaptive Metropolis algorithm
HEIKKI HAARIO
1
, EERO SAKSMAN
1
and JOHANNA TAMMINEN
2
1
Department of Mathematics, P.O. Box 4 (Yliopistonkatu 5), FIN-00014 University of
Helsinki, Finland. E-mail:
heikki.haario@helsinki.®;

eero.saksman@helsinki.®.
2
Finnish Meteorological Institute, Geophysics Research, P.O. Box 503, FIN-00101
Helsinki, Finland. E-mail: johanna.tamminen@fmi.®
A proper choice of a proposal distribution for Markov chain Monte Carlo methods, for example for
the Metropolis±Hastings algorithm, is well known to be a crucial factor for the convergence of the
algorithm. In this paper we introduce an adaptive Metropolis (AM) algorithm, where the Gaussian
proposal distribution is updated along the process using the full information cumulated so far. Due to
the adaptive nature of the process, the AM algorithm is non-Markovian, but we establish here that it
has the correct ergodic properties. We also include the results of our numerical tests, which indicate
that the AM algorithm competes well with traditional Metropolis±Hastings algorithms, and
demonstrate that the AM algorithm is easy to use in practical computation.
Keywords: adaptive Markov chain Monte Carlo; comparison; convergence; ergodicity; Markov chain
Monte Carlo; Metropolis±Hastings algorithm
1. Introduction
It is generally acknowledged that the choice of an effective proposal distribution for the
random walk Metropolis algorithm, for example, is essential in order to obtain reasonable
results by simulation in a limited amount of time. This choice concerns both the size and the
spatial orientation of the proposal distribution, which are often very dif®cult to choose well
since the target density is unknown (see Gelman et al. 1996; Gilks et al. 1995; 1998; Haario
et al. 1999; Roberts et al. 1997). A possible remedy is provided by adaptive algorithms,
which use the history of the process in order to `tune' the proposal distribution suitably. This
has previously been done (for instance) by assuming that the state space contains an atom.
The adaptation is performed only at the times of recurrence to the atom in order to preserve
the right ergodic properties (Gilks et al. 1998). The adaptation criteria are then obtained by
monitoring the acceptance rate. A related and interesting self-regenerative version of adaptive
Markov chain Monte Carlo (MCMC), based on introducing an auxiliary chain, is contained in
the recent preprint of Sahu and Zhigljavsky (1999). For other versions of adaptive MCMC
and related work, we refer to Evans (1991), Fishman (1996), Gelfand and Sahu (1994), Gilks
and Roberts (1995) and Gilks et al. (1994), together with the references therein.
We introduce here an adaptive Metropolis (AM) algorithm which adapts continuously to
the target distribution. Signi®cantly, the adaptation affects both the size and the spatial
orientation of the proposal distribution. Moreover, the new algorithm is straightforward to
implement and use in practice. The de®nition of the AM algorithm is based on the classical
Bernoulli 7(2), 2001, 223±242
1350±7265 # 2001 ISI/BS

random walk Metropolis algorithm (Metropolis et al. 1953) and its modi®cation, the AP
algorithm, introduced in Haario et al. (1999). In the AP algorithm the proposal distribution
is a Gaussian distribution centred on the current state, and the covariance is calculated from
a ®xed ®nite number of previous states. In the AM algorithm the covariance of the proposal
distribution is calculated using all of the previous states. The method is easily implemented
with no extra computational cost since one may apply a simple recursion formula for the
covariances involved.
An important advantage of the AM algorithm is that it starts using the cumulating
information right at the beginning of the simulation. The rapid start of the adaptation
ensures that the search becomes more effective at an early stage of the simulation, which
diminishes the number of function evaluations needed.
To be more exact, assume that at time t the already sampled states of the AM chain are
X
0
, X
1
, ..., X
t
, some of which may be multiple. The new proposal distribution for the
next candidate point is then a Gaussian distribution with mean at the current point X
t
and
covariance given by s
d
R, where R is the covariance matrix determined by the spatial
distribution of the states X
0
, X
1
, ..., X
t
2 R
d
. The scaling parameter s
d
depends only on
the dimension d of the vectors. This adaptation strategy forces the proposal distribution to
approach an appropriately scaled Gaussian approximation of the target distribution, which
increases the ef®ciency of the simulation. A more detailed description of the algorithm is
given in Section 2 below.
One of the dif®culties in constructing adaptive MCMC algorithms is to ensure that the
algorithm maintains the correct ergodicity properties. We observe here (see also Haario et
al. 1999) that the AP algorithm does not possess this property. Our main result, Theorem 1
below, veri®es that the AM process does indeed have the correct ergodicity properties,
assuming that the target density is bounded from above and has a bounded support. The
AM chain is not Markovian, but we show that the asymptotic dependence between the
elements of the chain is weak enough to apply known theorems of large numbers for
mixingales ± see McLeish (1975) and (29) below for this notion. Similar results may be
proven also for variants of the algorithm, where the covariance is computed from a suitably
increasing segment of the near history.
Section 3 contains a detailed description of the AM algorithm as a stochastic process and
the theorem on the ergodicity of the AM. The proof is based on an auxiliary result that is
proven in Section 4. Finally, in Section 5 we present results from test simulations, where the
AM algorithm is compared with traditional Metropolis±Hastings algorithms (Hastings 1970)
by applying both linear and nonlinear, correlated and uncorrelated unimodal target
distributions. Our tests seem to imply that AM performs at least as well as the traditional
algorithms for which a nearly optimal proposal distribution has been given a priori.
2. Description of the algorithm
We assume that our target distribution is supported on the subset S R
d
, and that it has the
(unscaled) density ð(x) with respect to the Lebesgue measure on S. With a slight abuse of
notation, we shall also denote the target distribution by ð.
224 H. Haario, E. Saksman and J. Tamminen

We now explain how the AM algorithm works. Recall from Section 1 that the basic idea
is to update the proposal distribution by using the knowledge we have so far acquired about
the target distribution. Otherwise the de®nition of the algorithm is identical to the usual
Metropolis process.
Suppose, therefore, that at time t ÿ1 we have sampled the states X
0
, X
1
, ..., X
tÿ1
,
where X
0
is the initial state. Then a candidate point Y is sampled from the (asymptotically
symmetric) proposal distribution q
t
(jX
0
, ..., X
tÿ1
), which now may depend on the whole
history (X
0
, X
1
, ..., X
tÿ1
). The candidate point Y is accepted with probability
á(X
tÿ1
, Y ) min 1,
ð(Y )
ð(X
tÿ1
)

,
in which case we set X
t
Y , and otherwise X
t
X
tÿ1
. Observe that the chosen probability
for the acceptance resembles the familiar acceptance probability of the Metropolis algorithm.
However, here the choice for the acceptance probability is not based on symmetry
(reversibility) conditions since these cannot be satis®ed in our case ± the corresponding
stochastic chain is no longer Markovian. For this reason we have to study the exactness of the
simulation separately, and we do so in Section 3.
The proposal distribution q
t
(jX
0
, ..., X
tÿ1
) employed in the AM algorithm is a
Gaussian distribution with mean at the current point X
tÿ1
and covariance
C
t
C
t
(X
0
, ..., X
tÿ1
). Note that in the simulation only jumps into S are accepted since
we assume that the target distribution vanishes outside S.
The crucial thing regarding the adaptation is how the covariance of the proposal
distribution depends on the history of the chain. In the AM algorithm this is solved by
setting C
t
s
d
cov(X
0
, ..., X
tÿ1
) s
d
åI
d
after an initial period, where s
d
is a parameter
that depends only on dimension d and å . 0 is a constant that we may choose very small
compared to the size of S. Here I
d
denotes the d-dimensional identity matrix. In order to
start, we select an arbitrary, strictly positive de®nite, initial covariance C
0
, according to our
best prior knowledge (which may be quite poor). We select an index t
0
. 0 for the length of
an initial period and de®ne
C
t
C
0
, t < t
0
,
s
d
cov(X
0
, ..., X
tÿ1
) s
d
åI
d
, t . t
0
:
(1)
The covariance C
t
may be viewed as a function of t variables from R
d
having values in
uniformly positive de®nite matrices.
Recall the de®nition of the empirical covariance matrix determined by points
x
0
, ..., x
k
2 R
d
:
cov(x
0
, ..., x
k
)
1
k
X
k
i0
x
i
x
T
i
ÿ (k 1)x
k
x
T
k
!
, (2)
where
x
k
(1=(k 1))
P
k
i0
x
i
and the elements x
i
2 R
d
are considered as column vectors.
So one obtains that in de®nition (1) for t > t
0
1 the covariance C
t
satis®es the recursion
formula
An adaptive Metropolis algorithm 225

C
t1
t ÿ 1
t
C
t
s
d
t
(t
X
tÿ1
X
T
tÿ1
ÿ (t 1)X
t
X
T
t
X
t
X
T
t
åI
d
): (3)
This allows one to calculate C
t
without too much computational cost since the mean X
t
also
satis®es an obvious recursion formula.
The choice for the length of the initial segment t
0
. 0 is free, but the bigger it is chosen
the more slowly the effect of the adaptation is felt. In a sense the size of t
0
re¯ects our
trust in the initial covariance C
0
. The role of the parameter å is just to ensure that C
t
will
not become singular (see Remark 1 below). As a basic choice for the scaling parameter we
have adopted the value s
d
(2:4)
2
=d from Gelman et al. (1996), where it was shown that
in a certain sense this choice optimizes the mixing properties of the Metropolis search in
the case of Gaussian targets and Gaussian proposals.
Remark 1. In our test runs the covariance C
t
has not had the tendency to degenerate. This has
also been the case in our multimodal test examples. However, potential dif®culties with å 0
(if any) are more likely to appear in the multimodal cases. In practical computations one
presumably may utilize de®nition (1) with å 0, although the change is negligible if å has
already been chosen small enough. More importantly, we can prove the correct ergodicity
property of the algorithm only under the assumption å . 0; see Theorem 1 below.
Remark 2. In order to avoid the algorithm starting slowly it is possible to employ special
tricks. Naturally, if a priori knowledge (such as the maximum likelihood value or
approximate covariance of the target distribution) is available, it can be utilized in
choosing the initial state or the initial covariance C
0
. Also, in some cases it is advisable to
employ the greedy start procedure: during a short initial period one updates the proposal
using only the accepted states. Afterwards the AM is run as described above. Moreover,
during the early stage of the algorithm it is natural to require it to move at least a little. If it
has not moved enough in the course of a certain number of iterations, the proposal
distribution could be shrunk by some constant factor.
Remark 3. It is also possible to choose an integer n
0
. 1 and update the covariance every
n
0
th step only (again using the entire history). This saves computer time when generating the
candidate points. There is again a simple recursion formula for the covariances C
t
.
3. Ergodicity of the AM chain
In the AP algorithm, which was brie¯y described in Section 1, the covariance C
t
was
calculated from the last H states only, where H > 2. This strategy has the undesirable
consequence of bringing non-exactness into the simulation. There are several ways to see this.
One may, for instance, study the Markov chain consisting of H-tuples of consecutive
variables of the AP chain, to obtain the limit distribution for the AP by a suitable projection
from the equilibrium distribution of this Markov chain. Simple examples in the case of ®nite
state space for an analogous model show that the limiting distribution of the AP algorithm
226 H. Haario, E. Saksman and J. Tamminen

differs slightly from the target distribution. Numerical calculations in the continuous case
indicate similar behaviour. An illustrating example of this phenomenon is presented in the
Appendix.
It is our aim in this section to show that the AM algorithm has the right ergodic
properties and hence provides correct simulation of the target distribution.
Let us start by recalling some basic notions of the theory of stochastic processes that are
needed later. We ®rst de®ne the set-up. Let (S, B , m) be a state space and denote by M(S)
the set of ®nite measures on (S, B ). The norm kk on M(S) denotes the total variation
norm. Let n > 1 be a natural number. A map K
n
: S
n
3 B ! [0, 1] is a generalized
transition probability on the set S if the map x 7! K
n
(x; A)isB
n
-measurable for each
A B, where x 2 S
n
and K
n
(x; ) is a probability measure on (S, B) for each x 2 S
n
.Ina
natural way K
n
de®nes a positive contraction from M(S
n
) into M(S). A transition
probability on S corresponds to the case n 1 in the above de®nition.
We assume that a sequence of generalized transition probabilities (K
n
)
1
n1
is given.
Moreover, let ì
0
be a probability distribution (the initial distribution)onS. Then the
sequence (K
n
) and ì
0
determine uniquely the ®nite-dimensional distributions of the
discrete-time stochastic process (chain) (X
n
)
1
n0
on S via the formula
P(X
0
2 A
0
, X
1
2 A
1
, ..., X
n
2 A
n
)
y
0
2A
0
ì
0
(dy
0
)
y
1
2A
1
K
1
(y
0
;dy
1
)
3
y
2
2A
2
K
2
(y
0
, y
1
;dy
2
) 
y
n
2A
n
K
n
(y
0
, y
1
, ..., y
nÿ1
;dy
n
)
!
...
!
!
: (4)
In fact, it is directly veri®ed that these distributions are consistent and the theorem of Ionescu
Tulcea (see Proposition V.1.1 of Neveu 1965) yields the existence of the chain (X
n
)onS
satisfying (4).
We shall now turn to the exact de®nition of the AM chain as a discrete-time stochastic
process. We assume that the target distribution is supported on a bounded subset S R
d
,
so that ð(x) 0 outside S. Thus we shall choose S to be our state space, when equipped
with the Borel ó-algebra B (S) and choosing m to be the normalized Lebesgue measure on
S. The target ð has the (unscaled) density ð(x) with respect to the Lebesgue measure on S.
We also assume that the density is bounded from above on S: for some M , 1,wehave
that
ð(x) < M for x 2 S: (5)
Let C be a symmetric and strictly positive de®nite matrix on R
d
and denote by N
C
the
density of the mean-zero Gaussian distribution on R
n
with covariance C.Thus
N
C
(x)
1
(2ð)
n=2

jCj
p
exp ÿ
1
2
x
T
C
ÿ1
x

: (6)
The Gaussian proposal transition probability corresponding to the covariance C satis®es
An adaptive Metropolis algorithm 227

Citations
More filters
Journal ArticleDOI

MCMC Methods for Multi-Response Generalized Linear Mixed Models: The MCMCglmm R Package

TL;DR: The R package MCMCglmm implements Markov chain Monte Carlo methods for generalized linear mixed models, which provide a flexible framework for modeling a range of data, although with non-Gaussian response variables the likelihood cannot be obtained in closed form.
BookDOI

Handbook of Markov Chain Monte Carlo

TL;DR: A Markov chain Monte Carlo based analysis of a multilevel model for functional MRI data and its applications in environmental epidemiology, educational research, and fisheries science are studied.
Journal ArticleDOI

The Average Star Formation Histories of Galaxies in Dark Matter Halos from z = 0-8

TL;DR: In this article, a robust method to constrain average galaxy star formation rates, star formation histories (SFHs), and the intracluster light (ICL) as a function of halo mass is presented.
Journal ArticleDOI

Particle Markov chain Monte Carlo methods

TL;DR: It is shown here how it is possible to build efficient high dimensional proposal distributions by using sequential Monte Carlo methods, which allows not only to improve over standard Markov chain Monte Carlo schemes but also to make Bayesian inference feasible for a large class of statistical models where this was not previously so.
Journal ArticleDOI

DRAM: Efficient adaptive MCMC

TL;DR: The ergodicity of the resulting non-Markovian sampler is proved, and the efficiency of the combination of adaptive Metropolis samplers and delayed rejection outperforms the original methods.
References
More filters
Journal ArticleDOI

Equation of state calculations by fast computing machines

TL;DR: In this article, a modified Monte Carlo integration over configuration space is used to investigate the properties of a two-dimensional rigid-sphere system with a set of interacting individual molecules, and the results are compared to free volume equations of state and a four-term virial coefficient expansion.
Journal ArticleDOI

Monte Carlo Sampling Methods Using Markov Chains and Their Applications

TL;DR: A generalization of the sampling method introduced by Metropolis et al. as mentioned in this paper is presented along with an exposition of the relevant theory, techniques of application and methods and difficulties of assessing the error in Monte Carlo estimates.
BookDOI

Markov Chain Monte Carlo in Practice

TL;DR: The Markov Chain Monte Carlo Implementation Results Summary and Discussion MEDICAL MONITORING Introduction Modelling Medical Monitoring Computing Posterior Distributions Forecasting Model Criticism Illustrative Application Discussion MCMC for NONLINEAR HIERARCHICAL MODELS.
Journal ArticleDOI

Markov Chains for Exploring Posterior Distributions

Luke Tierney
- 01 Dec 1994 - 
TL;DR: Several Markov chain methods are available for sampling from a posterior distribution as discussed by the authors, including Gibbs sampler and Metropolis algorithm, and several strategies for constructing hybrid algorithms, which can be used to guide the construction of more efficient algorithms.
Related Papers (5)
Frequently Asked Questions (3)
Q1. What have the authors contributed in "An adaptive metropolis algorithm" ?

In this paper the authors introduce an adaptive Metropolis ( AM ) algorithm, where the Gaussian proposal distribution is updated along the process using the full information cumulated so far. The authors also include the results of their numerical tests, which indicate that the AM algorithm competes well with traditional Metropolis±Hastings algorithms, and demonstrate that the AM algorithm is easy to use in practical computation. 

Their main result, Theorem 1 below, veri®es that the AM process does indeed have the correct ergodicity properties, assuming that the target density is bounded from above and has a bounded support. 

In this section the authors will prove Theorem 2 by showing that a related process is a mixingale (in the sense of McLeish 1975) that satis®es an appropriate law of large numbers.