What have the authors contributed in "An adaptive metropolis algorithm" ?

In this paper the authors introduce an adaptive Metropolis ( AM ) algorithm, where the Gaussian proposal distribution is updated along the process using the full information cumulated so far. The authors also include the results of their numerical tests, which indicate that the AM algorithm competes well with traditional Metropolis±Hastings algorithms, and demonstrate that the AM algorithm is easy to use in practical computation.

What is the proof of Theorem 2?

In this section the authors will prove Theorem 2 by showing that a related process is a mixingale (in the sense of McLeish 1975) that satis®es an appropriate law of large numbers.

(Open Access) An adaptive Metropolis algorithm (2001) | Heikki Haario

Q: What is the main result of the AM algorithm?

Their main result, Theorem 1 below, veri®es that the AM process does indeed have the correct ergodicity properties, assuming that the target density is bounded from above and has a bounded support.

An adaptive Metropolis algorithm

HEIKKI HAARIO

1

, EERO SAKSMAN

1

and JOHANNA TAMMINEN

Department of Mathematics, P.O. Box 4 (Yliopistonkatu 5), FIN-00014 University of

Helsinki, Finland. E-mail:



heikki.haario@helsinki.®;



eero.saksman@helsinki.®.

Finnish Meteorological Institute, Geophysics Research, P.O. Box 503, FIN-00101

Helsinki, Finland. E-mail: johanna.tamminen@fmi.®

A proper choice of a proposal distribution for Markov chain Monte Carlo methods, for example for

the Metropolis±Hastings algorithm, is well known to be a crucial factor for the convergence of the

algorithm. In this paper we introduce an adaptive Metropolis (AM) algorithm, where the Gaussian

proposal distribution is updated along the process using the full information cumulated so far. Due to

the adaptive nature of the process, the AM algorithm is non-Markovian, but we establish here that it

has the correct ergodic properties. We also include the results of our numerical tests, which indicate

that the AM algorithm competes well with traditional Metropolis±Hastings algorithms, and

demonstrate that the AM algorithm is easy to use in practical computation.

Keywords: adaptive Markov chain Monte Carlo; comparison; convergence; ergodicity; Markov chain

Monte Carlo; Metropolis±Hastings algorithm

1. Introduction

It is generally acknowledged that the choice of an effective proposal distribution for the

random walk Metropolis algorithm, for example, is essential in order to obtain reasonable

results by simulation in a limited amount of time. This choice concerns both the size and the

spatial orientation of the proposal distribution, which are often very dif®cult to choose well

since the target density is unknown (see Gelman et al. 1996; Gilks et al. 1995; 1998; Haario

et al. 1999; Roberts et al. 1997). A possible remedy is provided by adaptive algorithms,

which use the history of the process in order to `tune' the proposal distribution suitably. This

has previously been done (for instance) by assuming that the state space contains an atom.

The adaptation is performed only at the times of recurrence to the atom in order to preserve

the right ergodic properties (Gilks et al. 1998). The adaptation criteria are then obtained by

monitoring the acceptance rate. A related and interesting self-regenerative version of adaptive

Markov chain Monte Carlo (MCMC), based on introducing an auxiliary chain, is contained in

the recent preprint of Sahu and Zhigljavsky (1999). For other versions of adaptive MCMC

and related work, we refer to Evans (1991), Fishman (1996), Gelfand and Sahu (1994), Gilks

and Roberts (1995) and Gilks et al. (1994), together with the references therein.

We introduce here an adaptive Metropolis (AM) algorithm which adapts continuously to

the target distribution. Signi®cantly, the adaptation affects both the size and the spatial

orientation of the proposal distribution. Moreover, the new algorithm is straightforward to

implement and use in practice. The de®nition of the AM algorithm is based on the classical

Bernoulli 7(2), 2001, 223±242

1350±7265 # 2001 ISI/BS

random walk Metropolis algorithm (Metropolis et al. 1953) and its modi®cation, the AP

algorithm, introduced in Haario et al. (1999). In the AP algorithm the proposal distribution

is a Gaussian distribution centred on the current state, and the covariance is calculated from

a ®xed ®nite number of previous states. In the AM algorithm the covariance of the proposal

distribution is calculated using all of the previous states. The method is easily implemented

with no extra computational cost since one may apply a simple recursion formula for the

covariances involved.

An important advantage of the AM algorithm is that it starts using the cumulating

information right at the beginning of the simulation. The rapid start of the adaptation

ensures that the search becomes more effective at an early stage of the simulation, which

diminishes the number of function evaluations needed.

To be more exact, assume that at time t the already sampled states of the AM chain are

, X

, ..., X

, some of which may be multiple. The new proposal distribution for the

next candidate point is then a Gaussian distribution with mean at the current point X

and

covariance given by s

R, where R is the covariance matrix determined by the spatial

distribution of the states X

, X

, ..., X

2 R

. The scaling parameter s

depends only on

the dimension d of the vectors. This adaptation strategy forces the proposal distribution to

approach an appropriately scaled Gaussian approximation of the target distribution, which

increases the ef®ciency of the simulation. A more detailed description of the algorithm is

given in Section 2 below.

One of the dif®culties in constructing adaptive MCMC algorithms is to ensure that the

algorithm maintains the correct ergodicity properties. We observe here (see also Haario et

al. 1999) that the AP algorithm does not possess this property. Our main result, Theorem 1

below, veri®es that the AM process does indeed have the correct ergodicity properties,

assuming that the target density is bounded from above and has a bounded support. The

AM chain is not Markovian, but we show that the asymptotic dependence between the

elements of the chain is weak enough to apply known theorems of large numbers for

mixingales ± see McLeish (1975) and (29) below for this notion. Similar results may be

proven also for variants of the algorithm, where the covariance is computed from a suitably

increasing segment of the near history.

Section 3 contains a detailed description of the AM algorithm as a stochastic process and

the theorem on the ergodicity of the AM. The proof is based on an auxiliary result that is

proven in Section 4. Finally, in Section 5 we present results from test simulations, where the

AM algorithm is compared with traditional Metropolis±Hastings algorithms (Hastings 1970)

by applying both linear and nonlinear, correlated and uncorrelated unimodal target

distributions. Our tests seem to imply that AM performs at least as well as the traditional

algorithms for which a nearly optimal proposal distribution has been given a priori.

2. Description of the algorithm

We assume that our target distribution is supported on the subset S  R

, and that it has the

(unscaled) density ð(x) with respect to the Lebesgue measure on S. With a slight abuse of

notation, we shall also denote the target distribution by ð.

224 H. Haario, E. Saksman and J. Tamminen

We now explain how the AM algorithm works. Recall from Section 1 that the basic idea

is to update the proposal distribution by using the knowledge we have so far acquired about

the target distribution. Otherwise the de®nition of the algorithm is identical to the usual

Metropolis process.

Suppose, therefore, that at time t ÿ1 we have sampled the states X

, X

, ..., X

tÿ1

where X

is the initial state. Then a candidate point Y is sampled from the (asymptotically

symmetric) proposal distribution q

(jX

, ..., X

tÿ1

), which now may depend on the whole

history (X

, X

, ..., X

tÿ1

). The candidate point Y is accepted with probability

á(X

tÿ1

, Y )  min 1,

ð(Y )

ð(X

tÿ1

)



in which case we set X

 Y , and otherwise X

 X

tÿ1

. Observe that the chosen probability

for the acceptance resembles the familiar acceptance probability of the Metropolis algorithm.

However, here the choice for the acceptance probability is not based on symmetry

(reversibility) conditions since these cannot be satis®ed in our case ± the corresponding

stochastic chain is no longer Markovian. For this reason we have to study the exactness of the

simulation separately, and we do so in Section 3.

The proposal distribution q

(jX

, ..., X

tÿ1

) employed in the AM algorithm is a

Gaussian distribution with mean at the current point X

tÿ1

and covariance

 C

, ..., X

tÿ1

). Note that in the simulation only jumps into S are accepted since

we assume that the target distribution vanishes outside S.

The crucial thing regarding the adaptation is how the covariance of the proposal

distribution depends on the history of the chain. In the AM algorithm this is solved by

setting C

 s

cov(X

, ..., X

tÿ1

)  s

åI

after an initial period, where s

is a parameter

that depends only on dimension d and å . 0 is a constant that we may choose very small

compared to the size of S. Here I

denotes the d-dimensional identity matrix. In order to

start, we select an arbitrary, strictly positive de®nite, initial covariance C

, according to our

best prior knowledge (which may be quite poor). We select an index t

. 0 for the length of

an initial period and de®ne



, t < t

cov(X

, ..., X

tÿ1

)  s

åI

, t . t



(1)

The covariance C

may be viewed as a function of t variables from R

having values in

uniformly positive de®nite matrices.

Recall the de®nition of the empirical covariance matrix determined by points

, ..., x

2 R

cov(x

, ..., x

) 

i0

ÿ (k  1)x

, (2)

where

 (1=(k  1))

i0

and the elements x

2 R

are considered as column vectors.

So one obtains that in de®nition (1) for t > t

 1 the covariance C

satis®es the recursion

formula

An adaptive Metropolis algorithm 225

t1



t ÿ 1



tÿ1

ÿ (t  1)X

 X

 åI

): (3)

This allows one to calculate C

without too much computational cost since the mean X

also

satis®es an obvious recursion formula.

The choice for the length of the initial segment t

. 0 is free, but the bigger it is chosen

the more slowly the effect of the adaptation is felt. In a sense the size of t

re¯ects our

trust in the initial covariance C

. The role of the parameter å is just to ensure that C

will

not become singular (see Remark 1 below). As a basic choice for the scaling parameter we

have adopted the value s

 (2:4)

=d from Gelman et al. (1996), where it was shown that

in a certain sense this choice optimizes the mixing properties of the Metropolis search in

the case of Gaussian targets and Gaussian proposals.

Remark 1. In our test runs the covariance C

has not had the tendency to degenerate. This has

also been the case in our multimodal test examples. However, potential dif®culties with å  0

(if any) are more likely to appear in the multimodal cases. In practical computations one

presumably may utilize de®nition (1) with å  0, although the change is negligible if å has

already been chosen small enough. More importantly, we can prove the correct ergodicity

property of the algorithm only under the assumption å . 0; see Theorem 1 below.

Remark 2. In order to avoid the algorithm starting slowly it is possible to employ special

tricks. Naturally, if a priori knowledge (such as the maximum likelihood value or

approximate covariance of the target distribution) is available, it can be utilized in

choosing the initial state or the initial covariance C

. Also, in some cases it is advisable to

employ the greedy start procedure: during a short initial period one updates the proposal

using only the accepted states. Afterwards the AM is run as described above. Moreover,

during the early stage of the algorithm it is natural to require it to move at least a little. If it

has not moved enough in the course of a certain number of iterations, the proposal

distribution could be shrunk by some constant factor.

Remark 3. It is also possible to choose an integer n

. 1 and update the covariance every

th step only (again using the entire history). This saves computer time when generating the

candidate points. There is again a simple recursion formula for the covariances C

3. Ergodicity of the AM chain

In the AP algorithm, which was brie¯y described in Section 1, the covariance C

was

calculated from the last H states only, where H > 2. This strategy has the undesirable

consequence of bringing non-exactness into the simulation. There are several ways to see this.

One may, for instance, study the Markov chain consisting of H-tuples of consecutive

variables of the AP chain, to obtain the limit distribution for the AP by a suitable projection

from the equilibrium distribution of this Markov chain. Simple examples in the case of ®nite

state space for an analogous model show that the limiting distribution of the AP algorithm

226 H. Haario, E. Saksman and J. Tamminen

differs slightly from the target distribution. Numerical calculations in the continuous case

indicate similar behaviour. An illustrating example of this phenomenon is presented in the

Appendix.

It is our aim in this section to show that the AM algorithm has the right ergodic

properties and hence provides correct simulation of the target distribution.

Let us start by recalling some basic notions of the theory of stochastic processes that are

needed later. We ®rst de®ne the set-up. Let (S, B , m) be a state space and denote by M(S)

the set of ®nite measures on (S, B ). The norm kk on M(S) denotes the total variation

norm. Let n > 1 be a natural number. A map K

: S

3 B ! [0, 1] is a generalized

transition probability on the set S if the map x 7! K

(x; A)isB

-measurable for each

A  B, where x 2 S

and K

(x; ) is a probability measure on (S, B) for each x 2 S

.Ina

natural way K

de®nes a positive contraction from M(S

) into M(S). A transition

probability on S corresponds to the case n  1 in the above de®nition.

We assume that a sequence of generalized transition probabilities (K

)

n1

is given.

Moreover, let ì

be a probability distribution (the initial distribution)onS. Then the

sequence (K

) and ì

determine uniquely the ®nite-dimensional distributions of the

discrete-time stochastic process (chain) (X

)

n0

on S via the formula

P(X

2 A

, X

2 A

, ..., X

2 A

) 



(dy

)



;dy

)



, y

;dy

) 



, y

, ..., y

nÿ1

;dy

)

...

: (4)

In fact, it is directly veri®ed that these distributions are consistent and the theorem of Ionescu

Tulcea (see Proposition V.1.1 of Neveu 1965) yields the existence of the chain (X

)onS

satisfying (4).

We shall now turn to the exact de®nition of the AM chain as a discrete-time stochastic

process. We assume that the target distribution is supported on a bounded subset S  R

so that ð(x)  0 outside S. Thus we shall choose S to be our state space, when equipped

with the Borel ó-algebra B (S) and choosing m to be the normalized Lebesgue measure on

S. The target ð has the (unscaled) density ð(x) with respect to the Lebesgue measure on S.

We also assume that the density is bounded from above on S: for some M , 1,wehave

that

ð(x) < M for x 2 S: (5)

Let C be a symmetric and strictly positive de®nite matrix on R

and denote by N

the

density of the mean-zero Gaussian distribution on R

with covariance C.Thus

(x) 

(2ð)

n=2



jCj

exp ÿ

ÿ1



: (6)

The Gaussian proposal transition probability corresponding to the covariance C satis®es

An adaptive Metropolis algorithm 227

An adaptive Metropolis algorithm

Figures

Citations

MCMC Methods for Multi-Response Generalized Linear Mixed Models: The MCMCglmm R Package

Handbook of Markov Chain Monte Carlo

The Average Star Formation Histories of Galaxies in Dark Matter Halos from z = 0-8

Particle Markov chain Monte Carlo methods

DRAM: Efficient adaptive MCMC

References

Equation of state calculations by fast computing machines

Monte Carlo Sampling Methods Using Markov Chains and Their Applications

Markov Chain Monte Carlo in Practice

Markov Chains for Exploring Posterior Distributions

Martingale Limit Theory and Its Application

Related Papers (5)

Equation of state calculations by fast computing machines

Monte Carlo Sampling Methods Using Markov Chains and Their Applications

Inference from Iterative Simulation Using Multiple Sequences

Weak convergence and optimal scaling of random walk Metropolis algorithms

Bayesian Data Analysis

Frequently Asked Questions (3)

Q1. What have the authors contributed in "An adaptive metropolis algorithm" ?

Q2. What is the main result of the AM algorithm?

Q3. What is the proof of Theorem 2?