Smoothing algorithms for state–space models

doi:10.1007/S10463-009-0236-2

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. XX, NO. XX, 200X 1

Smoothing Algorithms for State-Space Models

Mark Briers, Arnaud Doucet, and Simon Maskell

Abstract

A prevalent problem in statistical signal processing, applied statistics, and time series analysis is the calculation of the smoothed

posterior distribution, which describes the uncertainty associated with a state, or a sequence of states, conditional on data from the

past, the present, and the future. The aim of this paper is to provide a rigorous foundation for the calculation, or approximation, of

such smoothed distributions, to facilitate a robust and efﬁcient implementation. Through a cohesive and generic exposition of the

scientiﬁc literature we offer several novel extensions such that one can perform smoothing in the most general case. Experimental

results for: a Jump Markov Linear System; a comparison of particle smoothing methods; and parameter estimation using a particle

implementation of the EM algorithm, are provided.

Index Terms

State space smoothing, Hidden Markov Model, Kalman ﬁlter, Kalman smoother, Jump Markov Linear System, Particle ﬁlter,

Particle smoother, Parameter estimation.

I. INTRODUCTION

One is often interested in quantifying the uncertainty associated with an unobserved variable, X, given some information

pertaining to that variable, Y . In a sequential context, this relates to the calculation of the posterior density,

1

p(x

1:T

|y

1:T

),

where x

1:T

= {x

1

, . . . , x

T

}, y

1:T

= {y

1

, . . . , y

T

}, are generic points in path space of the signal and observation processes,

and the discrete time index t ∈ N

+

. To facilitate sequential estimation, a ﬁrst order Markovian state space model is often

assumed[46]. Furthermore, the observations are assumed to be conditionally independent given the state X

t

= x

t

. One thus

has mathematical expressions for the state transition and observation densities, as follows:

X

t

|X

t−1

= x

t−1

∼ f(·|x

t−1

); (1)

Y

t

|X

t

= x

t

∼ g(·|x

t

), (2)

with an appropriately deﬁned prior density, X

1

∼ µ (·).

For many state-space applications (e.g. tracking[4]) it is more convenient to compute the marginal posterior distribution of

the state at a particular time instance conditional on a sequence of data, p(x

t

|y

1:l

). If l < t then this process is known as

prediction; if l = t then it is commonly referred to as ﬁltering; and if l > t then one is conducting the process of smoothing.

The main objective of this article is to present a generic exposition of the vast array of literature available in the scientiﬁc

research community that attempt to solve the ‘smoothing’ problem. Moreover, we are able to offer several novel improvement

strategies for the techniques available by identifying and solving problems which were previously ignored. We illustrate these

novel algorithms on several examples.

The smoothing problem is commonly segmented into three problems: ﬁxed-interval smoothing, where one is interested in

calculating p(x

t

|y

1:T

) for all time indices t = 1, . . . , T ; ﬁxed lag smoothing, where one calculates p(x

t

|y

1:t+L

) where L > 0

is some ﬁxed value (this density is calculated on-line); and ﬁxed point smoothing, where p(x

t

|y

1:D

) is calculated for a ﬁxed

value t with D > t increasing.

• The most common application of smoothing is the (off-line) ﬁxed-interval smoothing algorithm. The authors assert that it

is possible to employ a single smoothing scheme, based on ﬁxed-interval smoothing, to solve all three problems. Details

of this proposition are now provided.

• Several schemes have been suggested to solve the ﬁxed-lag smoothing problem, with a favoured technique being the

augmentation of the states over the lag L. In general, however, one then has a computational cost which is exponential

in the dimension of the state space, n

x

. A more suitable technique would be to store the latest ﬁltering distribution

p(x

T −1

|y

1:T −1

), and calculate the new ﬁltering distribution p(x

T

|y

1:T

) on the receipt of a new measurement. One is then

able to employ a (traditionally considered) ﬁxed-interval smoothing algorithm to calculate p(x

T −L

|y

1:T

). This gives rise

to a computational cost which is linear in the length of the lag.

• A similar situation occurs with ﬁxed point smoothing, in that it is possible to calculate a distribution p(x

t

|y

1:D

) on

the receipt of a new measurement by calculating a new ﬁltering distribution at the latest time instance and stepping

Manuscript received XX; revised XX.

M. Briers is with the Advanced Signal and Information Processing group, QinetiQ Ltd, Malvern, UK, and Cambridge University Engineering Department,

Cambridge, UK, Email: m.briers@signal.qinetiq.com, or mb511@eng.cam.ac.uk

A. Doucet is with Cambridge University Engineering Department, Cambridge, UK, Email: ad2@eng.cam.ac.uk

S. Maskell is with the Advanced Signal and Information Processing group, QinetiQ Ltd, Malvern, UK, Email: s.maskell@signal.qinetiq.com

1

Appropriate nomenclature and notation can be substituted for the discrete analogue.

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. XX, NO. XX, 200X 2

back through the lag. However, one can calculate a revised smoothed distribution without the need for recursing through

the intermediate history of the states, using the same ﬁxed-interval type smoothing algorithm repository by calculating

a distribution over the joint state (X

t

, X

D

). Note that while this method is computationally attractive and allows the

methodology to be presented in a generic framework, particle based methods for ﬁxed point smoothing would exhibit

degeneracy as the number of data increased[11]. However, in most cases, the ﬁxed-lag smoothing distribution converges

exponentially fast to a ﬁxed distribution so it is possible to stop including new data.

It is noted that one is also often interested in calculating the joint smoothing distribution, particularly in the ﬁxed-interval

(p(x

1:T

|y

1:T

)) and ﬁxed-lag (p(x

T −L:T

|y

1:T

)) scenarios, and we are able to perform such calculations by exploiting the

sequential nature of the problem, which is discussed in the sequel.

The format of the paper is as follows: the following section identiﬁes the ﬁltering and smoothing recursions for general state

space models and suggests a rigorous framework on which one can base smoothing algorithms. The subsequent ﬁve sections

then discuss the algorithmic implementation of such recursions under differing scenarios: ﬁnite state space hidden Markov

models, linear Gaussian state space models, jump Markov linear systems, analytic approximations for general state models,

and sequential Monte Carlo approximations for general state space models, respectively. Simulation results are presented in

Section VIII, with conclusions drawn in Section IX.

II. FILTERING AND SMOOTHING FOR GENERAL STATE SPACE MODELS

The derivations of the recursions used in the ﬁltering, forward-backward smoothing, and two-ﬁlter smoothing procedures for

general state space models are now described. Note that by replacing the integrals with summations, and the notation used to

denote a probability density function with an appropriate term to denote a probability mass function, one arrives at expressions

for both the continuous and discrete state space case.

A. Filtering

Filtering is the term used to describe the process of recursively calculating the marginal posterior distribution, and is a pre-

requisite in all of the algorithms discussed herein. We therefore summarise the ﬁltering procedure in each of the subsequent

sections. The ﬁltering recursion can be derived as follows:

p(x

t

|y

1:t

) =

p(y

t

|x

t

, y

1:t−1

)p(x

t

|y

1:t−1

)

p(y

t

|y

1:t−1

)

∝ g(y

t

|x

t

)p(x

t

|y

1:t−1

), (3)

where:

p(x

t

|y

1:t−1

) =

Z

f(x

t

|x

t−1

)p(x

t−1

|y

1:t−1

)dx

t−1

.

One can see that this algorithm is recursive in nature; equation (3) relies solely upon the state transition and observation

densities, and the posterior density from the previous time instance.

B. Forward-backward smoothing

It is possible to deduce the ‘standard’ forward-backward recursive expression for the marginal smoothed posterior distribution,

p(x

t

|y

1:T

), as follows[1]:

p(x

t

|y

1:T

) =

Z

p(x

t

, x

t+1

|y

1:T

)dx

t+1

=

Z

p(x

t+1

|y

1:T

)p(x

t

|x

t+1

, y

1:T

)dx

t+1

=

Z

p(x

t+1

|y

1:T

)p(x

t

|x

t+1

, y

1:t

)dx

t+1

= p(x

t

|y

1:t

)

Z

p(x

t+1

|y

1:T

)f(x

t+1

|x

t

)

p(x

t+1

|y

1:t

)

dx

t+1

. (4)

It is thus possible to compute the ﬁltered and predicted distributions in a forward (ﬁltering) recursion of the algorithm (by

calculating p(x

t

|y

1:t

)), and then execute a backward recursion with each smoothed distribution (p(x

t

|y

1:T

)) relying upon the

quantities calculated and the previous (in reverse time) smoothed distribution (p(x

t+1

|y

1:T

)).

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. XX, NO. XX, 200X 3

C. Two-ﬁlter smoothing

One can compute the marginal smoothed posterior distribution by combining the output of two (independent) ﬁlters, one

that recurses in the forwards time direction and calculates p(x

t

|y

1:t−1

), the other in the backward time direction calculating

p(y

t:T

|x

t

), to give the required distribution p(x

t

|y

1:T

)[6], [31], [32]. This is constructed as follows:

p(x

t

|y

1:T

) = p(x

t

|y

1:t−1

, y

t:T

)

=

p(x

t

|y

1:t−1

)p(y

t:T

|y

1:t−1

, x

t

)

p(y

t:T

|y

1:t−1

)

∝ p(x

t

|y

1:t−1

)p(y

t:T

|x

t

). (5)

∝ p(x

t

|y

1:t

)p(y

t+1:T

|x

t

)

This form of smoothing reduces the complexity of the smoothing procedure for certain modelling assumptions.

A Backward Information Filter[37] can be used to calculate p(y

t:T

|x

t

) sequentially from p(y

t+1:T

|x

t+1

):

p(y

t:T

|x

t

) =

Z

p(y

t

, y

t+1:T

, x

t+1

|x

t

)dx

t+1

=

Z

p(y

t+1:T

|x

t+1

)f(x

t+1

|x

t

)g(y

t

|x

t

)dx

t+1

. (6)

Note that p(y

t:T

|x

t

) is not a probability density function (pdf) in argument x

t

and thus its integral over x

t

might not be ﬁnite.

However, it is often more convenient in practice to propagate a pdf. Moreover, particle based methods can only be used to

approximate ﬁnite measures. To cater for these cases, and thus ensure that p(y

t:T

|x

t

) is integrable over x

t

, we introduce an

artiﬁcial prior distribution over x

t

denoted γ

t

(x

t

). This concept is now described.

Given the initial condition:

ep(x

T

|y

T

) =

g(y

T

|x

T

)γ

T

(x

T

)

p(y

T

)

and deﬁning the sequence of probability distributions:

ep(x

t:T

|y

t:T

) ∝ γ

t

(x

t

)

T

Y

i=t+1

f(x

i

|x

i−1

)

T

Y

i=t

g(y

i

|x

i

),

where t < T , one can see that:

p(y

t:T

|x

t

) =

Z

· · ·

Z

p(y

t:T

, x

t+1:T

|x

t

)dx

t+1:T

=

Z

· · ·

Z

p(x

t+1:T

|x

t

)p(y

t:T

|x

t:T

)dx

t+1:T

=

Z

· · ·

Z

T

Y

i=t+1

f(x

i

|x

i−1

)

T

Y

i=t

g(y

i

|x

i

)dx

t+1:T

=

Z

· · ·

Z

γ

t

(x

t

)

γ

t

(x

t

)

T

Y

i=t+1

f(x

i

|x

i−1

)

T

Y

i=t

g(y

i

|x

i

)dx

t+1:T

∝

Z

· · ·

Z

ep(x

t:T

|y

t:T

)

γ

t

(x

t

)

dx

t+1:T

=

ep(x

t

|y

t:T

)

γ

t

(x

t

)

. (7)

This derivation allows one to construct a ﬁltering-like recursion to determine the information quantity, whilst ensuring that one

propagates ﬁnite measures:

p(y

t:T

|x

t

) =

Z

p(y

t+1:T

|x

t+1

)f(x

t+1

|x

t

)g(y

t

|x

t

)dx

t+1

∝

Z

ep(x

t+1

|y

t+1:T

)

γ

t+1

(x

t+1

)

f(x

t+1

|x

t

)g(y

t

|x

t

)dx

t+1

=

g(y

t

|x

t

)

γ

t

(x

t

)

Z

ep(x

t+1

|y

t+1:T

)

f(x

t+1

|x

t

)γ

t

(x

t

)

γ

t+1

(x

t+1

)

dx

t+1

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. XX, NO. XX, 200X 4

1) Prediction: The “prediction” stage can be deﬁned as follows:

ep(x

t

|y

t+1:T

) ,

Z

ep(x

t+1

|y

t+1:T

)

f(x

t+1

|x

t

)γ

t

(x

t

)

γ

t+1

(x

t+1

)

dx

t+1

. (8)

Note that (8) is only a probability measure if one deﬁnes the artiﬁcial prior as follows:

γ

t

(x

t

) ,







γ

1

(x

1

) for t = 1;

R

· · ·

R

γ

1

(x

1

)

Q

t−1

t

0

=1

f(x

t

0

+1

|x

t

0

)dx

1:t−1

= p(x

t

) for t ≥ 2;

(9)

which gives the (generally time inhomogeneous) backward Markov kernel:

γ(x

t

|x

t+1

) =

f(x

t+1

|x

t

)γ

t

(x

t

)

γ

t+1

(x

t+1

)

where on substitution:

ep(x

t

|y

t+1:T

) =

Z

ep(x

t+1

|y

t+1:T

)γ(x

t

|x

t+1

)dx

t+1

.

Note that the prior distribution γ

1

(x

1

) appearing in (9) is an arbitrary prior distribution and not necessarily that used in the

ﬁltering recursion. There are certain circumstances where employing this formulation γ

1

(x

1

) 6= µ(x

1

) is useful. For example,

if the Markov kernel f(x

t+1

|x

t

) admits an invariant distribution π(x) it is convenient to deﬁne γ

t

(x

t

) = π (x), even if

µ(x

1

) 6= π(x), as then γ

t

(x

t

) = π(x) for all t ≥ 2. Moreover, γ(x

t

|x

t+1

) = f(x

t

|x

t+1

) if f is π-reversible. If one takes the

actual (ﬁltering) prior distribution µ(x

1

) then one uses the abusive notation p(x

t

|x

t+1

) = γ(x

t

|x

t+1

). The use of such a prior

distribution was (implicitly) used in references [6], [22], [35].

The calculation of (9) is not generally analytically tractable when one uses the prior µ(·). That is, in the general case the

backward Markov kernel p(x

t

|x

t+1

) may not be known or even analytic and one necessarily has to propagate a ﬁnite measure,

and so the additional degree of freedom provided by (8) is important in applications; see experimental results.

2) Update: The “update” stage is simply taken to be:

ep(x

t

|y

t:T

) =

g(y

t

|x

t

)ep(x

t

|y

t+1:T

)

R

g(y

t

|x

0

t

)ep(x

0

t

|y

t+1:T

)dx

0

t

, (10)

which has been renormalised to be a probability measure.

One can therefore see that:

p(x

t

|y

1:T

) ∝p(x

t

|y

1:t−1

)p(y

t:T

|x

t

)

∝

p(x

t

|y

1:t−1

)ep(x

t

|y

t:T

)

γ

t

(x

t

)

. (11)

Since the artiﬁcial prior distribution is introduced in (8), it needs to be removed when calculating the smoothing distribution,

hence the factor γ

−1

t

(x

t

).

For clarity, we will refer to (5) as an independent two-ﬁlter smoother, and to (11) as an artiﬁcial two-ﬁlter smoother.

3) Propagation of a probability measure: It is often algorithmically appealing to be able to propagate a probability measure,

for example in the unscented information ﬁlter (as will be discussed later in the paper), rather than a ﬁnite measure as one would

by the (general) use of equation (8). As stated, the calculation of (9) is not analytically tractable, could be computationally

infeasible, or an approximation could be insufﬁcient. Thus, it is convenient to deﬁne the joint distribution:

ep(x

t

, x

t+1

|y

t:T

) ∝ ep(x

t

, x

t+1

|y

t+1:T

)eq(x

t

|x

t+1

, y

t

)

g(y

t

|x

t

)f(x

t+1

|x

t

)γ

t

(x

t

)

γ

t+1

(x

t+1

)eq(x

t

|x

t+1

, y

t

)

, (12)

such that one is able to directly compute p(y

t:T

|x

t

, x

t+1

) in an analogous manner to that done in (10). Further details will be

provided on the use of (12) when deemed appropriate later in the paper.

D. Joint Distribution

As mentioned, it is possible to capitalise on the sequential nature of the problem when one constructs the joint distribution,

by considering the following factorisation[15]:

p(x

1:T

|y

1:T

) = p(x

T

|y

1:T

)

T −1

Y

t=1

p(x

t

|x

t+1:T

, y

1:T

). (13)

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. XX, NO. XX, 200X 5

It is possible to exploit the modelling assumptions (1)-(2) to yield the following simple result:

p(x

t

|x

t+1:T

, y

1:T

) = p(x

t

|x

t+1

, y

1:t

)

=

p(x

t

|y

1:t

)f(x

t+1

|x

t

)

p(x

t+1

|y

1:t

)

∝ p(x

t

|y

1:t

)f(x

t+1

|x

t

).

It is thus possible to calculate p(x

1:T

|y

1:T

) based solely on p(x

t

|y

1:t

) and the state transition density f (x

t+1

|x

t

), which are

required for all values of t.

The authors note that alternative factorisations exist[5], which can be constructed as the symmetrical relationship of (13)

when one uses an artiﬁcial prior distribution deﬁned by (9), as follows:

p(x

1:T

|y

1:T

) = p(x

1

|y

1:T

)

T

Y

t=2

p(x

t

|x

t−1

, y

1:T

),

where

p(x

t

|x

t−1

, y

1:T

) = p(x

t

|x

t−1

, y

t:T

)

=

p(y

t:T

|x

t

)f(x

t+1

|x

t

)

p(y

t:T

|x

t−1

)

∝

ep(x

t

|y

t:T

)γ(x

t

|x

t+1

)

ep(x

t−1

|y

t:T

)

,

where γ(x

t

|x

t+1

) is the backward Markov kernel if γ

t

(x

t

) is deﬁned through (9). These alternative factorisations, however,

do not seem to bring any advantage over (13).

III. FINITE STATE SPACE HIDDEN MARKOV MODEL

Deﬁne X

t

as a discrete-time, time-homogenous, n

x

-state, ﬁrst-order Markov chain. By using a discrete analogue to the

previous section, one is able to construct the following algorithmic arguments.

A. Filtering

Let α

t−1

(x

t−1

) , p(X

t−1

= x

t−1

|y

1:t−1

). The ﬁltering recursion based on the algorithm described by Rabiner[42] is given

as follows:

p(X

t

= x

t

|y

1:t−1

) =

X

x

t−1

f(X

t

= x

t

|X

t−1

= x

t−1

)α

t−1

(x

t−1

)

α

t

(x

t

) =

p(X

t

= x

t

|y

1:t−1

)g(y

t

|X

t

= x

t

)

P

x

0

t

p(X

t

= x

0

t

|y

1:t−1

)g(y

t

|X

t

= x

0

t

)

with suitably deﬁned initial conditions µ(X

1

= x

1

), ∀x

1

∈ S = {1, . . . , n

x

}.

B. Forward-backward smoothing

A solution that enables one to conduct two passes through the data, the ﬁrst (forward) pass calculating α

t

(j) and the second

(backward) pass calculating the required quantity, p(X

t

= x

t

|y

1:T

), is given as:

p(X

t

= x

t

|y

1:T

) = α

t

(x

t

)

X

x

t+1

p(X

t+1

= x

t+1

|y

1:T

)f(X

t+1

= x

t+1

|X

t

= x

t

)

P

x

0

t

α

t

(x

0

t

)f(X

t+1

= x

t+1

|X

t

= x

0

t

)

.

Surprisingly, this formula is not used very often, not least because it avoids the need to rescale the results to prevent numerical

instabilities of the independent smoother which is used routinely in this context.

Smoothing algorithms for state–space models

Figures

Citations

Machine Learning : A Probabilistic Perspective

A Tutorial on Particle Filtering and Smoothing: Fifteen years later

Bayesian Filtering and Smoothing

贝叶斯滤波与平滑 (Bayesian filtering and smoothing)

An Overview of Existing Methods and Recent Advances in Sequential Monte Carlo

References

Maximum likelihood from incomplete data via the EM algorithm

A New Approach to Linear Filtering and Prediction Problems

A tutorial on hidden Markov models and selected applications in speech recognition

A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking

Novel approach to nonlinear/non-Gaussian Bayesian state estimation

Related Papers (5)

On sequential Monte Carlo sampling methods for Bayesian filtering

Novel approach to nonlinear/non-Gaussian Bayesian state estimation

Monte Carlo Filter and Smoother for Non-Gaussian Nonlinear State Space Models

Filtering via Simulation: Auxiliary Particle Filters

Sequential Monte Carlo methods in practice