An
Introduction to Hidden
Markov Models
The basic theory of Markov chains has been known to
mathematicians and engineers for close to 80 years, but it
is
only
in
the
past decade that it has been applied explicitly to
problems
in
speech processing.
One
of the major reasons why
speech models, based on Markov chains, have not been devel-
oped until recently was
the
lack of a method for optimizing
the parameters of the Markov model to match observed signal
patterns. Such a method
was
proposed
in
the late 1960's and
was immediately applied to speech processing
.in
several re-
search institutions. Continued refinements
in
the theory and
implementation of Markov modelling techniques have greatly
enhanced the method, leading to a wide range of applications
of these models.
It
is
the purpose of this tutorial paper to
give an introduction to
the
theory of Markov models, and to
illustrate how they have been applied to problems
in
speech
recognition.
INTRODUCTION
A
SSUME
YOU
ARE
GIVEN
the
following problem. A
real world
process
produces
a
sequence
of observable
symbols. The symbols
could
be
discrete (outcomes
of
coin
tossing
experiments,
characters from a finite alphabet,
quantized vectors from a
code
book,
etc.)
or
continuous
(speech samples, autocorrelation vectors, vectors
of
linear
prediction coefficients, etc.). Your job
is
to
build a signal
model
that
explains
and
characterizes
the
occurrence
of
the
observed
symbols.
If
such a signal model
is
obtain-
able, it
then
can
be
used
later
to
identify
or
recognize
other
sequences
of
observations.
In
attacking
such
a
problem,
some
fundamental deci-
sions,
guided
by signal
and
system theory, must
be
made.
For
example,
one
must
decide
on
the
form of
the
model,
linear
or
non-linear, time-varying
or
time-invariant, deter-
ministic
or
stochastic.
Depending
on
these
decisions, as
well as
other
signal processing considerations, several
possible signal
models
can
be
constructed.
To
fix
ideas,
consider
modelling a
pure
sinewave.
If
we
have reason
to
believe
that
the
observed
symbols
are
from
a
pure
sinewave,
then
all
that
would
need
to
be
measured
is
the
amplitude,
frequency
and
perhaps
phase
of
the
sine-
wave
and
an exact
model,
which explains
the
observed
symbols,
would
result.
4
iEEE
ASSP MAGAZINE JANUARY 1986
L.
R.
Rabiner
B.
H.
juang
Consider next a
somewhat
more
complicated signal-
namely a sinewave
imbedded
in noise. The noise
compo-
nents
of
the
signal make
the
modelling problem more
complicated
because
in
order
to
properly estimate
the
sinewave
parameters
(amplitude,
frequency,
phase)
one
has
to
take into
account
the
characteristics
of
the
noise
component.
In
the
above
examples,
we
have
assumed
the
sinewave
part
of
the
signal was stationary-i .e.
not
time varying. This
may
not
be
a realistic assumption.
If,
for example,
the
unknown process
produces
a sinewave with varying am-
plitude,
then
clearly a non-linear
model,
e.g. amplitude-
modulation, may
be
more
appropriate. Similarly,
if
we
assume
that
the
frequency, instead
of
the
amplitude,
of
the
sinewave
is
changing, a frequency-modulation model
might
be
most appropriate.
Linear
system
models
The
concepts
behind
the
above
examples have
been
well
studied
in
classical communication theory. The vari-
ety and types
of
real world
processes,
however,
does
not
stop
here. Linear system
models,
which model
the
ob-
served symbols as
the
output
of
a linear system excited by
an appropriate
source,
form
another
important class
of
processes for signal
modeling
and
have proven useful for
a wide variety
of
applications. For example,
"short
time"
segments
of
speech
signals
can
be
effectively
modeled
as
the
output
of
an all-pole filter excited by
appropriate
sources with essentially a flat spectral
envelope.
The signal
modeling
technique,
in
this case,
thus
involves deter-
mination
of
the
linear filter coefficients and, in
some
cases,
the
excitation
parameters.
Obviously, spectral analy-
ses
of
other
kinds also
fall
within this category.
One
can further
incorporate
temporal variations of
the
signal. into
the
linear system model by allowing
the
filter
coefficients,
or
the
excitation
parameters,
to
change
with
time.
In
fact, many real world
processes
cannot
be
mean-
ingfully
modeled
without
considering
such
temporal
variation.
Speech
signals
are
one
example
of
such
pro-
cesses.
There
are several ways
to
address
the
problem of
modeling temporal variation
of
a signal.
As
mentioned
above,
within a
"short
time"
period,
some
physical signals,
such
as
speech,
can
be
effectively
modeled by a simple linear time-invariant system with
the
0740-7467/86/0100-0004$Ol.00©1986
IEEE
appropriate excitation. The easiest way
then
to
address
the
time-varying nature of
the
process
is
to view
it
as a direct
concatenation
of
these
smaller "short time" segments,
each such
segment
being individually represented by a
linear system model.
In
other
words,
the
overall model
is
a synchronous
sequence
of symbols
where
each of
the
symbols
is
a linear system model representing a short seg-
ment
of
the process.
In
a
sense
this type of approach
models
the
observed signal using representative tokens of
the signal itself (or
some
suitably averaged
set
of such
signals
if
we have multiple observations).
Time-varying
processes
Modeling time-varying processes with
the
above ap-
proach assumes
that
every such short-time segment of
observation
is
a unit with a
prechosen
duration.
In
gen-
eral, however,
there
doesn't
exist a precise procedure
to decide what
the
unit duration should
be
so
that both
the time-invariant assumption holds, and
the
short-time
linear system models (as well as concatenation of
the
mod-
els) are meaningful.
In
most
physical systems,
the
duration
of a short-time
segment
is
determined
empirically.
In
many processes,
of
course,
one
would neither expect
the
properties of
the
process to change synchronously with
every unit analysis duration,
nor
observe drastic changes
from each unit
to
the
next except at certain instances.
Making no further assumptions
about
the
relationship be-
tween adjacent short-time models, and treating temporal
variations, small
or
large, as "typical"
phenomena
in
the
observed signal, are key features
in
the
above direct con-
catenation
technique.
This template approach to signal
modeling has proven
to
be
quite useful and has been the
basis
of
a wide variety
of
speech recognition systems.
There are
good
reasons to suspect, at this point, that
the
above approach, while useful, may not
be
the
most effi-
cient
(in
terms
of
computation, storage, parameters etc.)
technique
as far as representation
is
concerned. Many real
world processes
seem
to
manifest a rather sequentially
changing behavior;
the
properties of
the
process are usu-
ally held pretty steadily, except for minor fluctuations,
for a certain period of time (or a
number
of
the
above-
mentioned duration units), and
then,
at certain instances,
change (gradually
or
rapidly)
to
another
set
of properties.
The opportunity for
more
efficient modeling can
be
ex-
ploited
if
we
can first identify thes.e periods of rather
steadily behavior, and
then
are willing to assume that the
temporal variations within each of
these
steady periods
are,
in
a sense, statisti'cal. A more efficient representation
may
then
be
obtained by using a common short time
model for each of
the
steady,
or
well-behaved parts of the
signal, along with
some
characterization
of
how
one
such
period
evolves
to
the
next. This
is
how
hidden
Markov models
(HMM)
come
about. Clearly,
three
prob-
lems have
to
be
addressed:
1)
how
these
steadily
or
dis-
tinctively behaving periods can
be
identified,
2)
how the
"sequentially" evolving nature
of
these
periods can
be
characterized, and
3)
what
typical
or
common short time
model should
be
chosen
for each
of
these
periods. Hid-
den
Markov models successfully treat
these
problems un-
der
a probabilistic
or
statistical framework.
It
is
thus
the
purpose
of this paper
to
explain what a
hidden Markov model
is,
why
it
is
appropriate for certain
types of problems, and how
it
can
be
used
in
practice.
In
the
next section, we illustrate hidden Markov models via
some simple coin toss examples and outline
the
three
fundamental problems associated with
the
modeling tech-
nique. We
then
discuss how
these
problems can
be
solved
in
Section
III.
We
will not direct our general discussion to
anyone
particular problem,
but
at the
end
of this
paperwe
illustrate how HMM's are used via a
couple
of examples
in
speech recognition.
DEFINITION
OF
A
HIDDEN
MARKOV
MODEL
An
HMM
is
a doubly stochastic process with an unde'r-
lying stochastic process that
is
not
observable (it
is
hid-
den),
but
can only
be
observed
through
another
set of
stochastic processes that produce
the
sequence
of ob-
served symbols. We illustrate HMM's with
the
following
coin toss example.
Coin
toss
example
To
understand
the
concept
of the HMM, consider
the
following simplified example.
You
are
in
a room with a
barrier (e.g., a curtain) through which you
cannot
see
what
is
happening.
On
the
other
side of the barrier
is
another person
who
is
performing a coin (or multiple
coin) tossing experiment. The
other
person will not tell
you anything
about
what he
is
doing exactly; he will only
tell you
the
result
of
each coin flip. Thus a
sequence
of
hidden coin tossing experiments
is
performed, and you
only observe
the
results of
the
coin tosses, i.e.
0
1
0
2
0
3
•••••••••••
0
T
where
'M-
stands for heads and
~
stands for tails.
Given
the
above experiment, the problem
is
how
do we
build an
HMM
to explain
the
observed
sequence
of
heads
and tails.
One
possible model
is
shown
in
Fig.
1a. We call
this the "1-fair coin" model. There
are
two states
in
the
model,
but
each state
is
uniquely associated with either
heads (state
1)
or
tails (state 2). Hence this model
is
not
hidden
because
the
observation
sequence
uniquely de-
fines
the
state. The model represents a "fair coin" because
the probability of generating a head (or a tail) following a
head (or a tail)
is
0.5; hence
there
is
no
bias
on
the
current
observation. This
is
a
degenerate
example and shows how
independent
trials, like tossing
of
a fair coin, can
be
inter-
preted as a set
of
sequential events.
Of
course,
if
the
person
behind
the
barrier is,
in
fact, tossing a single fair
coin, this model should explain
the
outcomes
very well.
A
second
possible
HMM
for explaining
the
observed
sequence
of cofn toss
outcomes
is
giv-en
in
Fig.
1 b. We
call
this model
the
"2-fair coin" model. There are again 2 states
in
the model, but neither state
is
uniquely associated with
JANUARY
1986
IEEE ASSP MAGAZINE
5
a)
b)
c)
d)
0.5
0.5
~
1 2
0.5
P(H)= 1.0
P(H) =
0.0
P(T)=
0.0
PIT):
1.0
0.5
0.5
~
1 2
0.5
P(H)=0.5
P(H) =
0.5
P(T)=0.5
p(n=0.5
0.5
0.5
~
1 2
0.5
P(H) =
0.75
P(H) =
0.25
PIT)
=
0.25
PIn
=0.75
STATE
1 2 3
P(H)
0.6
0.25 0.45
PIT)
0.4
0.75
0.55
1 -
FAIR
COIN
MODEL
2-
FAIR
COINS
MODEL
2-
BIASED
COINS
MODEL
3-
BIASED
COINS
MODEL
Figure
1.
Models which
can
be
used
to
explain
the
results of
hidden
coin
tossing experiments.
The
sim-
plest
model.
shown
in
part
(a).
consists of a single fair
coin
with the outcome heads corresponding
to
one
state
and
tails
to
the other state.
The
model
of
part
(b)
corresponds
to
tossing two fair (unbiased) coins. with
the
first
coin
being
used
in
state
1
and
the second
coin
being
used
in
state
2.
An
independent "fair"
coin
is
used.
to
decide which of the other two fair coins is
flipped
at
each trial.
The
model
of
part
(c)
corresponds
to tossing two biased coins. with the
first
coin
being
heavily biased towards heads.
and
the second
coin
heavily
biased towards tails. Again a "fair" coin is
used
to
decide which biased
coin.
is tossed
at
each trial.
Finally the
model
of
part
d corresponds
to
the case of
3 biased coins
being
used.
either heads
or
tails. The probabilities
of
heads (or tails) in
either state
is
0.5. Also the probability
of
leaving (or
re-
maining in) either state
is
0.5. Thus, in this
case,
we
can
associate each state
with
a fair (unbiased) coin. Although
the probabilities associated with remaining in,
or
leaving,
either
of
the tw€H.tates are all 0.5, a little
thought
should
convince the reader that the statistics
of
the observable
output
sequencres
of
the 2-fair coins model are indepen-
dent
of
the state transitions. The reason
for
this
is
that this
6
IEEE
ASSP MAGAZINE JANUARY
1986
model
is
hidden (Le. we cannot know exactly
which
fair
coin (state) led
to
the observed heads
or
tails at each ob-
servation),
but
is
essentially indistinguishable (in a statisti-
cal
sense)
from
the 1-fair coin model
of
Fig.
1a.
Figures 1c and 1d show two more possible
HMM's
which
can
account
for
the observed sequence
of
heads and tails.
The model
of
Fig.
1c, which we call the 2-biased coins
model,
has
two
states (corresponding
to
two different
coins). In state 1,
the
coin
is
biased strongly towards
heads. In state
2,
the coin
is
biased strongly towards tails.
The state transition probabilities are all equal to 0.5. This
2-biased coins model
is
a hidden Markov model which
is
distinguishable
from
the
two
previously discussed
models. Interestingly, the reader should be able
to
con-
vince himself that the long time statistics (e.g. average
number
of
heads
or
tails)
of
the observation sequences
from the
HMM
of
Fig. 1c are the
same
as
those
from
the
models
of
Figs.
1a
and 1b. This model
is
very appropriate
if
what
is
happening behind the barrier
is
as
follows. The
person
has
three coins, one fair and the other
two
biased
according
to
the description in
Fig.
1c. The two biased
coins are associated-
with
the
two
faces
of
the fair coin
respectively.
To
report the outcome
of
every mysterious
coin flip, the person behind the barrier first flips
the
fair
coin
to
decide which biased coin
to
use, and then flips the
chosen biased coin
to
obtain the result. With this model,
we thus are able
to
look
into and explain the above subtle
characteristic changes (Le. switching the biased coins).
The model
of
Fig.
1d, which we call the 3-biased coins
model,
has
three states (corresponding
to
three different
coins). In state 1
the
coin
is
biased slightly towards heads;
in state 2
the
coin
is
biased strongly toward tails; in state 3
the coin
is
biased slightly toward tails. We
have
not
speci-
fied values
of
the state transition probabilities in
Fig.
1d;
clearly the behavior
of
the observation sequences pro-
duced by such a model are strongly dependent on these
transition probabilities.
(To
convince himself
of
this, the
reader should consider
two
extreme
cases,
namely when
the probability
of
remaining in state 3
is
large (>0.95), or
small
«0.05).
Very
different
sequence statistics
will
result
from these
two
extremes because
of
the strong bias
of
the
coin associated
with
state
3).
As
with the 2-biased coin
model, some real scenario behind
the
barrier, corre-
sponding
to
such a model
can
be composed; the reader
should find
no
difficulty
doing this himself.
There are several important points
to
be learned from
this discussion
of
how
to
model the outputs
of
the coin
tossing experiment via HMM's. First we note that one of
the most
difficult
parts
of
the modeling procedure
is
to
decide on the size (the number
of
states)
of
the model.
Without
some a
priori
information, this decision often
is
difficult
to
make and could involve trial and error before
settling on
the
most appropriate model size. Although we
stopped at a 3-coin model
for
the above illustration, even
this might be
too
small.
How
do
we decide on
how
many
coins (states) are really needed in the model? The answer
to
this question
is
related
to
an
even larger question,
namely
how
do
we choose model parameters (state transi-
Urn
1
Pr(R)=
•
Pr
(8)
= •
Pr
(Y)
= •
•
Urn
2
Pr
(R)
= •
Pr
(8)=
•
Pr(Y)=
•
•
•
• • •
Urn
N
Pr
(R)
=
Pr
(8)
= •
Pr(Y)=·
•
Figure 2.
An
urn
and
ball
model
which illustrates the general
case
of a discrete
symbol
hidden
Markov
model.
Each
of N urns (the N states of the
model)
contains a large number of colored balls.
The
proportion of each colored
ball.
in
each urn. is different.
and
is governed
by
the probability density of colors
for
each
urn.
The
observations from
the urn
and
ball
model
consists of announcing the color of the
ball
drawn
at
random from a selected urn. replacing
the
ball.
and
then choosing a new urn from which
to
select a
ball
according
to
the
state
transition density associated
with the originally selected urn.
tion probabilities, probabilities
of
heads and tails in
each
state)
to
optimize
the
model
so
that
it
best explains the
observed outcome sequence. We
will
try
to
answer this
question in the section on Solutions
to
the Three
HMM
Problems
as
this
is
the key to the successful
use
of
HMM's
for
real
world
problems. A final
point
concerns the size
of
the observation sequence.
If
we are restricted
to
a small
finite observation sequence we may
not
be able
to
reliably
estimate the optimal model parameters. (Think
of
the
case
of
actually using
10
coins
but
be given a set
of
50-100 observations). Hence, in a sense, depending on the
amount
of
model
training
data we are given, certain
HMM's
may
not
be statistically, reliably different.
Elements of an
HMM
We
now
explain the elements and
the
mechanism
of
the
type
of
HMM's
that we discuss in this paper:
1.
There are a finite number,
say
N,
of
states in the
model; we shall
not
rigorously define what a state
is
but
simply
say
that
within
a state the signal possesses some
measurable, distinctive properties.
2.
At
each clock time,
t,
a
new
state
is
entered
based
upon a transition probability distribution which depends
on
the
previous state (the Markovian property). (Note that
the transition may be such that the process remains in the
previous state.)
3.
After each transition
is
made,
an
observation
output
symbol
is
produced according
to
a probability distribution
which depends on
the
current state. This probability distri-
bution
;s
held fixed
for
the state regardless
of
when and
how
the
state
is
entered. There are thus N such obser-
vation probability distributions
which,
of
course, repre-
sent randpm variables
or
stochastic processes.
To
fix ideas, let
us
consider the
"urn
and ball" model
of
Fig.
2. There
are
N urns, each filled
with
a large number
of
colored balls. There are M possible colors
for
each
ball.
The observation sequence
is
generated by initially choos-
ing one
of
the N urns (according
to
an
initial probability
distribution), selecting a ball from
the
initial urn, record-
ing its color, replacing the ball, and then choosing a
new
urn according
to
a transition probability distribution
asso-
ciated
with
the current urn. Thus a typical observation
sequence might be:
clock time
1234·
..
T
urn (hidden) state
q3q1q1q2'"
qN-2
color (observation) R B Y Y
...
R
We
now
formally define the following model notation
for
a discrete observation
HMM:
T = length
of
the observation sequence (total number
of
clock times)
N = number
of
states (urns) in the model
M
= number
of
observation symbols (colors)
Q =
{q1,
q2,
.
..
,qN}, states (urns)
V
=
{V1,
V2,
...
,VM}
discrete set of possible symbol obser-
vations (colors)
A =
{aij},
aij
= Pr(qj at t +
11
qi
at
t),
state transition proba-
bility
distribution
B = {bj(k)}, bj(k) =
Pr(vk
at
tl
q;
at
t),
observation symbol
probability
distribution
in state i
7T
=
{7Ti},
71)
= Pr(qi at t
=1),
initial state distribution
Using
the
model,
an
observation
sequence, 0 =
0
1
O
2
,
•••
,OT,
is
generated
as
follows:
JANUARY
1986
IEEE
ASSP MAGAZINE
1. Choose an initial state, i
1
,
according to
the
initial state
distribution,7T;
2. Set t = 1;
3. Choose Ot according to bii(k),
the
symbol probability
distribution
in
state it;
4. Choose
it+1
according to {aitit+l}'
it+1
=
1,2,
...
,N,
the
state transition probability distribution for state it;
.
S.
Set t = t +
1;
return to step 3
if
t <
T;
otherwise
terminate
the
procedure.
We
use
the
compact notation A =
(A,
B,
7T)
to represent
an
HMM.
Specification
of
an
HMM
involves choice
of
the
number
of
states, N, and
the
number of discrete symbols
M,
(we
will
briefly discuss continuous density HMM's at
the
end
of
this paper),
and
specification of
the
three
probability densities A,
B,
and
7T.
If
we
try to specify
the
relative importance of
the
three
densities, A,
B,
and
7T,
then it should
be
clear
that
for most applications
7T
is
the
least important (this represents initial conditions), and B
is
the
most important (since it
is
directly related to
the
ob-
served symbols). For
some
problems
the
distribution A
is
also quite important (recall
the
3-biased coins models dis-
cussed earlier), whereas for
other
problems (e.g. isolated
word recognition problems)
it
is
of less importance.
The
three problems
for
HMM's
Given
the
form of
the
HMM
discussed
in
the
previous
section,
there
are
three
key problems of interest
that
must
be
solved for
the
model
to
be
useful
in
real world applica-
tions. These problems
are
the
following:
Problem
1 - Given
the
observation
sequence
0 =
0
1
,
O
2
,
•••
, OT,
and
the
model
A =
(A,B,7T), how we compute Pr(OIA),
the
probability
of
the
observation
sequence.
Problem 2 - Given
the
observation
sequence
0 =
0
1
,
O
2
,
•••
,
OT,
how
we
choose
a state
sequence
I = i
1
, i
2
, •
•••
,
iT
which
is
opti-
mal
in
some
meaningful sense.
Problem 3 - How
we
adjust
the
model
parameters
A =
(A,
B,
7T)
to maximize Pr(O I
A).
Problem 1
is
the
evaluation problem: given a model and
a
sequence
of observations, how
we
can
compute
the
probability
that
the
observed
sequence
was
produced
by
the
model. We can also view
the
problem as: given a
model and a
sequence
of observations, how we "score"
or
evaluate
the
model. The latter viewpoint
is
very useful.
If
we think of
the
case
in
which we have several competing
models (e.g.
the
four models of
Fig.
1 for
the
coin tossing
experiment),
the
solution
to
problem
1 allows us
to
choose
the
model which
best
matches
the
observations.
Problem 2
is
the
one
in which we attempt to uncover
the
hidden part of
the
model, i.e.
the
state sequence. This
is
a typical estimation problem. We usually use an opti-
mality criterion
to
solve this problem as best as possible.
Unfortunately, as
we
will
see,
there
are several possible
optimality criteria that can
be
imposed and
hence
the
choice of criterion
is
a strong function of
the
intended use
8
IEEE
ASSP MAGAZINE JANUARY
1986
for
the
uncovered state
sequence.
A typical use
of
the
recovered state
sequence
is
to
learn
about
the
structure of
the model, and to get average statistics, behavior, etc.
within individual states.
Problem 3
is
the
one
in
which we
attempt
to optimize
the model parameters
so
as to
best
describe how
the
ob-
served
sequence
comes about. We call this a training se-
quence
in
this case since it
is
used to train
the
model. The
training problem
is
the
crucial
one
for
most
applications
of HMM's since it allows us
to
optimally adapt model
parameters to observed training
data...,--
i.e.
to
create best
models for real
phenomena.
To
fix
ideas, consider
the
following
speech
recognition
scheme. We want to design an N-state
HMM
for each word
of a V-word vocabulary. Using vector quantization
(VQ)
techniques,
we
represent
the
speech
signal by a
sequence
of VQ
codebook
symbols derived from an M-word code-
book. Thus we start with a training
sequence,
for each
vocabulary word, consisting of a
number
of
repetitions of
the
spoken word (by
one
or
more
talkers).
We
use
the
solution to Problem 3
to
optimally get model parameters
for each word model.
To
develop an understanding
of
the
physical
meaning
of
the
model states, we
use
the
solution
to
Problem 2
to
segment
each
of
the
word
training se-
quences into states,
and
then
study
the
observations oc-
curring
in
each state. The result
of
this study may lead to
further improvements
on
the
model. We shall discuss this
in
later sections. Finally
to
do
recognition
on
an unknown
word,
we
use
the
solution to Problem 1
to
score each
word model based
upon
the
given test observation se-
quence, and select
the
word
whose
word model
score
is
the
highest.
We now
present
the
formal mathematical solutions to
each of
the
three
fundamental problems for HMM's. And,
as
we
shall
see,
these
three
problems may
be
linked to-
gether
under
our
probabilistic framework.
SOLUTIONS
TO
THE
THREE
HMM
PROBLEMS
Problem 1
We wish
to
calculate
the
probability
of
the
observation
sequence
0,
given
the
model
A.
The most straightforward
way of doing this
is
through
enumerating every possible
state
sequence
of length T (the
number
of
observations).
For
every fixed state
sequence
I = i
1
i
2
'
••
iT,
the
proba-
bility of
the
observation
sequence
0
is
Pr(O
II,
A),
where
Pr(
0
II,
A)
=
bil
01)b
i2
( O
2
)
•••
bii
OT)
.
The probability of such a state
sequence
I,
on
the
other
hand,
is
The joint probability of 0 and I, i.e.,
the
probability that
o and I
occur
simultaneously,
is
simply
the
product
of
the
above
two
terms,
Pr(O,
II
A)
= Pr(O
II,
A)
PrUI A). The
probability of 0
then
is
obtained by summing this joint
probability over
all
possible state
sequences: