scispace - formally typeset
Search or ask a question
Journal ArticleDOI

An introduction to hidden Markov models

01 Jan 1986-IEEE Assp Magazine (IEEE)-Vol. 3, Iss: 1, pp 4-16
TL;DR: The purpose of this tutorial paper is to give an introduction to the theory of Markov models, and to illustrate how they have been applied to problems in speech recognition.
Abstract: The basic theory of Markov chains has been known to mathematicians and engineers for close to 80 years, but it is only in the past decade that it has been applied explicitly to problems in speech processing. One of the major reasons why speech models, based on Markov chains, have not been developed until recently was the lack of a method for optimizing the parameters of the Markov model to match observed signal patterns. Such a method was proposed in the late 1960's and was immediately applied to speech processing in several research institutions. Continued refinements in the theory and implementation of Markov modelling techniques have greatly enhanced the method, leading to a wide range of applications of these models. It is the purpose of this tutorial paper to give an introduction to the theory of Markov models, and to illustrate how they have been applied to problems in speech recognition.

Summary (1 min read)

Jump to:  and [Summary]

Summary

  • A more efficient representation may then be obtained by using a common short time model for each of the steady, or well-behaved parts of the signal, along with some characterization of how one such period evolves to the next.
  • The model of Fig. 1c, which the authors call the 2-biased coins model, has two states (corresponding to two different coins).
  • The observations from the urn and ball model consists of announcing the color of the ball drawn at random from a selected urn.
  • The authors now explain the elements and the mechanism of the type of HMM's that they discuss in this paper: 1. There are a finite number, say N, of states in the model; they shall not rigorously define what a state is but simply say that within a state the signal possesses some measurable, distinctive properties.
  • Problem 3 is the one in which the authors attempt to optimize the model parameters so as to best describe how the observed sequence comes about.
  • Finally the authors point out that all the formulas presented in this paper for a single observation sequence can be modified to handle the case of multiple observation sequences.
  • For word recognition where the starting and ending points of the utterance are approximately known, it is found to be advantageous to use the above mentioned left-to-right models, particularly as shown in Fig. 6b.
  • Presently he is engaged in research on speech recognition and digital signal processing techniques at Bell laboratories, Murray Hill.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

An
Introduction to Hidden
Markov Models
The basic theory of Markov chains has been known to
mathematicians and engineers for close to 80 years, but it
is
only
in
the
past decade that it has been applied explicitly to
problems
in
speech processing.
One
of the major reasons why
speech models, based on Markov chains, have not been devel-
oped until recently was
the
lack of a method for optimizing
the parameters of the Markov model to match observed signal
patterns. Such a method
was
proposed
in
the late 1960's and
was immediately applied to speech processing
.in
several re-
search institutions. Continued refinements
in
the theory and
implementation of Markov modelling techniques have greatly
enhanced the method, leading to a wide range of applications
of these models.
It
is
the purpose of this tutorial paper to
give an introduction to
the
theory of Markov models, and to
illustrate how they have been applied to problems
in
speech
recognition.
INTRODUCTION
A
SSUME
YOU
ARE
GIVEN
the
following problem. A
real world
process
produces
a
sequence
of observable
symbols. The symbols
could
be
discrete (outcomes
of
coin
tossing
experiments,
characters from a finite alphabet,
quantized vectors from a
code
book,
etc.)
or
continuous
(speech samples, autocorrelation vectors, vectors
of
linear
prediction coefficients, etc.). Your job
is
to
build a signal
model
that
explains
and
characterizes
the
occurrence
of
the
observed
symbols.
If
such a signal model
is
obtain-
able, it
then
can
be
used
later
to
identify
or
recognize
other
sequences
of
observations.
In
attacking
such
a
problem,
some
fundamental deci-
sions,
guided
by signal
and
system theory, must
be
made.
For
example,
one
must
decide
on
the
form of
the
model,
linear
or
non-linear, time-varying
or
time-invariant, deter-
ministic
or
stochastic.
Depending
on
these
decisions, as
well as
other
signal processing considerations, several
possible signal
models
can
be
constructed.
To
fix
ideas,
consider
modelling a
pure
sinewave.
If
we
have reason
to
believe
that
the
observed
symbols
are
from
a
pure
sinewave,
then
all
that
would
need
to
be
measured
is
the
amplitude,
frequency
and
perhaps
phase
of
the
sine-
wave
and
an exact
model,
which explains
the
observed
symbols,
would
result.
4
iEEE
ASSP MAGAZINE JANUARY 1986
L.
R.
Rabiner
B.
H.
juang
Consider next a
somewhat
more
complicated signal-
namely a sinewave
imbedded
in noise. The noise
compo-
nents
of
the
signal make
the
modelling problem more
complicated
because
in
order
to
properly estimate
the
sinewave
parameters
(amplitude,
frequency,
phase)
one
has
to
take into
account
the
characteristics
of
the
noise
component.
In
the
above
examples,
we
have
assumed
the
sinewave
part
of
the
signal was stationary-i .e.
not
time varying. This
may
not
be
a realistic assumption.
If,
for example,
the
unknown process
produces
a sinewave with varying am-
plitude,
then
clearly a non-linear
model,
e.g. amplitude-
modulation, may
be
more
appropriate. Similarly,
if
we
assume
that
the
frequency, instead
of
the
amplitude,
of
the
sinewave
is
changing, a frequency-modulation model
might
be
most appropriate.
Linear
system
models
The
concepts
behind
the
above
examples have
been
well
studied
in
classical communication theory. The vari-
ety and types
of
real world
processes,
however,
does
not
stop
here. Linear system
models,
which model
the
ob-
served symbols as
the
output
of
a linear system excited by
an appropriate
source,
form
another
important class
of
processes for signal
modeling
and
have proven useful for
a wide variety
of
applications. For example,
"short
time"
segments
of
speech
signals
can
be
effectively
modeled
as
the
output
of
an all-pole filter excited by
appropriate
sources with essentially a flat spectral
envelope.
The signal
modeling
technique,
in
this case,
thus
involves deter-
mination
of
the
linear filter coefficients and, in
some
cases,
the
excitation
parameters.
Obviously, spectral analy-
ses
of
other
kinds also
fall
within this category.
One
can further
incorporate
temporal variations of
the
signal. into
the
linear system model by allowing
the
filter
coefficients,
or
the
excitation
parameters,
to
change
with
time.
In
fact, many real world
processes
cannot
be
mean-
ingfully
modeled
without
considering
such
temporal
variation.
Speech
signals
are
one
example
of
such
pro-
cesses.
There
are several ways
to
address
the
problem of
modeling temporal variation
of
a signal.
As
mentioned
above,
within a
"short
time"
period,
some
physical signals,
such
as
speech,
can
be
effectively
modeled by a simple linear time-invariant system with
the
0740-7467/86/0100-0004$Ol.00©1986
IEEE

appropriate excitation. The easiest way
then
to
address
the
time-varying nature of
the
process
is
to view
it
as a direct
concatenation
of
these
smaller "short time" segments,
each such
segment
being individually represented by a
linear system model.
In
other
words,
the
overall model
is
a synchronous
sequence
of symbols
where
each of
the
symbols
is
a linear system model representing a short seg-
ment
of
the process.
In
a
sense
this type of approach
models
the
observed signal using representative tokens of
the signal itself (or
some
suitably averaged
set
of such
signals
if
we have multiple observations).
Time-varying
processes
Modeling time-varying processes with
the
above ap-
proach assumes
that
every such short-time segment of
observation
is
a unit with a
prechosen
duration.
In
gen-
eral, however,
there
doesn't
exist a precise procedure
to decide what
the
unit duration should
be
so
that both
the time-invariant assumption holds, and
the
short-time
linear system models (as well as concatenation of
the
mod-
els) are meaningful.
In
most
physical systems,
the
duration
of a short-time
segment
is
determined
empirically.
In
many processes,
of
course,
one
would neither expect
the
properties of
the
process to change synchronously with
every unit analysis duration,
nor
observe drastic changes
from each unit
to
the
next except at certain instances.
Making no further assumptions
about
the
relationship be-
tween adjacent short-time models, and treating temporal
variations, small
or
large, as "typical"
phenomena
in
the
observed signal, are key features
in
the
above direct con-
catenation
technique.
This template approach to signal
modeling has proven
to
be
quite useful and has been the
basis
of
a wide variety
of
speech recognition systems.
There are
good
reasons to suspect, at this point, that
the
above approach, while useful, may not
be
the
most effi-
cient
(in
terms
of
computation, storage, parameters etc.)
technique
as far as representation
is
concerned. Many real
world processes
seem
to
manifest a rather sequentially
changing behavior;
the
properties of
the
process are usu-
ally held pretty steadily, except for minor fluctuations,
for a certain period of time (or a
number
of
the
above-
mentioned duration units), and
then,
at certain instances,
change (gradually
or
rapidly)
to
another
set
of properties.
The opportunity for
more
efficient modeling can
be
ex-
ploited
if
we
can first identify thes.e periods of rather
steadily behavior, and
then
are willing to assume that the
temporal variations within each of
these
steady periods
are,
in
a sense, statisti'cal. A more efficient representation
may
then
be
obtained by using a common short time
model for each of
the
steady,
or
well-behaved parts of the
signal, along with
some
characterization
of
how
one
such
period
evolves
to
the
next. This
is
how
hidden
Markov models
(HMM)
come
about. Clearly,
three
prob-
lems have
to
be
addressed:
1)
how
these
steadily
or
dis-
tinctively behaving periods can
be
identified,
2)
how the
"sequentially" evolving nature
of
these
periods can
be
characterized, and
3)
what
typical
or
common short time
model should
be
chosen
for each
of
these
periods. Hid-
den
Markov models successfully treat
these
problems un-
der
a probabilistic
or
statistical framework.
It
is
thus
the
purpose
of this paper
to
explain what a
hidden Markov model
is,
why
it
is
appropriate for certain
types of problems, and how
it
can
be
used
in
practice.
In
the
next section, we illustrate hidden Markov models via
some simple coin toss examples and outline
the
three
fundamental problems associated with
the
modeling tech-
nique. We
then
discuss how
these
problems can
be
solved
in
Section
III.
We
will not direct our general discussion to
anyone
particular problem,
but
at the
end
of this
paperwe
illustrate how HMM's are used via a
couple
of examples
in
speech recognition.
DEFINITION
OF
A
HIDDEN
MARKOV
MODEL
An
HMM
is
a doubly stochastic process with an unde'r-
lying stochastic process that
is
not
observable (it
is
hid-
den),
but
can only
be
observed
through
another
set of
stochastic processes that produce
the
sequence
of ob-
served symbols. We illustrate HMM's with
the
following
coin toss example.
Coin
toss
example
To
understand
the
concept
of the HMM, consider
the
following simplified example.
You
are
in
a room with a
barrier (e.g., a curtain) through which you
cannot
see
what
is
happening.
On
the
other
side of the barrier
is
another person
who
is
performing a coin (or multiple
coin) tossing experiment. The
other
person will not tell
you anything
about
what he
is
doing exactly; he will only
tell you
the
result
of
each coin flip. Thus a
sequence
of
hidden coin tossing experiments
is
performed, and you
only observe
the
results of
the
coin tosses, i.e.
0
1
0
2
0
3
•••••••••••
0
T
where
'M-
stands for heads and
~
stands for tails.
Given
the
above experiment, the problem
is
how
do we
build an
HMM
to explain
the
observed
sequence
of
heads
and tails.
One
possible model
is
shown
in
Fig.
1a. We call
this the "1-fair coin" model. There
are
two states
in
the
model,
but
each state
is
uniquely associated with either
heads (state
1)
or
tails (state 2). Hence this model
is
not
hidden
because
the
observation
sequence
uniquely de-
fines
the
state. The model represents a "fair coin" because
the probability of generating a head (or a tail) following a
head (or a tail)
is
0.5; hence
there
is
no
bias
on
the
current
observation. This
is
a
degenerate
example and shows how
independent
trials, like tossing
of
a fair coin, can
be
inter-
preted as a set
of
sequential events.
Of
course,
if
the
person
behind
the
barrier is,
in
fact, tossing a single fair
coin, this model should explain
the
outcomes
very well.
A
second
possible
HMM
for explaining
the
observed
sequence
of cofn toss
outcomes
is
giv-en
in
Fig.
1 b. We
call
this model
the
"2-fair coin" model. There are again 2 states
in
the model, but neither state
is
uniquely associated with
JANUARY
1986
IEEE ASSP MAGAZINE
5

a)
b)
c)
d)
0.5
0.5
~
1 2
0.5
P(H)= 1.0
P(H) =
0.0
P(T)=
0.0
PIT):
1.0
0.5
0.5
~
1 2
0.5
P(H)=0.5
P(H) =
0.5
P(T)=0.5
p(n=0.5
0.5
0.5
~
1 2
0.5
P(H) =
0.75
P(H) =
0.25
PIT)
=
0.25
PIn
=0.75
STATE
1 2 3
P(H)
0.6
0.25 0.45
PIT)
0.4
0.75
0.55
1 -
FAIR
COIN
MODEL
2-
FAIR
COINS
MODEL
2-
BIASED
COINS
MODEL
3-
BIASED
COINS
MODEL
Figure
1.
Models which
can
be
used
to
explain
the
results of
hidden
coin
tossing experiments.
The
sim-
plest
model.
shown
in
part
(a).
consists of a single fair
coin
with the outcome heads corresponding
to
one
state
and
tails
to
the other state.
The
model
of
part
(b)
corresponds
to
tossing two fair (unbiased) coins. with
the
first
coin
being
used
in
state
1
and
the second
coin
being
used
in
state
2.
An
independent "fair"
coin
is
used.
to
decide which of the other two fair coins is
flipped
at
each trial.
The
model
of
part
(c)
corresponds
to tossing two biased coins. with the
first
coin
being
heavily biased towards heads.
and
the second
coin
heavily
biased towards tails. Again a "fair" coin is
used
to
decide which biased
coin.
is tossed
at
each trial.
Finally the
model
of
part
d corresponds
to
the case of
3 biased coins
being
used.
either heads
or
tails. The probabilities
of
heads (or tails) in
either state
is
0.5. Also the probability
of
leaving (or
re-
maining in) either state
is
0.5. Thus, in this
case,
we
can
associate each state
with
a fair (unbiased) coin. Although
the probabilities associated with remaining in,
or
leaving,
either
of
the tw€H.tates are all 0.5, a little
thought
should
convince the reader that the statistics
of
the observable
output
sequencres
of
the 2-fair coins model are indepen-
dent
of
the state transitions. The reason
for
this
is
that this
6
IEEE
ASSP MAGAZINE JANUARY
1986
model
is
hidden (Le. we cannot know exactly
which
fair
coin (state) led
to
the observed heads
or
tails at each ob-
servation),
but
is
essentially indistinguishable (in a statisti-
cal
sense)
from
the 1-fair coin model
of
Fig.
1a.
Figures 1c and 1d show two more possible
HMM's
which
can
account
for
the observed sequence
of
heads and tails.
The model
of
Fig.
1c, which we call the 2-biased coins
model,
has
two
states (corresponding
to
two different
coins). In state 1,
the
coin
is
biased strongly towards
heads. In state
2,
the coin
is
biased strongly towards tails.
The state transition probabilities are all equal to 0.5. This
2-biased coins model
is
a hidden Markov model which
is
distinguishable
from
the
two
previously discussed
models. Interestingly, the reader should be able
to
con-
vince himself that the long time statistics (e.g. average
number
of
heads
or
tails)
of
the observation sequences
from the
HMM
of
Fig. 1c are the
same
as
those
from
the
models
of
Figs.
1a
and 1b. This model
is
very appropriate
if
what
is
happening behind the barrier
is
as
follows. The
person
has
three coins, one fair and the other
two
biased
according
to
the description in
Fig.
1c. The two biased
coins are associated-
with
the
two
faces
of
the fair coin
respectively.
To
report the outcome
of
every mysterious
coin flip, the person behind the barrier first flips
the
fair
coin
to
decide which biased coin
to
use, and then flips the
chosen biased coin
to
obtain the result. With this model,
we thus are able
to
look
into and explain the above subtle
characteristic changes (Le. switching the biased coins).
The model
of
Fig.
1d, which we call the 3-biased coins
model,
has
three states (corresponding
to
three different
coins). In state 1
the
coin
is
biased slightly towards heads;
in state 2
the
coin
is
biased strongly toward tails; in state 3
the coin
is
biased slightly toward tails. We
have
not
speci-
fied values
of
the state transition probabilities in
Fig.
1d;
clearly the behavior
of
the observation sequences pro-
duced by such a model are strongly dependent on these
transition probabilities.
(To
convince himself
of
this, the
reader should consider
two
extreme
cases,
namely when
the probability
of
remaining in state 3
is
large (>0.95), or
small
«0.05).
Very
different
sequence statistics
will
result
from these
two
extremes because
of
the strong bias
of
the
coin associated
with
state
3).
As
with the 2-biased coin
model, some real scenario behind
the
barrier, corre-
sponding
to
such a model
can
be composed; the reader
should find
no
difficulty
doing this himself.
There are several important points
to
be learned from
this discussion
of
how
to
model the outputs
of
the coin
tossing experiment via HMM's. First we note that one of
the most
difficult
parts
of
the modeling procedure
is
to
decide on the size (the number
of
states)
of
the model.
Without
some a
priori
information, this decision often
is
difficult
to
make and could involve trial and error before
settling on
the
most appropriate model size. Although we
stopped at a 3-coin model
for
the above illustration, even
this might be
too
small.
How
do
we decide on
how
many
coins (states) are really needed in the model? The answer
to
this question
is
related
to
an
even larger question,
namely
how
do
we choose model parameters (state transi-

Urn
1
Pr(R)=
Pr
(8)
=
Pr
(Y)
=
Urn
2
Pr
(R)
=
Pr
(8)=
Pr(Y)=
Urn
N
Pr
(R)
=
Pr
(8)
=
Pr(Y)=·
Figure 2.
An
urn
and
ball
model
which illustrates the general
case
of a discrete
symbol
hidden
Markov
model.
Each
of N urns (the N states of the
model)
contains a large number of colored balls.
The
proportion of each colored
ball.
in
each urn. is different.
and
is governed
by
the probability density of colors
for
each
urn.
The
observations from
the urn
and
ball
model
consists of announcing the color of the
ball
drawn
at
random from a selected urn. replacing
the
ball.
and
then choosing a new urn from which
to
select a
ball
according
to
the
state
transition density associated
with the originally selected urn.
tion probabilities, probabilities
of
heads and tails in
each
state)
to
optimize
the
model
so
that
it
best explains the
observed outcome sequence. We
will
try
to
answer this
question in the section on Solutions
to
the Three
HMM
Problems
as
this
is
the key to the successful
use
of
HMM's
for
real
world
problems. A final
point
concerns the size
of
the observation sequence.
If
we are restricted
to
a small
finite observation sequence we may
not
be able
to
reliably
estimate the optimal model parameters. (Think
of
the
case
of
actually using
10
coins
but
be given a set
of
50-100 observations). Hence, in a sense, depending on the
amount
of
model
training
data we are given, certain
HMM's
may
not
be statistically, reliably different.
Elements of an
HMM
We
now
explain the elements and
the
mechanism
of
the
type
of
HMM's
that we discuss in this paper:
1.
There are a finite number,
say
N,
of
states in the
model; we shall
not
rigorously define what a state
is
but
simply
say
that
within
a state the signal possesses some
measurable, distinctive properties.
2.
At
each clock time,
t,
a
new
state
is
entered
based
upon a transition probability distribution which depends
on
the
previous state (the Markovian property). (Note that
the transition may be such that the process remains in the
previous state.)
3.
After each transition
is
made,
an
observation
output
symbol
is
produced according
to
a probability distribution
which depends on
the
current state. This probability distri-
bution
;s
held fixed
for
the state regardless
of
when and
how
the
state
is
entered. There are thus N such obser-
vation probability distributions
which,
of
course, repre-
sent randpm variables
or
stochastic processes.
To
fix ideas, let
us
consider the
"urn
and ball" model
of
Fig.
2. There
are
N urns, each filled
with
a large number
of
colored balls. There are M possible colors
for
each
ball.
The observation sequence
is
generated by initially choos-
ing one
of
the N urns (according
to
an
initial probability
distribution), selecting a ball from
the
initial urn, record-
ing its color, replacing the ball, and then choosing a
new
urn according
to
a transition probability distribution
asso-
ciated
with
the current urn. Thus a typical observation
sequence might be:
clock time
1234·
..
T
urn (hidden) state
q3q1q1q2'"
qN-2
color (observation) R B Y Y
...
R
We
now
formally define the following model notation
for
a discrete observation
HMM:
T = length
of
the observation sequence (total number
of
clock times)
N = number
of
states (urns) in the model
M
= number
of
observation symbols (colors)
Q =
{q1,
q2,
.
..
,qN}, states (urns)
V
=
{V1,
V2,
...
,VM}
discrete set of possible symbol obser-
vations (colors)
A =
{aij},
aij
= Pr(qj at t +
11
qi
at
t),
state transition proba-
bility
distribution
B = {bj(k)}, bj(k) =
Pr(vk
at
tl
q;
at
t),
observation symbol
probability
distribution
in state i
7T
=
{7Ti},
71)
= Pr(qi at t
=1),
initial state distribution
Using
the
model,
an
observation
sequence, 0 =
0
1
O
2
,
•••
,OT,
is
generated
as
follows:
JANUARY
1986
IEEE
ASSP MAGAZINE

1. Choose an initial state, i
1
,
according to
the
initial state
distribution,7T;
2. Set t = 1;
3. Choose Ot according to bii(k),
the
symbol probability
distribution
in
state it;
4. Choose
it+1
according to {aitit+l}'
it+1
=
1,2,
...
,N,
the
state transition probability distribution for state it;
.
S.
Set t = t +
1;
return to step 3
if
t <
T;
otherwise
terminate
the
procedure.
We
use
the
compact notation A =
(A,
B,
7T)
to represent
an
HMM.
Specification
of
an
HMM
involves choice
of
the
number
of
states, N, and
the
number of discrete symbols
M,
(we
will
briefly discuss continuous density HMM's at
the
end
of
this paper),
and
specification of
the
three
probability densities A,
B,
and
7T.
If
we
try to specify
the
relative importance of
the
three
densities, A,
B,
and
7T,
then it should
be
clear
that
for most applications
7T
is
the
least important (this represents initial conditions), and B
is
the
most important (since it
is
directly related to
the
ob-
served symbols). For
some
problems
the
distribution A
is
also quite important (recall
the
3-biased coins models dis-
cussed earlier), whereas for
other
problems (e.g. isolated
word recognition problems)
it
is
of less importance.
The
three problems
for
HMM's
Given
the
form of
the
HMM
discussed
in
the
previous
section,
there
are
three
key problems of interest
that
must
be
solved for
the
model
to
be
useful
in
real world applica-
tions. These problems
are
the
following:
Problem
1 - Given
the
observation
sequence
0 =
0
1
,
O
2
,
•••
, OT,
and
the
model
A =
(A,B,7T), how we compute Pr(OIA),
the
probability
of
the
observation
sequence.
Problem 2 - Given
the
observation
sequence
0 =
0
1
,
O
2
,
•••
,
OT,
how
we
choose
a state
sequence
I = i
1
, i
2
,
•••
,
iT
which
is
opti-
mal
in
some
meaningful sense.
Problem 3 - How
we
adjust
the
model
parameters
A =
(A,
B,
7T)
to maximize Pr(O I
A).
Problem 1
is
the
evaluation problem: given a model and
a
sequence
of observations, how
we
can
compute
the
probability
that
the
observed
sequence
was
produced
by
the
model. We can also view
the
problem as: given a
model and a
sequence
of observations, how we "score"
or
evaluate
the
model. The latter viewpoint
is
very useful.
If
we think of
the
case
in
which we have several competing
models (e.g.
the
four models of
Fig.
1 for
the
coin tossing
experiment),
the
solution
to
problem
1 allows us
to
choose
the
model which
best
matches
the
observations.
Problem 2
is
the
one
in which we attempt to uncover
the
hidden part of
the
model, i.e.
the
state sequence. This
is
a typical estimation problem. We usually use an opti-
mality criterion
to
solve this problem as best as possible.
Unfortunately, as
we
will
see,
there
are several possible
optimality criteria that can
be
imposed and
hence
the
choice of criterion
is
a strong function of
the
intended use
8
IEEE
ASSP MAGAZINE JANUARY
1986
for
the
uncovered state
sequence.
A typical use
of
the
recovered state
sequence
is
to
learn
about
the
structure of
the model, and to get average statistics, behavior, etc.
within individual states.
Problem 3
is
the
one
in
which we
attempt
to optimize
the model parameters
so
as to
best
describe how
the
ob-
served
sequence
comes about. We call this a training se-
quence
in
this case since it
is
used to train
the
model. The
training problem
is
the
crucial
one
for
most
applications
of HMM's since it allows us
to
optimally adapt model
parameters to observed training
data...,--
i.e.
to
create best
models for real
phenomena.
To
fix
ideas, consider
the
following
speech
recognition
scheme. We want to design an N-state
HMM
for each word
of a V-word vocabulary. Using vector quantization
(VQ)
techniques,
we
represent
the
speech
signal by a
sequence
of VQ
codebook
symbols derived from an M-word code-
book. Thus we start with a training
sequence,
for each
vocabulary word, consisting of a
number
of
repetitions of
the
spoken word (by
one
or
more
talkers).
We
use
the
solution to Problem 3
to
optimally get model parameters
for each word model.
To
develop an understanding
of
the
physical
meaning
of
the
model states, we
use
the
solution
to
Problem 2
to
segment
each
of
the
word
training se-
quences into states,
and
then
study
the
observations oc-
curring
in
each state. The result
of
this study may lead to
further improvements
on
the
model. We shall discuss this
in
later sections. Finally
to
do
recognition
on
an unknown
word,
we
use
the
solution to Problem 1
to
score each
word model based
upon
the
given test observation se-
quence, and select
the
word
whose
word model
score
is
the
highest.
We now
present
the
formal mathematical solutions to
each of
the
three
fundamental problems for HMM's. And,
as
we
shall
see,
these
three
problems may
be
linked to-
gether
under
our
probabilistic framework.
SOLUTIONS
TO
THE
THREE
HMM
PROBLEMS
Problem 1
We wish
to
calculate
the
probability
of
the
observation
sequence
0,
given
the
model
A.
The most straightforward
way of doing this
is
through
enumerating every possible
state
sequence
of length T (the
number
of
observations).
For
every fixed state
sequence
I = i
1
i
2
'
••
iT,
the
proba-
bility of
the
observation
sequence
0
is
Pr(O
II,
A),
where
Pr(
0
II,
A)
=
bil
01)b
i2
( O
2
)
•••
bii
OT)
.
The probability of such a state
sequence
I,
on
the
other
hand,
is
The joint probability of 0 and I, i.e.,
the
probability that
o and I
occur
simultaneously,
is
simply
the
product
of
the
above
two
terms,
Pr(O,
II
A)
= Pr(O
II,
A)
PrUI A). The
probability of 0
then
is
obtained by summing this joint
probability over
all
possible state
sequences:

Citations
More filters
Journal ArticleDOI
Lawrence R. Rabiner1
01 Feb 1989
TL;DR: In this paper, the authors provide an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and give practical details on methods of implementation of the theory along with a description of selected applications of HMMs to distinct problems in speech recognition.
Abstract: This tutorial provides an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and gives practical details on methods of implementation of the theory along with a description of selected applications of the theory to distinct problems in speech recognition. Results from a number of original sources are combined to provide a single source of acquiring the background required to pursue further this area of research. The author first reviews the theory of discrete Markov chains and shows how the concept of hidden states, where the observation is a probabilistic function of the state, can be used effectively. The theory is illustrated with two simple examples, namely coin-tossing, and the classic balls-in-urns system. Three fundamental problems of HMMs are noted and several practical techniques for solving these problems are given. The various types of HMMs that have been studied, including ergodic as well as left-right models, are described. >

21,819 citations

Book ChapterDOI
TL;DR: The chapter discusses two important directions of research to improve learning algorithms: the dynamic node generation, which is used by the cascade correlation algorithm; and designing learning algorithms where the choice of parameters is not an issue.
Abstract: Publisher Summary This chapter provides an account of different neural network architectures for pattern recognition. A neural network consists of several simple processing elements called neurons. Each neuron is connected to some other neurons and possibly to the input nodes. Neural networks provide a simple computing paradigm to perform complex recognition tasks in real time. The chapter categorizes neural networks into three types: single-layer networks, multilayer feedforward networks, and feedback networks. It discusses the gradient descent and the relaxation method as the two underlying mathematical themes for deriving learning algorithms. A lot of research activity is centered on learning algorithms because of their fundamental importance in neural networks. The chapter discusses two important directions of research to improve learning algorithms: the dynamic node generation, which is used by the cascade correlation algorithm; and designing learning algorithms where the choice of parameters is not an issue. It closes with the discussion of performance and implementation issues.

13,033 citations

Journal ArticleDOI
TL;DR: Both optimal and suboptimal Bayesian algorithms for nonlinear/non-Gaussian tracking problems, with a focus on particle filters are reviewed.
Abstract: Increasingly, for many application areas, it is becoming important to include elements of nonlinearity and non-Gaussianity in order to model accurately the underlying dynamics of a physical system. Moreover, it is typically crucial to process data on-line as it arrives, both from the point of view of storage costs as well as for rapid adaptation to changing signal characteristics. In this paper, we review both optimal and suboptimal Bayesian algorithms for nonlinear/non-Gaussian tracking problems, with a focus on particle filters. Particle filters are sequential Monte Carlo methods based on point mass (or "particle") representations of probability densities, which can be applied to any state-space model and which generalize the traditional Kalman filtering methods. Several variants of the particle filter such as SIR, ASIR, and RPF are introduced within a generic framework of the sequential importance sampling (SIS) algorithm. These are discussed and compared with the standard EKF through an illustrative example.

11,409 citations


Cites background from "An introduction to hidden Markov mo..."

  • ...Hidden Markov model (HMM) filters [30], [35], [36], [39] are an application of such approximate grid-based methods in a fixed-interval smoothing context and have been used extensively in speech processing....

    [...]

Book
01 Jan 2009
TL;DR: The motivations and principles regarding learning algorithms for deep architectures, in particular those exploiting as building blocks unsupervised learning of single-layer modelssuch as Restricted Boltzmann Machines, used to construct deeper models such as Deep Belief Networks are discussed.
Abstract: Can machine learning deliver AI? Theoretical results, inspiration from the brain and cognition, as well as machine learning experiments suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g. in vision, language, and other AI-level tasks), one would need deep architectures. Deep architectures are composed of multiple levels of non-linear operations, such as in neural nets with many hidden layers, graphical models with many levels of latent variables, or in complicated propositional formulae re-using many sub-formulae. Each level of the architecture represents features at a different level of abstraction, defined as a composition of lower-level features. Searching the parameter space of deep architectures is a difficult task, but new algorithms have been discovered and a new sub-area has emerged in the machine learning community since 2006, following these discoveries. Learning algorithms such as those for Deep Belief Networks and other related unsupervised learning algorithms have recently been proposed to train deep architectures, yielding exciting results and beating the state-of-the-art in certain areas. Learning Deep Architectures for AI discusses the motivations for and principles of learning algorithms for deep architectures. By analyzing and comparing recent results with different learning algorithms for deep architectures, explanations for their success are proposed and discussed, highlighting challenges and suggesting avenues for future explorations in this area.

7,767 citations

BookDOI
01 Jan 2001
TL;DR: This book presents the first comprehensive treatment of Monte Carlo techniques, including convergence results and applications to tracking, guidance, automated target recognition, aircraft navigation, robot navigation, econometrics, financial modeling, neural networks, optimal control, optimal filtering, communications, reinforcement learning, signal enhancement, model averaging and selection.
Abstract: Monte Carlo methods are revolutionizing the on-line analysis of data in fields as diverse as financial modeling, target tracking and computer vision. These methods, appearing under the names of bootstrap filters, condensation, optimal Monte Carlo filters, particle filters and survival of the fittest, have made it possible to solve numerically many complex, non-standard problems that were previously intractable. This book presents the first comprehensive treatment of these techniques, including convergence results and applications to tracking, guidance, automated target recognition, aircraft navigation, robot navigation, econometrics, financial modeling, neural networks, optimal control, optimal filtering, communications, reinforcement learning, signal enhancement, model averaging and selection, computer vision, semiconductor design, population biology, dynamic Bayesian networks, and time series analysis. This will be of great value to students, researchers and practitioners, who have some basic knowledge of probability. Arnaud Doucet received the Ph. D. degree from the University of Paris-XI Orsay in 1997. From 1998 to 2000, he conducted research at the Signal Processing Group of Cambridge University, UK. He is currently an assistant professor at the Department of Electrical Engineering of Melbourne University, Australia. His research interests include Bayesian statistics, dynamic models and Monte Carlo methods. Nando de Freitas obtained a Ph.D. degree in information engineering from Cambridge University in 1999. He is presently a research associate with the artificial intelligence group of the University of California at Berkeley. His main research interests are in Bayesian statistics and the application of on-line and batch Monte Carlo methods to machine learning. Neil Gordon obtained a Ph.D. in Statistics from Imperial College, University of London in 1993. He is with the Pattern and Information Processing group at the Defence Evaluation and Research Agency in the United Kingdom. His research interests are in time series, statistical data analysis, and pattern recognition with a particular emphasis on target tracking and missile guidance.

6,574 citations

References
More filters
Journal ArticleDOI
01 Mar 1973
TL;DR: This paper gives a tutorial exposition of the Viterbi algorithm and of how it is implemented and analyzed, and increasing use of the algorithm in a widening variety of areas is foreseen.
Abstract: The Viterbi algorithm (VA) is a recursive optimal solution to the problem of estimating the state sequence of a discrete-time finite-state Markov process observed in memoryless noise. Many problems in areas such as digital communications can be cast in this form. This paper gives a tutorial exposition of the algorithm and of how it is implemented and analyzed. Applications to date are reviewed. Increasing use of the algorithm in a widening variety of areas is foreseen.

5,995 citations

Journal ArticleDOI
TL;DR: In this paper, a polynomial with nonnegative coefficients homogeneous of degree d in its variables is shown to be polynomially homogeneous unless 3(3(x))>P(x), where 3(x)=x.
Abstract: 1. Summary. The object of this note is to prove the theorem below and sketch two applications, one to statistical estimation for (proba-bilistic) functions of Markov processes [l] and one to Blakley's model for ecology [4]. 2. Result. THEOREM. Let P(x)=P({xij}) be a polynomial with nonnegative coefficients homogeneous of degree d in its variables {##}. Let x= {##} be any point of the domain D: ## §:(), ]pLi ## = 1, i = l, • • • , p, j=l, • • • , q%. For x= {xij} ££> let 3(#) = 3{##} denote the point of D whose i, j coordinate is (dP\\ \\ f « dP 3(*) Then P(3(x))>P(x) unless 3(x)=x. Notation, fi will denote a doubly indexed array of nonnegative integers: fx= {M#}> i = l> • • • >

1,145 citations

Journal ArticleDOI
TL;DR: This paper presents several of the salient theoretical and practical issues associated with modeling a speech signal as a probabilistic function of a (hidden) Markov chain, and focuses on a particular class of Markov models, which are especially appropriate for isolated word recognition.
Abstract: In this paper we present several of the salient theoretical and practical issues associated with modeling a speech signal as a probabilistic function of a (hidden) Markov chain. First we give a concise review of the literature with emphasis on the Baum-Welch algorithm. This is followed by a detailed discussion of three issues not treated in the literature: alternatives to the Baum-Welch algorithm; critical facets of the implementation of the algorithms, with emphasis on their numerical properties; and behavior of Markov models on certain artificial but realistic problems. Special attention is given to a particular class of Markov models, which we call “left-to-right” models. This class of models is especially appropriate for isolated word recognition. The results of the application of these methods to an isolated word, speaker-independent speech recognition experiment are given in a companion paper.

1,060 citations

Journal ArticleDOI
Frederick Jelinek1
01 Apr 1976
TL;DR: Experimental results are presented that indicate the power of the methods and concern modeling of a speaker and of an acoustic processor, extraction of the models' statistical parameters and hypothesis search procedures and likelihood computations of linguistic decoding.
Abstract: Statistical methods useful in automatic recognition of continuous speech are described. They concern modeling of a speaker and of an acoustic processor, extraction of the models' statistical parameters and hypothesis search procedures and likelihood computations of linguistic decoding. Experimental results are presented that indicate the power of the methods.

1,024 citations

Frequently Asked Questions (1)
Q1. What are the contributions in this paper?

It is the purpose of this tutorial paper to give an introduction to the theory of Markov models, and to illustrate how they have been applied to problems in speech recognition.