scispace - formally typeset
Open AccessJournal ArticleDOI

Incremental Least Squares Methods and the Extended Kalman Filter

Reads0
Chats0
TLDR
This paper proposes and analyze nonlinear least squares methods which process the data incrementally, one data block at a time, and focuses on the extended Kalman filter, which may be viewed as an incremental version of the Gauss--Newton method.
Abstract
In this paper we propose and analyze nonlinear least squares methods which process the data incrementally, one data block at a time. Such methods are well suited for large data sets and real time operation and have received much attention in the context of neural network training problems. We focus on the extended Kalman filter, which may be viewed as an incremental version of the Gauss--Newton method. We provide a nonstochastic analysis of its convergence properties, and we discuss variants aimed at accelerating its convergence.

read more

Content maybe subject to copyright    Report

SIAM
J.
OPTIMIZATION
Vol.
6,
No.
3,
pp.
807-822,
August
1996
()
1996
Society
for
Industrial
and
Applied
Mathematics
015
INCREMENTAL
LEAST
SQUARES
METHODS
AND
THE
EXTENDED
KALMAN
FILTER*
DIMITRI
P.
BERTSEKAS
Abstract.
In
this
paper
we
propose
and
analyze
nonlinear
least
squares
methods
which
process
the
data
incrementally,
one
data
block
at
a
time.
Such
methods
are
well
suited
for
large
data
sets
and
real
time
operation
and
have
received
much
attention
in
the
context
of
neural
network
training
problems.
We
focus
on
the
extended
Kalman
filter,
which
may
be
viewed
as
an
incremental
version
of
the
Gauss-Newton
method.
We
provide
a
nonstochastic
analysis
of
its
convergence
properties,
and
we
discuss
variants
aimed
at
accelerating
its
convergence.
Key
words,
optimization,
least
squares,
Kalman
filter
AMS
subject
classifications.
93-11,
90C30,
65K10
1.
Introduction.
We
consider
least
squares
problems
of
the
form
m
minimize
f(x)=
ilg(x)ll
IIg (x)ll
=
i=1
subject
to
x
where
g
is
a
continuously
differentiable
function
with
component
functions
gl,...,
g,
where
gi
:n
_
r.
Here
we
write
Ilzll
for
the
usual
Euclidean
norm
of
a
vector
z,
that
is,
Ilzll
,
where
the
prime
denotes
transposition.
We
also
write
Tgi
for
the
n
ri
gradient
matrix
of
gi
and
Vg
forthe
n
(rl
+...
+
r,)
gradient
matrix
of
g.
Least
squares
problems
very
often
arise
in
contexts
where
the
functions
gi
correspond
to
measurenents
that
we
are
trying
to
fit
with
a
model
parameterized
by
x.
Motivated
by
this
context,
we
refer
to
each
component
gi
as
a
data
block,
and
we
refer
to
the
entire
function
g
(gl,...,
gin)
as
the
data
set.
One
of
the.
most
common
iterative
methods
for
solving
least
squares
problems
is
the
Gauss-Newton
method,
given
by
where
a
k
is
a
positive
stepsize,
and
we
assume
that
the
n n
matrix
Vg(xk)Vg(xk)
is
invertible.
The
case
ak
1
corresponds
to
the
pure
form
of
the
method,
where
x
k+
is
obtained
by
linearizing
g
at
the
current
iterate
x
k
and
mininizing
the
norm
of
the
linearized
function,
that
is,
(3)
x
k+
arg
min
[Ig(x
k)
+
Vg(xk)’(x
xk)ll
2
if
ak
1.
In
problems
where
there
are
many
data
blocks,
the
Gauss-Newton
method
may
be
ineffective
because
the
size
of
the
data
set
makes
each
iteration
very
costly.
For
such
problems
it
may
be
much
better
to
use
an
incremental
method
that
does
not
*Received
by
the
editors
May
27,
1994;
accepted
for
publication
(in
revised
form)
April
4,
1995.
This
research
was
supported
by
NSF
grant
9300494-DMI.
f
Department
of
Electrical
Engineering
and
Computer
Science,
Massachusetts
Institute
of
Tech-
nology,
Cambridge,
MA
02139
(dimitrib@mit.edu).
807
Downloaded 06/26/13 to 18.7.29.240. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

808
DIMITRI
P.
BERTSEKAS
wait
to
proCess
the
entire
data
set
before
updating
x,
as
discussed
in
[Ber95].
Instead,
the
method
cycles
through
the
data
blocks
in
sequence
and
updates
the
estimate
of
x
after
each
data
block
is
processed.
A
further
advantage
is
that
estimates
of
x
become
available
as
data
is
accumulated,
making
the
approach
suitable
for
real
time
operation.
Such
methods
include
the
Widrow-Hoff
least-mean-square
(LMS)
algorithm
[WiH60],
[WiS85]
for
the
case
where
the
data
blocks
are
linear,
and
other
steepest-descent-like
methods
for
nonlinear
data
blocks
that
have
been
used
extensively
for
the
training
of
neural
networks
under
the
generic
name
of
backpropagation
methods.
A
cycle
through
the
data
set
of
a
typical
example
of
such
a
method
starts
with
a
vector
x
k
and
generates
x
k+1
according
to
X
k+l
rn,
where
rn
is
obtained
at
the
last
step
of
the
recursion
1,...,m,
k
is
a
positive
stepsize,
and
0
x
k.
Backpropagation
methods
are
often
effective,
and
they
are
supported
by
stochas-
tic
[PoT73],
[Lju77],
[KuC78],
[Po187],
[BeT89],
[Whi89a],
[Whi89b],
[Gai93],
[BeT96]
as
well
as
deterministic
convergence
analyses
[Luo91],
[Gri93],
[LuT93],
[MaS94],
[Man93],
[BeT96].
The
main
difference
between
stochastic
and
deterministic
meth-
ods
of
analysis
is
that
the
former
apply
to
an
infinite
data
set
(one
with
an
infinite
number
of
data
blocks)
satisfying
some
statistical
assumptions,
while
the
latter
apply
to
a
finite
data
set.
There
are
also
parallel
asynchronous
versions
of
backpropagation
methods
and
corresponding
stochastic
[Wsi84],
[TBA86],
[BeT89],
[Gai93]
as
well
as
deterministic
convergence
results
[Tsi84],
[TBA86],
[BeT89],
[MaS94].
However,
back-
propagation
methods
typically
have
a
slow
convergence
rate
not
only
because
they
are
first-order
steepest-descent-like
methods,
but
also
because
they
require
a
diminishing
stepsize
o
k
O(1/k)
for
convergence.
If
k
is
instead
taken
to
be
a
small
constant,
an
oscillation
within
each
data
cycle
typically
arises,
as
shown
by
[Luo91].
In
this
paper
we
focus
on
methods
that
combine
the
advantages
of
backprop-
agation
methods
for
large
data
sets
with
the
often
superior
convergence
rate
of
the
Gauss-Newton
method.
We
thus
consider
an
incremental
version
of
the
Gauss-Newton
method,
which
operates
in
cycles
through
the
data
blocks.
The
(k
+
1)st
cycle
starts
with
a
vector
x
k
and
a
positive
semidefinite
matrix
H
k
to
be
defined
later,
then
updates
x
via
a
Gauss-Newton-like
iteration
aimed
at
minimizing
A(x
x)’Hk(x
x
)
+
Ilgl(x)ll
2,
where
A
is
a
scalar
with
0<A_<I,
then
updates
x
via
a
Gauss-Newton-like
iteration
aimed
at
minimizing
)2(x
xk)’H(x
x
k)
+
AIIgl(x)ll
2
/
IIg2(x)
2,
and
similarly
continues,
with
the
ith
step
consisting
of
a
Gauss-Newton-like
iteration
aimed
at
minimizing
the
weighted
partial
sum
Downloaded 06/26/13 to 18.7.29.240. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

EXTENDED
KALMAN
FILTER
809
In
particular,
given
x
k,
the
(k
+
1)st
cycle
sequentially
generates
the
vectors
(4)
{
}
i
arg
min
A(x-
Xk)’gk(x--
X
k)
+
)-YlIgj(X,j-1)[[
2
i
1,...,m,
xEn
j=l
and
sets
(5)
x +l
where
j
(x,
j-1)
are
the
linearized
functions
(6)
j(X,
j--1)
gj()j--1)
+
Vgj(j--1)t(X
2j--1)
and
0
is
the
estimate
of
x
.at
the
end
of
the
kth
cycle:
(7)
0.
As
will
be
seen
later,
the
quadratic
ninimizations
above
can
be
efficiently
implemented
using
the
recursive
Kalman
filter
formulas.
The
most
common
version
of
the
preceding
algorithm
is
obtained
when
the
ma-
trices
H
k
are
updated
by
the
recursion
m
(8)
Hk+l
j=l
Then
for
A
1
and
H
0,
the
method
becomes
the
well-known
extended
Kalman
filter
(EKF
for
short)
specialized
to
the
case
where
the
state
of
the
underlying
dynam-
ical
system
stays
constant
and
the
measurement
equation
is
nonlinear.
The
EKF
was
originally
conceived
as
a
method
for
estimating
parameters
from
nonlinear
measure-
ments
that
are
generated
in
real
time.
The
basic
idea
of
the
method
is
to
linearize
each
new
measurement
around
the
current
value
of
the
estimate
and
treat
the
mea-
surement
as
if.it
were
linear
(cf.
eq.
(4)).
The
estimate
is
then
corrected
to
account
for
the
new
(linearized)
measurement
using
the
convenient
Kalman
filter
formulas
(see
Lemma
1).
The
algorithm
considered
here
cycles
repeatedly
through
the
data
set
and
is
sometimes
called
the
iterated
extended
Kalman
filter.
For
the
problem
of
estimat-
ing
the
state
of
a
dynamic
system,
a
cycle
through
the
data
set
involves
solving
a
problem
of
smoothing
the
estimate
of
the
state
trajectory
before
starting
a
new
cycle
(see,
e.g.,
[Be194]).
The
matrix
H
k
has
the
meaning
of
the
inverse
of
an
approximate
error
covariance
of
the
estimate
x
k.
In
the
case
/k
<
1,
the
effect
of
old
data
blocks
is
discounted,
and
successive
estimates
produced
by
the
method
tend
to
change
more
rapidly.
In
this
way
one
may
obtain
a
faster
rate
of
progress
of
the
method,
and
this
is
the
main
motivation
for
considering
A
<
1.
The
EKF
has
been
used
extensively
in
a
variety
of
control
and
estimation
applica-
tions
(see,
e.g.,
[AWT69],
[Jaz70],
[Meh71],
[THS77],
[AnM79],
[WeMS0])
and
has
also
been
suggested
for
the
training
of
neural
networks
(see,
e.g.,
[WaT90]
and
[RRK92]).
The
version
of
the
algorithm
(4)-(8)
with
A
<
1
has
also
been
proposed
by
Davi-
don
[Dav76].
Unaware
of
the
earlier
work
in
the
control
and
estimation
literature,
Davidon
described
the
qualitative
behavior
of
the
method
together
with
favorable
computational
experience
for
problems
with
large
data
sets,
but
gave
no
convergence
Downloaded 06/26/13 to 18.7.29.240. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

810
DIMITRI
P.
BERTSEKAS
analysis.
The
first
convergence
analysis
of
the
EKF
was
given
by
Ljung
[Lju79],
who
assuming
1
used
a
stochastic
formulation
(i.e.,
an
infinite
data
set)
and
the
ODE
approach
of
[Lju77]
to
prove
satisfactory
convergence
properties
for
a
version
of
the
EKF
that
is
closely
related
to
the
one
considered
here
(Theorem
6.1
of
[Lju79],
which
assumes
a
stationary
measurement
equation
and
additive
noise).
Ljung
also
showed
that
the
EKF,
when
applied
to
more
complex
models
where
the
underlying
dynamic
system
is
linear
but
its
dynamics
depend
on
z,
exhibits
complex
behavior,
includ-
ing
the
possible
convergence
to
biased
estimates.
For
such
models
he
suggested
the
use
of
a
different
formulation
of
the
least
squares
problem
involving
the
innovations
process
(see
also
[UrsS0]).
The
algorithms
and
analysis
of
the
present
paper
apply
to
any
type
of
deterministic
least
squares
problem,
and
thus
also
apply
to
Ljung’s
innovations-based
formulation.
A
deterministic
analysis
of
the
EKF
method
(4)-(8),
where
<
1,
was
given
in
Pappas’s
Master’s
thesis
[Pap82].
He
considered
only
the
special
case
where
minx
IIg()ll
0
and
showed
that
the
EKF
converges
locally
to
a
nonsingular
so-
lution
of
the
system
9(z)
0
at
a
rate
that
is
linear
with
convergence
ratio
A’.
He
also
argued
by
example
that
when
,k
<
1
and
minx
119(x)l
>
0,
the
iterates
bi
pro-
duced
by
the
EKF
within
each
cycle
generally
oscillate
with
a
"size"
of
oscillation
that
diminishes
as
A
approaches
1.
The
purpose
of
this
paper
is
to
provide
a
deterministic
analysis
of
the
convergence
properties
of
the
EKF
for
the
general
case
where
minx
IIg(x)ll
is
not
necessarily
zero.
Our
analysis
is
complicated
by
the
lack
of
an
explicit
stepsize
in
the
algorithm.
In
the
case
where
A
1
we
show
that
the
limit
points
of
the
generated
sequence
{x
k
}
by
the
EKF
are
stationary
points
of
the
least
squares
problem.
The
idea
of
the
proof
is
to
show
that
the
method
involves
an
implicit
stepsize
of
order
O(1/k)
and
then
to
apply
arguments
similar
to
those
used
by
Tsitsiklis
[Tsi84]
and
Tsitsiklis,
Bertsekas,
and
Athans
[TBA86]
in
their
analyses
of
asynchronous
distributed
gradient
methods,
and
by
Mangasarian
and
Solodov
[MaS94]
in
their
convergence
proof
of
an
asynchronous
parallel
backpropagation
method.
To
improve
the
rate
of
convergence
of
the
method,
which
is
sublinear
and
typically
slow,
we
suggest
a
convergent
and
empirically
faster
variant
where
A
is
initially
less
than
1
and
is
progressively
increased
toward
1.
In
addition
to
dealing
more
naturally
with
the
case
of
a
finite
data
set,
a
nice
aspect
of
the
deterministic
analysis
is
that
it
decouples
the
stochastic
modeling
of
the
data
generation
process
from
the
algorithmic
solution
of
the
least
squares
problem.
In
other
words,
the
EKF
discussed
here
will
(typically)
find
a
least
squares
solution
even
if
the
least
squares
formulation
is
inappropriate
for
the
corresponding
real
parameter
estimation
problem.
This
is
a
valuable
insight
because
it
is
sometimes
thought
that
convergence
of
the
EKF
depends
on
the
validity
of
the
Underlying
stochastic
model
assumptions.
2.
The
EKF.
When
the
data
blocks
are
linear
functions,
it
takes
a
single
pure
Gauss-Newton
iteration
to
find
the
least
squares
estimate.
This
iteration
can
be
implemented
as
an
incremental
algorithm,
the
Kalman
filter,
which
we
now
describe.
Assume
that
the
functions
9i
are
linear
and
of
the
form
(9)
gi(x)
zi
Cix,
where
zi
N
are
given
vectors
and
Ci
are
given
ri
x
n
matrices.
Let
us
consider
the
incremental
method
that
generates
the
vectors
(10)
9i
arg
min
E
)i-JllzJ
CjxlI2
1
m.
xE
j=l
Downloaded 06/26/13 to 18.7.29.240. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

EXTENDED
KALMAN
FILTER
811
Then
the
method
can
be
recursively
implemented,
as
shown
by
the
following
well-
known
proposition
(see,
e.g.,
[AnM79]).
PROPOSITION
1
(Kalman
filter).
Assuming
that
the
matrix
C[C1
is
positive
definite,
the
least
squares
estimates
i
arg
min
E/V-Y
Ilzj
Cjxl[
xN
j=l
i--
1,...,m,
can
be
generated
by
the
algorithm
(11)
i
i-1
-}-
HlV(zi
Cii-1),
1,...,
where
o
is
an
arbitrary
vector,
and
the
positive-definite
matrices
Hi
are
generated
by
(12)
Hi
AHi-
+
CCi,
1,...,
m,
with
More
generally,
for
all
<
we
have
Ho--O.
(13)
j=T+l
The
proof
of
Proposition
1
is
obtained
by
using
the
following
lemma
involving
two
data
blocks,
the
straightforward
proof
of
which
is
omitted.
LEMMA
l.
Let
1,
@
be
given
vectors
and
F1,
F2
be
given
matrices
such
that
FF1
is
positive
definite.
Then
the
vectors
(14)
1
arg
min
II
Flxll
xN
and
(15)
are
also
given
by
(16)
and
(17)
2
1
-
(Fir-,
+
r;r.)-lr;(
where
o
is
an
arbitrary
vector.
The
proof
of
eqs.
(12)
and
(13)
of
Proposition
1
follows
by
applying
Lemma
1
with
the
correspondences
0
0,
/)1
),
22
/)i,
and
(18)
1
rl
Downloaded 06/26/13 to 18.7.29.240. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

Citations
More filters

Sigma-point kalman filters for probabilistic inference in dynamic state-space models

TL;DR: This work has consistently shown that there are large performance benefits to be gained by applying Sigma-Point Kalman filters to areas where EKFs have been used as the de facto standard in the past, as well as in new areas where the use of the EKF is impossible.
Journal ArticleDOI

Stochastic stability of the discrete-time extended Kalman filter

TL;DR: It is shown that the estimation error remains bounded if the system satisfies the nonlinear observability rank condition and the initial estimation error as well as the disturbing noise terms are small enough.
Journal ArticleDOI

Discrete-Time Nonlinear Filtering Algorithms Using Gauss–Hermite Quadrature

TL;DR: The Gaussian sum-quadrature Kalman filter (GS-QKF) as mentioned in this paper approximates the predicted and posterior densities as a finite number of weighted sums of Gaussian densities.
Posted Content

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey.

TL;DR: A unified algorithmic framework is introduced for incremental methods for minimizing a sum P m=1 fi(x) consisting of a large number of convex component functions fi, including the advantages offered by randomization in the selection of components.
Journal ArticleDOI

Incremental proximal methods for large scale convex optimization

TL;DR: A convergence and rate of convergence analysis of a variety of incremental methods, including some that involve randomization in the selection of components, and applications in a few contexts, including signal processing and inference/machine learning are discussed.
Frequently Asked Questions (14)
Q1. What are the contributions in this paper?

In this paper the authors propose and analyze nonlinear least squares methods which process the data incrementally, one data block at a time. The authors provide a nonstochastic analysis of its convergence properties, and they discuss variants aimed at accelerating its convergence. 

for a nonlinear least squares problem, the convergence rate tends to be faster when A <: 1 than when A 1, essentially because the implicit stepsize does not diminish to zero as in the case 1. 

Projecting the iterates on a compact set is a well-known approach to enhance the theoretical convergence properties of the EKF (see [Lju79]). 

backpropagation methods typically have a slow convergence rate not only because they are first-order steepest-descent-like methods, but also because they require a diminishing stepsize ok O(1/k) for convergence. 

The purpose of this paper is to provide a deterministic analysis of the convergence properties of the EKF for the general case where minx IIg(x)ll is not necessarily zero. 

The authors finally note that as a result of its sublinear convergence rate, the EKF will typically become ultimately slower than the Gauss-Newton method, even though it may be much faster in the initial iterations. 

Note that the positive definiteness assumption on CC1 in Proposition The authoris needed to guarantee that the first matrix HI is positive definite and hence invertible; then the positive definiteness of the subsequent matrices H2,..., Hm follows from eq. (12). 

Assuming that the matrix C[C1 is positive definite, the least squares estimatesi arg min E/V-Y Ilzj Cjxl[xN j=li-- 1,...,m,can be generated by the algorithm(11) i i-1 -}- HlV(zi Cii-1), 1,...,where o is an arbitrary vector, and the positive-definite matrices 

In the case /k < 1, the effect of old data blocks is discounted, and successive estimates produced by the method tend to change more rapidly. 

There are also parallel asynchronous versions of backpropagation methods and corresponding stochastic [Wsi84], [TBA86], [BeT89], [Gai93] as well as deterministic convergence results [Tsi84], [TBA86], [BeT89], [MaS94]. 

One may attempt to correct this behavior byselecting H0 to be a sufficiently large multiple of the identity matrix, but this leads tolarge asymptotic convergence errors (biased estimates), as can be seen through simple examples where the data blocks are linear. 

In particular, as convergence is approached, one may adaptively combine ever larger groups of data blocks together into single data blocks. 

however, that in this case the last estimate Cm is only approximately equal to the least squares estimate x*, even if/ 1 (the approximation error depends on the size of 5). 

In this paper the authors focus on methods that combine the advantages of backpropagation methods for large data sets with the often superior convergence rate of the Gauss-Newton method.