Open AccessJournal ArticleDOI

Incremental Least Squares Methods and the Extended Kalman Filter

Q: What are the contributions in this paper?

In this paper the authors propose and analyze nonlinear least squares methods which process the data incrementally, one data block at a time. The authors provide a nonstochastic analysis of its convergence properties, and they discuss variants aimed at accelerating its convergence.

Q: What is the simplest way to minimize a linear least squares problem?

for a nonlinear least squares problem, the convergence rate tends to be faster when A <: 1 than when A 1, essentially because the implicit stepsize does not diminish to zero as in the case 1.

Q: What is the way to improve the convergence properties of the EKF?

Projecting the iterates on a compact set is a well-known approach to enhance the theoretical convergence properties of the EKF (see [Lju79]).

Q: Why do backpropagation methods have a slow convergence rate?

backpropagation methods typically have a slow convergence rate not only because they are first-order steepest-descent-like methods, but also because they require a diminishing stepsize ok O(1/k) for convergence.

Q: What is the purpose of this paper?

The purpose of this paper is to provide a deterministic analysis of the convergence properties of the EKF for the general case where minx IIg(x)ll is not necessarily zero.

Q: What is the effect of the sublinear convergence rate of the EKF?

The authors finally note that as a result of its sublinear convergence rate, the EKF will typically become ultimately slower than the Gauss-Newton method, even though it may be much faster in the initial iterations.

Q: What is the positive definiteness assumption on cci?

Note that the positive definiteness assumption on CC1 in Proposition The authoris needed to guarantee that the first matrix HI is positive definite and hence invertible; then the positive definiteness of the subsequent matrices H2,..., Hm follows from eq. (12).

Q: What is the simplest way to estimate the least squares?

Assuming that the matrix C[C1 is positive definite, the least squares estimatesi arg min E/V-Y Ilzj Cjxl[xN j=li-- 1,...,m,can be generated by the algorithm(11) i i-1 -}- HlV(zi Cii-1), 1,...,where o is an arbitrary vector, and the positive-definite matrices

Q: What is the effect of old data blocks on the estimate?

In the case /k < 1, the effect of old data blocks is discounted, and successive estimates produced by the method tend to change more rapidly.

Q: What are the parallel versions of backpropagation methods?

There are also parallel asynchronous versions of backpropagation methods and corresponding stochastic [Wsi84], [TBA86], [BeT89], [Gai93] as well as deterministic convergence results [Tsi84], [TBA86], [BeT89], [MaS94].

Dimitri P. Bertsekas

- 01 Mar 1996 -

Siam Journal on Optimization

- Vol. 6, Iss: 3, pp 807-822

Chats0

TLDR

This paper proposes and analyze nonlinear least squares methods which process the data incrementally, one data block at a time, and focuses on the extended Kalman filter, which may be viewed as an incremental version of the Gauss--Newton method.

Abstract:

In this paper we propose and analyze nonlinear least squares methods which process the data incrementally, one data block at a time. Such methods are well suited for large data sets and real time operation and have received much attention in the context of neural network training problems. We focus on the extended Kalman filter, which may be viewed as an incremental version of the Gauss--Newton method. We provide a nonstochastic analysis of its convergence properties, and we discuss variants aimed at accelerating its convergence.

Content maybe subject to copyright Report

SIAM

OPTIMIZATION

Vol.

No.

pp.

807-822,

August

1996

()

1996

Society

for

Industrial

and

Applied

Mathematics

015

INCREMENTAL

LEAST

SQUARES

METHODS

AND

THE

EXTENDED

KALMAN

FILTER*

DIMITRI

BERTSEKAS

Abstract.

this

paper

propose

and

analyze

nonlinear

least

squares

methods

which

process

the

data

incrementally,

one

data

block

time.

Such

methods

are

well

suited

for

large

data

sets

and

real

time

operation

and

have

received

much

attention

the

context

neural

network

training

problems.

focus

the

extended

Kalman

filter,

which

may

viewed

incremental

version

the

Gauss-Newton

method.

provide

nonstochastic

analysis

its

convergence

properties,

and

discuss

variants

aimed

accelerating

its

convergence.

Key

words,

optimization,

least

squares,

Kalman

filter

AMS

subject

classifications.

93-11,

90C30,

65K10

Introduction.

consider

least

squares

problems

the

form

minimize

f(x)=

ilg(x)ll

IIg (x)ll

i=1

subject

where

continuously

differentiable

function

with

component

functions

gl,...,

where

Here

write

Ilzll

for

the

usual

Euclidean

norm

vector

that

is,

Ilzll

where

the

prime

denotes

transposition.

also

write

Tgi

for

the

gradient

matrix

and

forthe

(rl

+...

r,)

gradient

matrix

Least

squares

problems

very

often

arise

contexts

where

the

functions

correspond

measurenents

that

are

trying

fit

with

model

parameterized

Motivated

this

context,

refer

each

component

data

block,

and

refer

the

entire

function

(gl,...,

gin)

the

data

set.

One

the.

most

common

iterative

methods

for

solving

least

squares

problems

the

Gauss-Newton

method,

given

where

positive

stepsize,

and

assume

that

the

n n

matrix

Vg(xk)Vg(xk)

invertible.

The

case

corresponds

the

pure

form

the

method,

where

obtained

linearizing

the

current

iterate

and

mininizing

the

norm

the

linearized

function,

that

is,

(3)

arg

min

[Ig(x

Vg(xk)’(x

xk)ll

problems

where

there

are

many

data

blocks,

the

Gauss-Newton

method

may

ineffective

because

the

size

the

data

set

makes

each

iteration

very

costly.

For

such

problems

may

much

better

use

incremental

method

that

does

not

*Received

the

editors

May

27,

1994;

accepted

for

publication

(in

revised

form)

April

1995.

This

research

was

supported

NSF

grant

9300494-DMI.

Department

Electrical

Engineering

and

Computer

Science,

Massachusetts

Institute

Tech-

nology,

Cambridge,

02139

(dimitrib@mit.edu).

807

Downloaded 06/26/13 to 18.7.29.240. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

808

DIMITRI

BERTSEKAS

wait

proCess

the

entire

data

set

before

updating

discussed

[Ber95].

Instead,

the

method

cycles

through

the

data

blocks

sequence

and

updates

the

estimate

after

each

data

block

processed.

further

advantage

that

estimates

become

available

data

accumulated,

making

the

approach

suitable

for

real

time

operation.

Such

methods

include

the

Widrow-Hoff

least-mean-square

(LMS)

algorithm

[WiH60],

[WiS85]

for

the

case

where

the

data

blocks

are

linear,

and

other

steepest-descent-like

methods

for

nonlinear

data

blocks

that

have

been

used

extensively

for

the

training

neural

networks

under

the

generic

name

backpropagation

methods.

cycle

through

the

data

set

typical

example

such

method

starts

with

vector

and

generates

k+1

according

k+l

rn,

where

obtained

the

last

step

the

recursion

1,...,m,

positive

stepsize,

and

Backpropagation

methods

are

often

effective,

and

they

are

supported

stochas-

tic

[PoT73],

[Lju77],

[KuC78],

[Po187],

[BeT89],

[Whi89a],

[Whi89b],

[Gai93],

[BeT96]

well

deterministic

convergence

analyses

[Luo91],

[Gri93],

[LuT93],

[MaS94],

[Man93],

[BeT96].

The

main

difference

between

stochastic

and

deterministic

meth-

ods

analysis

that

the

former

apply

infinite

data

set

(one

with

infinite

number

data

blocks)

satisfying

some

statistical

assumptions,

while

the

latter

apply

finite

data

set.

There

are

also

parallel

asynchronous

versions

backpropagation

methods

and

corresponding

stochastic

[Wsi84],

[TBA86],

[BeT89],

[Gai93]

well

deterministic

convergence

results

[Tsi84],

[TBA86],

[BeT89],

[MaS94].

However,

back-

propagation

methods

typically

have

slow

convergence

rate

not

only

because

they

are

first-order

steepest-descent-like

methods,

but

also

because

they

require

diminishing

stepsize

O(1/k)

for

convergence.

instead

taken

small

constant,

oscillation

within

each

data

cycle

typically

arises,

shown

[Luo91].

this

paper

focus

methods

that

combine

the

advantages

backprop-

agation

methods

for

large

data

sets

with

the

often

superior

convergence

rate

the

Gauss-Newton

method.

thus

consider

incremental

version

the

Gauss-Newton

method,

which

operates

cycles

through

the

data

blocks.

The

1)st

cycle

starts

with

vector

and

positive

semidefinite

matrix

defined

later,

then

updates

via

Gauss-Newton-like

iteration

aimed

minimizing

A(x

x)’Hk(x

)

Ilgl(x)ll

where

scalar

with

0<A_<I,

then

updates

via

Gauss-Newton-like

iteration

aimed

minimizing

)2(x

xk)’H(x

AIIgl(x)ll

IIg2(x)

and

similarly

continues,

with

the

ith

step

consisting

Gauss-Newton-like

iteration

aimed

minimizing

the

weighted

partial

sum

Downloaded 06/26/13 to 18.7.29.240. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

EXTENDED

KALMAN

FILTER

809

particular,

given

the

1)st

cycle

sequentially

generates

the

vectors

(4)

{

}

arg

min

A(x-

Xk)’gk(x--

)-YlIgj(X,j-1)[[

1,...,m,

xEn

j=l

and

sets

(5)

x +l

where

(x,

j-1)

are

the

linearized

functions

(6)

j(X,

j--1)

gj()j--1)

Vgj(j--1)t(X

2j--1)

and

the

estimate

.at

the

end

the

kth

cycle:

(7)

will

seen

later,

the

quadratic

ninimizations

above

can

efficiently

implemented

using

the

recursive

Kalman

filter

formulas.

The

most

common

version

the

preceding

algorithm

obtained

when

the

ma-

trices

are

updated

the

recursion

(8)

Hk+l

j=l

Then

for

and

the

method

becomes

the

well-known

extended

Kalman

filter

(EKF

for

short)

specialized

the

case

where

the

state

the

underlying

dynam-

ical

system

stays

constant

and

the

measurement

equation

nonlinear.

The

EKF

was

originally

conceived

method

for

estimating

parameters

from

nonlinear

measure-

ments

that

are

generated

real

time.

The

basic

idea

the

method

linearize

each

new

measurement

around

the

current

value

the

estimate

and

treat

the

mea-

surement

if.it

were

linear

(cf.

eq.

(4)).

The

estimate

then

corrected

account

for

the

new

(linearized)

measurement

using

the

convenient

Kalman

filter

formulas

(see

Lemma

1).

The

algorithm

considered

here

cycles

repeatedly

through

the

data

set

and

sometimes

called

the

iterated

extended

Kalman

filter.

For

the

problem

estimat-

ing

the

state

dynamic

system,

cycle

through

the

data

set

involves

solving

problem

smoothing

the

estimate

the

state

trajectory

before

starting

new

cycle

(see,

e.g.,

[Be194]).

The

matrix

has

the

meaning

the

inverse

approximate

error

covariance

the

estimate

the

case

the

effect

old

data

blocks

discounted,

and

successive

estimates

produced

the

method

tend

change

rapidly.

this

way

one

may

obtain

faster

rate

progress

the

method,

and

this

the

main

motivation

for

considering

The

EKF

has

been

used

extensively

variety

control

and

estimation

applica-

tions

(see,

e.g.,

[AWT69],

[Jaz70],

[Meh71],

[THS77],

[AnM79],

[WeMS0])

and

has

also

been

suggested

for

the

training

neural

networks

(see,

e.g.,

[WaT90]

and

[RRK92]).

The

version

the

algorithm

(4)-(8)

with

has

also

been

proposed

Davi-

don

[Dav76].

Unaware

the

earlier

work

the

control

and

estimation

literature,

Davidon

described

the

qualitative

behavior

the

method

together

with

favorable

computational

experience

for

problems

with

large

data

sets,

but

gave

convergence

Downloaded 06/26/13 to 18.7.29.240. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

810

DIMITRI

BERTSEKAS

analysis.

The

first

convergence

analysis

the

EKF

was

given

Ljung

[Lju79],

who

assuming

used

stochastic

formulation

(i.e.,

infinite

data

set)

and

the

ODE

approach

[Lju77]

prove

satisfactory

convergence

properties

for

version

the

EKF

that

closely

the

one

considered

here

(Theorem

6.1

[Lju79],

which

assumes

stationary

measurement

equation

and

additive

noise).

Ljung

also

showed

that

the

EKF,

when

applied

complex

models

where

the

underlying

dynamic

system

linear

but

its

dynamics

depend

exhibits

complex

behavior,

includ-

ing

the

possible

convergence

biased

estimates.

For

such

models

suggested

the

use

different

formulation

the

least

squares

problem

involving

the

innovations

process

(see

also

[UrsS0]).

The

algorithms

and

analysis

the

present

paper

apply

any

type

deterministic

least

squares

problem,

and

thus

also

apply

Ljung’s

innovations-based

formulation.

deterministic

analysis

the

EKF

method

(4)-(8),

where

was

given

Pappas’s

Master’s

thesis

[Pap82].

considered

only

the

special

case

where

minx

IIg()ll

and

showed

that

the

EKF

converges

locally

nonsingular

so-

lution

the

system

9(z)

rate

that

linear

with

convergence

ratio

A’.

also

argued

example

that

when

and

minx

119(x)l

the

iterates

pro-

duced

the

EKF

within

each

cycle

generally

oscillate

with

"size"

oscillation

that

diminishes

approaches

The

purpose

this

paper

provide

deterministic

analysis

the

convergence

properties

the

EKF

for

the

general

case

where

minx

IIg(x)ll

not

necessarily

zero.

Our

analysis

complicated

the

lack

explicit

stepsize

the

algorithm.

the

case

where

show

that

the

limit

points

the

generated

sequence

}

the

EKF

are

stationary

points

the

least

squares

problem.

The

idea

the

proof

show

that

the

method

involves

implicit

stepsize

order

O(1/k)

and

then

apply

arguments

similar

those

used

Tsitsiklis

[Tsi84]

and

Tsitsiklis,

Bertsekas,

and

Athans

[TBA86]

their

analyses

asynchronous

distributed

gradient

methods,

and

Mangasarian

and

Solodov

[MaS94]

their

convergence

proof

asynchronous

parallel

backpropagation

method.

improve

the

rate

convergence

the

method,

which

sublinear

and

typically

slow,

suggest

convergent

and

empirically

faster

variant

where

initially

less

than

and

progressively

increased

toward

addition

dealing

naturally

with

the

case

finite

data

set,

nice

aspect

the

deterministic

analysis

that

decouples

the

stochastic

modeling

the

data

generation

process

from

the

algorithmic

solution

the

least

squares

problem.

other

words,

the

EKF

discussed

here

will

(typically)

find

least

squares

solution

even

the

least

squares

formulation

inappropriate

for

the

corresponding

real

parameter

estimation

problem.

This

valuable

insight

because

sometimes

thought

that

convergence

the

EKF

depends

the

validity

the

Underlying

stochastic

model

assumptions.

The

EKF.

When

the

data

blocks

are

linear

functions,

takes

single

pure

Gauss-Newton

iteration

find

the

least

squares

estimate.

This

iteration

can

implemented

incremental

algorithm,

the

Kalman

filter,

which

now

describe.

Assume

that

the

functions

are

linear

and

the

form

(9)

gi(x)

Cix,

where

are

given

vectors

and

are

given

matrices.

Let

consider

the

incremental

method

that

generates

the

vectors

(10)

arg

min

)i-JllzJ

CjxlI2

j=l

Downloaded 06/26/13 to 18.7.29.240. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

EXTENDED

KALMAN

FILTER

811

Then

the

method

can

recursively

implemented,

shown

the

following

well-

known

proposition

(see,

e.g.,

[AnM79]).

PROPOSITION

(Kalman

filter).

Assuming

that

the

matrix

C[C1

positive

definite,

the

least

squares

estimates

arg

min

E/V-Y

Ilzj

Cjxl[

j=l

i--

1,...,m,

can

generated

the

algorithm

(11)

i-1

-}-

HlV(zi

Cii-1),

1,...,

where

arbitrary

vector,

and

the

positive-definite

matrices

are

generated

(12)

AHi-

CCi,

1,...,

with

generally,

for

all

have

Ho--O.

(13)

j=T+l

The

proof

Proposition

obtained

using

the

following

lemma

involving

two

data

blocks,

the

straightforward

proof

which

omitted.

LEMMA

Let

given

vectors

and

F1,

given

matrices

such

that

FF1

positive

definite.

Then

the

vectors

(14)

arg

min

Flxll

and

(15)

are

also

given

(16)

and

(17)

(Fir-,

r;r.)-lr;(

where

arbitrary

vector.

The

proof

eqs.

(12)

and

(13)

Proposition

follows

applying

Lemma

with

the

correspondences

/)1

/)i,

and

(18)

Downloaded 06/26/13 to 18.7.29.240. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

HTML Viewer

Citations

PDF

Open Access

More filters

Sigma-point kalman filters for probabilistic inference in dynamic state-space models

Rudolph van der Merwe, +1 more

TL;DR: This work has consistently shown that there are large performance benefits to be gained by applying Sigma-Point Kalman filters to areas where EKFs have been used as the de facto standard in the past, as well as in new areas where the use of the EKF is impossible.

...read moreread less

Journal ArticleDOI

Stochastic stability of the discrete-time extended Kalman filter

K. Reif, +3 more

- 01 Apr 1999 -

IEEE Transactions on Automatic Control

TL;DR: It is shown that the estimation error remains bounded if the system satisfies the nonlinear observability rank condition and the initial estimation error as well as the disturbing noise terms are small enough.

...read moreread less

Journal ArticleDOI

Discrete-Time Nonlinear Filtering Algorithms Using Gauss–Hermite Quadrature

Ienkaran Arasaratnam, +2 more

TL;DR: The Gaussian sum-quadrature Kalman filter (GS-QKF) as mentioned in this paper approximates the predicted and posterior densities as a finite number of weighted sums of Gaussian densities.

...read moreread less

Posted Content

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey.

Dimitri P. Bertsekas

- 03 Jul 2015 -

arXiv: Systems and Control

TL;DR: A unified algorithmic framework is introduced for incremental methods for minimizing a sum P m=1 fi(x) consisting of a large number of convex component functions fi, including the advantages offered by randomization in the selection of components.

...read moreread less

Journal ArticleDOI

Incremental proximal methods for large scale convex optimization

Dimitri P. Bertsekas

- 01 Oct 2011 -

Mathematical Programming

TL;DR: A convergence and rate of convergence analysis of a variety of incremental methods, including some that involve randomization in the selection of components, and applications in a few contexts, including signal processing and inference/machine learning are discussed.

...read moreread less

Collapse

Frequently Asked Questions (14)

Q1. What are the contributions in this paper?

In this paper the authors propose and analyze nonlinear least squares methods which process the data incrementally, one data block at a time. The authors provide a nonstochastic analysis of its convergence properties, and they discuss variants aimed at accelerating its convergence.

Q2. What is the simplest way to minimize a linear least squares problem?

for a nonlinear least squares problem, the convergence rate tends to be faster when A <: 1 than when A 1, essentially because the implicit stepsize does not diminish to zero as in the case 1.

Q3. What is the way to improve the convergence properties of the EKF?

Projecting the iterates on a compact set is a well-known approach to enhance the theoretical convergence properties of the EKF (see [Lju79]).

Q4. Why do backpropagation methods have a slow convergence rate?

backpropagation methods typically have a slow convergence rate not only because they are first-order steepest-descent-like methods, but also because they require a diminishing stepsize ok O(1/k) for convergence.

Q5. What is the purpose of this paper?

The purpose of this paper is to provide a deterministic analysis of the convergence properties of the EKF for the general case where minx IIg(x)ll is not necessarily zero.

Q6. What is the effect of the sublinear convergence rate of the EKF?

The authors finally note that as a result of its sublinear convergence rate, the EKF will typically become ultimately slower than the Gauss-Newton method, even though it may be much faster in the initial iterations.

Q7. What is the positive definiteness assumption on cci?

Note that the positive definiteness assumption on CC1 in Proposition The authoris needed to guarantee that the first matrix HI is positive definite and hence invertible; then the positive definiteness of the subsequent matrices H2,..., Hm follows from eq. (12).

Q8. What is the simplest way to estimate the least squares?

Assuming that the matrix C[C1 is positive definite, the least squares estimatesi arg min E/V-Y Ilzj Cjxl[xN j=li-- 1,...,m,can be generated by the algorithm(11) i i-1 -}- HlV(zi Cii-1), 1,...,where o is an arbitrary vector, and the positive-definite matrices

Q9. What is the effect of old data blocks on the estimate?

In the case /k < 1, the effect of old data blocks is discounted, and successive estimates produced by the method tend to change more rapidly.

Q10. What are the parallel versions of backpropagation methods?

There are also parallel asynchronous versions of backpropagation methods and corresponding stochastic [Wsi84], [TBA86], [BeT89], [Gai93] as well as deterministic convergence results [Tsi84], [TBA86], [BeT89], [MaS94].

Q11. What is the way to correct the error in the EKF?

One may attempt to correct this behavior byselecting H0 to be a sufficiently large multiple of the identity matrix, but this leads tolarge asymptotic convergence errors (biased estimates), as can be seen through simple examples where the data blocks are linear.

Q12. What is the effect of a stepwise convergence on the EKF?

In particular, as convergence is approached, one may adaptively combine ever larger groups of data blocks together into single data blocks.

Q13. What is the last estimate of cci?

however, that in this case the last estimate Cm is only approximately equal to the least squares estimate x*, even if/ 1 (the approximation error depends on the size of 5).

Q14. What is the difference between the two methods?

In this paper the authors focus on methods that combine the advantages of backpropagation methods for large data sets with the often superior convergence rate of the Gauss-Newton method.

Incremental Least Squares Methods and the Extended Kalman Filter

Citations

Sigma-point kalman filters for probabilistic inference in dynamic state-space models

Stochastic stability of the discrete-time extended Kalman filter

Discrete-Time Nonlinear Filtering Algorithms Using Gauss–Hermite Quadrature

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey.

Incremental proximal methods for large scale convex optimization

Related Papers (5)

The iterated Kalman filter update as a Gauss-Newton method

Nonlinear Programming

A New Approach to Linear Filtering and Prediction Problems

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey.

A Stochastic Approximation Method

Frequently Asked Questions (14)

Q1. What are the contributions in this paper?

Q2. What is the simplest way to minimize a linear least squares problem?

Q3. What is the way to improve the convergence properties of the EKF?

Q4. Why do backpropagation methods have a slow convergence rate?

Q5. What is the purpose of this paper?

Q6. What is the effect of the sublinear convergence rate of the EKF?

Q7. What is the positive definiteness assumption on cci?

Q8. What is the simplest way to estimate the least squares?

Q9. What is the effect of old data blocks on the estimate?

Q10. What are the parallel versions of backpropagation methods?

Q11. What is the way to correct the error in the EKF?

Q12. What is the effect of a stepwise convergence on the EKF?

Q13. What is the last estimate of cci?

Q14. What is the difference between the two methods?