Kalman Filter-based Algorithms for Estimating Depth from Image Sequences

doi:10.1007/BF00133032

International Journal

of

Computer Vision.

3.

209-236

(1989)

'3

1989

Kluwer Academic Publishers. Manufactured

in

The Netherlands.

I

Kalman Filter-based Algorithms for Estimating Depth from Image Sequences

LARRY MATTHIES AND TAKE0 KANADE

Department

of

Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213;

Schlumberger Palo Alto Research, 3340 Hillview Ave., Palo Alto, CA 94304

RICHARD SZELISKI

Digital Equipment Corporation,

1

Kendall Square, Building

700,

Cambridge, MA 02139

Abstract

Using known camera motion

to

estimate depth from image sequences is an important problem

in

robot vision. Many

applications

of

depth-from-motion, including navigation and manipulation, require algorithms that can estimate depth

in

an on-line, incremental fashion. This requires a representation that records the uncertainty in depth estimates and

a mechanism that integrates new measurements with existing depth estimates

to

reduce the uncertainty over time.

Kalman filtering provides this mechanism. Previous applications of Kalman filtering to depth-from-motion have been

limited

to

estimating depth at the location of a sparse set of features. In this paper, we introduce a new, pixel-based

(iconic) algcrithm that estimates depth and depth uncertainty at each pixel and incrementally refines these estimates

over time. We describe the algorithm and contrast its formulation and performance to that of a feature-based Kalman

filtering algorithm. We compare the performance

of

the two approaches by analyzing their theoretical convergence

rates, by conducting quantitative experiments with images of a flat poster, and by conducting qualitative experiments

with images of a realistic outdoor-scene model. The results show that the new method is an effective way to extract

depth from lateral camera translations. This approach can be extended

to

incorporate general motion and to integrate

other sources

of

information, such as stereo. The algorithms we have developed, which combine Kalman filtering with

iconic descriptions of depth, therefore can serve as a useful and general framework for low-level dynamic vision.

1

Introduction

Using known camera motion

to

estimate depth from

image sequences is important

in

many applications

of

computer vision

to

robot navigation and manipulation.

In

these applications, depth-from-motion can be used

by itself, as part of a multimodal sensing strategy,

or

as a way

to

guide stereo matching. Many applications

require a depth estimation algorithm that operates

in

an

on-line,

incremental fashion. To develop such an

algorithm, we require a depth representation that

in-

cludes not only the current depth estimate, but also an

estimate of the uncertainty

in

the current depth estimate.

Previous work

[3,

5,

9,

10,

16,

17.

251

has identified

Kalman filtering as a viable framework for this prob-

lem. because

it

incorporates representations of uncer-

tainty

and provides a mechanism for incrementally

reducing uncertainty over time. To date, applications

of

this framework have largely been restricted

to

estimating the positions of a sparse set of trackable

features. such as points

or

line segments. While this

is adequate for many robotics applications,

it

requires

reliable feature extraction and

it

fails

to

describe large

areas

of

the image. Another line of work has addressed

the problem

of

extracting dense displacement

or

depth

estimates from image sequences. However, these

previous approaches have either been restricted to two-

frame analysis

[l]

or

have used batch processing

of

the

image sequence, for example via spatiotemporal filter-

ing

[ll].

In this paper we introduce a new, pixed-based

(iconic) approach

to

incremental depth estimation and

compare

it

mathematically and experimentally

to

a

feature-based approach we developed previously

[16].

The new approach represents depth and depth variance

at every pixel and uses Kalman filtering

to

extrapolate

and update the pixel-based depth representation. The

algorithm uses correlation

to

measure the optical flow

and

to

estimate the variance

in

the flow, then uses the

known camera motion

to

convert the flow field into a

depth map. It then uses the Kalman filter

to

generate

an updated depth map from a weighted combination

of

the new measurements and the prior depth estimates.

Regularization

is

employed

to

smooth the depth map

210

Matthies, Kanade, Szeliski

and to fill in the underconstrained areas. The resulting

algorithm is parallel, uniform, and can take advantage

of mesh-connected

or

multiresolution (pyramidal) proc-

essing architectures.

The remainder of this paper is structured as follows.

In the next section, we give a brief review of Kalman

filtering and introduce our overall approach to Kalman

filtering of depth. Next, we review the equations of mo-

tion, present a simple camera model, and examine the

potential accuracy of the method by analyzing its sen-

sitivity to the direction of camera motion. We then

describe

our

new, pixel-based depth-from-motion algor-

ithm and review the formulation of the feature-based

algorithm. Next, we analyze the theoretical accuracy

of both methods, compare them both to the theoretical

accuracy of stereo matching, and verify this analysis

experimentally using images of a flat scene. We then

show the performance of both methods on images of

realistic outdoor scene models. In the final section, we

discuss the promise and the problems involved in ex-

tending the method to arbitrary motion. We also con-

clude that the ideas and results presented apply directly

to the much broader problem of integrating depth infor-

mation from multiple sources.

2

Estimation

Framework

The depth-from-motion algorithms described

in

this

paper use image sequences with small frame-to-frame

camera motion

[4].

Small motion minimizes the corres-

pondence problem between successive images, but sac-

rifices depth resolution because of the small baseline

between consecutive image pairs. This problem can be

overcome by integrating information over the course of

the image sequence. For many applications, it is desir-

able to process the images incrementally by generating

updated depth estimates after each new image is ac-

quired, instead of processing many images together

in

a batch. The incremental approach offers real-time

operation and requires less storage, since only the cur-

rent estimates of depth and depth uncertainty need to

be stored.

The Kalman filter is a powerful technique for doing

incremental, real-time estimation in dynamic systems. It

allows for the integration of information over time and

is robust with respect to both system and sensor noise.

In this section, we first present the notation and the

equations of the Kalman filter, along with a simple ex-

ample. We then sketch the application of this frame-

work to motion-sequence processing and discuss those

parts of the framework that are common to both the

iconic and the feature-based algorithms. The details

of these algorithms are given in sections

4

and

5,

respectively.

2.1.

Kalmun Filter

The Kalman filter is a Bayesian estimation technique

used

to track stochastic dynamic systems being observed

with noisy sensors. The filter is based on three separate

probabilistic models, as shown in table

1.

The first

model, the

system model,

describes

the

evolution over

time of the current state vector

u,.

The transition be-

tween states is characterized by the known transition

matrix

@,

and the addition of Gaussian noise with a

covariance

Q,.

The second model, the

measurement

(0;

sensor) model,

relates the measurement vector

d,

to the current state through a measurement matrix

HI

and the addition of Gaussian noise with a covariance

R,.

The third model, the

prior model,

describes the

knowledge about the system state

io

and its covariance

Po

before the first measurement is taken. The sensor

and process noise are assumed to be uncorrelated.

Table

1.

Kalman filter equations.

Models system model

UI

=

+I-~u,-I

+

v,.

'I,

-

N(O.Q,)

E[UJ

=

&,

COV[U,]

=

Po

measurement model

dl

=

Hp,

+

E,.(,

-

N(O.R,)

(other assumptions)

E[v,€:l

=

0

i,-

=

i+

Prediction phase state estimate extrapolation

1-1 1-1

state covariance extrapolation

prior model

P;=

+f-lf~l+~-i

+

Qf-l

Update phase state estimate update

;:=

it-

+

K,[d,

-

H,iC,-]

P:=

[I

-

K,H,]Pt-

K,

=

P;H:IH,P;H:R,)

state coviarance update

Kalman gain matrix

Kalman Filter-based Algorithms for Estimating Depth

from

Image

Sequences

2

11

-

100At

0

0 0

010

0

At

0

001

0

At

0

@,=

000

-p

0

0 0

0

-@

0

-gAf

000

0

-0

0

-

000

0

1

i

z

5

1

1000000

0100000

HI=

[

which maps the state

u

to

the measurement

d.

The

uncertainty

in

the sensed ball position can be modeled

by a

2x2

covariance matrix

R,.

Once the system. measurement, and prior models

have been specified (i.e., the upper third of table

l),

the Kalman filter algorithm follows from the formula-

tion

in

the lower two thirds

of

table

1.

The algorithm

operates

in

two phases: extrapolation (prediction) and

update (correction).

At

time

t,

the previous state and

covariance estimates,

GLl

and

PcI,

are extrapolated

to

predict the current state G;and covariance

PI-.

The

predicted covariance is used

to

compute the new Kal-

man gain matrix K, and the updated covariance matrix

P:

Finally, the measurement residual d,

-

H,;,-is

weighted by the gain matrix K, and added

to

the

predicted state

ul-

to

yield the updated state

u:.

A

block diagram for the Kalman filter is given

in

figure

1.

r-

1

;rJqZ+p

-

Fix.

1.

Kalnian tiller

hlock

diagram

2.2.

Application to Depth

from

Motion

To apply the Kalman filter estimation framework

to

the

depth-from-motion problem, we specialize each

of

the

three models (system, measurement, and prior) and

define the implementations

of

the extrapolation and up-

date stages. This section briefly previews how these

components are chosen for the two depth-from-motion

algorithms described in this paper. The details of the

implementation are left

to

sections

4

and

5.

The first step in designing a Kalman filter is

to

specify the elements

of

the state vector. The iconic

depth-from-motion algorithm estimates the depth at

each pixel in the current image,

so

the state vector

in

this case is the entire depth map.' Thus, the diagonal

elements of the state covariance matrix

PI

are the vari-

ances of the depth estimates at each pixel.

As

discussed

shortly, we implicitly use off-diagonal elements of the

inverse covariance matrix

Ptl

as part of the update

stage of the filter, but do not explicitly model them

anywhere in the algorithm because of the large size of

the matrix. For the feature-based approach. which

tracks edge elements through the image sequence. thc

state consists of a 3D position vector for each feature.

We model the

full

covariance matrix

of

each individual

feature, but treat separate features as independent.

The system model

in

both approaches is based on

the same motion equations (section 3.1), but the imple-

mentations of the extrapolation and update stages dif-

fer because of the differences

in

the underlying repre-

sentations.

For

the iconic method. the extrapolation

stage uses the depth map estimated for the current

frame, together with knowledge

of

the

camera motion.

to

predict the depth and depth variance at each pixel

in the next frame. Similarly, the update stage

uscs

measurements of depth at each pixel to update the depth

and variance estimates at each pixel. For the feature-

based method, the extrapolation stage predicts the posi-

tion

vector and covariance matrix of each feature for

the next image, then uses measurements of the image

coordinates of the feature

to

update the position vector

and the covariance matrix. Details

of

the measurement

models for each algorithm will be discussed later.

Finally, the prior model can be used

to

embed prior

knowledge about the scene. For the iconic method. for

example, smoothness constraints requiring nearby im-

age points

to

have similar disparity can be modeled eas-

ily by off-diagonal elements of thc inverse of the prior

covariance matrix

Po

[29].

Our algorithm incorporates

'Our

actual implementation

uses

inverse depth (called "disparity")

See section

4.

212

Matthies, Kanade, Szeliski

this knowledge

as

part

of

a

smoothing operation that

follows the state update stage. Similar concepts may be

applicable to modeling figural continuity [20,24] in the

edge-tracking approach, that is, the constraint that con-

nected edges must match connected edges; however,

we have

not

pursued this possibility.

3

Motion Equations and Camera Model

Our system and measurement models are based on the

equations relating scene depth and camera motion to

the induced image flow. In this section, we review these

equations for an idealized camera (focal length

=

1)

and show how

to

use a simple calibration model to

relate the idealized equations to real cameras. We also

derive an expression for the relative uncertainty in depth

estimates obtained from lateral versus forward camera

translation. This expression shows concretely the effects

of

camera motion on depth uncertainty and reinforces

the need for modeling the uncertainty

in

computed

depth.

3.1.

Equations

of

Motion

If the inter-frame camera motion is sufficiently small,

the resulting optical flow can be expressed to a good

approximation

in

terms of the instantaneous camera

velocity

[6,

13, 331. We will specify this

in

terms

of

a translational velocity

T

and an angular velocity

R.

In the camera coordinate frame (figure 2), the motion

of a 3D point

P

is described by the equation

Expanding this into components yields

dX/dt

=

-T,

-

R,.Z

+

R,Y

dY/dt

=

-T,

-

R,X

+

R,Z

[ll

Now, projecting

(X,

Y,

Z)

onto an ideal, unit focal length

image,

dZ/dt

=

-T,

-

R,Y

+

R,X

X

Y

Z

’=z

x=-

taking the derivatives of (x, y) with respect to time, and

substituting

in

from equation

(1)

leads to the familiar

equations of optical flow [33]:

These equations relate the depth

Z

of the point to the

camera motion

T,R

and the induced image displace-

ments or optical flow [Ax

d~]~.

We will use these

equations to measure depth, given the camera motion

and optical flow, and to predict the change

in

the depth

map between frames. Note that parameterizing

(2)

in

terms of the inverse depth

d

=

1/Z

makes the equa-

tions linear in the “depth” variable. Since this leads

to a simpler estimation formulation, we will use this

parameterization in the balance of the paper.

Fig.

2.

Camera

model.

CP

IS

the center

of

projection

3.2.

Camera Model

Relating the ideal flow equations to real measurements

requires a camera model. If optical distortions are

not

severe, a pin-hole camera model will suffice.

In

this

paper we adopt a model similar to that originated by

Sobel [n] (figure

2).

This model specifies the origin

(c,,c,)

of the image coordinate system and a pair of

scale- factors

(s.,, s,.)

that combine the focal length and

image aspect ratio. Denoting the actual image coordin-

ates with a subscript

a,

the projection onto the actual

image is summarized by the equation

-_

-

’

CP

Z

Kalman Filter-based Algorithms for Estimating Depth from Image Sequences

2

13

C

is known as the collimation matrix. Thus, the ideal

image coordinates

(x,y)

are related

to

the actual image

coordinates by

x,,

=

s,x

+

c,

yu

=

s,y

+

c,

Equations

in

the balance

of

the paper will primarily

use ideal image coordinates for clarity. These equations

can be re-expressed

in

terms

of

actual coordinates using

the transformations above.

3.3.

Sensitivity Analysis

Before describing

our

Kalman filter algorithms, we will

analyze the effect

of

different camera motions on the

uncertainty

in

depth estimates. Given specific descrip-

tions of real cameras and scenes, we can obtain bounds

on

the estimation accuracy

of

depth-from-motion algor-

ithms using perturbation

or

covariance analysis tech-

niques based on first-order Taylor expansions

[8].

For

example,

if

we solve the motion equations for the in-

verse depth d

in

terms of the optical flow, camera mo-

tion, and camera model,

d

=

F(Ax,Ay,

T,

R,

e,,

s,, s,)

[41

then the uncertainty

in

depth arising from uncertainty

in

flow, motion, and calibration can be expressed by

6d

=

Jf

6f

+

J,,,

6m

+

J,.

6c

where

JP

J,,,,

and

J,.

are the Jacobians

of

(4)

with

respect

to

the flow, motion, and calibration parameters,

respectively, and

6f;

6m,

and

6c

are perturbations

of

the respective parameters. We will use this methodology

to

draw some conclusions about the relative accuracy

of

depth estimates obtained from different classes

of

motion.

It

is well known that camera rotation provides

no

depth information. Furthermore, for a translating

camera, the accuracy

of

depth estimates increases with

increasing distance

of

image features from the focus

of

expunsion

(FOE),

the point

in

the image where the

translation vector

(T)

pierces the image. This implies

that the ‘best’ translations are parallel

to

the image plane

and that the ‘worst’ are forward along the camera axis.

We

will give a short derivation that demonstrates the

relative accuracy obtainable from forward and lateral

camera translation. The effects of measurement uncer-

tainty

on

depth-from-motion calculations is also exam-

ined

in

(261.

For

clarity, we consider only one-dimensional

flow

induced by translation along the

X

or

Z

axes. For an

ideal camera. lateral motion induces the

flow

whereas forward motion induces the flow

xT,

Axf

=

z

The inverse depth

(or

disparity)

in

each case is

161

[71

d

-

“f

f--

XT:

Therefore, perturbations of

6xl

and

6xf

in

the flow

measurements Axl and Axfyield the following pertur-

bations

in

the disparity estimates:

These equations give the error

in

the inverse depth as

a function

of

the error

in

the measured image displace-

ment, the amount

of

camera motion, and position

ot

the feature

in

the field of view. Since we are interested

in comparing forward and lateral motions, a good way

to

visualize these equations is

to

plot the relative depth

uncertainty, 6df/6d/. Assuming that the flow pertur-

bations

rSx/

and 6xfare equal, the relative uncertainty is

The image coordinate

x

indicates where the object ap-

pears

in

the field of view. Figure

3

shows that

x

equals

the tangent of the angle

0

between the object and the

camera axis. The formula for the relative uncertainty

is thus

191

Kalman Filter-based Algorithms for Estimating Depth from Image Sequences

Citations

A taxonomy and evaluation of dense two-frame stereo correspondence algorithms

C ONDENSATION —Conditional Density Propagation forVisual Tracking

Object tracking: A survey

Computer Vision: Algorithms and Applications

Shape and motion from image streams under orthography: a factorization method

References

A Computational Approach to Edge Detection

Determining optical flow

Applied Optimal Estimation

Stochastic Models, Estimation And Control

Spacecraft attitude determination and control

Related Papers (5)

An iterative image registration technique with an application to stereo vision

Determining Optical Flow

A New Approach to Linear Filtering and Prediction Problems

The interpretation of a moving retinal image

A Computational Framework and an Algorithm for the Measurement of Visual