How many steps were required for convergence of a single point or frame refinement?

Generally about six steps were required for convergence of a single point or frame refinement, so a complete refinement step requires 6P inversions of 3 3¥ matrices and 6F inversions of 6 6¥ matrices.

How can the authors solve the motion variables for each frame?

While holding the shape constant, the minimization with respect to the motion variables can be performed independently for each frame.

How many pixels did the feature tracker move?

Due to the bumpy motion of the plane and the instability of the hand-held camera, features often moved by as much as 30 pixels from one image to the next.

How do the authors solve the motion variables?

The authors perform the individual minimizations, fitting six motion variables to P equations or fitting three shape variables to 2F equations, using the Levenberg-Marquardt method [8].

What is the method for determining the distance between the camera and the object?

In image sequences in which the object being viewed translates significantly toward or away from the camera or across the camera’s field of view, the paraperspective factorization method performs significantly better than the orthographic method.

What is the principle that the measurement matrix has rank three?

The principle that the measurement matrix has rank three, as put forth by Tomasi and Kanade in [14], was dependent on the use of an orthographic projection model.

What is the drawback of iterative methods on complex non-linear error surfaces?

A common drawback of iterative methods on complex non-linear error surfaces is that the final result can be highly dependent on the initial value.

how long did it take to solve a system of 60 frames and 60 points on a?

The C implementation of the paraperspective factorization method required about 20-24 seconds to solve a system of 60 frames and 60 points on a Sun 4/65, with most of this time spent computing the singular value decomposition of the measurement matrix.

What was the effect of the orthographic factorization method?

The shape recovered by the orthographic factorization method was rather deformed (see Fig. 8) and the recovered motion incorrect, because the method could notaccount for the scaling and position effects which are prominent in the sequence.

What is the metric for the scaled orthographic factorization method?

The scaled orthographic factorization method is very similar to theparaperspective factorization method; the metric constraints for the method are m nf f 2 2 = , m nf f◊ = 0 , and m1 1= .

(Open Access) A paraperspective factorization method for shape and motion recovery (1997) | Conrad J. Poelman

206 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 3, MARCH 1997

A Paraperspective Factorization Method for

Shape and Motion Recovery

Conrad J. Poelman and Takeo Kanade,

Fellow, IEEE

Abstract

—The factorization method, first developed by Tomasi and Kanade, recovers both the shape of an object and its motion

from a sequence of images, using many images and tracking many feature points to obtain highly redundant feature position

information. The method robustly processes the feature trajectory information using singular value decomposition (SVD), taking

advantage of the linear algebraic properties of orthographic projection. However, an orthographic formulation limits the range of

motions the method can accommodate. Paraperspective projection, first introduced by Ohta, is a projection model that closely

approximates perspective projection by modeling several effects not modeled under orthographic projection, while retaining linear

algebraic properties. Our paraperspective factorization method can be applied to a much wider range of motion scenarios, including

image sequences containing motion toward the camera and aerial image sequences of terrain taken from a low-altitude airplane.

Index Terms

—Motion analysis, shape recovery, factorization method, three-dimensional vision, image sequence analysis, singular

value decomposition.

——————————

✦

——————————

NTRODUCTION

ECOVERING

the geometry of a scene and the motion of

the camera from a stream of images is an important

task in a variety of applications, including navigation, ro-

botic manipulation, and aerial cartography. While this is

possible in principle, traditional methods have failed to

produce reliable results in many situations [2].

Tomasi and Kanade [13], [14] developed a robust and ef-

ficient method for accurately recovering the shape and mo-

tion of an object from a sequence of images, called the fac-

torization method. It achieves its accuracy and robustness by

applying a well-understood numerical computation, the

singular value decomposition (SVD), to a large number of

images and feature points, and by directly computing

shape without computing the depth as an intermediate

step. The method was tested on a variety of real and syn-

thetic images, and was shown to perform well even for

distant objects, where traditional triangulation-based ap-

proaches tend to perform poorly.

The Tomasi-Kanade factorization method, however, as-

sumed an orthographic projection model. The applicability of

the method is therefore limited to image sequences created

from certain types of camera motions. The orthographic

model contains no notion of the distance from the camera to

the object. As a result, shape reconstruction from image se-

quences containing large translations toward or away from

the camera often produces deformed object shapes, as the

method tries to explain the size differences in the images by

creating size differences in the object. The method also sup-

plies no estimation of translation along the camera’s optical

axis, which limits its usefulness for certain tasks.

There exist several perspective approximations which

capture more of the effects of perspective projection while

remaining linear. Scaled orthographic projection, sometimes

referred to as “weak perspective” [5], accounts for the scaling

effect of an object as it moves towards and away from the

camera. Paraperspective projection, first introduced by Ohta

[6] and named by Aloimonos [1], accounts for the scaling

effect as well as the different angle from which an object is

viewed as it moves in a direction parallel to the image plane.

In this paper, we present a factorization method based

on the paraperspective projection model. The paraperspec-

tive factorization method is still fast, and robust with re-

spect to noise. It can be applied to a wider realm of situa-

tions than the original factorization method, such as se-

quences containing significant depth translation or con-

taining objects close to the camera, and can be used in ap-

plications where it is important to recover the distance to

the object in each image, such as navigation.

We begin by describing our camera and world reference

frames and introduce the mathematical notation that we use.

We review the original factorization method as defined in

[13], presenting it in a slightly different manner in order to

make its relation to the paraperspective method more appar-

ent. We then present our paraperspective factorization

method, followed by a description of a perspective refine-

ment step. We conclude with the results of several experi-

ments which demonstrate the practicality of our system.

ROBLEM

ESCRIPTION

In a shape-from-motion problem, we are given a sequence

of F images taken from a camera that is moving relative to

an object. Assume for the time being that we locate P

prominent feature points in the first image, and track these

————————————————

• C.J. Poelman is with the Satellite Assessment Center (WSAT), USAF

Phillips Laboratory, Albuquerque, NM 87117-5776.

E-mail: poelmanc@plk.af.mil.

• T. Kanade is with the School of Computer Science, Carnegie Mellon Uni-

versity, 5000 Forbes Avenue, Pittsburgh, PA 15213-3890.

E-mail: tk@cs.cmu.edu.

anuscript received June 15, 1994; revised Jan. 10, 1996. Recommended for accep-

tance by S. Peleg.

For information on obtaining reprints of this article, please send e-mail to:

transpami@computer.org, and reference IEEECS Log Number P97001.

POELMAN AND KANADE: A PARAPERSPECTIVE FACTORIZATION METHOD FOR SHAPE AND MOTION RECOVERY 207

points from each image to the next, recording the coordi-

nates uv

fp fp

of each point p in each image f. Each feature

point p that we track corresponds to a single world point,

located at position s

in some fixed world coordinate sys-

tem. Each image f was taken at some camera orientation,

which we describe by the orthonormal unit vectors i

, j

, and

, where i

and j

correspond to the x and y axes of the cam-

era’s image plane, and k

points along the camera’s line of

sight. We describe the position of the camera in each frame f

by the vector t

indicating the camera’s focal point. This

formulation is illustrated in Fig. 1.

Fig. 1. Coordinate system.

The result of the feature tracker is a set of P feature point

coordinates

fp fp

for each of the F frames of the image

sequence. From this information, our goal is to estimate the

shape of the object as

for each object point, and the mo-

tion of the camera as

, and

for each frame in the

sequence.

RTHOGRAPHIC

ACTORIZATION

ETHOD

This section presents a summary of the orthographic factori-

zation method developed by Tomasi and Kanade. A more

detailed description of the method can be found in [13].

3.1 Orthographic Projection

The orthographic projection model assumes that rays are

projected from an object point along the direction parallel

to the camera’s optical axis, so that they strike the image

plane orthogonally, as illustrated in Fig. 2. A point p whose

location is s

will be observed in frame f at image coordi-

nates

fp fp

, where

fp f

ffpf

=◊ - =◊ -

ist jst

ej ej

(1)

These equations can be rewritten as

uxvy

fp f

ffpf

=◊+ =◊+

ms ns

(2)

where

fff fff

=- ◊ =- ◊ti tj

ej ej

(3)

mi nj

ff ff

(4)

Fig. 2. Orthographic projection in two dimensions. Dotted lines indicate

perspective projection.

3.2 Decomposition

All of the feature point coordinates

fp fp

are entered in a

¥ measurement matrix W.

FFP

11 1

KKK

(5)

Each column of the measurement matrix contains the ob-

servations for a single point, while each row contains the

observed u-coordinates or v-coordinates for a single frame.

Equation (2) for all points and frames can now be combined

into the single matrix equation

WMST=+11K (6)

where M is the

23F¥

motion matrix whose rows are the m

and n

vectors, S is the 3

¥ P

shape matrix whose columns

are the s

vectors, and T is the 2 1F ¥ translation vector

whose elements are the x

and y

Up to this point, Tomasi and Kanade placed no restric-

tions on the location of the world origin, except that it be

stationary with respect to the object. Without loss of gener-

ality, they position the world origin at the center of mass of

the object, denoted by c, so that

(7)

Because the sum of any row of S is zero, the sum of any

row i of W is

. This enables them to compute the ith

element of the translation vector T directly from W, simply

by averaging the ith row of the measurement matrix. The

translation is the subtracted from W, leaving a “registered”

measurement matrix

WWT

11K. Because W* is the

product of a 2 3

F ¥

motion matrix M and a 3

¥ P

shape

208 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 3, MARCH 1997

matrix S, its rank is at most three. When noise is present in

the input, the W* will not be exactly of rank three, so the

Tomasi-Kanade factorization method uses the SVD to find

the best rank three approximation to W*, factoring it into

the product

WMS

= (8)

3.3 Normalization

The decomposition of (8) is only determined up to a linear

transformation. Any non-singular 3 3

matrix A and its

inverse could be inserted between

and

, and their

product would still equal W*. Thus the actual motion and

shape are given by

MMASAS

(9)

with the appropriate

33¥

invertible matrix A selected. The

correct A can be determined using the fact that the rows of

the motion matrix M (which are the m

and n

vectors) repre-

sent the camera axes, and therefore they must be of a certain

form. Since i

and j

are unit vectors, we see from (4) that

(10)

and because they are orthogonal,

◊=0

(11)

Equations (10) and (11) give us 3F equations which we call

the metric constraints. Using these constraints, we solve for

the 3 3

matrix A which, when multiplied by

, produces

the motion matrix M that best satisfies these constraints.

Once the matrix A has been found, the shape and motion

are computed from (9).

ARAPERSPECTIVE

ACTORIZATION

ETHOD

The Tomasi-Kanade factorization method was shown to be

computationally inexpensive and highly accurate, but its

use of an orthographic projection assumption limited the

method’s applicability. For example, the method does not

produce accurate results when there is significant transla-

tion along the camera’s optical axis, because orthography

does not account for the fact that an object appears larger

when it is closer to the camera. We must model this and

other perspective effects in order to successfully recover

shape and motion in a wider range of situations. We choose

an approximation to perspective projection known as

paraperspective projection, which was introduced by Ohta

et al. [6] in order to solve a shape from texture problem.

Although the paraperspective projection equations are

more complex than those for orthography, their basic form

is the same, enabling us develop a method analogous to

that developed by Tomasi and Kanade.

4.1 Paraperspective Projection

Paraperspective projection closely approximates perspec-

tive projection by modeling both the scaling effect (closer

objects appear larger than distant ones) and the position

effect (objects in the periphery of the image are viewed

from a different angle than those near the center of projec-

tion [1]) while retaining the linear properties of ortho-

graphic projection. Paraperspective projection is related to,

but distinct from, the affine camera model, as described in

Appendix A. The paraperspective projection of an object

onto an image, illustrated in Fig. 3, involves two steps.

1) An object point is projected along the direction of the

line connecting the focal point of the camera to the

object’s center of mass, onto a hypothetical image

plane parallel to the real image plane and passing

through the object’s center of mass.

2) The point is then projected onto the real image plane

using perspective projection. Because the hypothetical

plane is parallel to the real image plane, this is

equivalent to simply scaling the point coordinates by

the ratio of the camera focal length and the distance

between the two planes.

In general, the projection of a point p along direction r, onto

the plane with normal n and distance from the origin d, is

given by the equation

◊-

◊

(12)

In frame f, each object point s

is projected along the direc-

tion

(which is the direction from the camera’s focal

point to the object’s center of mass) onto the plane defined

by normal k

and distance from the origin

◊

. The result

of this projection is

◊-◊

-◊

-ss

sk ck

ct k

ejej

(13)

The perspective projection of this point onto the image

plane is given by subtracting t

from

to give the position

of the point in the camera’s coordinate system, and then

scaling the result by the ratio of the camera’s focal length l

to the depth to the object’s center of mass z

. Adjusting for

the aspect ratio a and projection center oo

yields the

coordinates of the projection in the image plane,

fp f

fff

=- ◊

ct k

where

(14)

Substituting (13) into (14) and simplifying gives the general

paraperspective equations for

and

1. The scaled orthographic projection model (also known as “weak per-

spective”) is similar to paraperspective projection, except that the direction

of the initial projection in Step 1 is parallel to the camera’s optical axis

rather than parallel to the line connecting the object’s center of mass to the

camera’s focal point. This model captures the scaling effect of perspective

projection, but not the position effect, as explained in Appendix B.

POELMAN AND KANADE: A PARAPERSPECTIVE FACTORIZATION METHOD FOR SHAPE AND MOTION RECOVERY 209

◊-

◊-+-◊

◊-

◊-+-◊

ict

ksccti

jct

kscctj

ejej

(15)

We simplify these equations by assuming unit focal length,

unit aspect ratio, and (0, 0) center of projection. This re-

quires that the image coordinates

fp fp

be adjusted to

account for these camera parameters before commencing

shape and motion recovery.

Fig. 3. Paraperspective projection in two dimensions. Dotted lines indi-

cate perspective projection. Æ indicates parallel lines.

In [3] the factorization approach is extended to handle

multiple objects moving separately, which requires each

object to be projected based on its own mass center. How-

ever, since this paper addresses the single object case, we

can further simplify our equations by placing the world

origin at the object’s center of mass so that by definition

(16)

This reduces (15) to

◊

◊- ◊

◊

◊- ◊

ks ti

ks tj

(17)

These equations can be rewritten as

uxvy

fp f

ffpf

=◊+ =◊+

ms ns

(18)

where

fff

=- ◊

(19)

◊

◊ti tj

(20)

fff

(21)

Notice that (18) has a form identical to its counterpart for

orthographic projection, (2), although the corresponding

definitions of

, and

differ. This enables us to

perform the basic decomposition of the matrix in the same

manner that Tomasi and Kanade did for orthographic

projection.

4.2 Paraperspective Decomposition

We can combine (18), for all points p from 1 to P, and all

frames f from 1 to F, into the single matrix equation

FFP

11 1

KKK

ss (22)

or in short

WMST=+

11K (23)

where W is the

2FP¥

measurement matrix, M is the

F¥

motion matrix, S is the 3

¥ P

shape matrix, and T is

the 2 1

F ¥

translation vector.

Using (16) and (18), we can write

u x Px Px

v y Py Py

=◊+=◊+=

== =

ÂÂ Â

11 1

ms m s

ns n s

(24)

Therefore we can compute

and

, which are the ele-

ments of the translation vector T, immediately from the

image data as

ffp

ÂÂ

(25)

Once we know the translation vector T, we subtract it from

W, giving the registered measurement matrix

WWT MS

=- =11

(26)

Since W* is the product of two matrices each of rank at most

three, W* has rank at most three, just as it did in the ortho-

graphic projection case. If there is noise present, the rank of

W* will not be exactly three, but by computing the SVD of

210 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 3, MARCH 1997

W* and only retaining the largest three singular values, we

can factor it into

WMS

(27)

where

is a 2 3F ¥ matrix and

is a 3 ¥ P matrix. Using

the SVD to perform this factorization guarantees that the

product

is the best possible rank three approximation

to W*, in the sense that it minimizes the sum of squares dif-

ference between corresponding elements of W* and

4.3 Paraperspective Normalization

Just as in the orthographic case, the decomposition of W*

into the product of

and

by (27) is only determined up

to a linear transformation matrix A. Again, we determine

this matrix A by observing that the rows of the motion ma-

trix M (the

and

vectors) must be of a certain form.

Taking advantage of the fact that

, and

are unit

vectors, from (21) we observe that

(28)

We know the values of

and

from our initial registra-

tion step, but we do not know the value of the depth

Thus we cannot impose individual constraints on the mag-

nitudes of

and

as was done in the orthographic fac-

torization method. However, we can adopt the following

constraint on the magnitudes of

and

xyz

(29)

In the case of orthographic projection, one constraint on

and

was that they each have unit magnitude, as re-

quired by (10). In the above paraperspective case, we sim-

ply require that their magnitudes be in a certain ratio.

There is also a constraint on the angle relationship of

and

. From (21), and the knowledge that

, and

are orthogonal unit vectors,

ikjk

fff

◊=

◊

(30)

The problem with this constraint is that, again,

is un-

known. We could use either of the two values given in (29)

for 1

, but in the presence of noisy input data the two

will not be exactly equal, so we use the average of the two

quantities. We choose the arithmetic mean over the geomet-

ric mean or some other measure in order to keep the solu-

tion of these constraints linear. Thus our second constraint

becomes

ff ff

◊=

(31)

This is the paraperspective version of the orthographic con-

straint given by (11), which required that the dot product of

and

be zero.

Equations (29) and (31) are homogeneous constraints,

which could be trivially satisfied by the solution

"==f

, or M = 0. To avoid this solution, we im-

pose the additional constraint

(32)

This does not effect the final solution except by a scaling

factor.

Equations (29), (31), and (32) give us 2F + 1 equations,

which are the paraperspective version of the metric con-

straints. We compute the 3 3

matrix A such that

MMA

best satisfies these metric constraints in the least sum-of-

squares error sense. This is a simple problem because the

constraints are linear in the six unique elements of the

symmetric

33¥

matrix

QAA

. We use the metric con-

straints to compute Q, compute its Jacobi Transformation

QLL

, where L is the diagonal eigenvalue matrix, and

as long as Q is positive definite,

12/

. A non-

positive-definite Q indicates that unmodeled distortion has

overwhelmed the third singular value of the measurement

matrix, due possibly to noise, perspective effects, insuffi-

cient rotational motion, a planar object shape, or a combi-

nation of these effects.

4.4 Paraperspective Motion Recovery

Once the matrix A has been determined, we compute the

shape matrix

SAS=

and the motion matrix

MMA=

For each frame f, we now need to recover the camera ori-

entation vectors

, and

from the vectors

and

which are the rows of the matrix M. From (21) we see that

imkjnk

ffffffffff

zx zy=+ =+

(33)

From this and the knowledge that

, and

must be

orthonormal, we determine that

$$$

ij m k n k k

imk

jnk

f f f f ff ff ff f

fffff

zx zy

¥= + ¥ + =

=+=

ejej

(34)

Again, we do not know a value for

, but using the rela-

tions specified in (29) and the additional knowledge that

1, (34) can be reduced to

ff f

k= (35)

where

GHx

(36)

A paraperspective factorization method for shape and motion recovery

Figures

Citations

Multiple View Geometry in Computer Vision.

Computer Vision: Algorithms and Applications

A Multibody Factorization Method for Independently Moving Objects

A Factorization Based Algorithm for Multi-Image Projective Structure and Motion

Using Unmanned Aerial Vehicles (UAV) for High-Resolution Reconstruction of Topography: The Structure from Motion Approach on Coastal Environments

References

An iterative image registration technique with an application to stereo vision

Numerical Recipes in C: The Art of Scientific Computing

Numerical Recipes in C: The Art of Scientific Computing

Numerical Recipes in FORTRAN - The Art of Scientific Computing - Second Edition

Numerical recipes in C. The art of scientific computing

Related Papers (5)

Shape and motion from image streams under orthography: a factorization method

Multiple view geometry in computer vision

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

Good features to track

Recovering non-rigid 3D shape from image streams

Frequently Asked Questions (14)

Q1. What contributions have the authors mentioned in the paper "A paraperspective factorization method for shape and motion recovery" ?

Q2. How many steps were required for convergence of a single point or frame refinement?

Q3. How does the method achieve its accuracy and robustness?

Q4. How can the authors solve the motion variables for each frame?

Q5. How many pixels did the feature tracker move?

Q6. How do the authors solve the motion variables?

Q7. What is the method for determining the distance between the camera and the object?

Q8. What is the principle that the measurement matrix has rank three?

Q9. What is the drawback of iterative methods on complex non-linear error surfaces?

Q10. how long did it take to solve a system of 60 frames and 60 points on a?

Q11. What is the function of the orthographic projection model?

Q12. What is the rank three approximation to W*?

Q13. What was the effect of the orthographic factorization method?

Q14. What is the metric for the scaled orthographic factorization method?