Twist Based Acquisition and Tracking of Animal and Human Kinematics

doi:10.1023/B:VISI.0000011203.00237.9B

International Journal of Computer Vision 56(3), 179–194, 2004

c

 2004 Kluwer Academic Publishers. Manufactured in The Netherlands.

Twist Based Acquisition and Tracking of Animal and Human Kinematics

CHRISTOPH BREGLER,

∗

Computer Science Department, Stanford University, Stanford, CA 94305, USA

chris.bregler@nyu.edu

JITENDRA MALIK

Computer Science Department, University of California at Berkeley, Berkeley, CA 94720, USA

malik@cs.berkeley.edu

KATHERINE PULLEN

Physics Department, Stanford University, Stanford, CA 94305, USA

pullen@graphics.stanford.edu

Received December 14, 1999; Revised May 27, 2003; Accepted May 30, 2003

Abstract. This paper demonstrates a new visual motion estimation technique that is able to recover high degree-of-

freedom articulated human body conﬁgurations in complex video sequences. We introduce the use and integration of

a mathematical technique, the product of exponential maps and twist motions, into a differential motion estimation.

This results in solving simple linear systems, and enables us to recover robustly the kinematic degrees-of-freedom

in noise and complex self occluded conﬁgurations. A new factorization technique lets us also recover the kinematic

chain model itself. We are able to track several human walk cycles, several wallaby hop cycles, and two walk

cycels of the famous movements of Eadweard Muybridge’s motion studies from the last century. To the best of our

knowledge, this is the ﬁrst computer vision based system that is able to process such challenging footage.

Keywords: human tracking, motion capture, kinematic chains, twists, exponential maps

1. Introduction

The estimation of image motion without any domain

constraints is an underconstrained problem. Therefore

all proposed motion estimation algorithms involve

additional constraints about the assumed motion

structure. One class of motion estimation techniques

are based on parametric algorithms (Bergen et al.,

1992). These techniques rely on solving a highly

overconstrained system of linear equations. For exam-

ple, if an image patch could be modeled as a planar

∗

Present address: Computer Science Dept., Courant Institute, Media

Research Lab, 719 Broadway, 12th Floor, New York, NY 10003,

USA. He was formerly at Stanford University.

surface, an afﬁne motion model with low degrees of

freedom (6 DOF) can be estimated. Measurements

over many pixel locations have to comply with this

motion model. Noise in image features and ambiguous

motion patterns can be overcome by measurements

from features at other image locations. If the motion

can be approximated by this simple motion model,

sub-pixel accuracy can be achieved.

Problems occur if the motion of such a patch is not

well described by the assumed motion model. Others

have shown how to extend this approach to multiple

independent moving motion areas (Jepson and Black,

1993; Ayer Sawhney, 1995; Weiss and Adelson, 1995).

For each area, this approach still has the advantage that

a large number of measurements are incorporated into

180 Bregler, Malik and Pullen

alow DOF linear motion estimation. Problems occur

if some of the areas do not have a large number of

pixel locations or have mostly noisy or ambiguous mo-

tion measurements. One example is the measurement

of human body motion. Each body segment can be ap-

proximated by one rigid moving object. Unfortunately,

in standard video sequences the area of such body seg-

ments are very small, the motion of leg and arm seg-

ments is ambiguous in certain directions (for exam-

ple parallel to the boundaries), and deforming clothes

cause noisy measurements.

If we increase the ratio between the number of mea-

surements and the degrees of freedom, the motion

estimation will be more robust. This can be done us-

ing additional constraints. Body segments don’t move

independently; they are attached by body joints. This

reduces the number of free parameters dramatically. A

convenient way of describing these additional domain

constraints is the twist and product of exponential map

formalism for kinematic chains (Murray et al., 1994).

The motion of one body segment can be described as

the motion of the previous segment in a kinematic chain

and an angular motion around a body joint. This adds

just a single DOF for each additional segment in the

chain. In addition, the exponential map formulation

makes it possible to relate the image motion vectors

linearly to the angular velocity.

Others have modeled the human body with rigid seg-

ments connected at joints (Hogg, 1983; Rohr, 1993;

Regh and Kanade, 1995; Gavrila and Davis, 1995;

Concalves et al., 1995; Clergue et al., 1995; Ju et al.,

1996; Kakadiaris and Metaxas, 1996), but use differ-

ent representations and features (for example Denavit-

Hartenburg and edge detection). The introduction of

twists and product of exponential maps into region-

based motion estimation simpliﬁes the estimation dra-

matically and leads to robust tracking results. Besides

tracking, we also outline how to ﬁne-tune the kine-

matic model itself. Here the ratio between the number

of measurements and the degrees of freedom is even

larger, because we can optimize over a complete image

sequence.

Alternative solutions to tracking of human bodies

were proposed by Wren et al. (1995) in tracking color

blobs, and by Davis and Bobick (1997) in using motion

templates. Nonrigid models were proposed by Pentland

and Horowitz (1991), Blake et al. (1995), Black and

Yacoob (1995) and Black et al. (1997).

Section 2 introduces the new motion tracking and

kinematic model acquisition framework and its mathe-

matical formulation, Section 3 details our experiments,

and we discuss the results and future directions in

Section 4.

The tracking technique of this paper has been pre-

sented in a shorter conference proceeding version in

Bregler and Malik (1998). The new model acquisition

technique has not been published previously.

2. Motion Estimation

We ﬁrst describe a commonly used region-based mo-

tion estimation framework (Bergen and Anandan,

1992; Shi and Tomasi, 1994), and then describe the ex-

tension to kinematic chain constraints (Murray et al.,

1994).

2.1. Preliminaries

Assuming that changes in image intensity are only due

to translation of local image intensity, a parametric im-

age motion between consecutive time frames t and t +1

can be described by the following equation:

I (x + u

x

(x, y,φ), y + u

y

(x, y,φ), t + 1) = I (x, y, t)

(1)

I (x , y, t)isthe image intensity. The motion model

u(x, y,φ) = [u

x

(x, y,φ), u

y

(x, y,φ)]

T

describes the

pixel displacement dependent on location (x, y) and

model parameters φ.For example, a 2D afﬁne motion

model with parameters φ = [a

1

, a

2

, a

3

, a

4

, d

x

, d

y

]

T

is

deﬁned as

u(x, y,φ) =



a

1

a

2

a

3

a

4



·



x

y



+



d

x

d

y



(2)

The ﬁrst-order Taylor series expansion of (1) leads

to the commonly used gradient formulation (Lucas and

Kanade, 1981):

I

t

(x, y) + [I

x

(x, y), I

y

(x, y)] · u(x, y,φ) = 0 (3)

I

t

(x, y)isthe temporal image gradient and

[I

x

(x, y), I

y

(x, y)] is the spatial image gradient at loca-

tion (x, y). Assuming a motion model of K degrees of

freedom (in case of the afﬁne model K = 6) and a re-

gion of N > K pixels, we can write an over-constrained

set of N equations. For the case that the motion model

Twist Based Acquisition and Tracking of Animal and Human Kinematics 181

is linear (as in the afﬁne case), we can write the set of

equations in matrix form (see Bergen et al., 1992 for

details):

H · φ +z =



0 (4)

where H ∈

N ×K

, and z ∈

N

. The least squares

solution to (3) is:

φ =−(H

T

· H)

−1

· H

T

z (5)

Because (4) is the ﬁrst-order Taylor series lineariza-

tion of (1), we linearize around the new solution and it-

erate. This is done by warping the image I (t +1) using

the motion model parameters φ found by (5). Based

on the re-warped image we compute the new image

gradients (3). Repeating this process is equivalent to a

Newton-Raphson style minimization.

A convenient representation of the shape of an im-

age region is a probability mask w(x, y) ∈ [0, 1].

w(x, y) = 1 declares that pixel (x, y)ispart of the re-

gion. Equation (5) can be modiﬁed, such that it weights

the contribution of pixel location (x, y) according to

w(x, y):

φ =−((W · H)

T

· H)

−1

· (W · H)

T

z (6)

W is an N × N diagonal matrix, with W(i, i) =

w(x

i

, y

i

). We assume for now that we know the exact

shape of the region. For example, if we want to estimate

the motion parameters for a human body part, we sup-

ply a weight matrix W that deﬁnes the image support

map of that speciﬁc body part, and run this estimation

technique for several iterations. Section 2.4 describes

how we can estimate the shape of the support maps as

well.

Tracking over multiple frames can be achieved by

applying this optimization technique successively over

the complete image sequence.

2.2. Twists and the Product of Exponential Formula

In the following we develop a motion model u(x, y,φ)

for a 3D kinematic chain under scaled orthographic

projection and show how these domain constraints can

be incorporated into one linear system similar to (6). φ

will represent the 3D pose and angle conﬁguration of

such a kinematic chain and can be tracked in the same

fashion as already outlined for simpler motion models.

2.2.1. 3D Pose. The pose of an object relative to

the camera frame can be represented as a rigid

body transformation in 

3

using homogeneous coor-

dinates (we will use the notation from Murray et al.

(1994)):

q

c

= G · q

o

with G =







r

1,1

r

1,2

r

1,3

d

x

r

2,1

r

2,2

r

2,3

d

y

r

3,1

r

3,2

r

3,3

d

z

0001







(7)

q

o

= [x

o

, y

o

, z

o

, 1]

T

is a point in the object frame

and q

c

= [x

c

, y

c

, z

c

, 1]

T

is the corresponding point

in the camera frame. Using scaled orthographic pro-

jection with scale s, the point q

c

in the camera frame

gets projected into the image point [x

im

, y

im

]

T

=

s · [x

c

, y

c

]

T

.

The 3D translation [d

x

, d

y

, d

z

]

T

can be arbitrary, but

the rotation matrix:

R =







r

1,1

r

1,2

r

1,3

r

2,1

r

2,2

r

2,3

r

3,1

r

3,2

r

3,3







∈ SO(3) (8)

has only 3 degrees of freedom. Therefore the rigid body

transformation G ∈ SE(3) has a total of 6 degrees of

freedom.

Our goal is to ﬁnd a model of the image motion

that is parameterized by 6 degrees of freedom for the

3D rigid motion and the scale factor s for scaled ortho-

graphic projection. Euler angles are commonly used to

constrain the rotation matrix to SO(3), but they suffer

from singularities and don’t lead to a simple formula-

tion in the optimization procedure (for example Basu

et al. (1996) propose a 3D ellipsoidal tracker based on

Euler angles). In contrast, the twist representation pro-

vides a more elegant solution (Murray et al., 1994) and

leads to a very simple linear representation of the mo-

tion model. It is based on the observation that every

rigid motion can be represented as a rotation around a

3D axis and a translation along this axis. A twist ξ has

two representations: (a) a 6D vector, or (b) a 4×4 matrix

with the upper 3 × 3 component as a skew-symmetric

matrix:

ξ =







v

1

v

2

v

3

ω

x

ω

y

ω

z







or

ˆ

ξ =







0 −ω

z

ω

y

v

1

ω

z

0 −ω

x

v

2

−ω

y

ω

x

0 v

3

0000







(9)

182 Bregler, Malik and Pullen

ω is a 3D unit vector that points in the direction of

the rotation axis. The amount of rotation is speciﬁed

with a scalar angle θ that is multiplied by the twist:

ξθ. The v component determines the location of the

rotation axis and the amount of translation along this

axis. It can be shown that for any arbitrary G ∈ SE(3)

there exists a ξ ∈

6

twist representation. See (Murray

et al., 1994) for more formal properties and a detailed

geometric interpretation. It is convenient to drop the

θ coefﬁcient by relaxing the constraint that ω is unit

length. Therefore ξ ∈

6

.

A twist can be converted into the G representation

with following exponential map:

G =







r

1,1

r

1,2

r

1,3

d

x

r

2,1

r

2,2

r

2,3

d

y

r

3,1

r

3,2

r

3,3

d

z

0001







= e

ˆ

ξ

= I +

ˆ

ξ +

(

ˆ

ξ)

2

2!

+

(

ˆ

ξ)

3

3!

+··· (10)

2.2.2. Twist Motion Model. At this point we would

like to track the 3D pose of a rigid object under

scaled orthographic projection. We will extend this

formulation in the next section to a kinematic chain

representation. The pose of an object is deﬁned as

[s,ξ

T

]

T

= [s,v

1

,v

2

,v

3

,ω

x

,ω

y

,ω

z

]

T

.Apoint q

o

in

the object frame is projected to the image location

[x

im

, y

im

] with:



x

im

y

im



=



1000

0100



· s · e

ˆ

ξ

· q

o

=



1000

0100



· q

c

(11)

s is the scale change of the scaled orthographic projec-

tion. The image motion of point [x

im

, y

im

] from time t

to time t + 1 is:



u

x

u

y



=



x

im

(t + 1) − x

im

(t)

y

im

(t + 1) − y

im

(t)



=



1000

0100



·



s(t + 1) · e

ˆ

ξ(t+1)

· q

o

− s(t) · e

ˆ

ξ(t)

· q

o



=



1000

0100



· ((1 + s) · e



ˆ

ξ

− I) · s(t) · e

ˆ

ξ(t)

· q

o

=



1000

0100



· ((1 + s) · e



ˆ

ξ

− I) · q

c

(12)

with

e

ˆ

ξ(t+1)

= e

ˆ

ξ(t)

· e



ˆ

ξ

s(t + 1) = s(t) · (1 + s) (13)

q

c

= s(t) · e

ˆ

ξ(t)

· q

o

Using the ﬁrst order Taylor expansion from (10) we

can approximate:

(1 + s) · e



ˆ

ξ

≈ (1 + s) · I + (1 + s) · 

ˆ

ξ (14)

and can rewrite (12) as:



u

x

u

y



=



s −ω

z

ω

y

v

1

ω

z

s −ω

x

v

2



· q

c

(15)

with

ξ = [v

1

,v

2

,v

3

,ω

x

,ω

y

,ω

z

]

T

φ = [s,v

1

,v

2

,ω

x

,ω

y

,ω

z

]

T

codes the rel-

ative scale and twist motion from time t to t + 1. Note

that (15) does not include v

3

.Translation in the Z

direction of the camera frame is not measurable under

scaled orthographic projection.

2.2.3. 3D Geometric Model. Equation (15) describes

the image motion of a point [x

im

, y

im

]interms of the

motion parameters φ and the corresponding 3D point q

c

in the camera frame. As previously deﬁned in Eq. (7) q

c

is a homogenous vector [x, y, z, 1]

T

.Itisthe point that

intersects the camera ray of the image point [x

im

, y

im

]

with the 3D model. The 3D model is given by the user

(for example a cyclinder, superquadric, or polygonial

model) or is estimated by an initialization procedure

that we will describe below. The pose of the 3D model

is deﬁned by G(t) = s(t) · e

ˆ

ξ(t)

.Weassume G(t)is

the correct pose estimate for image frame I (x , y, t)

(the estimation result of this algorithm over the previ-

ous time frame). Since we assume scaled orthographic

projection (11), [x

im

, y

im

] = [x , y]. We only need to

determine z.Inthis paper we approximate the body

segments by ellipsoidal 3D blobs. The 3D blobs are

deﬁned in the object frame. Following quadratic equa-

tion is the implicit function for the ellipsodial surface

Twist Based Acquisition and Tracking of Animal and Human Kinematics 183

with length 1/a

x

, 1/a

y

, 1/a

z

along the x, y, z axis and

centered around M = [m

x

, m

y

, m

z

, 1]

T

:

(q

o

− M)

T

·







a

2

x

000

0 a

2

y

00

00a

2

z

0

0000







· (q

o

− M) = 1 (16)

Since q

o

= G

−1

q

c

= G

−1

[x

im

, y

im

, z, 1]

T

we can

write the implicit function in the camera frame with:







G

−1







x

im

y

im

z

1







− M







T

·







a

2

x

000

0 a

2

y

00

00a

2

z

0

0000







·







G

−1







x

im

y

im

z

1







− M







= 1 (17)

Therefore z is the solution of this quadratic Eq. (17).

For image points that are inside the blob it has 2 (close-

form) solutions. We pick the smaller solution (z value

that is closer to the camera). Using (17) we can calculate

for all points inside the blob the q

c

points. For points

outside the blob it has no solution. Those points will

not be part of the estimation setup.

For more complex 3D shape models, the z cal-

culation can be replaced by standard graphics ray-

casting algorithms. We have not implemented this

generalization yet.

2.2.4. Combining 3D Motion and Geometric Model.

Inserting (15) into (3) leads to following equation for

each point [x

i

, y

i

] inside the blob:

I

t

+ I

x

· [s, −ω

z

,ω

y

,v

1

] · q

c

+ I

y

· [ω

z

,s, −ω

x

,v

2

] · q

c

= 0

⇔ I

t

(i) + H

i

· [s,v

1

,v

2

,ω

x

,ω

y

,ω

z

]

T

= 0

(18)

H

i

= [I

x

· x

i

+ I

y

· y

i

, I

x

, I

y

, −I

y

· z

i

, I

x

· z

i

,

− I

x

· y

i

+ I

y

· x

i

] ∈

1×6

with

I

t

:= I

t

(x

i

, y

i

), I

x

:= I

x

(x

i

, y

i

), I

y

:= I

y

(x

i

, y

i

)

For N pixel positions we have N equations of the

form (18). This can be written in matrix form:

H · φ +z = 0 (19)

with

H =







H

1

H

2

...

H

N







and z =







I

t

(x

1

, y

1

)

I

t

(x

2

, y

2

)

...

I

t

(x

N

, y

N

)







Finding the least-squares solution (3D twist motion

φ) for this equation is done using (6).

2.2.5. Kinematic Chain as a Product of Exponen-

tials. So far we have parameterized the 3D pose and

motion of a body segment by the 6 parameters of a

twist ξ . Points on this body segment in a canonical

object frame are transformed into a camera frame by

the mapping G

0

= e

ˆ

ξ

. Assume that a second body

segment is attached to the ﬁrst segment with a joint.

The joint can be deﬁned by an axis of rotation in

the object frame. We deﬁne this rotation axis in the

object frame by a 3D unit vector ω

1

along the axis,

and a point q

1

on the axis (Fig. 1). This is a revolute

joint, and can be modeled by a twist (Murray et al.,

Figure 1. Kinematic chain deﬁned by twists.

Twist Based Acquisition and Tracking of Animal and Human Kinematics

Figures

Citations

Motion capture assisted animation: texturing and synthesis

Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies

Motion capture using joint skeleton tracking and surface estimation

Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies

Motion capture of hands in action using discriminative salient points

References

Maximum likelihood from incomplete data via the EM algorithm

An iterative image registration technique with an application to stereo vision

A Mathematical Introduction to Robotic Manipulation

Computer vision

Pfinder: real-time tracking of the human body

Related Papers (5)

A survey of advances in vision-based human motion capture and analysis

A Mathematical Introduction to Robotic Manipulation

Real-time human pose recognition in parts from single depth images

Vision-based human motion analysis: An overview

A Survey of Computer Vision-Based Human Motion Capture