scispace - formally typeset
Open AccessJournal ArticleDOI

Twist Based Acquisition and Tracking of Animal and Human Kinematics

Reads0
Chats0
TLDR
This paper demonstrates a new visual motion estimation technique that is able to recover high degree-of-freedom articulated human body configurations in complex video sequences, and is the first computer vision based system able to process such challenging footage.
Abstract
This paper demonstrates a new visual motion estimation technique that is able to recover high degree-of-freedom articulated human body configurations in complex video sequences. We introduce the use and integration of a mathematical technique, the product of exponential maps and twist motions, into a differential motion estimation. This results in solving simple linear systems, and enables us to recover robustly the kinematic degrees-of-freedom in noise and complex self occluded configurations. A new factorization technique lets us also recover the kinematic chain model itself. We are able to track several human walk cycles, several wallaby hop cycles, and two walk cycels of the famous movements of Eadweard Muybridge's motion studies from the last century. To the best of our knowledge, this is the first computer vision based system that is able to process such challenging footage.

read more

Content maybe subject to copyright    Report

International Journal of Computer Vision 56(3), 179–194, 2004
c
2004 Kluwer Academic Publishers. Manufactured in The Netherlands.
Twist Based Acquisition and Tracking of Animal and Human Kinematics
CHRISTOPH BREGLER,
Computer Science Department, Stanford University, Stanford, CA 94305, USA
chris.bregler@nyu.edu
JITENDRA MALIK
Computer Science Department, University of California at Berkeley, Berkeley, CA 94720, USA
malik@cs.berkeley.edu
KATHERINE PULLEN
Physics Department, Stanford University, Stanford, CA 94305, USA
pullen@graphics.stanford.edu
Received December 14, 1999; Revised May 27, 2003; Accepted May 30, 2003
Abstract. This paper demonstrates a new visual motion estimation technique that is able to recover high degree-of-
freedom articulated human body configurations in complex video sequences. We introduce the use and integration of
a mathematical technique, the product of exponential maps and twist motions, into a differential motion estimation.
This results in solving simple linear systems, and enables us to recover robustly the kinematic degrees-of-freedom
in noise and complex self occluded configurations. A new factorization technique lets us also recover the kinematic
chain model itself. We are able to track several human walk cycles, several wallaby hop cycles, and two walk
cycels of the famous movements of Eadweard Muybridge’s motion studies from the last century. To the best of our
knowledge, this is the first computer vision based system that is able to process such challenging footage.
Keywords: human tracking, motion capture, kinematic chains, twists, exponential maps
1. Introduction
The estimation of image motion without any domain
constraints is an underconstrained problem. Therefore
all proposed motion estimation algorithms involve
additional constraints about the assumed motion
structure. One class of motion estimation techniques
are based on parametric algorithms (Bergen et al.,
1992). These techniques rely on solving a highly
overconstrained system of linear equations. For exam-
ple, if an image patch could be modeled as a planar
Present address: Computer Science Dept., Courant Institute, Media
Research Lab, 719 Broadway, 12th Floor, New York, NY 10003,
USA. He was formerly at Stanford University.
surface, an affine motion model with low degrees of
freedom (6 DOF) can be estimated. Measurements
over many pixel locations have to comply with this
motion model. Noise in image features and ambiguous
motion patterns can be overcome by measurements
from features at other image locations. If the motion
can be approximated by this simple motion model,
sub-pixel accuracy can be achieved.
Problems occur if the motion of such a patch is not
well described by the assumed motion model. Others
have shown how to extend this approach to multiple
independent moving motion areas (Jepson and Black,
1993; Ayer Sawhney, 1995; Weiss and Adelson, 1995).
For each area, this approach still has the advantage that
a large number of measurements are incorporated into

180 Bregler, Malik and Pullen
alow DOF linear motion estimation. Problems occur
if some of the areas do not have a large number of
pixel locations or have mostly noisy or ambiguous mo-
tion measurements. One example is the measurement
of human body motion. Each body segment can be ap-
proximated by one rigid moving object. Unfortunately,
in standard video sequences the area of such body seg-
ments are very small, the motion of leg and arm seg-
ments is ambiguous in certain directions (for exam-
ple parallel to the boundaries), and deforming clothes
cause noisy measurements.
If we increase the ratio between the number of mea-
surements and the degrees of freedom, the motion
estimation will be more robust. This can be done us-
ing additional constraints. Body segments don’t move
independently; they are attached by body joints. This
reduces the number of free parameters dramatically. A
convenient way of describing these additional domain
constraints is the twist and product of exponential map
formalism for kinematic chains (Murray et al., 1994).
The motion of one body segment can be described as
the motion of the previous segment in a kinematic chain
and an angular motion around a body joint. This adds
just a single DOF for each additional segment in the
chain. In addition, the exponential map formulation
makes it possible to relate the image motion vectors
linearly to the angular velocity.
Others have modeled the human body with rigid seg-
ments connected at joints (Hogg, 1983; Rohr, 1993;
Regh and Kanade, 1995; Gavrila and Davis, 1995;
Concalves et al., 1995; Clergue et al., 1995; Ju et al.,
1996; Kakadiaris and Metaxas, 1996), but use differ-
ent representations and features (for example Denavit-
Hartenburg and edge detection). The introduction of
twists and product of exponential maps into region-
based motion estimation simplifies the estimation dra-
matically and leads to robust tracking results. Besides
tracking, we also outline how to fine-tune the kine-
matic model itself. Here the ratio between the number
of measurements and the degrees of freedom is even
larger, because we can optimize over a complete image
sequence.
Alternative solutions to tracking of human bodies
were proposed by Wren et al. (1995) in tracking color
blobs, and by Davis and Bobick (1997) in using motion
templates. Nonrigid models were proposed by Pentland
and Horowitz (1991), Blake et al. (1995), Black and
Yacoob (1995) and Black et al. (1997).
Section 2 introduces the new motion tracking and
kinematic model acquisition framework and its mathe-
matical formulation, Section 3 details our experiments,
and we discuss the results and future directions in
Section 4.
The tracking technique of this paper has been pre-
sented in a shorter conference proceeding version in
Bregler and Malik (1998). The new model acquisition
technique has not been published previously.
2. Motion Estimation
We first describe a commonly used region-based mo-
tion estimation framework (Bergen and Anandan,
1992; Shi and Tomasi, 1994), and then describe the ex-
tension to kinematic chain constraints (Murray et al.,
1994).
2.1. Preliminaries
Assuming that changes in image intensity are only due
to translation of local image intensity, a parametric im-
age motion between consecutive time frames t and t +1
can be described by the following equation:
I (x + u
x
(x, y), y + u
y
(x, y), t + 1) = I (x, y, t)
(1)
I (x , y, t)isthe image intensity. The motion model
u(x, y) = [u
x
(x, y), u
y
(x, y)]
T
describes the
pixel displacement dependent on location (x, y) and
model parameters φ.For example, a 2D affine motion
model with parameters φ = [a
1
, a
2
, a
3
, a
4
, d
x
, d
y
]
T
is
defined as
u(x, y) =
a
1
a
2
a
3
a
4
·
x
y
+
d
x
d
y
(2)
The first-order Taylor series expansion of (1) leads
to the commonly used gradient formulation (Lucas and
Kanade, 1981):
I
t
(x, y) + [I
x
(x, y), I
y
(x, y)] · u(x, y) = 0 (3)
I
t
(x, y)isthe temporal image gradient and
[I
x
(x, y), I
y
(x, y)] is the spatial image gradient at loca-
tion (x, y). Assuming a motion model of K degrees of
freedom (in case of the affine model K = 6) and a re-
gion of N > K pixels, we can write an over-constrained
set of N equations. For the case that the motion model

Twist Based Acquisition and Tracking of Animal and Human Kinematics 181
is linear (as in the affine case), we can write the set of
equations in matrix form (see Bergen et al., 1992 for
details):
H · φ +z =
0 (4)
where H ∈
N ×K
, and z ∈
N
. The least squares
solution to (3) is:
φ =−(H
T
· H)
1
· H
T
z (5)
Because (4) is the first-order Taylor series lineariza-
tion of (1), we linearize around the new solution and it-
erate. This is done by warping the image I (t +1) using
the motion model parameters φ found by (5). Based
on the re-warped image we compute the new image
gradients (3). Repeating this process is equivalent to a
Newton-Raphson style minimization.
A convenient representation of the shape of an im-
age region is a probability mask w(x, y) [0, 1].
w(x, y) = 1 declares that pixel (x, y)ispart of the re-
gion. Equation (5) can be modified, such that it weights
the contribution of pixel location (x, y) according to
w(x, y):
φ =−((W · H)
T
· H)
1
· (W · H)
T
z (6)
W is an N × N diagonal matrix, with W(i, i) =
w(x
i
, y
i
). We assume for now that we know the exact
shape of the region. For example, if we want to estimate
the motion parameters for a human body part, we sup-
ply a weight matrix W that defines the image support
map of that specific body part, and run this estimation
technique for several iterations. Section 2.4 describes
how we can estimate the shape of the support maps as
well.
Tracking over multiple frames can be achieved by
applying this optimization technique successively over
the complete image sequence.
2.2. Twists and the Product of Exponential Formula
In the following we develop a motion model u(x, y)
for a 3D kinematic chain under scaled orthographic
projection and show how these domain constraints can
be incorporated into one linear system similar to (6). φ
will represent the 3D pose and angle configuration of
such a kinematic chain and can be tracked in the same
fashion as already outlined for simpler motion models.
2.2.1. 3D Pose. The pose of an object relative to
the camera frame can be represented as a rigid
body transformation in
3
using homogeneous coor-
dinates (we will use the notation from Murray et al.
(1994)):
q
c
= G · q
o
with G =
r
1,1
r
1,2
r
1,3
d
x
r
2,1
r
2,2
r
2,3
d
y
r
3,1
r
3,2
r
3,3
d
z
0001
(7)
q
o
= [x
o
, y
o
, z
o
, 1]
T
is a point in the object frame
and q
c
= [x
c
, y
c
, z
c
, 1]
T
is the corresponding point
in the camera frame. Using scaled orthographic pro-
jection with scale s, the point q
c
in the camera frame
gets projected into the image point [x
im
, y
im
]
T
=
s · [x
c
, y
c
]
T
.
The 3D translation [d
x
, d
y
, d
z
]
T
can be arbitrary, but
the rotation matrix:
R =
r
1,1
r
1,2
r
1,3
r
2,1
r
2,2
r
2,3
r
3,1
r
3,2
r
3,3
SO(3) (8)
has only 3 degrees of freedom. Therefore the rigid body
transformation G SE(3) has a total of 6 degrees of
freedom.
Our goal is to find a model of the image motion
that is parameterized by 6 degrees of freedom for the
3D rigid motion and the scale factor s for scaled ortho-
graphic projection. Euler angles are commonly used to
constrain the rotation matrix to SO(3), but they suffer
from singularities and don’t lead to a simple formula-
tion in the optimization procedure (for example Basu
et al. (1996) propose a 3D ellipsoidal tracker based on
Euler angles). In contrast, the twist representation pro-
vides a more elegant solution (Murray et al., 1994) and
leads to a very simple linear representation of the mo-
tion model. It is based on the observation that every
rigid motion can be represented as a rotation around a
3D axis and a translation along this axis. A twist ξ has
two representations: (a) a 6D vector, or (b) a 4×4 matrix
with the upper 3 × 3 component as a skew-symmetric
matrix:
ξ =
v
1
v
2
v
3
ω
x
ω
y
ω
z
or
ˆ
ξ =
0 ω
z
ω
y
v
1
ω
z
0 ω
x
v
2
ω
y
ω
x
0 v
3
0000
(9)

182 Bregler, Malik and Pullen
ω is a 3D unit vector that points in the direction of
the rotation axis. The amount of rotation is specified
with a scalar angle θ that is multiplied by the twist:
ξθ. The v component determines the location of the
rotation axis and the amount of translation along this
axis. It can be shown that for any arbitrary G SE(3)
there exists a ξ ∈
6
twist representation. See (Murray
et al., 1994) for more formal properties and a detailed
geometric interpretation. It is convenient to drop the
θ coefficient by relaxing the constraint that ω is unit
length. Therefore ξ ∈
6
.
A twist can be converted into the G representation
with following exponential map:
G =
r
1,1
r
1,2
r
1,3
d
x
r
2,1
r
2,2
r
2,3
d
y
r
3,1
r
3,2
r
3,3
d
z
0001
= e
ˆ
ξ
= I +
ˆ
ξ +
(
ˆ
ξ)
2
2!
+
(
ˆ
ξ)
3
3!
·· (10)
2.2.2. Twist Motion Model. At this point we would
like to track the 3D pose of a rigid object under
scaled orthographic projection. We will extend this
formulation in the next section to a kinematic chain
representation. The pose of an object is defined as
[s
T
]
T
= [s,v
1
,v
2
,v
3
x
y
z
]
T
.Apoint q
o
in
the object frame is projected to the image location
[x
im
, y
im
] with:
x
im
y
im
=
1000
0100
· s · e
ˆ
ξ
· q
o
=
1000
0100
· q
c
(11)
s is the scale change of the scaled orthographic projec-
tion. The image motion of point [x
im
, y
im
] from time t
to time t + 1 is:
u
x
u
y
=
x
im
(t + 1) x
im
(t)
y
im
(t + 1) y
im
(t)
=
1000
0100
·
s(t + 1) · e
ˆ
ξ(t+1)
· q
o
s(t) · e
ˆ
ξ(t)
· q
o
=
1000
0100
· ((1 + s) · e
ˆ
ξ
I) · s(t) · e
ˆ
ξ(t)
· q
o
=
1000
0100
· ((1 + s) · e
ˆ
ξ
I) · q
c
(12)
with
e
ˆ
ξ(t+1)
= e
ˆ
ξ(t)
· e
ˆ
ξ
s(t + 1) = s(t) · (1 + s) (13)
q
c
= s(t) · e
ˆ
ξ(t)
· q
o
Using the first order Taylor expansion from (10) we
can approximate:
(1 + s) · e
ˆ
ξ
(1 + s) · I + (1 + s) ·
ˆ
ξ (14)
and can rewrite (12) as:
u
x
u
y
=
s ω
z
ω
y
v
1
ω
z
s ω
x
v
2
· q
c
(15)
with
ξ = [v
1
,v
2
,v
3
,ω
x
,ω
y
,ω
z
]
T
φ = [s,v
1
,v
2
,ω
x
,ω
y
,ω
z
]
T
codes the rel-
ative scale and twist motion from time t to t + 1. Note
that (15) does not include v
3
.Translation in the Z
direction of the camera frame is not measurable under
scaled orthographic projection.
2.2.3. 3D Geometric Model. Equation (15) describes
the image motion of a point [x
im
, y
im
]interms of the
motion parameters φ and the corresponding 3D point q
c
in the camera frame. As previously defined in Eq. (7) q
c
is a homogenous vector [x, y, z, 1]
T
.Itisthe point that
intersects the camera ray of the image point [x
im
, y
im
]
with the 3D model. The 3D model is given by the user
(for example a cyclinder, superquadric, or polygonial
model) or is estimated by an initialization procedure
that we will describe below. The pose of the 3D model
is defined by G(t) = s(t) · e
ˆ
ξ(t)
.Weassume G(t)is
the correct pose estimate for image frame I (x , y, t)
(the estimation result of this algorithm over the previ-
ous time frame). Since we assume scaled orthographic
projection (11), [x
im
, y
im
] = [x , y]. We only need to
determine z.Inthis paper we approximate the body
segments by ellipsoidal 3D blobs. The 3D blobs are
defined in the object frame. Following quadratic equa-
tion is the implicit function for the ellipsodial surface

Twist Based Acquisition and Tracking of Animal and Human Kinematics 183
with length 1/a
x
, 1/a
y
, 1/a
z
along the x, y, z axis and
centered around M = [m
x
, m
y
, m
z
, 1]
T
:
(q
o
M)
T
·
a
2
x
000
0 a
2
y
00
00a
2
z
0
0000
· (q
o
M) = 1 (16)
Since q
o
= G
1
q
c
= G
1
[x
im
, y
im
, z, 1]
T
we can
write the implicit function in the camera frame with:
G
1
x
im
y
im
z
1
M
T
·
a
2
x
000
0 a
2
y
00
00a
2
z
0
0000
·
G
1
x
im
y
im
z
1
M
= 1 (17)
Therefore z is the solution of this quadratic Eq. (17).
For image points that are inside the blob it has 2 (close-
form) solutions. We pick the smaller solution (z value
that is closer to the camera). Using (17) we can calculate
for all points inside the blob the q
c
points. For points
outside the blob it has no solution. Those points will
not be part of the estimation setup.
For more complex 3D shape models, the z cal-
culation can be replaced by standard graphics ray-
casting algorithms. We have not implemented this
generalization yet.
2.2.4. Combining 3D Motion and Geometric Model.
Inserting (15) into (3) leads to following equation for
each point [x
i
, y
i
] inside the blob:
I
t
+ I
x
· [s, ω
z
,ω
y
,v
1
] · q
c
+ I
y
· [ω
z
,s, ω
x
,v
2
] · q
c
= 0
I
t
(i) + H
i
· [s,v
1
,v
2
,ω
x
,ω
y
,ω
z
]
T
= 0
(18)
H
i
= [I
x
· x
i
+ I
y
· y
i
, I
x
, I
y
, I
y
· z
i
, I
x
· z
i
,
I
x
· y
i
+ I
y
· x
i
] ∈
1×6
with
I
t
:= I
t
(x
i
, y
i
), I
x
:= I
x
(x
i
, y
i
), I
y
:= I
y
(x
i
, y
i
)
For N pixel positions we have N equations of the
form (18). This can be written in matrix form:
H · φ +z = 0 (19)
with
H =
H
1
H
2
...
H
N
and z =
I
t
(x
1
, y
1
)
I
t
(x
2
, y
2
)
...
I
t
(x
N
, y
N
)
Finding the least-squares solution (3D twist motion
φ) for this equation is done using (6).
2.2.5. Kinematic Chain as a Product of Exponen-
tials. So far we have parameterized the 3D pose and
motion of a body segment by the 6 parameters of a
twist ξ . Points on this body segment in a canonical
object frame are transformed into a camera frame by
the mapping G
0
= e
ˆ
ξ
. Assume that a second body
segment is attached to the first segment with a joint.
The joint can be defined by an axis of rotation in
the object frame. We define this rotation axis in the
object frame by a 3D unit vector ω
1
along the axis,
and a point q
1
on the axis (Fig. 1). This is a revolute
joint, and can be modeled by a twist (Murray et al.,
Figure 1. Kinematic chain defined by twists.

Citations
More filters
Book

Motion capture assisted animation: texturing and synthesis

TL;DR: A method for creating animations that allows the animator to sketch an animation by setting a small number of keyframes on a fraction of the possible degrees of freedom, which takes advantage of the fact that joint motions of an articulated figure are often correlated.
Proceedings ArticleDOI

Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies

TL;DR: In this paper, a unified deformation model for the markerless capture of human movement at multiple scales, including facial expressions, body motion, and hand gestures, is presented, which enables the full expression of part movements, including face and hands, by a single seamless model.
Proceedings ArticleDOI

Motion capture using joint skeleton tracking and surface estimation

TL;DR: This paper proposes a method for capturing the performance of a human or an animal from a multi-view video sequence and proposes a novel optimization scheme for skeleton-based pose estimation that exploits the skeleton's tree structure to split the optimization problem into a local one and a lower dimensional global one.
Posted Content

Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies

TL;DR: A unified deformation model is presented for the markerless capture of human movement at multiple scales, including facial expressions, body motion, and hand gestures, which enables the full expression of part movements by a single seamless model.
Book ChapterDOI

Motion capture of hands in action using discriminative salient points

TL;DR: This paper proposes to use discriminatively learned salient points on the fingers and to estimate the finger-salient point associations simultaneously with the estimation of the hand pose, and introduces a differentiable objective function that also takes edges, optical flow and collisions into account.
References
More filters
Proceedings Article

An iterative image registration technique with an application to stereo vision

TL;DR: In this paper, the spatial intensity gradient of the images is used to find a good match using a type of Newton-Raphson iteration, which can be generalized to handle rotation, scaling and shearing.
Book

A Mathematical Introduction to Robotic Manipulation

TL;DR: In this paper, the authors present a detailed overview of the history of multifingered hands and dextrous manipulation, and present a mathematical model for steerable and non-driveable hands.
Book

Computer vision

Journal ArticleDOI

Pfinder: real-time tracking of the human body

TL;DR: Pfinder is a real-time system for tracking people and interpreting their behavior that uses a multiclass statistical model of color and shape to obtain a 2D representation of head and hands in a wide range of viewing conditions.
Related Papers (5)