scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Twist Based Acquisition and Tracking of Animal and Human Kinematics

01 Feb 2004-International Journal of Computer Vision (Kluwer Academic Publishers)-Vol. 56, Iss: 3, pp 179-194
TL;DR: This paper demonstrates a new visual motion estimation technique that is able to recover high degree-of-freedom articulated human body configurations in complex video sequences, and is the first computer vision based system able to process such challenging footage.
Abstract: This paper demonstrates a new visual motion estimation technique that is able to recover high degree-of-freedom articulated human body configurations in complex video sequences. We introduce the use and integration of a mathematical technique, the product of exponential maps and twist motions, into a differential motion estimation. This results in solving simple linear systems, and enables us to recover robustly the kinematic degrees-of-freedom in noise and complex self occluded configurations. A new factorization technique lets us also recover the kinematic chain model itself. We are able to track several human walk cycles, several wallaby hop cycles, and two walk cycels of the famous movements of Eadweard Muybridge's motion studies from the last century. To the best of our knowledge, this is the first computer vision based system that is able to process such challenging footage.
Figures (11)

Content maybe subject to copyright    Report

International Journal of Computer Vision 56(3), 179–194, 2004
c
2004 Kluwer Academic Publishers. Manufactured in The Netherlands.
Twist Based Acquisition and Tracking of Animal and Human Kinematics
CHRISTOPH BREGLER,
Computer Science Department, Stanford University, Stanford, CA 94305, USA
chris.bregler@nyu.edu
JITENDRA MALIK
Computer Science Department, University of California at Berkeley, Berkeley, CA 94720, USA
malik@cs.berkeley.edu
KATHERINE PULLEN
Physics Department, Stanford University, Stanford, CA 94305, USA
pullen@graphics.stanford.edu
Received December 14, 1999; Revised May 27, 2003; Accepted May 30, 2003
Abstract. This paper demonstrates a new visual motion estimation technique that is able to recover high degree-of-
freedom articulated human body configurations in complex video sequences. We introduce the use and integration of
a mathematical technique, the product of exponential maps and twist motions, into a differential motion estimation.
This results in solving simple linear systems, and enables us to recover robustly the kinematic degrees-of-freedom
in noise and complex self occluded configurations. A new factorization technique lets us also recover the kinematic
chain model itself. We are able to track several human walk cycles, several wallaby hop cycles, and two walk
cycels of the famous movements of Eadweard Muybridge’s motion studies from the last century. To the best of our
knowledge, this is the first computer vision based system that is able to process such challenging footage.
Keywords: human tracking, motion capture, kinematic chains, twists, exponential maps
1. Introduction
The estimation of image motion without any domain
constraints is an underconstrained problem. Therefore
all proposed motion estimation algorithms involve
additional constraints about the assumed motion
structure. One class of motion estimation techniques
are based on parametric algorithms (Bergen et al.,
1992). These techniques rely on solving a highly
overconstrained system of linear equations. For exam-
ple, if an image patch could be modeled as a planar
Present address: Computer Science Dept., Courant Institute, Media
Research Lab, 719 Broadway, 12th Floor, New York, NY 10003,
USA. He was formerly at Stanford University.
surface, an affine motion model with low degrees of
freedom (6 DOF) can be estimated. Measurements
over many pixel locations have to comply with this
motion model. Noise in image features and ambiguous
motion patterns can be overcome by measurements
from features at other image locations. If the motion
can be approximated by this simple motion model,
sub-pixel accuracy can be achieved.
Problems occur if the motion of such a patch is not
well described by the assumed motion model. Others
have shown how to extend this approach to multiple
independent moving motion areas (Jepson and Black,
1993; Ayer Sawhney, 1995; Weiss and Adelson, 1995).
For each area, this approach still has the advantage that
a large number of measurements are incorporated into

180 Bregler, Malik and Pullen
alow DOF linear motion estimation. Problems occur
if some of the areas do not have a large number of
pixel locations or have mostly noisy or ambiguous mo-
tion measurements. One example is the measurement
of human body motion. Each body segment can be ap-
proximated by one rigid moving object. Unfortunately,
in standard video sequences the area of such body seg-
ments are very small, the motion of leg and arm seg-
ments is ambiguous in certain directions (for exam-
ple parallel to the boundaries), and deforming clothes
cause noisy measurements.
If we increase the ratio between the number of mea-
surements and the degrees of freedom, the motion
estimation will be more robust. This can be done us-
ing additional constraints. Body segments don’t move
independently; they are attached by body joints. This
reduces the number of free parameters dramatically. A
convenient way of describing these additional domain
constraints is the twist and product of exponential map
formalism for kinematic chains (Murray et al., 1994).
The motion of one body segment can be described as
the motion of the previous segment in a kinematic chain
and an angular motion around a body joint. This adds
just a single DOF for each additional segment in the
chain. In addition, the exponential map formulation
makes it possible to relate the image motion vectors
linearly to the angular velocity.
Others have modeled the human body with rigid seg-
ments connected at joints (Hogg, 1983; Rohr, 1993;
Regh and Kanade, 1995; Gavrila and Davis, 1995;
Concalves et al., 1995; Clergue et al., 1995; Ju et al.,
1996; Kakadiaris and Metaxas, 1996), but use differ-
ent representations and features (for example Denavit-
Hartenburg and edge detection). The introduction of
twists and product of exponential maps into region-
based motion estimation simplifies the estimation dra-
matically and leads to robust tracking results. Besides
tracking, we also outline how to fine-tune the kine-
matic model itself. Here the ratio between the number
of measurements and the degrees of freedom is even
larger, because we can optimize over a complete image
sequence.
Alternative solutions to tracking of human bodies
were proposed by Wren et al. (1995) in tracking color
blobs, and by Davis and Bobick (1997) in using motion
templates. Nonrigid models were proposed by Pentland
and Horowitz (1991), Blake et al. (1995), Black and
Yacoob (1995) and Black et al. (1997).
Section 2 introduces the new motion tracking and
kinematic model acquisition framework and its mathe-
matical formulation, Section 3 details our experiments,
and we discuss the results and future directions in
Section 4.
The tracking technique of this paper has been pre-
sented in a shorter conference proceeding version in
Bregler and Malik (1998). The new model acquisition
technique has not been published previously.
2. Motion Estimation
We first describe a commonly used region-based mo-
tion estimation framework (Bergen and Anandan,
1992; Shi and Tomasi, 1994), and then describe the ex-
tension to kinematic chain constraints (Murray et al.,
1994).
2.1. Preliminaries
Assuming that changes in image intensity are only due
to translation of local image intensity, a parametric im-
age motion between consecutive time frames t and t +1
can be described by the following equation:
I (x + u
x
(x, y), y + u
y
(x, y), t + 1) = I (x, y, t)
(1)
I (x , y, t)isthe image intensity. The motion model
u(x, y) = [u
x
(x, y), u
y
(x, y)]
T
describes the
pixel displacement dependent on location (x, y) and
model parameters φ.For example, a 2D affine motion
model with parameters φ = [a
1
, a
2
, a
3
, a
4
, d
x
, d
y
]
T
is
defined as
u(x, y) =
a
1
a
2
a
3
a
4
·
x
y
+
d
x
d
y
(2)
The first-order Taylor series expansion of (1) leads
to the commonly used gradient formulation (Lucas and
Kanade, 1981):
I
t
(x, y) + [I
x
(x, y), I
y
(x, y)] · u(x, y) = 0 (3)
I
t
(x, y)isthe temporal image gradient and
[I
x
(x, y), I
y
(x, y)] is the spatial image gradient at loca-
tion (x, y). Assuming a motion model of K degrees of
freedom (in case of the affine model K = 6) and a re-
gion of N > K pixels, we can write an over-constrained
set of N equations. For the case that the motion model

Twist Based Acquisition and Tracking of Animal and Human Kinematics 181
is linear (as in the affine case), we can write the set of
equations in matrix form (see Bergen et al., 1992 for
details):
H · φ +z =
0 (4)
where H ∈
N ×K
, and z ∈
N
. The least squares
solution to (3) is:
φ =−(H
T
· H)
1
· H
T
z (5)
Because (4) is the first-order Taylor series lineariza-
tion of (1), we linearize around the new solution and it-
erate. This is done by warping the image I (t +1) using
the motion model parameters φ found by (5). Based
on the re-warped image we compute the new image
gradients (3). Repeating this process is equivalent to a
Newton-Raphson style minimization.
A convenient representation of the shape of an im-
age region is a probability mask w(x, y) [0, 1].
w(x, y) = 1 declares that pixel (x, y)ispart of the re-
gion. Equation (5) can be modified, such that it weights
the contribution of pixel location (x, y) according to
w(x, y):
φ =−((W · H)
T
· H)
1
· (W · H)
T
z (6)
W is an N × N diagonal matrix, with W(i, i) =
w(x
i
, y
i
). We assume for now that we know the exact
shape of the region. For example, if we want to estimate
the motion parameters for a human body part, we sup-
ply a weight matrix W that defines the image support
map of that specific body part, and run this estimation
technique for several iterations. Section 2.4 describes
how we can estimate the shape of the support maps as
well.
Tracking over multiple frames can be achieved by
applying this optimization technique successively over
the complete image sequence.
2.2. Twists and the Product of Exponential Formula
In the following we develop a motion model u(x, y)
for a 3D kinematic chain under scaled orthographic
projection and show how these domain constraints can
be incorporated into one linear system similar to (6). φ
will represent the 3D pose and angle configuration of
such a kinematic chain and can be tracked in the same
fashion as already outlined for simpler motion models.
2.2.1. 3D Pose. The pose of an object relative to
the camera frame can be represented as a rigid
body transformation in
3
using homogeneous coor-
dinates (we will use the notation from Murray et al.
(1994)):
q
c
= G · q
o
with G =
r
1,1
r
1,2
r
1,3
d
x
r
2,1
r
2,2
r
2,3
d
y
r
3,1
r
3,2
r
3,3
d
z
0001
(7)
q
o
= [x
o
, y
o
, z
o
, 1]
T
is a point in the object frame
and q
c
= [x
c
, y
c
, z
c
, 1]
T
is the corresponding point
in the camera frame. Using scaled orthographic pro-
jection with scale s, the point q
c
in the camera frame
gets projected into the image point [x
im
, y
im
]
T
=
s · [x
c
, y
c
]
T
.
The 3D translation [d
x
, d
y
, d
z
]
T
can be arbitrary, but
the rotation matrix:
R =
r
1,1
r
1,2
r
1,3
r
2,1
r
2,2
r
2,3
r
3,1
r
3,2
r
3,3
SO(3) (8)
has only 3 degrees of freedom. Therefore the rigid body
transformation G SE(3) has a total of 6 degrees of
freedom.
Our goal is to find a model of the image motion
that is parameterized by 6 degrees of freedom for the
3D rigid motion and the scale factor s for scaled ortho-
graphic projection. Euler angles are commonly used to
constrain the rotation matrix to SO(3), but they suffer
from singularities and don’t lead to a simple formula-
tion in the optimization procedure (for example Basu
et al. (1996) propose a 3D ellipsoidal tracker based on
Euler angles). In contrast, the twist representation pro-
vides a more elegant solution (Murray et al., 1994) and
leads to a very simple linear representation of the mo-
tion model. It is based on the observation that every
rigid motion can be represented as a rotation around a
3D axis and a translation along this axis. A twist ξ has
two representations: (a) a 6D vector, or (b) a 4×4 matrix
with the upper 3 × 3 component as a skew-symmetric
matrix:
ξ =
v
1
v
2
v
3
ω
x
ω
y
ω
z
or
ˆ
ξ =
0 ω
z
ω
y
v
1
ω
z
0 ω
x
v
2
ω
y
ω
x
0 v
3
0000
(9)

182 Bregler, Malik and Pullen
ω is a 3D unit vector that points in the direction of
the rotation axis. The amount of rotation is specified
with a scalar angle θ that is multiplied by the twist:
ξθ. The v component determines the location of the
rotation axis and the amount of translation along this
axis. It can be shown that for any arbitrary G SE(3)
there exists a ξ ∈
6
twist representation. See (Murray
et al., 1994) for more formal properties and a detailed
geometric interpretation. It is convenient to drop the
θ coefficient by relaxing the constraint that ω is unit
length. Therefore ξ ∈
6
.
A twist can be converted into the G representation
with following exponential map:
G =
r
1,1
r
1,2
r
1,3
d
x
r
2,1
r
2,2
r
2,3
d
y
r
3,1
r
3,2
r
3,3
d
z
0001
= e
ˆ
ξ
= I +
ˆ
ξ +
(
ˆ
ξ)
2
2!
+
(
ˆ
ξ)
3
3!
·· (10)
2.2.2. Twist Motion Model. At this point we would
like to track the 3D pose of a rigid object under
scaled orthographic projection. We will extend this
formulation in the next section to a kinematic chain
representation. The pose of an object is defined as
[s
T
]
T
= [s,v
1
,v
2
,v
3
x
y
z
]
T
.Apoint q
o
in
the object frame is projected to the image location
[x
im
, y
im
] with:
x
im
y
im
=
1000
0100
· s · e
ˆ
ξ
· q
o
=
1000
0100
· q
c
(11)
s is the scale change of the scaled orthographic projec-
tion. The image motion of point [x
im
, y
im
] from time t
to time t + 1 is:
u
x
u
y
=
x
im
(t + 1) x
im
(t)
y
im
(t + 1) y
im
(t)
=
1000
0100
·
s(t + 1) · e
ˆ
ξ(t+1)
· q
o
s(t) · e
ˆ
ξ(t)
· q
o
=
1000
0100
· ((1 + s) · e
ˆ
ξ
I) · s(t) · e
ˆ
ξ(t)
· q
o
=
1000
0100
· ((1 + s) · e
ˆ
ξ
I) · q
c
(12)
with
e
ˆ
ξ(t+1)
= e
ˆ
ξ(t)
· e
ˆ
ξ
s(t + 1) = s(t) · (1 + s) (13)
q
c
= s(t) · e
ˆ
ξ(t)
· q
o
Using the first order Taylor expansion from (10) we
can approximate:
(1 + s) · e
ˆ
ξ
(1 + s) · I + (1 + s) ·
ˆ
ξ (14)
and can rewrite (12) as:
u
x
u
y
=
s ω
z
ω
y
v
1
ω
z
s ω
x
v
2
· q
c
(15)
with
ξ = [v
1
,v
2
,v
3
,ω
x
,ω
y
,ω
z
]
T
φ = [s,v
1
,v
2
,ω
x
,ω
y
,ω
z
]
T
codes the rel-
ative scale and twist motion from time t to t + 1. Note
that (15) does not include v
3
.Translation in the Z
direction of the camera frame is not measurable under
scaled orthographic projection.
2.2.3. 3D Geometric Model. Equation (15) describes
the image motion of a point [x
im
, y
im
]interms of the
motion parameters φ and the corresponding 3D point q
c
in the camera frame. As previously defined in Eq. (7) q
c
is a homogenous vector [x, y, z, 1]
T
.Itisthe point that
intersects the camera ray of the image point [x
im
, y
im
]
with the 3D model. The 3D model is given by the user
(for example a cyclinder, superquadric, or polygonial
model) or is estimated by an initialization procedure
that we will describe below. The pose of the 3D model
is defined by G(t) = s(t) · e
ˆ
ξ(t)
.Weassume G(t)is
the correct pose estimate for image frame I (x , y, t)
(the estimation result of this algorithm over the previ-
ous time frame). Since we assume scaled orthographic
projection (11), [x
im
, y
im
] = [x , y]. We only need to
determine z.Inthis paper we approximate the body
segments by ellipsoidal 3D blobs. The 3D blobs are
defined in the object frame. Following quadratic equa-
tion is the implicit function for the ellipsodial surface

Twist Based Acquisition and Tracking of Animal and Human Kinematics 183
with length 1/a
x
, 1/a
y
, 1/a
z
along the x, y, z axis and
centered around M = [m
x
, m
y
, m
z
, 1]
T
:
(q
o
M)
T
·
a
2
x
000
0 a
2
y
00
00a
2
z
0
0000
· (q
o
M) = 1 (16)
Since q
o
= G
1
q
c
= G
1
[x
im
, y
im
, z, 1]
T
we can
write the implicit function in the camera frame with:
G
1
x
im
y
im
z
1
M
T
·
a
2
x
000
0 a
2
y
00
00a
2
z
0
0000
·
G
1
x
im
y
im
z
1
M
= 1 (17)
Therefore z is the solution of this quadratic Eq. (17).
For image points that are inside the blob it has 2 (close-
form) solutions. We pick the smaller solution (z value
that is closer to the camera). Using (17) we can calculate
for all points inside the blob the q
c
points. For points
outside the blob it has no solution. Those points will
not be part of the estimation setup.
For more complex 3D shape models, the z cal-
culation can be replaced by standard graphics ray-
casting algorithms. We have not implemented this
generalization yet.
2.2.4. Combining 3D Motion and Geometric Model.
Inserting (15) into (3) leads to following equation for
each point [x
i
, y
i
] inside the blob:
I
t
+ I
x
· [s, ω
z
,ω
y
,v
1
] · q
c
+ I
y
· [ω
z
,s, ω
x
,v
2
] · q
c
= 0
I
t
(i) + H
i
· [s,v
1
,v
2
,ω
x
,ω
y
,ω
z
]
T
= 0
(18)
H
i
= [I
x
· x
i
+ I
y
· y
i
, I
x
, I
y
, I
y
· z
i
, I
x
· z
i
,
I
x
· y
i
+ I
y
· x
i
] ∈
1×6
with
I
t
:= I
t
(x
i
, y
i
), I
x
:= I
x
(x
i
, y
i
), I
y
:= I
y
(x
i
, y
i
)
For N pixel positions we have N equations of the
form (18). This can be written in matrix form:
H · φ +z = 0 (19)
with
H =
H
1
H
2
...
H
N
and z =
I
t
(x
1
, y
1
)
I
t
(x
2
, y
2
)
...
I
t
(x
N
, y
N
)
Finding the least-squares solution (3D twist motion
φ) for this equation is done using (6).
2.2.5. Kinematic Chain as a Product of Exponen-
tials. So far we have parameterized the 3D pose and
motion of a body segment by the 6 parameters of a
twist ξ . Points on this body segment in a canonical
object frame are transformed into a camera frame by
the mapping G
0
= e
ˆ
ξ
. Assume that a second body
segment is attached to the first segment with a joint.
The joint can be defined by an axis of rotation in
the object frame. We define this rotation axis in the
object frame by a 3D unit vector ω
1
along the axis,
and a point q
1
on the axis (Fig. 1). This is a revolute
joint, and can be modeled by a twist (Murray et al.,
Figure 1. Kinematic chain defined by twists.

Citations
More filters
Book
30 Sep 2010
TL;DR: Computer Vision: Algorithms and Applications explores the variety of techniques commonly used to analyze and interpret images and takes a scientific approach to basic vision problems, formulating physical models of the imaging process before inverting them to produce descriptions of a scene.
Abstract: Humans perceive the three-dimensional structure of the world with apparent ease. However, despite all of the recent advances in computer vision research, the dream of having a computer interpret an image at the same level as a two-year old remains elusive. Why is computer vision such a challenging problem and what is the current state of the art? Computer Vision: Algorithms and Applications explores the variety of techniques commonly used to analyze and interpret images. It also describes challenging real-world applications where vision is being successfully used, both for specialized applications such as medical imaging, and for fun, consumer-level tasks such as image editing and stitching, which students can apply to their own personal photos and videos. More than just a source of recipes, this exceptionally authoritative and comprehensive textbook/reference also takes a scientific approach to basic vision problems, formulating physical models of the imaging process before inverting them to produce descriptions of a scene. These problems are also analyzed using statistical models and solved using rigorous engineering techniques Topics and features: structured to support active curricula and project-oriented courses, with tips in the Introduction for using the book in a variety of customized courses; presents exercises at the end of each chapter with a heavy emphasis on testing algorithms and containing numerous suggestions for small mid-term projects; provides additional material and more detailed mathematical topics in the Appendices, which cover linear algebra, numerical techniques, and Bayesian estimation theory; suggests additional reading at the end of each chapter, including the latest research in each sub-field, in addition to a full Bibliography at the end of the book; supplies supplementary course material for students at the associated website, http://szeliski.org/Book/. Suitable for an upper-level undergraduate or graduate-level course in computer science or engineering, this textbook focuses on basic techniques that work under real-world conditions and encourages students to push their creative boundaries. Its design and exposition also make it eminently suitable as a unique reference to the fundamental techniques and current research literature in computer vision.

4,146 citations


Cites background from "Twist Based Acquisition and Trackin..."

  • ...2003); (b) tracking a kinematic chain blob model in a video sequence (Bregler et al. 2004); (c–d) probabilistic loose-limbed collection of body parts (Sigal et al....

    [...]

Journal ArticleDOI
TL;DR: This survey reviews recent trends in video-based human capture and analysis, as well as discussing open problems for future research to achieve automatic visual analysis of human movement.

2,738 citations


Cites background from "Twist Based Acquisition and Trackin..."

  • ...Year First author Initialisation Tracking Pose estimation Recognition 2004 Agarwal [6] 2004 Agarwal * [7] 2004 Agarwal [12] 2004 Billard [34] 2004 Bregler [43] 2004 Brostow [44] 2004 Calinon [50] 2004 Cucchiara [70] 2004 Date [78] 2004 Davis [81] 2004 Davis [82] 2004 Demirdjian [88] 2004 Elgammal [97] 2004 Elgammal [98] 2004 Figueroa [105] 2004 Gao [112] * 2004 Giebel [115] 2004 Gonzàlez [119] 2004 Grauman [124] 2004 Gritai [126] 2004 Hayashi [138] 2004 Heikkila [139] 2004 Herda [143] 2004 Howe * [151] 2004 Hu [154] 2004 Hu [156] * * 2004 Huang * * [159] * 2004 Iwase [171] 2004 Junejo * [181] 2004 Kang [185] * 2004 Krahnstoever [199] * 2004 Lee * * [209] 2004 Lee * * [210] 2004 Leo [216] 2004 Loy [224] 2004 Lu [225] 2004 Lv * [227] 2004 Mikolajczyk * [240] 2004 Moeslund * [251] 2004 Mori [261] 2004 Murakita [263] 2004 Okuma [269] 2004 Pan [276] 2004 Parameswaran [278] * 2004 Park [281] 2004 Porikli [291] 2004 Remondino [299] 2004 Ren [301] 2004 Roberts [310] 2004 Sidenbladh * [331] 2004 Sigal [336] 2004 Thalmann [363] 2004 Urtasun * [373] 2004 Yang [406] 2004 Yang [407] 2004 Yi [409] 2004 Yu [413] 2004 Zhao [420] 2004 Zhao * [421] ∑ Total=59 5 18 20 16...

    [...]

  • ...To resolve the inherent ambiguity in monocular human motion reconstruction additional constraints on kinematics and movement are typically employed [43,384]....

    [...]

Proceedings ArticleDOI
15 Oct 2005
TL;DR: It is shown that the direct 3D counterparts to commonly used 2D interest point detectors are inadequate, and an alternative is proposed, and a recognition algorithm based on spatio-temporally windowed data is devised.
Abstract: A common trend in object recognition is to detect and leverage the use of sparse, informative feature points. The use of such features makes the problem more manageable while providing increased robustness to noise and pose variation. In this work we develop an extension of these ideas to the spatio-temporal case. For this purpose, we show that the direct 3D counterparts to commonly used 2D interest point detectors are inadequate, and we propose an alternative. Anchoring off of these interest points, we devise a recognition algorithm based on spatio-temporally windowed data. We present recognition results on a variety of datasets including both human and rodent behavior.

2,699 citations


Cites background from "Twist Based Acquisition and Trackin..."

  • ...In the domain of human behavior recognition for example, an entire class of approaches for recognition is based on first recovering the location and pose of body parts, see for example [29, 3]....

    [...]

Journal ArticleDOI
TL;DR: The characteristics of human motion analysis are discussed to highlight trends in the domain and to point out limitations of the current state of the art.

908 citations


Cites background or methods from "Twist Based Acquisition and Trackin..."

  • ...A local search is often performed around an initial pose estimate [28,13,6]....

    [...]

  • ...[13] Christoph Bregler, Jitendra Malik, Katherine Pullen, Twist based acquisition and tracking of animal and human kinematics, International Journal of Computer Vision 56 (3) (2004) 179–194....

    [...]

  • ...Single hypothesis approaches include Kalman filtering and local-optimization methods [13,118,45]....

    [...]

  • ...[13] introduce a twist motion model and exponential maps which simplify the relation between image motion and model motion....

    [...]

Proceedings ArticleDOI
26 Dec 2007
TL;DR: The approach builds on recent work on object recognition based on hierarchical feedforward architectures and extends a neurobiological model of motion processing in the visual cortex and finds that sparse features in intermediate stages outperform dense ones and that using a simple feature selection approach leads to an efficient system that performs better with far fewer features.
Abstract: We present a biologically-motivated system for the recognition of actions from video sequences. The approach builds on recent work on object recognition based on hierarchical feedforward architectures [25, 16, 20] and extends a neurobiological model of motion processing in the visual cortex [10]. The system consists of a hierarchy of spatio-temporal feature detectors of increasing complexity: an input sequence is first analyzed by an array of motion- direction sensitive units which, through a hierarchy of processing stages, lead to position-invariant spatio-temporal feature detectors. We experiment with different types of motion-direction sensitive units as well as different system architectures. As in [16], we find that sparse features in intermediate stages outperform dense ones and that using a simple feature selection approach leads to an efficient system that performs better with far fewer features. We test the approach on different publicly available action datasets, in all cases achieving the highest results reported to date.

786 citations


Cites background from "Twist Based Acquisition and Trackin..."

  • ...One class of approaches relies on the tracking of object parts [32, 19, 3]....

    [...]

References
More filters
Proceedings Article
24 Aug 1981
TL;DR: In this paper, the spatial intensity gradient of the images is used to find a good match using a type of Newton-Raphson iteration, which can be generalized to handle rotation, scaling and shearing.
Abstract: Image registration finds a variety of applications in computer vision. Unfortunately, traditional image registration techniques tend to be costly. We present a new image registration technique that makes use of the spatial intensity gradient of the images to find a good match using a type of Newton-Raphson iteration. Our technique is taster because it examines far fewer potential matches between the images than existing techniques Furthermore, this registration technique can be generalized to handle rotation, scaling and shearing. We show how our technique can be adapted tor use in a stereo vision system.

12,944 citations

Book
22 Mar 1994
TL;DR: In this paper, the authors present a detailed overview of the history of multifingered hands and dextrous manipulation, and present a mathematical model for steerable and non-driveable hands.
Abstract: INTRODUCTION: Brief History. Multifingered Hands and Dextrous Manipulation. Outline of the Book. Bibliography. RIGID BODY MOTION: Rigid Body Transformations. Rotational Motion in R3. Rigid Motion in R3. Velocity of a Rigid Body. Wrenches and Reciprocal Screws. MANIPULATOR KINEMATICS: Introduction. Forward Kinematics. Inverse Kinematics. The Manipulator Jacobian. Redundant and Parallel Manipulators. ROBOT DYNAMICS AND CONTROL: Introduction. Lagrange's Equations. Dynamics of Open-Chain Manipulators. Lyapunov Stability Theory. Position Control and Trajectory Tracking. Control of Constrained Manipulators. MULTIFINGERED HAND KINEMATICS: Introduction to Grasping. Grasp Statics. Force-Closure. Grasp Planning. Grasp Constraints. Rolling Contact Kinematics. HAND DYNAMICS AND CONTROL: Lagrange's Equations with Constraints. Robot Hand Dynamics. Redundant and Nonmanipulable Robot Systems. Kinematics and Statics of Tendon Actuation. Control of Robot Hands. NONHOLONOMIC BEHAVIOR IN ROBOTIC SYSTEMS: Introduction. Controllability and Frobenius' Theorem. Examples of Nonholonomic Systems. Structure of Nonholonomic Systems. NONHOLONOMIC MOTION PLANNING: Introduction. Steering Model Control Systems Using Sinusoids. General Methods for Steering. Dynamic Finger Repositioning. FUTURE PROSPECTS: Robots in Hazardous Environments. Medical Applications for Multifingered Hands. Robots on a Small Scale: Microrobotics. APPENDICES: Lie Groups and Robot Kinematics. A Mathematica Package for Screw Calculus. Bibliography. Index Each chapter also includes a Summary, Bibliography, and Exercises

6,592 citations


"Twist Based Acquisition and Trackin..." refers background or methods in this paper

  • ...We can rewrite (29), such that it is parameterized by by a specific twist ξl : It + Hi · [s, v′1, v′2, ωx , ωy, ωz]T + Ji · [θ̇1, θ̇2, . . . θ̇ K ]T = 0 (35) Ci + Ji · [θ̇1, θ̇2, . . . θ̇ K ]T = 0 (36) Ci + (∑ k =l Ji,k · θ̇ k ) + Ji,l · θ̇ l = 0 (37) Di + Ji,l · θ̇ l = 0 (38) Di + [Ix , Iy, 0, −Iy · z, Ix · z, −Ix · y + Iy · x] · Adgl−1 · ξl · θ̇ l = 0 (39) Di + Mi · ξl · θ̇ l = 0 (40) The scalar Di and the 1 × 6 vector Mi contain all the spatio-temporal gradients Ix , Iy, It and 3D point locations x, y, z for image point at location i ....

    [...]

  • ...A convenient way of describing these additional domain constraints is the twist and product of exponential map formalism for kinematic chains (Murray et al., 1994)....

    [...]

  • ...The coordinate transformation for ξ ′k is done relative to gk−1 (as defined in (23)) and can be computed with a so called Adjoint transformation Adgk−1 (Murray et al., 1994)....

    [...]

  • ...In contrast, the twist representation provides a more elegant solution (Murray et al., 1994) and leads to a very simple linear representation of the motion model....

    [...]

  • ...The coordinate transformation for ξ ′ k is done relative to gk−1 (as defined in (23)) and can be computed with a so called Adjoint transformation Adgk−1 (Murray et al., 1994)....

    [...]

Book
01 Jan 1982

5,834 citations

Journal ArticleDOI
TL;DR: Pfinder is a real-time system for tracking people and interpreting their behavior that uses a multiclass statistical model of color and shape to obtain a 2D representation of head and hands in a wide range of viewing conditions.
Abstract: Pfinder is a real-time system for tracking people and interpreting their behavior. It runs at 10 Hz on a standard SGI Indy computer, and has performed reliably on thousands of people in many different physical locations. The system uses a multiclass statistical model of color and shape to obtain a 2D representation of head and hands in a wide range of viewing conditions. Pfinder has been successfully used in a wide range of applications including wireless interfaces, video databases, and low-bandwidth coding.

4,280 citations


"Twist Based Acquisition and Trackin..." refers methods in this paper

  • ...Alternative solutions to tracking of human bodies were proposed by Wren et al. (1995) in tracking color blobs, and by Davis and Bobick (1997) in using motion templates....

    [...]