scispace - formally typeset
Open AccessJournal ArticleDOI

Regularized Bundle-Adjustment to Model Heads from Image Sequences without Calibration Data

Reads0
Chats0
TLDR
An approach based on regularized bundle-adjustment that takes advantage of the rough knowledge of the head's shape, in the form of a generic face model, to recover relative head-motion and epipolar geometry accurately and consistently enough to exploit a previously-developed stereo-based approach to head modeling.
Abstract
We address the structure-from-motion problem in the context of head modeling from video sequences for which calibration data is not available This task is made challenging by the fact that correspondences are difficult to establish due to lack of texture and that a quasi-euclidean representation is required for realism We have developed an approach based on regularized bundle-adjustment It takes advantage of our rough knowledge of the head's shape, in the form of a generic face model It allows us to recover relative head-motion and epipolar geometry accurately and consistently enough to exploit a previously-developed stereo-based approach to head modeling In this way, complete and realistic head models can be acquired with a cheap and entirely passive sensor, such as an ordinary video camera, with minimal manual intervention We chose to demonstrate and evaluate our technique mainly in the context of head-modeling We do so because it is the application for which all the tools required to perform the complete reconstruction are available to us We will, however, argue that the approach is generic and could be applied to other tasks, such as body modeling, for which generic facetized models exist

read more

Content maybe subject to copyright    Report

Regularized Bundle-Adjustment to Model Heads
from Image Sequences without Calibration Data
P. Fua
Computer Graphics Lab (LIG)
Swiss Federal Institute of Technology (EPFL)
CH-1015 Lausanne
Switzerland
Pascal.Fua@epfl.ch
International Journal of Computer Vision, 38(2), July 2000
Abstract
We address the structure-from-motionproblemin the contextof head modeling from video sequences
for which calibration data is not available. This task is made challenging by the fact that correspondences
are difficult to establish due to lack of texture and that a quasi-euclidean representation is required for
realism.
We have developed an approach based on regularized bundle-adjustment. It takes advantage of our
rough knowledge of the head’s shape, in the form of a generic face model. It allows us to recover
relative head-motion and epipolar geometry accurately and consistently enough to exploit a previously-
developed stereo-based approach to head modeling. In this way, complete and realistic head models can
be acquired with a cheap and entirely passive sensor, such as an ordinary video camera with minimal
manual intervention.
We chose to demonstrate and evaluate our technique mainly in the context of head-modeling. We do
so because it is the application for which all the tools required to perform the complete reconstruction
are available to us. We will, however, argue that the approach is generic and could be applied to other
tasks, such as body modeling, for which generic facetized models exist.
1 Introduction
In earlier work, we have proposed an approach to fitting complex head animation models, including ears
and hair, to registered stereo pairs and triplets. Here, we extend this approach so that it can take advantage
of image sequences taken with a single camera, without requiring calibration data.
Our challenge, here, is to solve the structure from motion problem in a case where
Correspondences are hard to establish and can be expected to be neither precise nor reliable due to
lack of texture.
A Euclidean or Quasi-Euclidean
[
Beardsley et al., 1997
]
reconstruction is required for realism.
1

The motion is far from being optimal for most of the auto-calibration techniques that have been
developed in recent years.
To overcome these difficulties, we have developed an approach based on bundle-adjustment that takes
advantage of our rough knowledge of the face’s shape, in the form of a generic face model, to introduce
regularization constraints. This has allowed us to robustly estimate the relative head motion. The resulting
image registration is accurate enough to use a simple correlation-based stereo algorithm to derive 3–D
information from the data. We can then fit a 3–D facial animation mask
[
Kalra et al., 1992
]
using our earlier
work
[
Fua and Miccio, 1998, Fua and Miccio, 1999
]
.
We chose to demonstrate and evaluate our technique mainly in the context of head-modeling because
it is the application for which we have all the tools required to perform the complete reconstruction task.
However, the difficulties discussed above are not specific to head modeling and are pervasive. In that sense
the solution we propose is generic: It is applicable to any modeling problem for which a rough shape model
is available.
Our contribution is a robust algorithm that takes advantage of our generic knowledge of the shape of the
object to be reconstructed, in this work a head, to effectively recover both motion and shape even though
the images typically exhibit little texture and are therefore hard to match. Furthermore, this technique has
been fully integrated into a complete approach that goes from images to high-quality models with very
little manual intervention. Thus, we can create realistic and sophisticated animation models using a cheap,
entirely passive and readily available sensor.
As more and more people have video cameras attached to their computers, our approach will be usable
to quickly produce clones for video-conferencing purposes. It will also allow the exploitation of ordinary
movies to reconstruct the faces of actors or famous people that cannot easily be scanned using active tech-
niques, for example because they are unavailable or long dead.
In the remainder of this paper, we first describe related approaches to relative-motion recovery and head
modeling. We then introduce our own approach to registration and demonstrate its robustness using real
video sequences. Next, we show reconstructions obtained by using these motion estimates to register the
images; deriving 3–D information by treating consecutive images as stereo pairs; and, in the end, fitting the
animation mask to the 3–D data. Finally, we use synthetic and Monte Carlo simulations to show that the
assumptions we make in this paper can be expected to hold for typical camera configurations. Our earlier
fitting procedure
[
Fua and Miccio, 1999
]
is described briefly in the appendix.
2 Related Work
2.1 Bundle-Adjustment and Autocalibration
Bundle-adjustment is a well established technique in the photogrammetric community
[
Gruen and Beyer,
1992
]
. However, it is typically used in a context, mapping or close-range photogrammetry, where reliable
and precise correspondences can be established. Also, because it involves nonlinear optimization, it requires
good initialization for proper convergence.
Lately, it has been increasingly used in the computer vision community to refine the output of auto-
calibration techniques. There again, however, most results have been demonstrated in man-made environ-
ments where feature points can be reliably extracted and matched across images. One cannot assume that
those results carry over directly in the case of ill-textured objects and low quality correspondences.
2

These auto-calibration techniques have been the object of a tremendous amount of work
[
Faugeras et
al., 1992, Hartley et al., 1992, Luong and Vi´eville, 1996, Triggs, 1997, Pollefeys et al., 1998
]
and effective
methods to derive the epipolar geometry and the trifocal tensor from point correspondences have been
devised
[
Zhang et al., 1995, Fitzgibbon and Zisserman, 1998
]
. However, most of these methods assume
that it is possible to run an interest operator such as a corner detector
[
Pollefeys et al., 1998, Fitzgibbon and
Zisserman, 1998
]
to extract from one of the images a sufficiently large number of points that can then be
reliably matched in the other images. However, when using images such as the ones shown in Figure 1, we
cannot depend on such interest points because faces exhibit too little texture. We must expect that whatever
points we extract can only be matched with relatively little precision and a high probability of error.
(a)
(b)
Figure 1: Input video sequences: Five out of nine consecutive images of a short video sequences of two different
people. The images are of size 376 × 258 and 488 × 208 respectively
Autocalibration algorithms tend to be sensitive to such errors, as illustrated by Figure 2. We treated
three consecutive images in the video sequence of Figure 1(b) as two independent stereo pairs that share
the central image and ran Zhang’s image matcher
[
Zhang et al., 1995
]
independently on both image pairs.
The resulting epipolar geometry is depicted by Figure 2(b,d). These images were acquired in our lab with a
relatively long focal length and the head motion was close to being horizontal. Consequently, the epipolar
lines should also be almost horizontal and the epipoles should be very far away. The epipolar geometry of
Figure 2(b,d) is, therefore, clearly wrong. Of course, we want to stress that this example is not meant to
belittle in any way the quality of Zhang’s algorithm that has been acknowledged as one of the best of its
kind.
1
Visual inspection of the correspondences shows very few mismatches. But, as for most algorithms in
this class, even relatively minor matching errors can create major problems.
To some extent, this problem can be alleviated by using more than two images at a time
[
Beardsley et
al., 1997
]
. However, in our case, this approach can only be of limited use because typical short sequences
of moving faces, such as the ones shown in Figure 1, often fail to exhibit rotational motion about two truly
independent axes. As a result, the corresponding camera geometries are close to being degenerate for these
methods
[
Sturm, 1997, Zisserman et al., 1998
]
.
1
It also has the great merit of being freely available on the web and we are prepared to make ours similarly available for testing
and comparison purposes.
3

(a) (b) (c) (d)
(e) (f) (g)
Figure 2: Computing the epipolar geometry without and with a model. (a,b) Running Zhang’s algorithm
[
Zhang
et al., 1995
]
on two consecutive images of the video sequence of Figure 1(b). The matches, shown as
numbered crosses, are mostly correct. However, the epipolar geometry, depicted by the solid lines is
not. (c,d) The output of an independent run of Zhang’s system on a different image pair. (e,f,g) The
epipolar geometry recovered by the algorithm described in this paper. The lines in (e,g) are the epipolar
lines that correspond to the crosses in (f).
In short, while the structure-from-motion problem is well understood from a theoretical point of view,
model-free techniques are too sensitive to noise to be directly applicable both to our specific problem and to
all modeling tasks that involve the difficulties described in the introduction.
In the case of head tracking, a generic 2–D face model can be used to estimate roughly estimate pose
from appearance
[
Lanitis et al., 1995
]
. However for 3–D reconstruction purposes and more precise es-
timation, using a 3–D model is, in general, more effective. Jebara and Pentland
[
1997
]
introduce shape
constraints based on allowable deformation modes derived from a collection of Cyberware
tm
scans of real
heads. When such a database is available, this certainly is an effective approach. However, to make it fully
general, one would require large number of instances of the target object, making it difficult to derive in
practice.
By contrast, in this work, we will show that a simple, and easily obtainable, facetized model can be used
to derive effective shape constraints. These are key to a practical solution of our reconstruction problem. As
shown in Figure 2(e,f,g), by using these constraints, we can recover a consistent epipolar geometry. This is
crucial for us because we treat consecutive images in a sequence as stereo pairs that provide the 3-D data
required to compute the results of Section 4.
4

2.2 Head Modeling
In recent years much work has been devoted to modeling faces from image and range data. There are
many effective approaches to recovering face geometry. They rely on stereo
[
Devernay and Faugeras, 1994,
Fua and Leclerc, 1995
]
, shading
[
Leclerc and Bobick, 1991, Samaras and Metaxas, 1998
]
, structured
light
[
Proesmans et al., 1996
]
, silhouettes
[
Tang and Huang, 1996
]
or low-intensity lasers. However, if
the goal is to fit a full animation model to the data, recovering the head as a simple triangulated mesh does
not suffice. To be suitable for animation, such a model must have a large number of degrees of freedom.
Some approaches use very clean data—the kind produced by a laser scanner or structured light—to instan-
tiate them
[
Lee et al., 1995
]
. Among approaches that rely on image data alone, many require extensive
manual intervention, such as supplying silhouettes in orthogonal images
[
Lee and Thalmann, 1998
]
or point
correspondences in multiple images
[
Pighin et al., 1998
]
.
Successful approaches to automating the fitting process have involved the use of optical flow
[
DeCarlo
and Metaxas, 1998
]
or appearance based techniques
[
Kang, 1997
]
to overcome the fact that faces have little
texture and that, as a result, automatically and reliably establishing correspondences is difficult. This latter
technique is closely related to ours because head shape and camera motion are recovered simultaneously.
However, the optical flow approach avoids the “correspondence problem” at the cost of making assumptions
about constant illumination of the face that may be violated as the head moves. This tends to limit the range
of images that can be used, especially if the lighting is not diffuse.
More recently, another extremely impressive appearance-based approach that uses a sophisticated sta-
tistical head model has been proposed
[
Blanz and Vetter, 1999
]
. This model has been learned from a large
database of human heads and its parameters can be adjusted so that it can synthesize images that closely
resemble the input image or images. While the result are outstanding even when only one image is used,
the recovered shape cannot be guaranteed to be correct unless more than one is used. Because the model is
Euclidean, initial camera parameters must be supplied when dealing with uncalibrated imagery. Therefore,
the technique proposed here could be used to initialize the Blanz & Vetter system in an automated fashion.
In other words, if we had had their model, we could have used it to develop the technique described here.
However, for practical reasons, it was not available. Instead, we used the model described below.
2.3 Face Model
In this work, we use the facial animation model that has been developed at University of Geneva and
EPFL
[
Kalra et al., 1992
]
. It can produce the different facial expressions arising from speech and emo-
tions. Its multilevel configuration reduces complexity and provides independent control for each level. At
the lowest level, a deformation controller simulates muscle actions using rational free form deformations.
At a higher level, the controller produces animations corresponding to abstract entities such as speech and
emotions.
The corresponding skin surface is shown in its rest position in Figure 3(a,b). We will refer to it as the
surface triangulation. Our goal is to deform the surface without changing its topology. This is important
because the facial animation software depends on the model’s topology and its configuration files must be
recomputed every time it is changed, which is hard to do on an automated basis.
5

Figures
Citations
More filters
Journal ArticleDOI

Accurate Camera Calibration from Multi-View Stereo and Bundle Adjustment

TL;DR: This paper presents a novel approach to camera calibration where top-down information from rough camera parameter estimates and the output of a multi-view-stereo system on scaled-down input images is used to effectively guide the search for additional image correspondences and significantly improve camera calibration parameters using a standard bundle adjustment algorithm.
Proceedings ArticleDOI

Articulated soft objects for video-based body modeling

TL;DR: A formalism is proposed that incorporates the use of implicit surfaces into earlier robotics approaches that were designed to handle articulated structures to develop a framework for 3-D shape and motion recovery of articulated deformable objects.
Proceedings ArticleDOI

Modeling Facial Geometry Using Compositional VAEs

TL;DR: A new parameterization of facial geometry is proposed that naturally decomposes the structure of the human face into a set of semantically meaningful levels of detail that enables us to do model fitting while capturing varying level of detail under different types of geometrical constraints.
Proceedings ArticleDOI

3D Assisted Face Recognition: A Survey of 3D Imaging, Modelling and Recognition Approachest

TL;DR: 3D face recognition has lately been attracting ever increasing attention and this paper complements other reviews in the face biometrics area by focusing on the sensor technology, and by detailing the efforts in 3D face modelling and 3D assisted 2D face matching.
Journal ArticleDOI

Robust and Rapid Generation of Animated Faces from Video Images: A Model-Based Modeling Approach

TL;DR: An easy-to-use and cost-effective system to construct textured 3D animated face models from videos with minimal user interaction, which makes full use of generic knowledge of faces in head motion determination, head tracking, model fitting, and multiple-view bundle adjustment.
References
More filters
Journal ArticleDOI

Numerical recipes

Book

The finite element method

TL;DR: In this article, the methodes are numeriques and the fonction de forme reference record created on 2005-11-18, modified on 2016-08-08.
Proceedings ArticleDOI

A morphable model for the synthesis of 3D faces

TL;DR: A new technique for modeling textured 3D faces by transforming the shape and texture of the examples into a vector space representation, which regulates the naturalness of modeled faces avoiding faces with an “unlikely” appearance.
Journal ArticleDOI

A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry

TL;DR: A robust approach to image matching by exploiting the only available geometric constraint, namely, the epipolar constraint, is proposed and a new strategy for updating matches is developed, which only selects those matches having both high matching support and low matching ambiguity.
Related Papers (5)
Frequently Asked Questions (11)
Q1. What contributions have the authors mentioned in the paper "Regularized bundle-adjustment to model heads from image sequences without calibration data" ?

The authors chose to demonstrate and evaluate their technique mainly in the context of head-modeling. The authors do so because it is the application for which all the tools required to perform the complete reconstruction are available to us. 

In future work, the authors will use articulated models to extend their approach to this new application field. However, calibrated stereo pairs or triplets can be used as well. The animation model [ Kalra et al., 1992 ] the authors use can produce the different facial expressions arising from speech and emotions. Because only the 2-D location of these points need to be specified, this can be done very quickly. 

Successful approaches to automating the fitting process have involved the use of optical flow [DeCarlo and Metaxas, 1998] or appearance based techniques [Kang, 1997] to overcome the fact that faces have little texture and that, as a result, automatically and reliably establishing correspondences is difficult. 

The authors chose to demonstrate and evaluate their technique mainly in the context of head-modeling because it is the application for which the authors have all the tools required to perform the complete reconstruction task. 

This is important because the facial animation software depends on the model’s topology and its configuration files must be recomputed every time it is changed, which is hard to do on an automated basis. 

The authors have shown that by incorporating model-based constraints in the framework of bundle-adjustment, the authors are able to effectively tackle the structure-from-motion problem in a case where correspondences are difficult to establish. 

To simulate the errors that can be expected from their stereo matcher, the authors corrupt these projections by adding two kinds of noise:1. White noise with variance σnoise ∈ {0.5pixel, 1.0pixel}. 

It is only approximate because, CamCal, in effect, can also trade changes in the value of f against changes of the estimated distance of the camera to the calibration grid it uses. 

To increase the robustness of their algorithm, the authors augment the standard procedure in two ways:1. Iterative reweighted least squares. 

To estimate the positions and orientations for the two images on either side of the central image, the authors begin by retriangulating the surface of the generic face model of Figure 3(a,b) to produce the regular mesh depictedby Figure 3(c,d) that the authors call the bundle-adjustment triangulation. 

In the case of head tracking, a generic 2–D face model can be used to estimate roughly estimate pose from appearance [Lanitis et al., 1995].