What contributions have the authors mentioned in the paper "Regularized bundle-adjustment to model heads from image sequences without calibration data" ?

The authors chose to demonstrate and evaluate their technique mainly in the context of head-modeling. The authors do so because it is the application for which all the tools required to perform the complete reconstruction are available to us.

What future works have the authors mentioned in the paper "Regularized bundle-adjustment to model heads from image sequences without calibration data" ?

In future work, the authors will use articulated models to extend their approach to this new application field. However, calibrated stereo pairs or triplets can be used as well. The animation model [ Kalra et al., 1992 ] the authors use can produce the different facial expressions arising from speech and emotions. Because only the 2-D location of these points need to be specified, this can be done very quickly.

Why is it important to use the facial animation software?

This is important because the facial animation software depends on the model’s topology and its configuration files must be recomputed every time it is changed, which is hard to do on an automated basis.

How do the authors solve the structure-from-motion problem?

The authors have shown that by incorporating model-based constraints in the framework of bundle-adjustment, the authors are able to effectively tackle the structure-from-motion problem in a case where correspondences are difficult to establish.

How do the authors simulate the errors that can be expected from their stereo matcher?

To simulate the errors that can be expected from their stereo matcher, the authors corrupt these projections by adding two kinds of noise:1. White noise with variance σnoise ∈ {0.5pixel, 1.0pixel}.

Why is the f value only approximate?

It is only approximate because, CamCal, in effect, can also trade changes in the value of f against changes of the estimated distance of the camera to the calibration grid it uses.

How do the authors increase the robustness of the standard procedure?

To increase the robustness of their algorithm, the authors augment the standard procedure in two ways:1. Iterative reweighted least squares.

How do the authors estimate the positions and orientations of the two images on either side of the central?

To estimate the positions and orientations for the two images on either side of the central image, the authors begin by retriangulating the surface of the generic face model of Figure 3(a,b) to produce the regular mesh depictedby Figure 3(c,d) that the authors call the bundle-adjustment triangulation.

(Open Access) Regularized Bundle-Adjustment to Model Heads from Image Sequences without Calibration Data (2000) | Pascal Fua

Q: What is the way to automate the fitting process?

Successful approaches to automating the fitting process have involved the use of optical flow [DeCarlo and Metaxas, 1998] or appearance based techniques [Kang, 1997] to overcome the fact that faces have little texture and that, as a result, automatically and reliably establishing correspondences is difficult.

Regularized Bundle-Adjustment to Model Heads

from Image Sequences without Calibration Data

P. Fua

Computer Graphics Lab (LIG)

Swiss Federal Institute of Technology (EPFL)

CH-1015 Lausanne

Switzerland

Pascal.Fua@epﬂ.ch

International Journal of Computer Vision, 38(2), July 2000

Abstract

We address the structure-from-motionproblemin the contextof head modeling from video sequences

for which calibration data is not available. This task is made challenging by the fact that correspondences

are difﬁcult to establish due to lack of texture and that a quasi-euclidean representation is required for

realism.

We have developed an approach based on regularized bundle-adjustment. It takes advantage of our

rough knowledge of the head’s shape, in the form of a generic face model. It allows us to recover

relative head-motion and epipolar geometry accurately and consistently enough to exploit a previously-

developed stereo-based approach to head modeling. In this way, complete and realistic head models can

be acquired with a cheap and entirely passive sensor, such as an ordinary video camera with minimal

manual intervention.

We chose to demonstrate and evaluate our technique mainly in the context of head-modeling. We do

so because it is the application for which all the tools required to perform the complete reconstruction

are available to us. We will, however, argue that the approach is generic and could be applied to other

tasks, such as body modeling, for which generic facetized models exist.

1 Introduction

In earlier work, we have proposed an approach to ﬁtting complex head animation models, including ears

and hair, to registered stereo pairs and triplets. Here, we extend this approach so that it can take advantage

of image sequences taken with a single camera, without requiring calibration data.

Our challenge, here, is to solve the structure from motion problem in a case where

• Correspondences are hard to establish and can be expected to be neither precise nor reliable due to

lack of texture.

• A Euclidean or Quasi-Euclidean

[

Beardsley et al., 1997

]

reconstruction is required for realism.

• The motion is far from being optimal for most of the auto-calibration techniques that have been

developed in recent years.

To overcome these difﬁculties, we have developed an approach based on bundle-adjustment that takes

advantage of our rough knowledge of the face’s shape, in the form of a generic face model, to introduce

regularization constraints. This has allowed us to robustly estimate the relative head motion. The resulting

image registration is accurate enough to use a simple correlation-based stereo algorithm to derive 3–D

information from the data. We can then ﬁt a 3–D facial animation mask

[

Kalra et al., 1992

]

using our earlier

work

[

Fua and Miccio, 1998, Fua and Miccio, 1999

]

We chose to demonstrate and evaluate our technique mainly in the context of head-modeling because

it is the application for which we have all the tools required to perform the complete reconstruction task.

However, the difﬁculties discussed above are not speciﬁc to head modeling and are pervasive. In that sense

the solution we propose is generic: It is applicable to any modeling problem for which a rough shape model

is available.

Our contribution is a robust algorithm that takes advantage of our generic knowledge of the shape of the

object to be reconstructed, in this work a head, to effectively recover both motion and shape even though

the images typically exhibit little texture and are therefore hard to match. Furthermore, this technique has

been fully integrated into a complete approach that goes from images to high-quality models with very

little manual intervention. Thus, we can create realistic and sophisticated animation models using a cheap,

entirely passive and readily available sensor.

As more and more people have video cameras attached to their computers, our approach will be usable

to quickly produce clones for video-conferencing purposes. It will also allow the exploitation of ordinary

movies to reconstruct the faces of actors or famous people that cannot easily be scanned using active tech-

niques, for example because they are unavailable or long dead.

In the remainder of this paper, we ﬁrst describe related approaches to relative-motion recovery and head

modeling. We then introduce our own approach to registration and demonstrate its robustness using real

video sequences. Next, we show reconstructions obtained by using these motion estimates to register the

images; deriving 3–D information by treating consecutive images as stereo pairs; and, in the end, ﬁtting the

animation mask to the 3–D data. Finally, we use synthetic and Monte Carlo simulations to show that the

assumptions we make in this paper can be expected to hold for typical camera conﬁgurations. Our earlier

ﬁtting procedure

[

Fua and Miccio, 1999

]

is described brieﬂy in the appendix.

2 Related Work

2.1 Bundle-Adjustment and Autocalibration

Bundle-adjustment is a well established technique in the photogrammetric community

[

Gruen and Beyer,

1992

]

. However, it is typically used in a context, mapping or close-range photogrammetry, where reliable

and precise correspondences can be established. Also, because it involves nonlinear optimization, it requires

good initialization for proper convergence.

Lately, it has been increasingly used in the computer vision community to reﬁne the output of auto-

calibration techniques. There again, however, most results have been demonstrated in man-made environ-

ments where feature points can be reliably extracted and matched across images. One cannot assume that

those results carry over directly in the case of ill-textured objects and low quality correspondences.

These auto-calibration techniques have been the object of a tremendous amount of work

[

Faugeras et

al., 1992, Hartley et al., 1992, Luong and Vi´eville, 1996, Triggs, 1997, Pollefeys et al., 1998

]

and effective

methods to derive the epipolar geometry and the trifocal tensor from point correspondences have been

devised

[

Zhang et al., 1995, Fitzgibbon and Zisserman, 1998

]

. However, most of these methods assume

that it is possible to run an interest operator such as a corner detector

[

Pollefeys et al., 1998, Fitzgibbon and

Zisserman, 1998

]

to extract from one of the images a sufﬁciently large number of points that can then be

reliably matched in the other images. However, when using images such as the ones shown in Figure 1, we

cannot depend on such interest points because faces exhibit too little texture. We must expect that whatever

points we extract can only be matched with relatively little precision and a high probability of error.

(a)

(b)

Figure 1: Input video sequences: Five out of nine consecutive images of a short video sequences of two different

people. The images are of size 376 × 258 and 488 × 208 respectively

Autocalibration algorithms tend to be sensitive to such errors, as illustrated by Figure 2. We treated

three consecutive images in the video sequence of Figure 1(b) as two independent stereo pairs that share

the central image and ran Zhang’s image matcher

[

Zhang et al., 1995

]

independently on both image pairs.

The resulting epipolar geometry is depicted by Figure 2(b,d). These images were acquired in our lab with a

relatively long focal length and the head motion was close to being horizontal. Consequently, the epipolar

lines should also be almost horizontal and the epipoles should be very far away. The epipolar geometry of

Figure 2(b,d) is, therefore, clearly wrong. Of course, we want to stress that this example is not meant to

belittle in any way the quality of Zhang’s algorithm that has been acknowledged as one of the best of its

kind.

Visual inspection of the correspondences shows very few mismatches. But, as for most algorithms in

this class, even relatively minor matching errors can create major problems.

To some extent, this problem can be alleviated by using more than two images at a time

[

Beardsley et

al., 1997

]

. However, in our case, this approach can only be of limited use because typical short sequences

of moving faces, such as the ones shown in Figure 1, often fail to exhibit rotational motion about two truly

independent axes. As a result, the corresponding camera geometries are close to being degenerate for these

methods

[

Sturm, 1997, Zisserman et al., 1998

]

It also has the great merit of being freely available on the web and we are prepared to make ours similarly available for testing

and comparison purposes.

(a) (b) (c) (d)

(e) (f) (g)

Figure 2: Computing the epipolar geometry without and with a model. (a,b) Running Zhang’s algorithm

[

Zhang

et al., 1995

]

on two consecutive images of the video sequence of Figure 1(b). The matches, shown as

numbered crosses, are mostly correct. However, the epipolar geometry, depicted by the solid lines is

not. (c,d) The output of an independent run of Zhang’s system on a different image pair. (e,f,g) The

epipolar geometry recovered by the algorithm described in this paper. The lines in (e,g) are the epipolar

lines that correspond to the crosses in (f).

In short, while the structure-from-motion problem is well understood from a theoretical point of view,

model-free techniques are too sensitive to noise to be directly applicable both to our speciﬁc problem and to

all modeling tasks that involve the difﬁculties described in the introduction.

In the case of head tracking, a generic 2–D face model can be used to estimate roughly estimate pose

from appearance

[

Lanitis et al., 1995

]

. However for 3–D reconstruction purposes and more precise es-

timation, using a 3–D model is, in general, more effective. Jebara and Pentland

[

1997

]

introduce shape

constraints based on allowable deformation modes derived from a collection of Cyberware

scans of real

heads. When such a database is available, this certainly is an effective approach. However, to make it fully

general, one would require large number of instances of the target object, making it difﬁcult to derive in

practice.

By contrast, in this work, we will show that a simple, and easily obtainable, facetized model can be used

to derive effective shape constraints. These are key to a practical solution of our reconstruction problem. As

shown in Figure 2(e,f,g), by using these constraints, we can recover a consistent epipolar geometry. This is

crucial for us because we treat consecutive images in a sequence as stereo pairs that provide the 3-D data

required to compute the results of Section 4.

2.2 Head Modeling

In recent years much work has been devoted to modeling faces from image and range data. There are

many effective approaches to recovering face geometry. They rely on stereo

[

Devernay and Faugeras, 1994,

Fua and Leclerc, 1995

]

, shading

[

Leclerc and Bobick, 1991, Samaras and Metaxas, 1998

]

, structured

light

[

Proesmans et al., 1996

]

, silhouettes

[

Tang and Huang, 1996

]

or low-intensity lasers. However, if

the goal is to ﬁt a full animation model to the data, recovering the head as a simple triangulated mesh does

not sufﬁce. To be suitable for animation, such a model must have a large number of degrees of freedom.

Some approaches use very clean data—the kind produced by a laser scanner or structured light—to instan-

tiate them

[

Lee et al., 1995

]

. Among approaches that rely on image data alone, many require extensive

manual intervention, such as supplying silhouettes in orthogonal images

[

Lee and Thalmann, 1998

]

or point

correspondences in multiple images

[

Pighin et al., 1998

]

Successful approaches to automating the ﬁtting process have involved the use of optical ﬂow

[

DeCarlo

and Metaxas, 1998

]

or appearance based techniques

[

Kang, 1997

]

to overcome the fact that faces have little

texture and that, as a result, automatically and reliably establishing correspondences is difﬁcult. This latter

technique is closely related to ours because head shape and camera motion are recovered simultaneously.

However, the optical ﬂow approach avoids the “correspondence problem” at the cost of making assumptions

about constant illumination of the face that may be violated as the head moves. This tends to limit the range

of images that can be used, especially if the lighting is not diffuse.

More recently, another extremely impressive appearance-based approach that uses a sophisticated sta-

tistical head model has been proposed

[

Blanz and Vetter, 1999

]

. This model has been learned from a large

database of human heads and its parameters can be adjusted so that it can synthesize images that closely

resemble the input image or images. While the result are outstanding even when only one image is used,

the recovered shape cannot be guaranteed to be correct unless more than one is used. Because the model is

Euclidean, initial camera parameters must be supplied when dealing with uncalibrated imagery. Therefore,

the technique proposed here could be used to initialize the Blanz & Vetter system in an automated fashion.

In other words, if we had had their model, we could have used it to develop the technique described here.

However, for practical reasons, it was not available. Instead, we used the model described below.

2.3 Face Model

In this work, we use the facial animation model that has been developed at University of Geneva and

EPFL

[

Kalra et al., 1992

]

. It can produce the different facial expressions arising from speech and emo-

tions. Its multilevel conﬁguration reduces complexity and provides independent control for each level. At

the lowest level, a deformation controller simulates muscle actions using rational free form deformations.

At a higher level, the controller produces animations corresponding to abstract entities such as speech and

emotions.

The corresponding skin surface is shown in its rest position in Figure 3(a,b). We will refer to it as the

surface triangulation. Our goal is to deform the surface without changing its topology. This is important

because the facial animation software depends on the model’s topology and its conﬁguration ﬁles must be

recomputed every time it is changed, which is hard to do on an automated basis.

Regularized Bundle-Adjustment to Model Heads from Image Sequences without Calibration Data

Figures

Citations

Accurate Camera Calibration from Multi-View Stereo and Bundle Adjustment

Articulated soft objects for video-based body modeling

Modeling Facial Geometry Using Compositional VAEs

3D Assisted Face Recognition: A Survey of 3D Imaging, Modelling and Recognition Approachest

Robust and Rapid Generation of Animated Faces from Video Images: A Model-Based Modeling Approach

References

Numerical recipes

The finite element method

Numerical Recipes, The Art of Scientific Computing

A morphable model for the synthesis of 3D faces

A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry

Related Papers (5)

A morphable model for the synthesis of 3D faces

Multiple view geometry in computer vision

Bundle Adjustment - A Modern Synthesis

Face recognition based on fitting a 3D morphable model

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

Frequently Asked Questions (11)

Q1. What contributions have the authors mentioned in the paper "Regularized bundle-adjustment to model heads from image sequences without calibration data" ?

Q2. What future works have the authors mentioned in the paper "Regularized bundle-adjustment to model heads from image sequences without calibration data" ?

Q3. What is the way to automate the fitting process?

Q4. Why did the authors choose to demonstrate and evaluate their technique?

Q5. Why is it important to use the facial animation software?

Q6. How do the authors solve the structure-from-motion problem?

Q7. How do the authors simulate the errors that can be expected from their stereo matcher?

Q8. Why is the f value only approximate?

Q9. How do the authors increase the robustness of the standard procedure?

Q10. How do the authors estimate the positions and orientations of the two images on either side of the central?

Q11. What is the way to estimate pose from the head?