scispace - formally typeset
Open AccessJournal ArticleDOI

Active appearance models

Reads0
Chats0
Abstract
We describe a new method of matching statistical models of appearance to images. A set of model parameters control modes of shape and gray-level variation learned from a training set. We construct an efficient iterative matching algorithm by learning the relationship between perturbations in the model parameters and the induced image errors.

read more

Content maybe subject to copyright    Report

Active Appearance Models
Timothy F. Cootes, Gareth J. Edwards, and
Christopher J. Taylor
AbstractÐWe describe a new method of matching statistical models of
appearance to images. A set of model parameters control modes of shape and
gray-level variation learned from a training set. We construct an efficient iterative
matching algorithm by learning the relationship between perturbations in the
model parameters and the induced image errors.
Index TermsÐAppearance models, deformable templates, model matching.
æ
1INTRODUCTION
THE ªinterpretation through synthesisº approach has received
considerable attention over the past few years [3], [6], [11], [14].
The aim is to ªexplainº novel images by generating synthetic
images that are as similar as possible, using a parameterized model
of appearance. One motivation is to achieve robust segmentation
by using the model to constrain solutions to be valid examples of
the class of images modeled. A suitable model also provides a
basis for a broad range of applications by coding the appearance of
a given image in terms of a compact set of parameters that are
useful for higher-level interpretation of the scene.
Suitable methods of modeling photo-realistic appearance have
been described previously, e.g., Edwards et al. [3], Jones and
Poggio [9], or Vetter [14]). When applied to images of complex and
variable objects (e.g., faces), these models typically require a large
number of parameters (50-100). In order to interpret novel images,
an efficient method of finding the best match between model and
image is required. Direct optimization in this high-dimensional
space is possible using standard methods but it is slow [9] and/or
liable to become trapped in local minima.
We have shown previously [11], [3] that a statistical model of
appearance can be matched to an image in two steps: First, an
Active Shape Model is matched to boundary features in the image,
then a separate eigenface model is used to reconstruct the texture
(in the graphics sense) in a shape-normalized frame. This approach
is not, however, guaranteed to give an optimal fit of the
appearance model to the image because small errors in the match
of the shape model can result in a shape-normalized texture map
that cannot be reconstructed correctly using the eigenface model.
In this paper, we show an efficient direct optimization approach
that matches shape and texture simultaneously, resulting in an
algorithm that is rapid, accurate, and robust. In our method, we do
not attempt to solve a general optimization problem each time we
wish to fit the model to a new image. Instead, we exploit the fact
that the optimization problem is similar each time so we can learn
these similarities offline. This allows us to find directions of rapid
convergence even though the search space has very high
dimensionality. The approach has similarities with ªdifference
decompositionº tracking algorithms [7], [8], [13], but rather than
tracking a single deforming object we match a model which can fit
a whole class of objects.
2BACKGROUND
Several authors have described methods for matching deformable
models of shape and appearance to novel images. Nastar et al. [12]
describe a model of shape and intensity variations using a
3D deformable model of the intensity landscape. They use a
closest point surface matching algorithm to perform the fitting,
which tends to be sensitive to the initialization. Lanitis et al. [11]
and Edwards et al. [3] use a boundary finding algorithm (an
ªActive Shape Modelº) to find the best shape, then use this to
match a model of image texture. Poggio et al. [6], [9] use an optical
flow algorithm to match shape and texture models iteratively, and
Vetter [14] uses a general purpose optimization method to match
photorealistic human face models to images. The last two
approaches are slow because of the high dimensionality of the
problem and the expense of testing the quality of fit of the model
against the image.
Fast model matching algorithms have been developed in the
tracking community. Gleicher [7] describes a method of tracking
objects by allowing a single template to deform under a variety of
transformations (affine, projective, etc.). He chooses the parameters
to minimize a sum of squares measure and essentially precom-
putes derivatives of the difference vector with respect to the
parameters of the transformation. Hager and Belhumeur [8]
describe a similar approach, but include robust kernels and
models of illumination variation. Sclaroff and Isidoro [13] extend
the approach to track objects which deform, modeling deformation
using the low energy modes of a finite element model of the target.
The approach has been used to track heads [10] using a rigid
cylindrical model of the head.
The Active Appearance Models described below are an
extension of this approach [4], [1]. Rather than tracking a particular
object, our models of appearance can match to any of a class of
deformable objects (e.g., any face with any expression, rather than
one persons face with a particular expression). To match rapidly,
we precompute the derivatives of the residual of the match
between the model and the target image and use them to compute
the update steps in an iterative matching algorithm. In order to
make the algorithm insensitive to changes in overall intensity, the
residuals are computed in a normalized reference frame.
3MODELING APPEARANCE
Following the approach of Edwards et al. [3], our statistical
appearance models are generated by combining a model of shape
variation with a model of texture variation. By ªtextureº we mean
the pattern of intensities or colors across an image patch. To build a
model, we require a training set of annotated images where
corresponding points have been marked on each example. For
instance, to build a face model, we require face images marked
with points defining the main features (Fig. 1). We apply
Procrustes analysis to align the sets of points (each represented
as a vector, x) and build a statistical shape model [2]. We then
warp each training image so the points match those of the mean
shape, obtaining a ªshape-free patchº (Fig. 1). This is raster
scanned into a texture vector, g, which is normalized by applying a
linear transformation, g !g ÿ
g
1=
g
, where 1 is a vector of
ones,
g
and
2
g
are the mean and variance of elements of g. After
normalization, g
T
1 0 and jgj1. Eigen-analysis is applied to
build a texture model. Finally, the correlations between shape and
texture are learned to generate a combined appearance model (see
[3] for details).
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 23, NO. 6, JUNE 2001 681
. T.F. Cootes and C.J. Taylor are with the Division of Imaging Science and
Biomedical Engineering, University of Manchester, M13 9PT, UK.
E-mail: {t.cootes, chris.taylor}@man.ac.uk.
. G.J. Edwards is with Image-Metrics Ltd., Regent House, Heaton Lane,
Stockport, Cheshire, Sk4 1BS, UK.
E-mail: gareth.edwards@image-metrics.com.
Manuscript received 5 Apr. 2000; revised 10 Nov. 2000; accepted 7 Jan. 2001.
Recommended for acceptance by M. Shah.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number 111825.
0162-8828/01/$10.00 ß 2001 IEEE

The appearance model has parameters, c, controlling the shape
and texture (in the model frame) according to
x
x Q
s
c
g
g Q
g
c;
1
where
x is the mean shape,
g the mean texture in a mean shaped
patch, and Q
s
, Q
g
are matrices describing the modes of variation
derived from the training set.
A shape, X, in the image frame can be generated by
applying a suitable transformation to the points, x : X S
t
x.
Typically, S
t
will be a similarity transformation described by a
scaling, s, an in-plane rotation, , and a translation t
x
;t
y
. For
linearity, we represent the scaling and rotation as s
x
;s
y
,
where s
x
s cos ÿ 1, s
y
s sin . The pose parameter vector
t s
x
;s
y
;t
x
;t
y
T
is then zero for the identity transformation
and S
tt
xS
t
S
t
x.
The texture in the image frame is generated by applying a scaling
and offset to the intensities, g
im
T
u
gu
1
1g
im
u
2
1, where
u is the vector of transformation parameters, defined so that u 0 is
the identity transformation and T
uu
gT
u
T
u
g.
A full reconstruction is given by generating the texture in a
mean shaped patch, then warping it so that the model points lie on
the image points, X. In experiments described below, we used a
triangulation-based interpolation method to define a continuous
mapping.
In practice, we construct a multiresolution model based on a
Gaussian image pyramid. This allows a multiresolution approach
to matching, leading to improved speed and robustness. For each
level of the pyramid, we build a separate texture model, sampling
a number of pixels proportional to the area (in pixels) of the target
region in the image. Thus, the model at level 1 has one quarter the
number of pixels of that at level 0. For simplicity, we use the same
set of landmarks at every level and the same shape model. Though
strictly the number of points should reduce at lower resolutions,
any blurring of detail is taken account of by the texture model. The
appearance model at a given level is then computed given the
global shape model and the particular texture model for that level.
3.1 An Appearance Model for Faces
We used the method described above to build a model of facial
appearance. We used a training set of 400 images of faces, each
labeled with 68 points around the main features (Fig. 1), and
sampled approximately 10,000 intensity values from the facial
region. From this, we generated an appearance model which
required 55 parameters to explain 95 percent of the observed
variation. Fig. 2 shows the effect of varying the first four
parameters from c, showing changes in identity, pose, and
expression. Note the correlation between shape and intensity
variation.
4ACTIVE APPEARANCE MODEL SEARCH
We now turn to the central problem addressed by this paper: We
have a new image to be interpreted, a full appearance model as
described above and a rough initial estimate of the position,
orientation, and scale at which this model should be placed in an
image. We require an efficient scheme for adjusting the model
parameters, so that a synthetic example is generated, which
matches the image as closely as possible.
4.1 Modeling Image Residuals
The appearance model parameters, c, and shape transformation
parameters, t, define the position of the model points in the image
frame, X, which gives the shape of the image patch to be
represented by the model. During matching, we sample the pixels
in this region of the image, g
im
, and project into the texture model
frame, g
s
T
ÿ1
u
g
im
. The current model texture is given by
g
m
g Q
g
c. The current difference between model and image
(measured in the normalized texture frame) is thus
682 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 23, NO. 6, JUNE 2001
Fig. 1. A labeled training image gives a shape free patch and a set of points.
Fig. 2. Effect of varying first four facial appearance model parameters, c
1
ÿ c
4
by 3 standarad deviations from the mean.

rpg
s
ÿ g
m
; 2
where p are the parameters of the model, p
T
c
T
jt
T
ju
T
.
A simple scalar measure of difference is the sum of squares of
elements of r, Epr
T
r. A first order Taylor expansion of (2)
gives
rp prp
@r
@p
p; 3
where the ijth element of matrix
@r
@p
is
dr
i
dp
j
.
Suppose during matching our current residual is r. We wish to
choose p so as to minimize jrp pj
2
. By equating (3) to zero,
we obtain the RMS solution,
p ÿRrp where R
@r
@p
T
@r
@p

ÿ1
@r
@p
T
: 4
In a standard optimization scheme, it would be necessary to
recalculate
@p
@p
at every step, an expensive operation. However, we
assume that since it is being computed in a normalized reference
frame, it can be considered approximately fixed. We can thus
estimate it once from a training set. We estimate
@r
@p
by numeric
differentiation. The jth column is computed by systematically
displacing each parameter from the known optimal value on
typical images and computing a weighted average of the residuals.
To improve the robustness of the measured estimate, residuals at
displacements of differing magnitudes, p
jk
, are measured (typi-
cally up to 0.5 standard deviations,
j
, for each parameter, p
j
) and
summed with a weighting proportional to expÿp
2
jk
=2
2
j
=p
jk
.We
then precompute R and use it in all subsequent searches with the
model.
Images used in the calculation of
@r
@p
can either be examples
from the training set or synthetic images generated using the
appearance model itself. Where synthetic images are used, one can
either use a suitable (e.g., random) background, or can detect the
areas of the model which overlap the background and remove
those pixels from the model building process. This latter makes the
final relationship more independent of the background. Where the
background is predictable (e.g., medical images), this is not
necessary.
4.2 Iterative Model Refinement
Using (4), we can suggest a correction to make in the model
parameters based on a measured residual r. This allows us to
construct an iterative algorithm for solving our optimization
problem. Given a current estimate of the model parameters, c,
the pose t, the texture transformation u, and the image sample at
the current estimate, g
im
, one step of the iterative procedure is as
follows:
1. Project the texture sample into the texture model frame
using g
s
T
ÿ1
u
g
im
.
2. Evaluate the error vector, r g
s
ÿ g
m
, and the current
error, E jrj
2
.
3. Compute the predicted displacements, p ÿRrp.
4. Update the model parameters p ! p kp, where initially
k 1.
5. Calculate the new points, X
0
and model frame texture g
0
m
.
6. Sample the image at the new points to obtain g
0
im
.
7. Calculate a new error vector, r
0
T
ÿ1
u
0
g
0
im
ÿg
0
m
.
8. If jr
0
j
2
<E, then accept the new estimate; otherwise, try at
k 0:5, k 0:25, etc.
This procedure is repeated until no improvement is made to the
error, jrj
2
, and convergence is assumed. In practice, we use a
multiresolution implementation, in which we start at a coarse
resolution and iterate to convergence at each level before
projecting the current solution to the next level of the model. This
is more efficient and can converge to the correct solution from
further away than searching at a single resolution. The complexity
of the algorithm is On
pixels
n
modes
at a given level. Essentially, each
iteration involves sampling n
pixels
points from the image then
multiplying by a n
modes
n
pixel
matrix.
In earlier work [4], [1], we described a training algorithm based
on regressing random displacement vectors against residual error
vectors. Because of the linear assumptions being made, in the limit
this would give the same result as above. However, in practice, we
have found the above method to be faster and more reliable.
5AN ACTIVE APPEARANCE MODEL FOR FACES
We applied the Active Appearance Model training algorithm to the
face model described in Section 3.1. The elements of a row of the
matrix R give the weights for each texture sample point for a given
model parameter. Figs. 3 and 4 show the first two modes of the
appearance model and the corresponding weight images. The
areas which exhibit the largest variations for the mode are assigned
the largest weights by the training process.
We used the face AAM to search for faces in previously unseen
images. Fig. 5 shows frames from an AAM search for two faces,
each starting with the mean model displaced from the true face
center. In each case, a good result is found. The search takes less
than one second on a 400MHz Pentium II PC.
6QUANTITATIVE EVALUATION
To obtain a quantitative evaluation of the performance of the
AAM algorithm, we trained a model on 100 hand-labeled face
images and tested it on a different set of 100 labeled images.
Each face was about 200 pixels wide. A variety of different
people and expressions were included. The AAM was trained
by systematically displacing the model parameters on the first
10 images of the training set.
On each test image, we systematically displaced the model from
the true position by 10 percent of the face width in x and y, and
changed its scale by 10 percent. We then ran the multiresolution
search, starting with the mean appearance parameters.
2,700 searches were run in total. Of these, 13 percent failed to
converge to a satisfactory result (the mean point position error was
greater than 5 percent of face width per point). Of those that did
converge, the RMS error between the model points and the target
points was about 0.8 percent of face width. The mean magnitude of
the final texture error vector in the normalized frame, relative to
that of the best model fit given the marked points, was 0.84 (sd:
0.2), suggesting that the algorithm is locating a better result than
that provided by the marked points. This is to be expected since
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 23, NO. 6, JUNE 2001 683
Fig. 3. First mode and displacement weights.
Fig. 4. Second mode and displacement weights.

the algorithm is minimizing the texture error vector and will
compromise the shape if that leads to an overall improvement in
texture match.
Fig. 6 shows the mean intensity error per pixel (for an image
using 256 gray-levels) against the number of iterations, averaged
over a set of searches at a single resolution. The model was initially
displaced by 5 percent of the face width. The dotted line gives the
mean reconstruction error using the hand-marked landmark
points, suggesting a good result is obtained by the search.
Fig. 7 shows the proportion of 100 multiresolution searches
which converged correctly given various starting positions dis-
placed by up to 50 pixels (25 percent of face width) in x and y. The
model displays good results with up to 20 pixels (10 percent of the
face width) initial displacement.
7DISCUSSION AND CONCLUSIONS
We have demonstrated an iterative scheme for matching a
Statistical Appearance Model to new images. The method makes
use of learned correlation between errors in model parameters and
the resulting residual texture errors. Given a reasonable initial
starting position, the search converges rapidly and reliably. The
approach is valuable for locating examples of any class of objects
that can be adequately represented using the statistical appearance
model described above. This includes objects which exhibit shape
and texture variability for which correspondences can be defined
between different examples. For instance, the method cannot be
used for tree like structures with varying numbers of branches, but
can be used for various organs which exhibit shape variation but
not a change in topology. We have tested the method extensively
with images of faces and with medical images [1], [5] and have
obtained encouraging results in other domains. Although an
AAM search is slightly slower than Active Shape Model search [2],
the procedure tends to be be more robust than ASM search alone,
since all the image evidence is used.
The algorithm has been described for gray-level images but can
be extended to color images simply by sampling each color at each
sample point (e.g., three values per pixel for an RGB image). The
nature of the search algorithm also makes it suitable for tracking
objects in image sequences, where it can be shown to give robust
results [5].
ACKNOWLEDGMENTS
Dr. Cootes is funded under an EPSRC Advanced Fellowship
Grant. During much of this work, Dr. Edwards was funded by an
EPSRC CASE award with BT Research Laboratories.
684 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 23, NO. 6, JUNE 2001
Fig. 5. Multiresolution search from displaced position using face model.
Fig. 6. Mean intensity error as search progresses. Dotted line is the mean error of
the best fit to the landmarks.
Fig. 7. Proportion of searches which converged from different initial
displacements.

REFERENCES
[1] T.F. Cootes, G.J. Edwards, and C.J. Taylor, ªActive Appearance Models,º
Proc. Fifth European Conf. Computer Vision, H. Burkhardt and B. Neumann,
eds., vol. 2, pp. 484-498, 1998.
[2] T.F. Cootes, C.J. Taylor, D. Cooper, and J. Graham, ªActive Shape
ModelsÐTheir Training and Application,º Computer Vision and Image
Understanding, vol. 61, no. 1, pp. 38-59, Jan. 1995.
[3] G. Edwards, A. Lanitis, C. Taylor, and T. Cootes, ªStatistical Models of Face
ImagesÐImproving Specificity,º Image and Vision Computing, vol. 16,
pp. 203-211, 1998.
[4] G. Edwards, C.J. Taylor, and T.F. Cootes, ªInterpreting Face Images Using
Active Appearance Models,º Proc. Third Int'l Conf. Automatic Face and
Gesture Recognition, pp. 300-305, 1998.
[5] G.J. Edwards, C.J. Taylor, and T.F. Cootes, ªFace Recognition Using the
Active Appearance Model,º Proc. Fifth European Conf. Computer Vision,
H. Burkhardt and B. Neumann, eds., vol. 2, pp. 581-695, 1998.
[6] T. Ezzat and T. Poggio, ªFacial Analysis and Synthesis Using Image-Based
Models,º Proc. Second Int'l Conf. Automatic Face and Gesture Recognition 1997,
pp. 116-121, 1997.
[7] M. Gleicher, ªProjective Registration with Difference Decomposition,º Proc.
IEEE Conf. Computer Vision and Pattern Recognition, 1997.
[8] G. Hager and P. Belhumeur, ªEfficient Region Tracking with Parametric
Models of Geometry and Illumination,º IEEE Trans. Pattern Analysis and
Machine Intelligence, vol. 20, no. 10, pp. 1025-1039, Oct. 1998.
[9] M.J. Jones and T. Poggio, ªMultidimensional Morphable Models: A
Framework for Representing and Matching Object Classes,º Int'l J.
Computer Vision, vol. 2, no. 29, pp. 107-131, 1998.
[10] M. La Cascia, S. Sclaroff, and V. Athitsos, ªFast, Reliable Head Tracking
under Varying Illumination: An Approach Based on Registration of
Texture Mapped 3D Models,º IEEE Trans. Pattern Analysis and Machine
Intelligence, vol. 22, no. 4, pp. 322-336, Apr. 2000.
[11] A. Lanitis, C.J. Taylor, and T.F. Cootes, ªAutomatic Interpretation and
Coding of Face Images using Flexible Models,º IEEE Trans. Pattern Analysis
and Machine Intelligence, vol. 19, no. 7, pp. 743-756, July 1997.
[12] C. Nastar, B. Moghaddam, and A. Pentland, ªGeneralized Image Matching:
Statistical Learning of Physically-Based Deformations,º Computer Vision and
Image Understanding, vol. 65, no. 2, pp. 179-191, 1997.
[13] S. Sclaroff and J. Isidoro, ªActive Blobs,º Proc. Sixth Int'l Conf. Computer
Vision, pp. 1146-1153, 1998.
[14] T. Vetter, ªLearning Novel Views to a Single Face Image,º Proc. Second Int'l
Conf. Automatic Face and Gesture Recognition, pp. 22-27, 1996.
. For further information on this or any computing topic, please visit our
Digital Library at http://computer.org/publications/dlib.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 23, NO. 6, JUNE 2001 685
Citations
More filters
Journal ArticleDOI

Object Detection with Discriminatively Trained Part-Based Models

TL;DR: An object detection system based on mixtures of multiscale deformable part models that is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges is described.
Journal ArticleDOI

Face recognition: A literature survey

TL;DR: In this paper, the authors provide an up-to-date critical survey of still-and video-based face recognition research, and provide some insights into the studies of machine recognition of faces.
Journal ArticleDOI

Object tracking: A survey

TL;DR: The goal of this article is to review the state-of-the-art tracking methods, classify them into different categories, and identify new trends to discuss the important issues related to tracking including the use of appropriate image features, selection of motion models, and detection of objects.
Journal ArticleDOI

From few to many: illumination cone models for face recognition under variable lighting and pose

TL;DR: A generative appearance-based method for recognizing human faces under variation in lighting and viewpoint that exploits the fact that the set of images of an object in fixed pose but under all possible illumination conditions, is a convex cone in the space of images.
Proceedings ArticleDOI

A morphable model for the synthesis of 3D faces

TL;DR: A new technique for modeling textured 3D faces by transforming the shape and texture of the examples into a vector space representation, which regulates the naturalness of modeled faces avoiding faces with an “unlikely” appearance.
References
More filters
Journal ArticleDOI

Eigenfaces for recognition

TL;DR: A near-real-time computer system that can locate and track a subject's head, and then recognize the person by comparing characteristics of the face to those of known individuals, and that is easy to implement using a neural network architecture.
Journal ArticleDOI

Active shape models—their training and application

TL;DR: This work describes a method for building models by learning patterns of variability from a training set of correctly annotated images that can be used for image search in an iterative refinement algorithm analogous to that employed by Active Contour Models (Snakes).
Book ChapterDOI

Active Appearance Models

TL;DR: A novel method of interpreting images using an Active Appearance Model (AAM), a statistical model of the shape and grey-level appearance of the object of interest which can generalise to almost any valid example.
Journal ArticleDOI

Distortion invariant object recognition in the dynamic link architecture

TL;DR: An object recognition system based on the dynamic link architecture, an extension to classical artificial neural networks (ANNs), is presented and the implementation on a transputer network achieved recognition of human faces and office objects from gray-level camera images.
Frequently Asked Questions (1)
Q1. What are the contributions mentioned in the paper "Active appearance models" ?

ÐWe describe a new method of matching statistical models of appearance to images.