scispace - formally typeset
Open AccessProceedings ArticleDOI

Automatic face recognition for film character retrieval in feature-length films

Ognjen Arandjelovic, +1 more
- Vol. 1, pp 860-867
TLDR
It is demonstrated that high recall rates can be achieved whilst maintaining good precision (over 93%) and a recognition method based on a cascade of processing steps that normalize for the effects of the changing imaging environment is developed.
Abstract
The objective of this work is to recognize all the frontal faces of a character in the closed world of a movie or situation comedy, given a small number of query faces. This is challenging because faces in a feature-length film are relatively uncontrolled with a wide variability of scale, pose, illumination, and expressions, and also may be partially occluded. We develop a recognition method based on a cascade of processing steps that normalize for the effects of the changing imaging environment. In particular there are three areas of novelty: (i) we suppress the background surrounding the face, enabling the maximum area of the face to be retained for recognition rather than a subset; (ii) we include a pose refinement step to optimize the registration between the test image and face exemplar; and (iii) we use robust distance to a sub-space to allow for partial occlusion and expression change. The method is applied and evaluated on several feature length films. It is demonstrated that high recall rates (over 92%) can be achieved whilst maintaining good precision (over 93%).

read more

Content maybe subject to copyright    Report

Thisisthepublishedversion:
Arandjelovic,OgnjenandZisserman,A2005,Automaticfacerecognitionforfilmcharacter
retrievalinfeature‐lengthfilms,inCVPR2005:ProceedingsoftheComputerVisionandPattern
RecognitionConference2005,IEEE,Piscataway,NewJersey,pp.860‐867.
AvailablefromDeakinResearchOnline:
http://hdl.handle.net/10536/DRO/DU:30058433
Reproducedwiththekindpermissionofthecopyrightowner.
Copyright:2005,IEEE

Automatic Face Recognition for Film Character Retrieval in Feature-Length
Films
Ognjen Arandjelovi
´
c Andrew Zisserman
Engineering Department, University of Oxford, UK
E-mail: oa214@cam.ac.uk,az@robots.ox.ac.uk
Abstract
The objective of this work is to recognize all the frontal
faces of a character in the closed world of a movie or situ-
ation comedy, given a small number of query faces. This is
challenging because faces in a feature-length film are rel-
atively uncontrolled with a wide variability of scale, pose,
illumination, and expressions, and also may be partially oc-
cluded. We develop a recognition method based on a cas-
cade of processing steps that normalize for the effects of
the changing imaging environment. In particular there are
three areas of novelty: (i) we suppress the background sur-
rounding the face, enabling the maximum area of the face
to be retained for recognition rather than a subset; (ii) we
include a pose refinement step to optimize the registration
between the test image and face exemplar; and (iii) we use
robust distance to a sub-space to allow for partial occlusion
and expression change. The method is applied and evalu-
ated on several feature length films. It is demonstrated that
high recall rates (over 92%) can be achieved whilst main-
taining good precision (over 93%).
1. Introduction
The problem of automatic face recognition (AFR) con-
cerns matching a detected (roughly localized) face against
a database of known faces with associated identities. This
task, although very intuitive to humans and despite the vast
amounts of research behind it, still poses a significant chal-
lenge to computer methods, see [2, 19] for surveys. Much
AFR research has concentrated on the user authentication
paradigm. In contrast, we consider the content-based mul-
timedia retrieval setup: our aim is to retrieve, and rank by
confidence, film shots based on the presence of specific ac-
tors. A query to the system consists of the user choosing
the person of interest in one or more keyframes. Possible
applications include rapid DVD browsing or multimedia-
oriented web search.
Figure 1. Automatically detected faces in a typical frame from the
feature-length film “Groundhog day”. The background is clut-
tered, pose, expression and illumination very variable.
We proceed from the face detection stage, assuming lo-
calized faces. Face detection technology is fairly mature
and a number of reliable face detectors have been built,
see [13, 16, 18]. We use a local implementation of the
method of Schneiderman and Kanade [16] and consider a
face to be correctly detected if both eyes and the mouth are
visible, see Figure 1. In a typical feature-length film we
obtain 2000-5000 face images which result from a cast of
10-20 primary and secondary characters.
Problem challenges. A number of factors other than
identity influence the way a face appears in an image.
Lighting conditions, and especially light angle, drastically
change the appearance of a face [1]. Facial expressions,
including closed or partially closed eyes, also complicate
the problem, just as head pose does. Partial occlusions,
be they artefacts in front of a face or resulting from hair
style change, or growing a beard or moustache also cause
problems. Films therefore provide an uncontrolled, realis-
tic working environment for face recognition algorithms.
Method overview. Our approach consists of computing
a numerical value, a distance, expressing the degree of be-
lief that two face images belong to the same person. Low
distance, ideally zero, signifies that images are of the same
person, whilst a large one signifies that they are of different
people.
Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
1063-6919/05 $20.00 © 2005 IEEE

(a) (b) (c)
Figure 2. The effects of imaging conditions illumination (a), pose
(b) and expression (c) on the appearance of a face are dramatic
and present the main difficulty to AFR.
The method involves computing a series of transforma-
tions of the original image, each aimed at removing the ef-
fects of a particular extrinsic imaging factor. The end result
is a signature image of a person, which depends mainly on
the person’s identity (and expression) and can be readily
classified. This is summarized in Figure 3 and Algorithm 1.
1.1. Previous Work
Little work in the literature addresses AFR in a setup
similar to ours. Fitzgibbon and Zisserman [11] investigated
face clustering in feature films, though without explicitly
using facial features for registration. Berg et al. [3] consider
the problem of clustering detected frontal faces extracted
from web news pages. In a similar manner to us, affine
registration with an underlying SVM-based facial feature
detector is used for face rectification. The classification
is then performed in a Kernel PCA space using combined
image and contextual text-based features. The problem we
consider is more difficult in two respects: (i) the variation
in imaging conditions in films is typically greater than in
newspaper photographs, and (ii) we do not use any type of
information other than visual cues (i.e. no text). The differ-
ence in the difficulty is apparent by comparing the examples
in [3] with those used for evaluation in Section 3. For ex-
ample, in [3] the face image size is restricted to be at least
86 × 86 pixels, whilst a significant number of faces we use
are of lower resolution.
Everingham and Zisserman [9] consider AFR in situa-
tion comedies. However, rather than using facial feature
detection a quasi-3D model of the head is used to correct
for varying pose. Temporal information via shot tracking is
exploited for enriching the training corpus. In contrast, we
do not use any temporal information, and the use of local
features (Section 2.1) allows us to compare two face images
in spite of partial occlusions (Section 2.5).
Algorithm 1 Method overview
Input: novel image I,
training signature image S
r
.
Output: distance d(I, S
r
).
1: Facial feature localization
{x
i
}←I
2: Pose effects: registration by affine warping
I
R
= f (I, {x
i
}, S
r
)
3: Background clutter: face outline detection
I
F
= I
R
. mask(I
R
)
4: Illumination effects: band-pass filtering
S = I
F
B
5: Pose effects: registration refinement
S
f
=
ˆ
f(I
F
, S
r
)
6: Occlusion effects: robust distance measure
d(I, S
r
)=S
r
S
f
Features
Warp
Background Removal
Filter
SVM Classifiers
Features Training
Data
Probabilistic Model of
Face Outline
Original Image
Face Signature
Image
Normalized Pose
Background
Clutter Removed
Normalized
Illumination
Figure 3. Face representation: Each step in the cascade produces
a result invariant to a specific extrinsic factor.
2. Method Details
In the proposed framework, the first step in processing
a face image is the normalization of the subject’s pose
registration. After the face detection stage, faces are only
roughly localized and aligned more sophisticated registra-
tion methods are needed to correct for the effects of varying
pose. One way of doing this is to “lock onto” the character-
istic facial points. In our method, these facial points are the
locations of the mouth and the eyes.
Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
1063-6919/05 $20.00 © 2005 IEEE

Figure 4. The difficulties of facial feature detection: without con-
text, distinguishing features in low resolution and bad illumination
conditions is a hard task even for a human. Shown are a mouth
and an eye that although easily recognized within the context of
the whole image, are very similar in isolation.
2.1. Facial Feature Detection
In the proposed algorithm Support Vector Machines
1
(SVMs) [6, 17] are used for facial feature detection. For
a related approach see [3]; alternative methods include pic-
torial structures [10] or the method of Cristinacce et al. [7].
We represent each facial feature, i.e. the image patch sur-
rounding it, by a feature vector. An SVM with a set of
parameters (kernel type, its bandwidth and a regularization
constant) is then trained on a part of the training data and
its performance iteratively optimized on the remainder. The
final detector is evaluated by a one-time run on unseen data.
2.1.1 Training
For training we use manually localized facial features in a
set of 300 randomly chosen faces from the feature-length
film “Groundhog day”. Examples are extracted by taking
rectangular image patches centred at feature locations (see
Figures 4 and 5). We represent each patch I R
N×M
with
a feature vector v R
2N×M
with appearance and gradient
information (we used N =17and M =21):
v
A
(Ny + x)=I(x, y) (1)
v
G
(Ny + x)=|∇I(x, y)| (2)
v =
v
A
v
G
(3)
Local information. In the proposed method, implicit lo-
cal information is included for increased robustness. This
is done by complementing the image appearance vector v
A
with the greyscale intensity gradient vector v
G
,asin(3).
Synthetic data. For robust classification, it is important
that training data sets are representative of the whole spaces
that are discriminated between. In uncontrolled imaging
conditions, the appearance of facial features exhibits a lot
of variation, requiring an appropriately large training cor-
pus. This makes the approach with manual feature extrac-
tion impractical. In our method, a large portion of training
1
We used the LibSVM implementation freely available at http://
www.csie.ntu.edu.tw/
cjlin/libsvm/
Figure 5. A subset of data (1800 in total) used to train the eye
detector. Notice the low resolution and the importance of the sur-
rounding image context for precise localization.
data (1500 out of 1800 training examples) was synthetically
generated. Seeing that the surface of the face is smooth
and roughly fronto-parallel, its 3D motion produces locally
affine-like effects in the image plane. Therefore, we synthe-
size training examples by applying random affine perturba-
tions to the manually detected set.
2.1.2 SVM-based Feature Detector
SVMs only provide classification decision for individual
feature vectors, but no associated probabilistic information.
Therefore, performing classification on all image patches
produces as a result a binary image (a feature is either
present or not in a particular location) from which only one
feature location needs to be selected.
Our method is based on the observation that due to the
robustness to noise of SVMs, the binary image output con-
sists of connected components of positive classifications
(we will refer to these as clusters), see Figure 6. We use
a prior on feature locations to focus on the cluster of inter-
est. Priors corresponding to the 3 features are assumed to be
independent and Gaussian (2D, with full covariance matri-
ces) and are learnt from the training corpus of 300 manually
localized features described in Section 2.1.1. We then con-
sider the total ‘evidence’ for a feature within each cluster:
x∈S
P (x)dx (4)
where S is a cluster and P (x) the Gaussian prior on the fa-
cial feature location. An unbiased feature location estimate
with σ 1.5 pixels was obtained by choosing the mean of
the cluster with largest evidence as the final feature location,
see Figures 6 and 7.
2.2. Registration
In the proposed method dense point correspondences are
implicitly or explicitly used for background clutter removal,
partial occlusion detection and signature image comparison
(Sections 2.3- 2.5). To this end, images of faces are affine
Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
1063-6919/05 $20.00 © 2005 IEEE

1 2
3
Figure 6. Efficient SVM-based eye detection. 1: Prior on feature
location restricts the search region. 2: Only 25% of the loca-
tions are initially classified. 3: Morphological dilation is used to
approximate the dense classification result from a sparse output.
warped to have salient facial features aligned. The six trans-
formation parameters are uniquely determined from three
pairs of point correspondences between detected facial
features (the eyes and the mouth) and their canonical lo-
cations. In contrast to global appearance-based methods
(e.g. [5, 8]) our approach is more robust to partial occlu-
sion. It is summarized in Algorithm 2 with typical results
showninFigure8.
Algorithm 2 Face Registration
Input: canonical facial feature locations x
can
,
face image I,
facial feature locations x
in
.
Output: registered image I
reg
.
1: Estimate the affine warp matrix
A (x
can
,x
in
)
2: Compute eigenvalues of A
{λ
1
2
} = eig(A)
3: Impose prior on shear and rescaling by A
if (|A|∈[0.9, 1.1] λ
1
2
[0.6, 1.3]) then
4: Warp the image
I
reg
= f(I; A)
5: else
6: Face detector false +ve
Report(“I is not a face”)
7: endif
2.3. Background Removal
The bounding box of a face, supplied by the face de-
tector, typically contains significant background clutter. To
realize a reliable comparison of two faces, segmentation to
Figure 7. Automatically detected facial features: High accuracy
is achieved in spite of wide variation in facial expression, pose,
illumination and the presence of facial wear (glasses).
foreground (i.e. face) and background regions has to be per-
formed. We show that the face outline can be robustly de-
tected by combining a learnt prior on the face shape and a
set of measurements of intensity discontinuity.
In detecting the face outline, we only consider points
confined to a discrete mesh corresponding to angles equally
spaced at α and radii at r, see Figure 9 (a). At each
mesh point we measure the image intensity gradient in the
radial direction if its magnitude is locally maximal and
greater than a threshold t, we assign it a constant high-
probability and a constant low probability otherwise, see
Figure 9 (a,b). Let m
i
be a vector of probabilities corre-
sponding to discrete radius values at angle α
i
= iα, and
r
i
the boundary location at the same angle. We seek the
maximum a posteriori estimate of the boundary radii:
{r
i
} =argmax
{r
i
}
P (r
1
, .., r
N
|m
1
, .., m
N
)= (5)
arg max
{r
i
}
P (m
1
, .., m
N
|r
1
, .., r
N
)P (r
1
, .., r
N
)
We make the Na
¨
ıve Bayes assumption for the first term
in (5), whereas the second term we assume to be a first-order
Markov chain. Formally:
Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
1063-6919/05 $20.00 © 2005 IEEE

Citations
More filters
Proceedings Article

A morphable model for the synthesis of 3D faces

Matthew Turk
Journal ArticleDOI

Probability and Random Processes

Ali Esmaili
- 01 Aug 2005 - 
TL;DR: This handbook is a very useful handbook for engineers, especially those working in signal processing, and provides real data bootstrap applications to illustrate the theory covered in the earlier chapters.
Proceedings ArticleDOI

"Hello! My name is... Buffy" - Automatic Naming of Characters in TV Video

TL;DR: It is demonstrated that high precision can be achieved by combining multiple sources of information, both visual and textual, by automatic generation of time stamped character annotation by aligning subtitles and transcripts.
Proceedings ArticleDOI

Face recognition based on image sets

TL;DR: A novel method for face recognition from image sets that combines kernel trick and robust methods to discard input points that are far from the fitted model, thus handling complex and nonlinear manifolds of face images.
Journal ArticleDOI

2D Articulated Human Pose Estimation and Retrieval in (Almost) Unconstrained Still Images

TL;DR: This work proposes and proposes and evaluates techniques for searching a video dataset for people in a specific pose, and develops three new pose descriptors and compares their classification and retrieval performance to two baselines built on state-of-the-art object detection models.
References
More filters
Book

The Nature of Statistical Learning Theory

TL;DR: Setting of the learning problem consistency of learning processes bounds on the rate of convergence ofLearning processes controlling the generalization ability of learning process constructing learning algorithms what is important in learning theory?
Journal ArticleDOI

A Tutorial on Support Vector Machines for Pattern Recognition

TL;DR: There are several arguments which support the observed high accuracy of SVMs, which are reviewed and numerous examples and proofs of most of the key theorems are given.
Journal ArticleDOI

Robust Real-Time Face Detection

TL;DR: In this paper, a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates is described. But the detection performance is limited to 15 frames per second.
Book

Introduction to Modern Information Retrieval

TL;DR: Reading is a need and a hobby at once and this condition is the on that will make you feel that you must read.
Proceedings ArticleDOI

Robust real-time face detection

TL;DR: A new image representation called the “Integral Image” is introduced which allows the features used by the detector to be computed very quickly and a method for combining classifiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more computation on promising face-like regions.
Related Papers (5)
Frequently Asked Questions (15)
Q1. What contributions have the authors mentioned in the paper "Automatic face recognition for film character retrieval in feature-length films" ?

The objective of this work is to recognize all the frontal faces of a character in the closed world of a movie or situation comedy, given a small number of query faces. 

The main research direction the authors intend to pursue in the future is the development of a flexible model for learning person-specific manifolds, for example due to facial expression changes. The authors are very grateful to Mark Everingham for a number of helpful discussions and suggestions, and Krystian Mikolajczyk and Cordelia Schmid of INRIA Grenoble who supplied face detection code. 

Seeing that the surface of the face is smooth and roughly fronto-parallel, its 3D motion produces locally affine-like effects in the image plane. 

The main research direction the authors intend to pursue in the future is the development of a flexible model for learning person-specific manifolds, for example due to facial expression changes. 

After the face detection stage, faces are only roughly localized and aligned – more sophisticated registration methods are needed to correct for the effects of varying pose. 

The proposed approach achieved a reduction of 33% in the expected within-class signature image distance, while the effect on between-class distances was found to be statistically insignificant. 

MF = M ∗ exp− (r(x, y) 4)2 (8)IF (x, y) = IR(x, y)MF (x, y) (9)The last step in processing a face image to produce its signature is the removal of illumination effects. 

The score of ρ = 1.0 can be seen to correspond to orderings which correctly cluster all the data (all the in-class faces are recalled first), 0.0 to those that invert the classes (the in-class faces are recalled last), while 0.5 is the expected score of a random ordering. 

For training the authors use manually localized facial features in a set of 300 randomly chosen faces from the feature-length film “Groundhog day”. 

In a typical feature-length film the authors obtain 2000-5000 face images which result from a cast of 10-20 primary and secondary characters. 

Foreground/background segmentation produces a binary mask image M. As well as masking the corresponding face image IR (see Figure 10), the authors smoothly suppress image information around the boundary to achieve robustness to small errors in its localization: 

The proposed approach of systematically removing particular imaging distortions – pose, background clutter, il-Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) 1063-6919/05 $20.00 © 2005 IEEElumination and partial occlusion has been demonstrated to consistently achieve high recall and precision rates. 

In uncontrolled imaging conditions, the appearance of facial features exhibits a lot of variation, requiring an appropriately large training corpus. 

In detecting the face outline, the authors only consider points confined to a discrete mesh corresponding to angles equally spaced at ∆α and radii at ∆r, see Figure 9 (a). 

Noting that these produce mostly slowly varying, low spatial frequency variations [11], the authors normalize for their effects by band-pass filtering, see Figure 3:S = IF ∗ Gσ=0.5 − IF ∗ Gσ=8 (10) This defines the signature image S.In Sections 2.1–2.4 a cascade of transformations applied to face images was described, producing a signature image insensitive to illumination, pose and background clutter.