What is the main research direction the authors intend to pursue in the future?

The main research direction the authors intend to pursue in the future is the development of a flexible model for learning person-specific manifolds, for example due to facial expression changes.

How was the effect on between-class distances found to be statistically insignificant?

The proposed approach achieved a reduction of 33% in the expected within-class signature image distance, while the effect on between-class distances was found to be statistically insignificant.

What is the last step in processing a face image to produce its signature?

MF = M ∗ exp− (r(x, y) 4)2 (8)IF (x, y) = IR(x, y)MF (x, y) (9)The last step in processing a face image to produce its signature is the removal of illumination effects.

What is the expected score of a random ordering?

The score of ρ = 1.0 can be seen to correspond to orderings which correctly cluster all the data (all the in-class faces are recalled first), 0.0 to those that invert the classes (the in-class faces are recalled last), while 0.5 is the expected score of a random ordering.

How do the authors suppress the image information around the boundary?

Foreground/background segmentation produces a binary mask image M. As well as masking the corresponding face image IR (see Figure 10), the authors smoothly suppress image information around the boundary to achieve robustness to small errors in its localization:

What is the way to detect the face outline?

In detecting the face outline, the authors only consider points confined to a discrete mesh corresponding to angles equally spaced at ∆α and radii at ∆r, see Figure 9 (a).

What is the effect of the affine transformations on the face image?

Noting that these produce mostly slowly varying, low spatial frequency variations [11], the authors normalize for their effects by band-pass filtering, see Figure 3:S = IF ∗ Gσ=0.5 − IF ∗ Gσ=8 (10) This defines the signature image S.In Sections 2.1–2.4 a cascade of transformations applied to face images was described, producing a signature image insensitive to illumination, pose and background clutter.

(Open Access) Automatic face recognition for film character retrieval in feature-length films (2005) | Ognjen Arandjelovic

Q: How many facial features are used in the film?

For training the authors use manually localized facial features in a set of 300 randomly chosen faces from the feature-length film “Groundhog day”.

Q: How many people are in a typical film?

In a typical feature-length film the authors obtain 2000-5000 face images which result from a cast of 10-20 primary and secondary characters.





Thisisthepublishedversion:



Arandjelovic,OgnjenandZisserman,A2005,Automaticfacerecognitionforfilmcharacter

retrievalinfeature‐lengthfilms,inCVPR2005:ProceedingsoftheComputerVisionandPattern 

RecognitionConference2005,IEEE,Piscataway,NewJersey,pp.860‐867.



AvailablefromDeakinResearchOnline:

http://hdl.handle.net/10536/DRO/DU:30058433



Reproducedwiththekindpermissionofthecopyrightowner.



Automatic Face Recognition for Film Character Retrieval in Feature-Length

Films

Ognjen Arandjelovi

c Andrew Zisserman

Engineering Department, University of Oxford, UK

E-mail: oa214@cam.ac.uk,az@robots.ox.ac.uk

Abstract

The objective of this work is to recognize all the frontal

faces of a character in the closed world of a movie or situ-

ation comedy, given a small number of query faces. This is

challenging because faces in a feature-length ﬁlm are rel-

atively uncontrolled with a wide variability of scale, pose,

illumination, and expressions, and also may be partially oc-

cluded. We develop a recognition method based on a cas-

cade of processing steps that normalize for the effects of

the changing imaging environment. In particular there are

three areas of novelty: (i) we suppress the background sur-

rounding the face, enabling the maximum area of the face

to be retained for recognition rather than a subset; (ii) we

include a pose reﬁnement step to optimize the registration

between the test image and face exemplar; and (iii) we use

robust distance to a sub-space to allow for partial occlusion

and expression change. The method is applied and evalu-

ated on several feature length ﬁlms. It is demonstrated that

high recall rates (over 92%) can be achieved whilst main-

taining good precision (over 93%).

1. Introduction

The problem of automatic face recognition (AFR) con-

cerns matching a detected (roughly localized) face against

a database of known faces with associated identities. This

task, although very intuitive to humans and despite the vast

amounts of research behind it, still poses a signiﬁcant chal-

lenge to computer methods, see [2, 19] for surveys. Much

AFR research has concentrated on the user authentication

paradigm. In contrast, we consider the content-based mul-

timedia retrieval setup: our aim is to retrieve, and rank by

conﬁdence, ﬁlm shots based on the presence of speciﬁc ac-

tors. A query to the system consists of the user choosing

the person of interest in one or more keyframes. Possible

applications include rapid DVD browsing or multimedia-

oriented web search.

Figure 1. Automatically detected faces in a typical frame from the

feature-length ﬁlm “Groundhog day”. The background is clut-

tered, pose, expression and illumination very variable.

We proceed from the face detection stage, assuming lo-

calized faces. Face detection technology is fairly mature

and a number of reliable face detectors have been built,

see [13, 16, 18]. We use a local implementation of the

method of Schneiderman and Kanade [16] and consider a

face to be correctly detected if both eyes and the mouth are

visible, see Figure 1. In a typical feature-length ﬁlm we

obtain 2000-5000 face images which result from a cast of

10-20 primary and secondary characters.

Problem challenges. A number of factors other than

identity inﬂuence the way a face appears in an image.

Lighting conditions, and especially light angle, drastically

change the appearance of a face [1]. Facial expressions,

including closed or partially closed eyes, also complicate

the problem, just as head pose does. Partial occlusions,

be they artefacts in front of a face or resulting from hair

style change, or growing a beard or moustache also cause

problems. Films therefore provide an uncontrolled, realis-

tic working environment for face recognition algorithms.

Method overview. Our approach consists of computing

a numerical value, a distance, expressing the degree of be-

lief that two face images belong to the same person. Low

distance, ideally zero, signiﬁes that images are of the same

person, whilst a large one signiﬁes that they are of different

people.

Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)

(a) (b) (c)

Figure 2. The effects of imaging conditions – illumination (a), pose

(b) and expression (c) – on the appearance of a face are dramatic

and present the main difﬁculty to AFR.

The method involves computing a series of transforma-

tions of the original image, each aimed at removing the ef-

fects of a particular extrinsic imaging factor. The end result

is a signature image of a person, which depends mainly on

the person’s identity (and expression) and can be readily

classiﬁed. This is summarized in Figure 3 and Algorithm 1.

1.1. Previous Work

Little work in the literature addresses AFR in a setup

similar to ours. Fitzgibbon and Zisserman [11] investigated

face clustering in feature ﬁlms, though without explicitly

using facial features for registration. Berg et al. [3] consider

the problem of clustering detected frontal faces extracted

from web news pages. In a similar manner to us, afﬁne

registration with an underlying SVM-based facial feature

detector is used for face rectiﬁcation. The classiﬁcation

is then performed in a Kernel PCA space using combined

image and contextual text-based features. The problem we

consider is more difﬁcult in two respects: (i) the variation

in imaging conditions in ﬁlms is typically greater than in

newspaper photographs, and (ii) we do not use any type of

information other than visual cues (i.e. no text). The differ-

ence in the difﬁculty is apparent by comparing the examples

in [3] with those used for evaluation in Section 3. For ex-

ample, in [3] the face image size is restricted to be at least

86 × 86 pixels, whilst a signiﬁcant number of faces we use

are of lower resolution.

Everingham and Zisserman [9] consider AFR in situa-

tion comedies. However, rather than using facial feature

detection a quasi-3D model of the head is used to correct

for varying pose. Temporal information via shot tracking is

exploited for enriching the training corpus. In contrast, we

do not use any temporal information, and the use of local

features (Section 2.1) allows us to compare two face images

in spite of partial occlusions (Section 2.5).

Algorithm 1 Method overview

Input: novel image I,

training signature image S

Output: distance d(I, S

1: Facial feature localization

}←I

2: Pose effects: registration by afﬁne warping

= f (I, {x

}, S

)

3: Background clutter: face outline detection

= I

. ∗ mask(I

)

4: Illumination effects: band-pass ﬁltering

S = I

∗ B

5: Pose effects: registration reﬁnement

f(I

, S

)

6: Occlusion effects: robust distance measure

d(I, S

)=S

− S



Features

Warp

Background Removal

Filter

SVM Classifiers

Features Training

Data

Probabilistic Model of

Face Outline

Original Image

Face Signature

Image

Normalized Pose

Background

Clutter Removed

Normalized

Illumination

Figure 3. Face representation: Each step in the cascade produces

a result invariant to a speciﬁc extrinsic factor.

2. Method Details

In the proposed framework, the ﬁrst step in processing

a face image is the normalization of the subject’s pose –

registration. After the face detection stage, faces are only

roughly localized and aligned – more sophisticated registra-

tion methods are needed to correct for the effects of varying

pose. One way of doing this is to “lock onto” the character-

istic facial points. In our method, these facial points are the

locations of the mouth and the eyes.

Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)

Figure 4. The difﬁculties of facial feature detection: without con-

text, distinguishing features in low resolution and bad illumination

conditions is a hard task even for a human. Shown are a mouth

and an eye that although easily recognized within the context of

the whole image, are very similar in isolation.

2.1. Facial Feature Detection

In the proposed algorithm Support Vector Machines

(SVMs) [6, 17] are used for facial feature detection. For

a related approach see [3]; alternative methods include pic-

torial structures [10] or the method of Cristinacce et al. [7].

We represent each facial feature, i.e. the image patch sur-

rounding it, by a feature vector. An SVM with a set of

parameters (kernel type, its bandwidth and a regularization

constant) is then trained on a part of the training data and

its performance iteratively optimized on the remainder. The

ﬁnal detector is evaluated by a one-time run on unseen data.

2.1.1 Training

For training we use manually localized facial features in a

set of 300 randomly chosen faces from the feature-length

ﬁlm “Groundhog day”. Examples are extracted by taking

rectangular image patches centred at feature locations (see

Figures 4 and 5). We represent each patch I ∈ R

N×M

with

a feature vector v ∈ R

2N×M

with appearance and gradient

information (we used N =17and M =21):

(Ny + x)=I(x, y) (1)

(Ny + x)=|∇I(x, y)| (2)

v =





(3)

Local information. In the proposed method, implicit lo-

cal information is included for increased robustness. This

is done by complementing the image appearance vector v

with the greyscale intensity gradient vector v

,asin(3).

Synthetic data. For robust classiﬁcation, it is important

that training data sets are representative of the whole spaces

that are discriminated between. In uncontrolled imaging

conditions, the appearance of facial features exhibits a lot

of variation, requiring an appropriately large training cor-

pus. This makes the approach with manual feature extrac-

tion impractical. In our method, a large portion of training

We used the LibSVM implementation freely available at http://

www.csie.ntu.edu.tw/

∼

cjlin/libsvm/

Figure 5. A subset of data (1800 in total) used to train the eye

detector. Notice the low resolution and the importance of the sur-

rounding image context for precise localization.

data (1500 out of 1800 training examples) was synthetically

generated. Seeing that the surface of the face is smooth

and roughly fronto-parallel, its 3D motion produces locally

afﬁne-like effects in the image plane. Therefore, we synthe-

size training examples by applying random afﬁne perturba-

tions to the manually detected set.

2.1.2 SVM-based Feature Detector

SVMs only provide classiﬁcation decision for individual

feature vectors, but no associated probabilistic information.

Therefore, performing classiﬁcation on all image patches

produces as a result a binary image (a feature is either

present or not in a particular location) from which only one

feature location needs to be selected.

Our method is based on the observation that due to the

robustness to noise of SVMs, the binary image output con-

sists of connected components of positive classiﬁcations

(we will refer to these as clusters), see Figure 6. We use

a prior on feature locations to focus on the cluster of inter-

est. Priors corresponding to the 3 features are assumed to be

independent and Gaussian (2D, with full covariance matri-

ces) and are learnt from the training corpus of 300 manually

localized features described in Section 2.1.1. We then con-

sider the total ‘evidence’ for a feature within each cluster:



x∈S

P (x)dx (4)

where S is a cluster and P (x) the Gaussian prior on the fa-

cial feature location. An unbiased feature location estimate

with σ ≈ 1.5 pixels was obtained by choosing the mean of

the cluster with largest evidence as the ﬁnal feature location,

see Figures 6 and 7.

2.2. Registration

In the proposed method dense point correspondences are

implicitly or explicitly used for background clutter removal,

partial occlusion detection and signature image comparison

(Sections 2.3- 2.5). To this end, images of faces are afﬁne

Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)

1 2

Figure 6. Efﬁcient SVM-based eye detection. 1: Prior on feature

location restricts the search region. 2: Only ∼ 25% of the loca-

tions are initially classiﬁed. 3: Morphological dilation is used to

approximate the dense classiﬁcation result from a sparse output.

warped to have salient facial features aligned. The six trans-

formation parameters are uniquely determined from three

pairs of point correspondences – between detected facial

features (the eyes and the mouth) and their canonical lo-

cations. In contrast to global appearance-based methods

(e.g. [5, 8]) our approach is more robust to partial occlu-

sion. It is summarized in Algorithm 2 with typical results

showninFigure8.

Algorithm 2 Face Registration

Input: canonical facial feature locations x

can

face image I,

facial feature locations x

Output: registered image I

reg

1: Estimate the afﬁne warp matrix

A ← (x

can

)

2: Compute eigenvalues of A

{λ

,λ

} = eig(A)

3: Impose prior on shear and rescaling by A

if (|A|∈[0.9, 1.1] ∧ λ

/λ

∈ [0.6, 1.3]) then

4: Warp the image

reg

= f(I; A)

5: else

6: Face detector false +ve

Report(“I is not a face”)

7: endif

2.3. Background Removal

The bounding box of a face, supplied by the face de-

tector, typically contains signiﬁcant background clutter. To

realize a reliable comparison of two faces, segmentation to

Figure 7. Automatically detected facial features: High accuracy

is achieved in spite of wide variation in facial expression, pose,

illumination and the presence of facial wear (glasses).

foreground (i.e. face) and background regions has to be per-

formed. We show that the face outline can be robustly de-

tected by combining a learnt prior on the face shape and a

set of measurements of intensity discontinuity.

In detecting the face outline, we only consider points

conﬁned to a discrete mesh corresponding to angles equally

spaced at ∆α and radii at ∆r, see Figure 9 (a). At each

mesh point we measure the image intensity gradient in the

radial direction – if its magnitude is locally maximal and

greater than a threshold t, we assign it a constant high-

probability and a constant low probability otherwise, see

Figure 9 (a,b). Let m

be a vector of probabilities corre-

sponding to discrete radius values at angle α

= i∆α, and

the boundary location at the same angle. We seek the

maximum a posteriori estimate of the boundary radii:

} =argmax

}

P (r

, .., r

, .., m

)= (5)

arg max

}

P (m

, .., m

, .., r

)P (r

, .., r

)

We make the Na

ıve Bayes assumption for the ﬁrst term

in (5), whereas the second term we assume to be a ﬁrst-order

Markov chain. Formally:

Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)

Automatic face recognition for film character retrieval in feature-length films

Figures

Citations

A morphable model for the synthesis of 3D faces

Probability and Random Processes

"Hello! My name is... Buffy" - Automatic Naming of Characters in TV Video

Face recognition based on image sets

2D Articulated Human Pose Estimation and Retrieval in (Almost) Unconstrained Still Images

References

The Nature of Statistical Learning Theory

A Tutorial on Support Vector Machines for Pattern Recognition

Robust Real-Time Face Detection

Introduction to Modern Information Retrieval

Robust real-time face detection

Related Papers (5)

"Hello! My name is... Buffy" - Automatic Naming of Characters in TV Video

Robust Real-Time Face Detection

Rapid object detection using a boosted cascade of simple features

Face recognition: A literature survey

Distinctive Image Features from Scale-Invariant Keypoints

Frequently Asked Questions (15)

Q1. What contributions have the authors mentioned in the paper "Automatic face recognition for film character retrieval in feature-length films" ?

Q2. What are the future works in "Automatic face recognition for film character retrieval in feature-length films" ?

Q3. What is the effect of the facial features on the image plane?

Q4. What is the main research direction the authors intend to pursue in the future?

Q5. What is the way to correct for the effects of varying pose?

Q6. How was the effect on between-class distances found to be statistically insignificant?

Q7. What is the last step in processing a face image to produce its signature?

Q8. What is the expected score of a random ordering?

Q9. How many facial features are used in the film?

Q10. How many people are in a typical film?

Q11. How do the authors suppress the image information around the boundary?

Q12. What is the method for removing particular distortions in the image?

Q13. What is the way to train a facial feature?

Q14. What is the way to detect the face outline?

Q15. What is the effect of the affine transformations on the face image?