What is the method for comparing two sets of faces?

The comparison between two sets of faces was performed using (i) Mutual Subspace Method (MSM) [38] (for both MRH and LBP), and (ii) feature averaging [8, 20] (for MRH only).

What is the reason why DFFS is over trained?

As there is an overlap between the subjects in the ‘fa’ and pose subsets in FERET (where ‘fa’ was used for training), the inconsistency in performance across FERET and PIE suggests that DFFS might be over trained to the training dataset.

Why is the pose variation so different on FERET?

The authors conjecture that this is due to the larger pose variation between frontal faces and faces with the smallest pose angle (±22.5◦), in contrast to ±15◦ on FERET.

How many faces are selected by the proposed method?

The authors note that even when only one face is selected by the proposed method (ie. N = 1), relatively high verification accuracy is still achieved.

(Open Access) Patch-based probabilistic image quality assessment for face selection and improved video-based face recognition (2011) | Yongkang Wong

Q: What contributions have the authors mentioned in the paper "Patch-based probabilistic image quality assessment for face selection and improved video-based face recognition" ?

The authors propose an efficient patch-based face image quality assessment algorithm which quantifies the similarity of a face image to a probabilistic face model, representing an ‘ ideal ’ face. Further experiments on a new video surveillance dataset ( termed ChokePoint ) show that the proposed method provides better face subsets than existing face selection techniques, leading to significant improvements in recognition accuracy.

Q: What are the challenges of face recognition in surveillance?

While recent face recognition algorithms can handle faces with moderately challenging illumination conditions [15, 17, 24, 28], strong illumination variations (causing cast shadows [30] and self-shadowing) remain problematic [31].

Q: What is the probability of the corresponding feature vector xi?

For each block location i, the probability of the corresponding feature vector xi is calculated using a location specific probabilistic model:p(xi|µi,Σi) = exp[ − 12 (xi − µi) T Σ−1i (xi − µi) ](2π) d 2 |Σi| 1 2(2)where µi and Σi are the mean and covariance matrix of a normal distribution.

Patch-based probabilistic image quality assessment for face

selection and improved video-based face recognition

Author

Wong, Yongkang, Chen, Shaokang, Mau, Sandra, Sanderson, Conrad, Lovell, Brian C

Published

2011

Conference Title

CVPR 2011 WORKSHOPS

Version

Accepted Manuscript (AM)

DOI

https://doi.org/10.1109/cvprw.2011.5981881

obtained for all other uses, in any current or future media, including reprinting/republishing this

material for advertising or promotional purposes, creating new collective works, for resale or

redistribution to servers or lists, or reuse of any copyrighted component of this work in other

works.

Downloaded from

http://hdl.handle.net/10072/401031

Griffith Research Online

https://research-repository.griffith.edu.au

Patch-based Probabilistic Image Quality Assessment for

Face Selection and Improved Video-based Face Recognition

Yongkang Wong, Shaokang Chen, Sandra Mau, Conrad Sanderson, Brian C. Lovell

NICTA, PO Box 6020, St Lucia, QLD 4067, Australia

∗

The University of Queensland, School of ITEE, QLD 4072, Australia

Abstract

In video based face recognition, face images are typically

captured over multiple frames in uncontrolled conditions,

where head pose, illumination, shadowing, motion blur and

focus change over the sequence. Additionally, inaccuracies

in face localisation can also introduce scale and alignment

variations. Using all face images, including images of poor

quality, can actually degrade face recognition performance.

While one solution it to use only the ‘best’ subset of images,

current face selection techniques are incapable of simulta-

neously handling all of the abovementioned issues. We pro-

pose an efﬁcient patch-based face image quality assessment

algorithm which quantiﬁes the similarity of a face image

to a probabilistic face model, representing an ‘ideal’ face.

Image characteristics that affect recognition are taken into

account, including variations in geometric alignment (shift,

rotation and scale), sharpness, head pose and cast shad-

ows. Experiments on FERET and PIE datasets show that

the proposed algorithm is able to identify images which are

simultaneously the most frontal, aligned, sharp and well

illuminated. Further experiments on a new video surveil-

lance dataset (termed ChokePoint) show that the proposed

method provides better face subsets than existing face se-

lection techniques, leading to signiﬁcant improvements in

recognition accuracy.

1. Introduction

Video-based identity inference in surveillance conditions

is challenging due to a variety of factors, including the

subjects’ motion, the uncontrolled nature of the subjects,

variable lighting, and poor quality CCTV video recordings.

This results in issues for face recognition such as low reso-

lution, blurry images (due to motion or loss of focus), large

pose variations, and low contrast [14, 28, 36, 41]. While re-

cent face recognition algorithms can handle faces with mod-

erately challenging illumination conditions [15, 17, 24, 28],

strong illumination variations (causing cast shadows [30]

and self-shadowing) remain problematic [31].

∗

Published in: IEEE Conference on Computer Vision and Pattern

Recognition Workshops (CVPRW), pp. 74–81, 2011.

One approach to overcome the impact of poor quality

images is to assume that such images are outliers in a se-

quence. This includes approaches like exemplar extraction

using clustering techniques (eg. k-means clustering [13])

and statistical model approaches for outlier removal [6].

However, these approaches are not likely to work when

most of the images in the sequence have poor quality — the

good quality images would actually be classiﬁed as outliers.

Another approach is explicit subset selection, where a

face quality assessment is automatically made on each im-

age, either to remove poor quality face images, or to select

a subset comprised of high quality images [10, 20, 21, 33].

This improves recognition performance, with the additional

beneﬁt of reducing the overall computation load during fea-

ture extraction and matching [19]. The challenge in this

approach is ﬁnding a good deﬁnition for “face quality”.

Several face image standards have been proposed for

face quality assessment (eg. ISO/IEC 19794-5 [1] and

ICAO 9303 [2]). In these standards, quality can be divided

into: (i) image speciﬁc qualities such as sharpness, contrast,

compression artifacts, and (ii) face speciﬁc qualities such as

face geometry, pose, eye detectability, illumination angles.

Based in part on the above standards, many approaches

have been proposed to analyse various face and image

properties. For example, face pose estimation using tree

structured multiple pose estimators [39], and face align-

ment estimation using template matching [7]. Asymme-

try analysis has been proposed to simultaneously estimate

two qualities: out-of-plane rotation and non-frontal illumi-

nation [10, 29, 40].

Since face recognition performance is simultaneously

impacted by multiple factors, being able to detect one or two

qualities is insufﬁcient for robust subset selection. One ap-

proach to simultaneously detect multiple quality character-

istics is through a fusion of individual face and image qual-

ity measurements. Nasrollahi and Moeslund [21] proposed

a weighted quality fusion approach to combine out-of-plane

rotation, sharpness, brightness, and image resolution qual-

ities. Rua et al. [26] proposed a similar quality assess-

ment approach, by using asymmetry analysis and two sharp-

ness measurements. Hsu et al. [16] proposed to learn fu-

sion parameters on multiple quality scores to achieve max-

imum correlation with matching scores between face pairs.

Another proposed fusion approach uses a Bayesian network

to model the relationships among qualities, image features

and matching scores [22]. The main drawbacks of the above

fusion approaches are:

• Fusion-based approaches only perform as well as their

individual classiﬁers. For example, if a pose estima-

tion algorithm requires accurate facial feature localisa-

tion, the whole fusion framework will fail in the cases

where that pose algorithm fails (such as in low resolu-

tion CCTV footage) [35].

• As various properties are measured individually and

have different inﬂuence on face quality, it may be dif-

ﬁcult to combine them to output a single quality score

for the purposes of image selection.

• As multiple classiﬁers as involved, they are typically

more time consuming and hence may not be suitable

for real-time surveillance applications.

• Since face matching scores are heavily dependant on

system-speciﬁc details (including the input features,

matching algorithms and training images), quality as-

sessment approaches that learn a fusion model based

on match scores end up being closely tied to the par-

ticular system conﬁguration and hence need to be re-

trained for each system.

Simultaneously detecting multiple quality characteristics

can also be accomplished by learning a generic model to

deﬁne the ‘ideal’ quality. Luo [18] proposed a learning

based approach where the quality model is trained to corre-

late with manually labelled quality scores. However, given

the subjective nature of human labelling, and the fact that

humans may not know what characteristics work best for

automatic face recognition algorithms, this approach may

not generate the best quality model for face recognition.

In this paper we propose a straightforward and effective

patch-based face quality assessment algorithm, targeted to-

wards handling images obtained in surveillance conditions.

It quantiﬁes the similarity of a given face to a probabilistic

face model, representing an ‘ideal’ face, via patch-based lo-

cal analysis. Without resorting to fusion, the proposed algo-

rithm outputs a single score for each image, with the score

simultaneously reﬂecting the degree of alignment errors,

pose variations, shadowing, and image sharpness (under-

lying resolution). Localisation of facial features (ie. eyes,

nose, mouth) is not required.

We continue the paper as follows. In Section 2 we de-

scribe the proposed quality assessment algorithm. Still im-

age and video datasets used in the experiments are brieﬂy

described in in Section 3. Extensive performance com-

parisons against existing techniques are given in Section 4

(on still images) and Section 5 (on surveillance videos).

The main ﬁndings are discussed in Section 6.

2. Probabilistic Face Quality Assessment

The proposed algorithm is comprised of ﬁve steps:

(1) pixel-based image normalisation, (2) patch extraction

and normalisation, (3) feature extraction from each patch,

(4) local probability calculation, (5) overall quality score

generation via integration of local probabilities. These steps

are elaborated below:

1. For a given image I, we perform non-linear pre-

processing (log transform) to reduce the dynamic

range of data. Following [9], the normalised image

log

is calculated using:

log

(r, c) = ln[I(r, c) + 1] (1)

where I(r, c) is the pixel intensity located at (r, c).

Logarithm normalisation ampliﬁes low intensity pixels

and compresses high intensity pixels. This property is

helpful in reducing the intensity differences between

skin tones.

2. The transformed image I

log

is divided into N overlap-

ping blocks (patches). Each block b

has a size of n ×n

pixels and overlap neighbouring blocks by t pixels. To

accommodate for contrast variations between face im-

ages, each patch is normalised to have zero mean and

unit variance [37].

3. From each block, a 2D Discrete Cosine Transform

(DCT) feature vector is extracted [11]. Excluding the

0-th DCT component (as it has no information due to

the previous normalisation), the top d low frequency

components are retained. The low frequency compo-

nents retain generic facial textures [12], while largely

omitting person-speciﬁc information. At the same

time, cast shadows [37] as well as variations in pose

and alignment can alter the local textures.

4. For each block location i, the probability of the corre-

sponding feature vector x

is calculated using a loca-

tion speciﬁc probabilistic model:

p(x

|µ

, Σ

) =

exp

−

− µ

)

−1

− µ

)

(2π)

|Σ

(2)

where µ

and Σ

are the mean and covariance matrix

of a normal distribution. The model for each location is

trained using a pool of frontal faces with frontal illumi-

nation and neutral expression. All of the training face

images are ﬁrst scaled and aligned to a ﬁxed size, with

each eye located at a ﬁxed location. We emphasise that

during testing, the faces do not need to be aligned.

5. By assuming that the model for each location is in-

dependent, an overall probabilistic quality score Q for

image I, comprised of N blocks, is calculated using:

Q(I) =

i=1

log p(x

|µ

, Σ

) (3)

The resulting quality score represents the probabilistic

similarity of a given face to an “ideal” face (as represented

by a set of training images). A higher quality score reﬂects

better image quality.

3. Face Datasets

In this section, we brieﬂy describe the FERET, PIE and

ChokePoint face datasets, as well as their setup for our ex-

periments.

FERET [23] and PIE [32] are used to analyse how accu-

rate the proposed quality assessment algorithm is for cor-

rectly selecting best quality images with several desired

characteristics, compared to other existing methods. In to-

tal, there are 1124 unique subjects in the training phase and

1263 subjects in the test phase.

The ChokePoint dataset contains surveillance videos. It

is used to study the improvement in veriﬁcation perfor-

mance gained from subset selection, using the proposed

quality method as well as other approaches.

3.1. Setup of Still Image Datasets: FERET and PIE

To study the performance of the proposed method in

terms of correctly selecting images with desired characteris-

tics, we simulated blurring as well as four alignment errors

using images from the ‘fb’ subset of FERET. Experiments

with pose variations (out-of-plane rotation) used dedicated

subsets from FERET and PIE. Experiments with cast shad-

ows used the illumination subset of PIE.

The generated alignment errors

are: horizontal shift and

vertical shift (using displacements of 0, ±2, ±4, ±6, ±8 pix-

els), in-plane rotation (using rotations of 0

◦

, ±10

◦

, ±20

◦

±30

◦

), and scale variations (using scaling factors of 0.7, 0.8,

0.9, 1.0, 1.1, 1.2, 1.3). For sharpness variations, each original

image is ﬁrst downscaled to three sizes (48 × 48, 32× 32 and

16 × 16 pixels) then rescaled to the baseline size of 64 × 64

pixels. See Fig. 1 for examples.

FERET provides the dedicated ‘b’ subset with pose vari-

ations, containing out-of plane rotations of 0

◦

, ±15

◦

, ±25

◦

±40

◦

, ±60

◦

. PIE also provides a dedicated subset with

pose variations, though with a smaller set of rotations (0

◦

±22.5

◦

, ±45

◦

, ±67.5

◦

The illumination subset of PIE was used to assess per-

formance in various cast shadow conditions. In our experi-

ments, we divided the frontal view images into six subsets

based on the angle of the corresponding light source. Sub-

set 1 has the most frontal light sources, while subset 6 has

the largest light sources angle (54

◦

- 67

◦

). See Fig. 2 for ex-

amples.

The generated alignment errors are representatives of real-life charac-

teristics of automatic face localisation/detection algorithms [25].

Subset 1: light source 8, 11, 20; Subset 2: light source 6, 7, 9, 12,

19, 21; Subset 3: light source 5, 10, 13, 14; Subset 4: light source 18, 22;

Subset 5: light source 4, 15; Subset 6: light source 2, 3, 16, 17.

Aligned

Horizontal

Shift

Vertical

Shift

In-Plane

Rotation

Scale

Change

Blurring

Figure 1. Examples of simulated image variations on FERET.

Subset 1

◦

)

Subset 2

(16

◦

- 21

◦

)

Subset 3

(31

◦

- 32

◦

)

Subset 4

(37

◦

- 38

◦

)

Subset 5

(44

◦

- 47

◦

)

Subset 6

(54

◦

- 67

◦

)

Figure 2. Examples from PIE with strong directed illumination,

causing self-shadowing.

3.2. Surveillance Videos: ChokePoint Dataset

We collected a new video dataset

, termed Choke-

Point, designed for experiments in person identiﬁca-

tion/veriﬁcation under real-world surveillance conditions

using existing technologies. An array of three cameras was

placed above several portals (natural choke points in terms

of pedestrian trafﬁc) to capture subjects walking through

each portal in a natural way (see Figs. 3 and 4).

While a person is walking through a portal, a sequence

of face images (ie. a face set) can be captured. Faces in such

sets will have variations in terms of illumination conditions,

pose, sharpness, as well as misalignment due to automatic

face localisation/detection [25, 28]. Due to the three camera

conﬁguration, one of the cameras is likely to capture a face

set where a subset of the faces is near-frontal.

The dataset consists of 25 subjects (19 male and 6 fe-

male) in portal 1 and 29 subjects (23 male and 6 female)

in portal 2. In total, it consists of 48 video sequences and

64,204 face images. Each sequence was named according

to the recording conditions (eg. P2E

S1 C3) where P, S,

and C stand for portal, sequence and camera, respectively.

E and L indicate subjects either entering or leaving the por-

tal. The numbers indicate the respective portal, sequence

and camera label. For example, P2L S1 C3 indicates that

the recording was done in Portal 2, with people leaving the

portal, and captured by camera 3 in the ﬁrst recorded se-

quence.

In this paper, all the experiments were performed with

the video-to-video veriﬁcation protocol. In this protocol,

video sequences are divided into two groups (G1 and G2),

where each group played the role of development set and

evaluation set in turn. Parameters can be ﬁrst learned on the

development set and then applied on the evaluation set. The

average veriﬁcation rate is used for reporting results. In our

experiments we selected the frontal view cameras (shown in

Table 1). In each group, each sequence takes turn to be the

gallery, with the the leftover sequences becoming the probe.

http://arma.sourceforge.net/chokepoint/

Camera Rig Camera 1 Camera 2 Camera 3

Figure 3. An example of the recording setup used for the Choke-

Point dataset. A camera rig contains 3 cameras placed just above a

door, used for simultaneously recording the entry of a person from

3 viewpoints. The variations between viewpoints allow for varia-

tions in walking directions, facilitating the capture of a near-frontal

face by one of the cameras.

Figure 4. Example shots from the ChokePoint dataset, showing

portals with various backgrounds.

Table 1. ChokePoint video-to-video veriﬁcation protocol. Se-

quences are divided into two groups (G1 and G2). Listed se-

quences contain faces with the most frontal pose view. P, S, and C

stand for portal, sequence and camera, respectively. E and L indi-

cate subjects entering or leaving the portal. The numbers indicate

the respective portal, sequence and camera label. For example,

P2L S1 C3 indicates that the recording was done in Portal 2, with

people leaving the portal, and captured by camera 3 in the ﬁrst

recorded sequence.

P1E S1 C1 P1E S2 C2 P2E S2 C2 P2E S1 C3

P1L S1 C1 P1L S2 C2 P2L S2 C2 P2L S1 C1

P1E S3 C3 P1E S4 C1 P2E S4 C2 P2E S3 C1

P1L S3 C3 P1L S4 C1 P2L S4 C2 P2L S3 C3

4. Experiments on Still Images

In this section, we evaluate how well the proposed qual-

ity assessment method can identify the best quality faces

when presented with both good and poor quality faces.

The proposed method was compared with: (i) a score fu-

sion method using pixel based asymmetry analysis and

two sharpness analyses (denoted as Asym shrp) [26],

(ii) asymmetry analysis with Gabor features (denoted as

Gabor asym) [29], (iii) the classical Distance From Face

Space (DFFS) method [5].

The ‘fa’ subset of FERET, containing frontal faces with

frontal illumination and neutral expression, was used to

train the location speciﬁc probabilistic models in the pro-

posed method. The ‘fa’ subset was also used to select the

decision threshold for rejecting “poor” quality images. The

‘fa’ subset was not used for any other purposes.

Based on preliminary experiments, closely cropped face

images were scaled to 64 × 64 pixels, the block size was

set to 8 × 8 pixels, with a 7 pixels overlap of neighbouring

blocks. The preliminary experiments also suggested that

using just 3 DCT coefﬁcients was sufﬁcient. This conﬁgu-

ration was used in all experiments. The quality assessment

methods were implemented with the aid of the Armadillo

C++ library [27].

4.1. Quality Assessment of Faces with Variations in

Alignment, Scale and Sharpness

In this experiment we evaluated the efﬁcacy of each

method to detect the best aligned images within a set of

images that have a particular image variation. For exam-

ple, out of the set of faces with rotations of 0

◦

, ±10

◦

, ±20

◦

±30

◦

, we measured the percentage of 0

◦

faces that were la-

belled as “high” quality.

Results for variations in shift, rotation and scale, shown

in Table 2, indicate that the proposed method consistently

achieved the best or near-best performance across most of

the variations. The results on the six PIE illumination sub-

sets indicate that even in the presence of cast shadows, the

proposed method can achieve good results, with the excep-

tion of images with scale changes. Averaging over all vari-

ations, the proposed method achieved the best results.

The asymmetry-based analysis methods (Gabor asym

and Asym sharp) could not reliably detect vertical align-

ment errors and scale variations. Gabor asym also per-

formed poorly for detecting images with various sharpness

variations. Asym shrp addressed this by combining asym-

metry analysis with two image sharpness measurements.

Despite that, the overall performance of Asym shrp was still

poor.

The performance of DFFS on alignment errors was con-

sistent but generally lower than the proposed method. No-

tably, DFFS failed to detect images with the best sharpness.

Patch-based probabilistic image quality assessment for face selection and improved video-based face recognition

Figures

Citations

Disentangled Representation Learning GAN for Pose-Invariant Face Recognition

Editor's Choice Article: A survey of approaches and trends in person re-identification

Armadillo: An Open Source C++ Linear Algebra Library for Fast Prototyping and Computationally Intensive Experiments Technical Report

Armadillo: An Open Source C++ Linear Algebra Library for Fast Prototyping and Computationally Intensive Experiments

A survey on deep learning based face recognition

References

Face recognition: A literature survey

The FERET evaluation methodology for face-recognition algorithms

The FERET evaluation methodology for face-recognition algorithms

Face Recognition with Local Binary Patterns

The CMU pose, illumination, and expression database

Related Papers (5)

Robust Real-Time Face Detection

FaceNet: A unified embedding for face recognition and clustering

Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments

Face recognition: A literature survey

DeepFace: Closing the Gap to Human-Level Performance in Face Verification

Frequently Asked Questions (12)

Q1. What contributions have the authors mentioned in the paper "Patch-based probabilistic image quality assessment for face selection and improved video-based face recognition" ?

Q2. What are the main drawbacks of fusion-based approaches?

Q3. What is the method for comparing two sets of faces?

Q4. What are the challenges of face recognition in surveillance?

Q5. What is the main drawback of the proposed fusion approach?

Q6. What is the probability of the corresponding feature vector xi?

Q7. What is the main drawback of the proposed learning based approach?

Q8. How can one learn a generic model to define the ‘ideal’ quality?

Q9. What is the reason why DFFS is over trained?

Q10. Why is the pose variation so different on FERET?

Q11. How many faces are selected by the proposed method?

Q12. What are the main components of the proposed quality assessment algorithm?