scispace - formally typeset
Open AccessProceedings ArticleDOI

Patch-based probabilistic image quality assessment for face selection and improved video-based face recognition

TLDR
An efficient patch-based face image quality assessment algorithm which quantifies the similarity of a face image to a probabilistic face model, representing an ‘ideal’ face is proposed.
Abstract
In video based face recognition, face images are typically captured over multiple frames in uncontrolled conditions, where head pose, illumination, shadowing, motion blur and focus change over the sequence. Additionally, inaccuracies in face localisation can also introduce scale and alignment variations. Using all face images, including images of poor quality, can actually degrade face recognition performance. While one solution it to use only the ‘best’ of images, current face selection techniques are incapable of simultaneously handling all of the abovementioned issues. We propose an efficient patch-based face image quality assessment algorithm which quantifies the similarity of a face image to a probabilistic face model, representing an ‘ideal’ face. Image characteristics that affect recognition are taken into account, including variations in geometric alignment (shift, rotation and scale), sharpness, head pose and cast shadows. Experiments on FERET and PIE datasets show that the proposed algorithm is able to identify images which are simultaneously the most frontal, aligned, sharp and well illuminated. Further experiments on a new video surveillance dataset (termed ChokePoint) show that the proposed method provides better face subsets than existing face selection techniques, leading to significant improvements in recognition accuracy.

read more

Content maybe subject to copyright    Report

Patch-based probabilistic image quality assessment for face
selection and improved video-based face recognition
Author
Wong, Yongkang, Chen, Shaokang, Mau, Sandra, Sanderson, Conrad, Lovell, Brian C
Published
2011
Conference Title
CVPR 2011 WORKSHOPS
Version
Accepted Manuscript (AM)
DOI
https://doi.org/10.1109/cvprw.2011.5981881
Copyright Statement
© 2011 IEEE. Personal use of this material is permitted. Permission from IEEE must be
obtained for all other uses, in any current or future media, including reprinting/republishing this
material for advertising or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of this work in other
works.
Downloaded from
http://hdl.handle.net/10072/401031
Griffith Research Online
https://research-repository.griffith.edu.au

Patch-based Probabilistic Image Quality Assessment for
Face Selection and Improved Video-based Face Recognition
Yongkang Wong, Shaokang Chen, Sandra Mau, Conrad Sanderson, Brian C. Lovell
NICTA, PO Box 6020, St Lucia, QLD 4067, Australia
The University of Queensland, School of ITEE, QLD 4072, Australia
Abstract
In video based face recognition, face images are typically
captured over multiple frames in uncontrolled conditions,
where head pose, illumination, shadowing, motion blur and
focus change over the sequence. Additionally, inaccuracies
in face localisation can also introduce scale and alignment
variations. Using all face images, including images of poor
quality, can actually degrade face recognition performance.
While one solution it to use only the ‘best’ subset of images,
current face selection techniques are incapable of simulta-
neously handling all of the abovementioned issues. We pro-
pose an efficient patch-based face image quality assessment
algorithm which quantifies the similarity of a face image
to a probabilistic face model, representing an ‘ideal’ face.
Image characteristics that affect recognition are taken into
account, including variations in geometric alignment (shift,
rotation and scale), sharpness, head pose and cast shad-
ows. Experiments on FERET and PIE datasets show that
the proposed algorithm is able to identify images which are
simultaneously the most frontal, aligned, sharp and well
illuminated. Further experiments on a new video surveil-
lance dataset (termed ChokePoint) show that the proposed
method provides better face subsets than existing face se-
lection techniques, leading to significant improvements in
recognition accuracy.
1. Introduction
Video-based identity inference in surveillance conditions
is challenging due to a variety of factors, including the
subjects’ motion, the uncontrolled nature of the subjects,
variable lighting, and poor quality CCTV video recordings.
This results in issues for face recognition such as low reso-
lution, blurry images (due to motion or loss of focus), large
pose variations, and low contrast [14, 28, 36, 41]. While re-
cent face recognition algorithms can handle faces with mod-
erately challenging illumination conditions [15, 17, 24, 28],
strong illumination variations (causing cast shadows [30]
and self-shadowing) remain problematic [31].
Published in: IEEE Conference on Computer Vision and Pattern
Recognition Workshops (CVPRW), pp. 74–81, 2011.
One approach to overcome the impact of poor quality
images is to assume that such images are outliers in a se-
quence. This includes approaches like exemplar extraction
using clustering techniques (eg. k-means clustering [13])
and statistical model approaches for outlier removal [6].
However, these approaches are not likely to work when
most of the images in the sequence have poor quality the
good quality images would actually be classified as outliers.
Another approach is explicit subset selection, where a
face quality assessment is automatically made on each im-
age, either to remove poor quality face images, or to select
a subset comprised of high quality images [10, 20, 21, 33].
This improves recognition performance, with the additional
benefit of reducing the overall computation load during fea-
ture extraction and matching [19]. The challenge in this
approach is finding a good definition for “face quality”.
Several face image standards have been proposed for
face quality assessment (eg. ISO/IEC 19794-5 [1] and
ICAO 9303 [2]). In these standards, quality can be divided
into: (i) image specific qualities such as sharpness, contrast,
compression artifacts, and (ii) face specific qualities such as
face geometry, pose, eye detectability, illumination angles.
Based in part on the above standards, many approaches
have been proposed to analyse various face and image
properties. For example, face pose estimation using tree
structured multiple pose estimators [39], and face align-
ment estimation using template matching [7]. Asymme-
try analysis has been proposed to simultaneously estimate
two qualities: out-of-plane rotation and non-frontal illumi-
nation [10, 29, 40].
Since face recognition performance is simultaneously
impacted by multiple factors, being able to detect one or two
qualities is insufficient for robust subset selection. One ap-
proach to simultaneously detect multiple quality character-
istics is through a fusion of individual face and image qual-
ity measurements. Nasrollahi and Moeslund [21] proposed
a weighted quality fusion approach to combine out-of-plane
rotation, sharpness, brightness, and image resolution qual-
ities. Rua et al. [26] proposed a similar quality assess-
ment approach, by using asymmetry analysis and two sharp-
ness measurements. Hsu et al. [16] proposed to learn fu-
sion parameters on multiple quality scores to achieve max-
imum correlation with matching scores between face pairs.

Another proposed fusion approach uses a Bayesian network
to model the relationships among qualities, image features
and matching scores [22]. The main drawbacks of the above
fusion approaches are:
Fusion-based approaches only perform as well as their
individual classifiers. For example, if a pose estima-
tion algorithm requires accurate facial feature localisa-
tion, the whole fusion framework will fail in the cases
where that pose algorithm fails (such as in low resolu-
tion CCTV footage) [35].
As various properties are measured individually and
have different influence on face quality, it may be dif-
ficult to combine them to output a single quality score
for the purposes of image selection.
As multiple classifiers as involved, they are typically
more time consuming and hence may not be suitable
for real-time surveillance applications.
Since face matching scores are heavily dependant on
system-specific details (including the input features,
matching algorithms and training images), quality as-
sessment approaches that learn a fusion model based
on match scores end up being closely tied to the par-
ticular system configuration and hence need to be re-
trained for each system.
Simultaneously detecting multiple quality characteristics
can also be accomplished by learning a generic model to
define the ‘ideal’ quality. Luo [18] proposed a learning
based approach where the quality model is trained to corre-
late with manually labelled quality scores. However, given
the subjective nature of human labelling, and the fact that
humans may not know what characteristics work best for
automatic face recognition algorithms, this approach may
not generate the best quality model for face recognition.
In this paper we propose a straightforward and effective
patch-based face quality assessment algorithm, targeted to-
wards handling images obtained in surveillance conditions.
It quantifies the similarity of a given face to a probabilistic
face model, representing an ‘ideal’ face, via patch-based lo-
cal analysis. Without resorting to fusion, the proposed algo-
rithm outputs a single score for each image, with the score
simultaneously reflecting the degree of alignment errors,
pose variations, shadowing, and image sharpness (under-
lying resolution). Localisation of facial features (ie. eyes,
nose, mouth) is not required.
We continue the paper as follows. In Section 2 we de-
scribe the proposed quality assessment algorithm. Still im-
age and video datasets used in the experiments are briefly
described in in Section 3. Extensive performance com-
parisons against existing techniques are given in Section 4
(on still images) and Section 5 (on surveillance videos).
The main findings are discussed in Section 6.
2. Probabilistic Face Quality Assessment
The proposed algorithm is comprised of five steps:
(1) pixel-based image normalisation, (2) patch extraction
and normalisation, (3) feature extraction from each patch,
(4) local probability calculation, (5) overall quality score
generation via integration of local probabilities. These steps
are elaborated below:
1. For a given image I, we perform non-linear pre-
processing (log transform) to reduce the dynamic
range of data. Following [9], the normalised image
I
log
is calculated using:
I
log
(r, c) = ln[I(r, c) + 1] (1)
where I(r, c) is the pixel intensity located at (r, c).
Logarithm normalisation amplifies low intensity pixels
and compresses high intensity pixels. This property is
helpful in reducing the intensity differences between
skin tones.
2. The transformed image I
log
is divided into N overlap-
ping blocks (patches). Each block b
i
has a size of n ×n
pixels and overlap neighbouring blocks by t pixels. To
accommodate for contrast variations between face im-
ages, each patch is normalised to have zero mean and
unit variance [37].
3. From each block, a 2D Discrete Cosine Transform
(DCT) feature vector is extracted [11]. Excluding the
0-th DCT component (as it has no information due to
the previous normalisation), the top d low frequency
components are retained. The low frequency compo-
nents retain generic facial textures [12], while largely
omitting person-specific information. At the same
time, cast shadows [37] as well as variations in pose
and alignment can alter the local textures.
4. For each block location i, the probability of the corre-
sponding feature vector x
i
is calculated using a loca-
tion specific probabilistic model:
p(x
i
|µ
i
, Σ
i
) =
exp
h
1
2
(x
i
µ
i
)
T
Σ
1
i
(x
i
µ
i
)
i
(2π)
d
2
|Σ
i
|
1
2
(2)
where µ
i
and Σ
i
are the mean and covariance matrix
of a normal distribution. The model for each location is
trained using a pool of frontal faces with frontal illumi-
nation and neutral expression. All of the training face
images are first scaled and aligned to a fixed size, with
each eye located at a fixed location. We emphasise that
during testing, the faces do not need to be aligned.
5. By assuming that the model for each location is in-
dependent, an overall probabilistic quality score Q for
image I, comprised of N blocks, is calculated using:
Q(I) =
X
N
i=1
log p(x
i
|µ
i
, Σ
i
) (3)

The resulting quality score represents the probabilistic
similarity of a given face to an “ideal” face (as represented
by a set of training images). A higher quality score reflects
better image quality.
3. Face Datasets
In this section, we briefly describe the FERET, PIE and
ChokePoint face datasets, as well as their setup for our ex-
periments.
FERET [23] and PIE [32] are used to analyse how accu-
rate the proposed quality assessment algorithm is for cor-
rectly selecting best quality images with several desired
characteristics, compared to other existing methods. In to-
tal, there are 1124 unique subjects in the training phase and
1263 subjects in the test phase.
The ChokePoint dataset contains surveillance videos. It
is used to study the improvement in verification perfor-
mance gained from subset selection, using the proposed
quality method as well as other approaches.
3.1. Setup of Still Image Datasets: FERET and PIE
To study the performance of the proposed method in
terms of correctly selecting images with desired characteris-
tics, we simulated blurring as well as four alignment errors
using images from the ‘fb’ subset of FERET. Experiments
with pose variations (out-of-plane rotation) used dedicated
subsets from FERET and PIE. Experiments with cast shad-
ows used the illumination subset of PIE.
The generated alignment errors
1
are: horizontal shift and
vertical shift (using displacements of 0, ±2, ±4, ±6, ±8 pix-
els), in-plane rotation (using rotations of 0
, ±10
, ±20
,
±30
), and scale variations (using scaling factors of 0.7, 0.8,
0.9, 1.0, 1.1, 1.2, 1.3). For sharpness variations, each original
image is first downscaled to three sizes (48 × 48, 32× 32 and
16 × 16 pixels) then rescaled to the baseline size of 64 × 64
pixels. See Fig. 1 for examples.
FERET provides the dedicated ‘b’ subset with pose vari-
ations, containing out-of plane rotations of 0
, ±15
, ±25
,
±40
, ±60
. PIE also provides a dedicated subset with
pose variations, though with a smaller set of rotations (0
,
±22.5
, ±45
, ±67.5
).
The illumination subset of PIE was used to assess per-
formance in various cast shadow conditions. In our experi-
ments, we divided the frontal view images into six subsets
2
based on the angle of the corresponding light source. Sub-
set 1 has the most frontal light sources, while subset 6 has
the largest light sources angle (54
- 67
). See Fig. 2 for ex-
amples.
1
The generated alignment errors are representatives of real-life charac-
teristics of automatic face localisation/detection algorithms [25].
2
Subset 1: light source 8, 11, 20; Subset 2: light source 6, 7, 9, 12,
19, 21; Subset 3: light source 5, 10, 13, 14; Subset 4: light source 18, 22;
Subset 5: light source 4, 15; Subset 6: light source 2, 3, 16, 17.
Aligned
Horizontal
Shift
Vertical
Shift
In-Plane
Rotation
Scale
Change
Blurring
Figure 1. Examples of simulated image variations on FERET.
Subset 1
(0
)
Subset 2
(16
- 21
)
Subset 3
(31
- 32
)
Subset 4
(37
- 38
)
Subset 5
(44
- 47
)
Subset 6
(54
- 67
)
Figure 2. Examples from PIE with strong directed illumination,
causing self-shadowing.
3.2. Surveillance Videos: ChokePoint Dataset
We collected a new video dataset
3
, termed Choke-
Point, designed for experiments in person identifica-
tion/verification under real-world surveillance conditions
using existing technologies. An array of three cameras was
placed above several portals (natural choke points in terms
of pedestrian traffic) to capture subjects walking through
each portal in a natural way (see Figs. 3 and 4).
While a person is walking through a portal, a sequence
of face images (ie. a face set) can be captured. Faces in such
sets will have variations in terms of illumination conditions,
pose, sharpness, as well as misalignment due to automatic
face localisation/detection [25, 28]. Due to the three camera
configuration, one of the cameras is likely to capture a face
set where a subset of the faces is near-frontal.
The dataset consists of 25 subjects (19 male and 6 fe-
male) in portal 1 and 29 subjects (23 male and 6 female)
in portal 2. In total, it consists of 48 video sequences and
64,204 face images. Each sequence was named according
to the recording conditions (eg. P2E
S1 C3) where P, S,
and C stand for portal, sequence and camera, respectively.
E and L indicate subjects either entering or leaving the por-
tal. The numbers indicate the respective portal, sequence
and camera label. For example, P2L S1 C3 indicates that
the recording was done in Portal 2, with people leaving the
portal, and captured by camera 3 in the first recorded se-
quence.
In this paper, all the experiments were performed with
the video-to-video verification protocol. In this protocol,
video sequences are divided into two groups (G1 and G2),
where each group played the role of development set and
evaluation set in turn. Parameters can be first learned on the
development set and then applied on the evaluation set. The
average verification rate is used for reporting results. In our
experiments we selected the frontal view cameras (shown in
Table 1). In each group, each sequence takes turn to be the
gallery, with the the leftover sequences becoming the probe.
3
http://arma.sourceforge.net/chokepoint/

Camera Rig Camera 1 Camera 2 Camera 3
Figure 3. An example of the recording setup used for the Choke-
Point dataset. A camera rig contains 3 cameras placed just above a
door, used for simultaneously recording the entry of a person from
3 viewpoints. The variations between viewpoints allow for varia-
tions in walking directions, facilitating the capture of a near-frontal
face by one of the cameras.
Figure 4. Example shots from the ChokePoint dataset, showing
portals with various backgrounds.
Table 1. ChokePoint video-to-video verification protocol. Se-
quences are divided into two groups (G1 and G2). Listed se-
quences contain faces with the most frontal pose view. P, S, and C
stand for portal, sequence and camera, respectively. E and L indi-
cate subjects entering or leaving the portal. The numbers indicate
the respective portal, sequence and camera label. For example,
P2L S1 C3 indicates that the recording was done in Portal 2, with
people leaving the portal, and captured by camera 3 in the first
recorded sequence.
G1
P1E S1 C1 P1E S2 C2 P2E S2 C2 P2E S1 C3
P1L S1 C1 P1L S2 C2 P2L S2 C2 P2L S1 C1
G2
P1E S3 C3 P1E S4 C1 P2E S4 C2 P2E S3 C1
P1L S3 C3 P1L S4 C1 P2L S4 C2 P2L S3 C3
4. Experiments on Still Images
In this section, we evaluate how well the proposed qual-
ity assessment method can identify the best quality faces
when presented with both good and poor quality faces.
The proposed method was compared with: (i) a score fu-
sion method using pixel based asymmetry analysis and
two sharpness analyses (denoted as Asym shrp) [26],
(ii) asymmetry analysis with Gabor features (denoted as
Gabor asym) [29], (iii) the classical Distance From Face
Space (DFFS) method [5].
The ‘fa’ subset of FERET, containing frontal faces with
frontal illumination and neutral expression, was used to
train the location specific probabilistic models in the pro-
posed method. The ‘fa’ subset was also used to select the
decision threshold for rejecting “poor” quality images. The
‘fa’ subset was not used for any other purposes.
Based on preliminary experiments, closely cropped face
images were scaled to 64 × 64 pixels, the block size was
set to 8 × 8 pixels, with a 7 pixels overlap of neighbouring
blocks. The preliminary experiments also suggested that
using just 3 DCT coefficients was sufficient. This configu-
ration was used in all experiments. The quality assessment
methods were implemented with the aid of the Armadillo
C++ library [27].
4.1. Quality Assessment of Faces with Variations in
Alignment, Scale and Sharpness
In this experiment we evaluated the efficacy of each
method to detect the best aligned images within a set of
images that have a particular image variation. For exam-
ple, out of the set of faces with rotations of 0
, ±10
, ±20
,
±30
, we measured the percentage of 0
faces that were la-
belled as “high” quality.
Results for variations in shift, rotation and scale, shown
in Table 2, indicate that the proposed method consistently
achieved the best or near-best performance across most of
the variations. The results on the six PIE illumination sub-
sets indicate that even in the presence of cast shadows, the
proposed method can achieve good results, with the excep-
tion of images with scale changes. Averaging over all vari-
ations, the proposed method achieved the best results.
The asymmetry-based analysis methods (Gabor asym
and Asym sharp) could not reliably detect vertical align-
ment errors and scale variations. Gabor asym also per-
formed poorly for detecting images with various sharpness
variations. Asym shrp addressed this by combining asym-
metry analysis with two image sharpness measurements.
Despite that, the overall performance of Asym shrp was still
poor.
The performance of DFFS on alignment errors was con-
sistent but generally lower than the proposed method. No-
tably, DFFS failed to detect images with the best sharpness.

Figures
Citations
More filters
Proceedings ArticleDOI

Disentangled Representation Learning GAN for Pose-Invariant Face Recognition

TL;DR: Quantitative and qualitative evaluation on both controlled and in-the-wild databases demonstrate the superiority of DR-GAN over the state of the art.
Journal ArticleDOI

Editor's Choice Article: A survey of approaches and trends in person re-identification

TL;DR: The problem of person re-identification is explored and open issues and challenges of the problem are highlighted with a discussion on potential directions for further research.

Armadillo: An Open Source C++ Linear Algebra Library for Fast Prototyping and Computationally Intensive Experiments Technical Report

TL;DR: An overview of the open source Armadillo C++ linear algebra library (matrix maths) is provided, which can be used for fast prototyping and computationally intensive experiments, while at the same time allowing for relatively painless transition of research code into production environments.

Armadillo: An Open Source C++ Linear Algebra Library for Fast Prototyping and Computationally Intensive Experiments

TL;DR: Armadillo as discussed by the authors is a C++ linear algebra library that supports integer, floating point and complex numbers, as well as a subset of trigonometric and statistics functions, which is used for fast prototyping and computationally intensive experiments, while at the same time allowing for relatively painless transition of research code into production environments.
Journal ArticleDOI

A survey on deep learning based face recognition

TL;DR: Major deep learning concepts pertinent to face image analysis and face recognition are reviewed, and a concise overview of studies on specific face recognition problems is provided, such as handling variations in pose, age, illumination, expression, and heterogeneous face matching.
References
More filters
Journal ArticleDOI

Face recognition: A literature survey

TL;DR: In this paper, the authors provide an up-to-date critical survey of still-and video-based face recognition research, and provide some insights into the studies of machine recognition of faces.
Journal ArticleDOI

The FERET evaluation methodology for face-recognition algorithms

TL;DR: Two of the most critical requirements in support of producing reliable face-recognition systems are a large database of facial images and a testing procedure to evaluate systems.
Proceedings ArticleDOI

The FERET evaluation methodology for face-recognition algorithms

TL;DR: Two of the most critical requirements in support of producing reliable face-recognition systems are a large database of facial images and a testing procedure to evaluate systems.
Book ChapterDOI

Face Recognition with Local Binary Patterns

TL;DR: A novel approach to face recognition which considers both shape and texture information to represent face images and the simplicity of the proposed method allows for very fast feature extraction.
Journal ArticleDOI

The CMU pose, illumination, and expression database

TL;DR: In the Fall of 2000, a database of more than 40,000 facial images of 68 people was collected using the Carnegie Mellon University 3D Room to imaged each person across 13 different poses, under 43 different illumination conditions, and with four different expressions.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What contributions have the authors mentioned in the paper "Patch-based probabilistic image quality assessment for face selection and improved video-based face recognition" ?

The authors propose an efficient patch-based face image quality assessment algorithm which quantifies the similarity of a face image to a probabilistic face model, representing an ‘ ideal ’ face. Further experiments on a new video surveillance dataset ( termed ChokePoint ) show that the proposed method provides better face subsets than existing face selection techniques, leading to significant improvements in recognition accuracy. 

Since face matching scores are heavily dependant on system-specific details (including the input features, matching algorithms and training images), quality assessment approaches that learn a fusion model based on match scores end up being closely tied to the particular system configuration and hence need to be retrained for each system. 

The comparison between two sets of faces was performed using (i) Mutual Subspace Method (MSM) [38] (for both MRH and LBP), and (ii) feature averaging [8, 20] (for MRH only). 

While recent face recognition algorithms can handle faces with moderately challenging illumination conditions [15, 17, 24, 28], strong illumination variations (causing cast shadows [30] and self-shadowing) remain problematic [31]. 

Another proposed fusion approach uses a Bayesian network to model the relationships among qualities, image features and matching scores [22]. 

For each block location i, the probability of the corresponding feature vector xi is calculated using a location specific probabilistic model:p(xi|µi,Σi) = exp[ − 12 (xi − µi) T Σ−1i (xi − µi) ](2π) d 2 |Σi| 1 2(2)where µi and Σi are the mean and covariance matrix of a normal distribution. 

Luo [18] proposed a learning based approach where the quality model is trained to correlate with manually labelled quality scores. 

Simultaneously detecting multiple quality characteristics can also be accomplished by learning a generic model to define the ‘ideal’ quality. 

As there is an overlap between the subjects in the ‘fa’ and pose subsets in FERET (where ‘fa’ was used for training), the inconsistency in performance across FERET and PIE suggests that DFFS might be over trained to the training dataset. 

The authors conjecture that this is due to the larger pose variation between frontal faces and faces with the smallest pose angle (±22.5◦), in contrast to ±15◦ on FERET. 

The authors note that even when only one face is selected by the proposed method (ie. N = 1), relatively high verification accuracy is still achieved. 

FERET [23] and PIE [32] are used to analyse how accurate the proposed quality assessment algorithm is for correctly selecting best quality images with several desired characteristics, compared to other existing methods.