Journal Article•DOI•

Heterogeneous Face Recognition Using Kernel Prototype Similarities

Brendan Klare¹, Anil K. Jain²•Institutions (2)

01 Jun 2013-IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE Computer Society)-Vol. 35, Iss: 6, pp 1410-1422

TL;DR: A generic HFR framework is proposed in which both probe and gallery images are represented in terms of nonlinear similarities to a collection of prototype face images, and Random sampling is introduced into the H FR framework to better handle challenges arising from the small sample size problem.

read less

Abstract: Heterogeneous face recognition (HFR) involves matching two face images from alternate imaging modalities, such as an infrared image to a photograph or a sketch to a photograph. Accurate HFR systems are of great value in various applications (e.g., forensics and surveillance), where the gallery databases are populated with photographs (e.g., mug shot or passport photographs) but the probe images are often limited to some alternate modality. A generic HFR framework is proposed in which both probe and gallery images are represented in terms of nonlinear similarities to a collection of prototype face images. The prototype subjects (i.e., the training set) have an image in each modality (probe and gallery), and the similarity of an image is measured against the prototype images from the corresponding modality. The accuracy of this nonlinear prototype representation is improved by projecting the features into a linear discriminant subspace. Random sampling is introduced into the HFR framework to better handle challenges arising from the small sample size problem. The merits of the proposed approach, called prototype random subspace (P-RS), are demonstrated on four different heterogeneous scenarios: 1) near infrared (NIR) to photograph, 2) thermal to photograph, 3) viewed sketch to photograph, and 4) forensic sketch to photograph.

...read moreread less

Summary (6 min read)

Jump to: [1 INTRODUCTION] – [2.1 Heterogeneous Face Recognition] – [2.2 Kernel Prototype Representation] – [2.3 Proposed Method] – [3 IMAGE PREPROCESSING AND REPRESENTATION] – [3.1 Geometric Normalization] – [3.2 Image Filtering] – [3.2.2 Center-Surround Divisive Normalization (CSDN)] – [3.2.3 Gaussian] – [3.3 Local Descriptor Representation] – [4.1 Prototype Representation] – [4.2 Discriminant Analysis] – [5.1 Motivation] – [5.2 Prototype Random Subspaces] – [5.4 Score Level Fusion] – [6.1 Commercial Matcher] – [6.2 Direct Random Subspaces] – [7 EXPERIMENTS] – [7.1 Databases] – [7.1.1 Dataset 1—Near-Infrared to Visible (Fig. 1a)] – [7.1.2 Dataset 2—Thermal to Visible (Fig. 1b)] – [7.1.3 Dataset 3—Viewed Sketch to Visible (Fig. 1c)] – [7.1.4 Dataset 4—Forensic Sketch to Visible (Fig. 1d)] – [7.1.5 Dataset 5: Standard Face Recognition] – [7.1.6 Enlarged Gallery] – [7.2 Results] – [8 SUMMARY] and [ACKNOWLEDGMENTS]

1 INTRODUCTION

AN emerging topic in face recognition is matchingbetween heterogeneous image modalities.
Coined heterogeneous face recognition (HFR) [1], the scenario offers potential solutions to many difficult face recognition scenarios.
While heterogeneous face recognition can involve matching between any two imaging modalities, the majority of scenarios involve a gallery dataset consisting of visible light photographs.
Probe images can be of any other modality, though the practical scenarios of interest to us are infrared images (NIR and thermal) and hand-drawn facial sketches.
When a subject’s face can only be acquired in nighttime environments, the use of infrared imaging may be the only modality for acquiring a useful face image of the subject.

2.1 Heterogeneous Face Recognition

A flurry of research has emerged providing solutions to various heterogeneous face recognition problems.
This began with sketch recognition using viewed sketches,1 and has continued into other modalities such as nearinfrared (NIR) and forensic sketches.
Published by the IEEE Computer Society [3].
Klare and Jain followed this work on NIR to VIS face recognition by also incorporating SIFT feature descriptors and an RS-LDA scheme [10].

2.2 Kernel Prototype Representation

The core of the proposed approach involves using a relational feature representation for face images (illustrated in Fig. 2).
One key to their framework is that each prototype has a pattern for each image modality.
Kernel PCA [21] and Kernel LDA [22], [23] approaches to face recognition have used a similar approach, where a face is represented as the kernel similarity to a collection of prototype images in a high-dimensional space.
The biometric indexing scheme by Gyaourova and Ross used similarity scores to a fixed set of references in the face and fingerprint modality [24].
These prior works differ from the proposed method because only a single prototype is used per training subject.

2.3 Proposed Method

The proposed method presents a new approach to heterogeneous face recognition, and extends existing methods in face recognition.
Unlike previous feature-based methods, where an image descriptor invariant to changes between the two HFR modalities was needed, the proposed framework only needs descriptors that are effective within each domain.
The accuracy of the HFR system is improved using a random subspace framework in conjunction with linear discriminant analysis (LDA), as described in Section 5.
While the authors demonstrate the strength of the proposed framework on many different HFR scenarios, the parameters controlling the framework are the same across all tested scenarios.

3 IMAGE PREPROCESSING AND REPRESENTATION

All face images are initially represented using a featurebased representation.
The use of local feature descriptors has been argued to closely resemble the postulated representation of the human visual processing system [26], and they have been shown to be well suited for face recognition [27].

3.1 Geometric Normalization

The first step in representing face images using feature descriptors is to geometrically normalize the face images with respect to the location of the eyes.
This step reduces the effect of scale, rotation, and translation variations.
The eye locations for the face images from all modalities are automatically estimated using Cognitec’s FaceVACS SDK [28].
The only exceptions are the thermal face images where the eyes are manually located for both the proposed method and the FaceVACS baseline.

3.2 Image Filtering

Face images are filtered with three different image filters.
These filters are intended to help compensate for both intensity variations within an image domain (such as nonuniform illumination changes), as well appearance variations between image domains.
The second aspect is of particular importance for the direct random subspace (D-RS) framework (see Section 6).
The three image filters used are as follows.

3.2.2 Center-Surround Divisive Normalization (CSDN)

Meyers and Wolf [30] introduced the center-surround divisive normalization filter in conjunction with their biologically inspired face recognition framework.
The CSDN filter divides the value of each pixel by the mean pixel value in the s s neighborhood surrounding the pixel.

3.2.3 Gaussian

The Gaussian smoothing filter has long been used in image processing applications to remove noise contained in high spatial frequencies while retaining the remainder of the signal.
The width of the filter used in their implementation was ¼.

3.3 Local Descriptor Representation

Once an image is geometrically normalized and filtered using one of the three filters, local feature descriptors are extracted from uniformly distributed patches across the face.
The authors use two different feature descriptors to represent the face image: the SIFT descriptor [14] and Local Binary Patterns [13].
LBP features have a longer history of successful use in face recognition.
Each patch overlaps its vertical and horizontal neighbors by 16 pixels.
Using uniform patterns at eight sampling locations, as decribed by Ojala et al. [13], the LBP descriptor yields a 59D feature descriptor.

4.1 Prototype Representation

The heterogeneous prototype framework begins with images from the probe and gallery modalities represented by (possibly different) feature descriptors for each of the N image patches, as described in the previous section.
The cosine kernel was chosen because it resulted in consistently higher accuracy on all tested scenarios compared to the radial basis function kernel and the polynomial kernel.
Additionally, because the feature vectors P ðP Þ and GðGÞ are a measure of the similarity between the test image and the prototype training images, the feature spaces for similarity computation do not have to be the same for the probe and gallery modalities.
Fc;DsP ðIÞ denotes the prototype similarity of image I when represented using the CSDN image filter and SIFT descriptors.

4.2 Discriminant Analysis

After representing the images in the training set T in the aforementioned prototype representation, the authors next learn linear subspaces using linear discriminant analysis [33] to enhance the discriminative capabilities of the prototype representation ð Þ. LDA (and its variants) has consistently demonstrated its ability to improve the accuracy of various recognition algorithms through feature extraction and dimensionality reduction.
The authors learn the linear projection matrix W by following the conventional approach for high-dimensional data, namely, by first applying PCA, followed by LDA [33].
In all experiments, the PCA step was used to retain 99.0 percent of the variance.
Next, the within-class and between-class scatter matrices of W 01 T X0 , SW and SB, are computed.
Letting denote the mean of X, the final representation for an unseen probe or gallery image I using the prototype framework is WT ð ðIÞ Þ. Subsequent uses of W in this work will assume the appropriate removal of the mean from ðIÞ for terseness.

5.1 Motivation

The proposed heterogeneous prototype framework uses training data to define the prototypes and to learn the linear subspace projection matrix W .
When applying a prototype representation to face recognition, a large number of classes (or subjects) and features are present.
Most are designed to handle deficiencies in the subspace W , such as dual-space LDA [34] and direct LDA [37].
These methods do not address the issue of too few prototypes for an expressive representation.
Their approach combined random subspaces and bagging by sampling both features and training instances.

5.2 Prototype Random Subspaces

The prototype random subspace framework uses B differ- ent bags (or samples) of the N face patches.
Let fðI; bÞ denote the concatenation of the N descriptors from the randomly selected patch indices in b.
For terseness the authors have omitted the superscript F and D in the previous equations.

5.4 Score Level Fusion

The proposed framework naturally lends to fusion of the different feature representations.
Given one image filter F and two feature descriptorsD1 andD2, one can utilize the following sum of similarity scores between probe image P and gallery image G : fSF;D1F;D1 ðP;GÞ þ S F;D2 F;D2 ðP;GÞ þ SF;D1F;D2 ðP;GÞ þ S F;D2 F;D1 ðP;GÞg. Min-max score normalization is performed prior to fusion.the authors.

6.1 Commercial Matcher

The accuracy of the proposed prototype random subspace framework is compared against Cognitec’s FaceVACS [28].
Comparing the accuracy of their system against a leading COTS FRS offers an unbiased baseline of a state-ofthe-art commercial matcher on each HFR scenario.
FaceVACS was chosen because it is considered as one of the best commercial face matchers and, in their internal tests, it excels at HFR scenarios (with respect to other commercial matchers).
The accuracy of FaceVACS on NIR to VIS [10] and Viewed Sketch to VIS [9] performed on par with some previously published HFR methods.

6.2 Direct Random Subspaces

In addition to a commercial face recognition system, the proposed prototype recognition system is also compared against a recognition system that directly measures the difference between probe and gallery images using a common feature descriptor representation.
The random subspace framework from [10] is used as the baseline because it is the most similar to the proposed prototype framework, thus helping to isolate the difference between using kernel prototype similarities versus directly measuring the similarity.
Further, because most of the datasets tested in Section 7 are in the public domain, the proposed framework may also be compared against any other published method on these datasets.
This follows from the fact that ff;D1ðIÞ and ff;D2ðIÞ are of generally different dimensionality and also have a different interpretation.
D-RS will be used in conjunction with the six filter/ descriptor representations presented in Section 3 (SIFT+DoG, MLBP+CSDN, etc.).

7 EXPERIMENTS

The results provided are based on the following parameter values: ¼ 0:1 and B ¼ 30.
A cosine kernel was used to compute the prototype similarity and 99.0 percent of the variance was retained in the PCA step of LDA.

7.1 Databases

Five different matching scenarios are tested in this paper: four heterogeneous face recognition scenarios and one standard face recognition scenario.
Example images from each of HFR dataset can be found in Fig.
Results shown on each dataset are the average and standard deviation of five random splits of training and testing subjects.
In every experiment, no subject that was used in training was used for testing.

7.1.1 Dataset 1—Near-Infrared to Visible (Fig. 1a)

The first dataset consists of 200 subjects with probe images captured in the near-infrared spectrum (~780-1,100 nm) and gallery images captured in the visible spectrum.
Portions of this dataset are publicly available for download.
Only one NIR and one VIS image per subject are used, making the scenario more difficult than previous experiments which benefited from multiple images per subject in training and 2.
The data was split as follows: nt ¼ 133 subjects were used for training set T and the remaining 67 subjects the authors used for testing.

7.1.2 Dataset 2—Thermal to Visible (Fig. 1b)

The second dataset is a private dataset collected by the Pinellas County Sheriff’s Office (PCSO) and consists of 1,000 subjects with thermal infrared probe images and visible (mug shot) gallery images.
The thermal infrared images were collected using a FLIR Recon III ObservIR camera, which has sensitivity in the range of 3-5 m.
The data was split as follows: nt ¼ 667 subjects were used for training set T and the remaining 333 subjects were used for testing.

7.1.3 Dataset 3—Viewed Sketch to Visible (Fig. 1c)

The third dataset is the CUHK sketch dataset,3 which was used by Tang and Wang [3], [5].
The CUHK dataset consists of 606 subjects with a viewed sketch image for probe and a visible photograph for gallery.
The 606 subjects were split to form a training set T with nt ¼ 404 subjects, and the remaining 202 subjects were used for testing.

7.1.4 Dataset 4—Forensic Sketch to Visible (Fig. 1d)

The fourth and final heterogeneous face dataset consists of real-world forensic sketches and mug shot photos of 159 subjects.
Forensic sketches are drawn by an artist based only on an eye witness description of the subject.
The forensic sketch dataset is a collection of images from Gibson [45], Taylor [46], the Michigan State Police, and the Pinellas County Sheriff’s Office.
Forensic sketches contain incomplete information regarding the subject and are one of the most difficult HFR scenarios because the sketches often do not closely resemble the photograph.
The number of subjects subjects used in T is 106, and 53 subjects are used for the test set.

7.1.5 Dataset 5: Standard Face Recognition

A fifth nonheterogeneous (i.e., homogeneous) dataset is used to demonstrate the ability of the proposed approach to operate in standard face recognition scenarios as well.
The dataset consists of one probe and one gallery photograph of 876 subjects, where 117 subjects were from the AR dataset [43], 294 subjects were from the XM2VTS dataset [44], 193 subjects from the FERET dataset [47], and 272 subjects were from a private dataset collected at the University of Notre Dame.

7.1.6 Enlarged Gallery

A collection of 10,000 mug shot images from 10,000 different subjects was used in certain experiments to increase the size of the gallery.
These mug shot images were provided by the Pinellas County Sheriff’s Office.
Any experiment using these additional images will have a gallery with the number of testing subjects plus 10,000 images.
Experiments with a large gallery are meant to present results that more closely resemble real-world face retrieval scenarios that would occur in forensic and intelligence applications of heterogeneous face recognition.

7.2 Results

Fig. 6 lists the rank retrieval results of P-RS, D-RS, and FaceVACS for each dataset using the additional 10,000 gallery images for each experiment.
Regardless, the improved accuracy using a smaller training set of subjects clearly demonstrates the value of the proposed P-RS method.
The lower accuracy of P-RS compared to D-RS on the forensic sketch dataset can be attributed to two factors.
As shown, the recognition accuracy generally saturates around 100 prototypes.
Using the standard face dataset, Fig. 10a compares the accuracy of P-RS, D-RS, and FaceVACS.

8 SUMMARY

A method for heterogeneous face recognition, called Prototype Random Subspaces, is proposed.
Probe and gallery images are initially filtered with three different image filters, and two different local feature descriptors are then extracted.
A training set acts as a set of prototypes in which each prototype subject has an image in both the gallery and probe modalities.
Results were compared against a leading commercial face recognition engine.
Tailoring the P-RS parameters and learning weighted fusion schemes for each HFR scenario separately should offer further accuracy improvements.

ACKNOWLEDGMENTS

The authors would like to thank Scott McCallum and the rest of the his team at the Pinellas County Sheriff’s Office, and Captain Greg Michaud from the Michigan State Police for their gracious support of this research.
They would also like to thank Rong Jin and Serhat Bucak for their feedback on this research.
This manuscript benefited from the value observations provided in the review process.
Anil Jain’s research was partially supported by the World Class University (WCU) program funded by the Ministry of Education, Science and Technology through the National Research Foundation of Korea (R31-10008).

Did you find this useful? Give us your feedback

Figures (10)

Fig. 4. Overview of the proposed algorithm for performing heterogenous face recognition using prototype similarities.

Fig. 5. The process of randomly sampling image patches is illustrated. (a) All image patches. (b), (c), and (d) Bags of randomly sampled patches. The kernel similarity between SIFT and MLBP descriptors at each patch of an incoming image and the prototypes of corresponding modality are computed for each bag. Images from [43].

Fig. 8. The effect of the number of prototypes on the recognition accuracy. The accuracy (vertical axis) is the TAR @ FAR ¼ 1:0%. For each HFR scenario, results are computed using a subset of the prototypes (horizontal axis) for the initial representation, and all training samples for the subsequent LDA step. For each scenario, it is observed that accuracy generally plateaus around 100 prototypes. However, when more LDA training samples are available (e.g., thermal), additional prototypes seem to help the recognition accuracy. Results shown are the mean and standard deviation from the five random splits for the following HFR scenarios: (a) near infrared, (b) viewed sketch, and (c) thermal infrared.

Fig. 9. True accept rates at a fixed false accept rate of 1.0 percent using the proposed P-RS framework with different features for the probe and gallery. The rows list the features used to represent the probe images, and the columns list the features for the gallery images. The nondiagonal entries in each table (in bold) use different feature descriptor representations for the probe images than the gallery images. These results demonstrate another “heterogeneous” aspect of the proposed framework: recognition using heterogeneous features. The first row and sixth column of (a) demonstrate that the P-RS framework can achieve 99.0 percent accuracy representing the probe/sketch images with DoG filtered SIFT and the photo/gallery images with Gaussian filtered MLBP.

Fig. 1. Example images from each of the four heterogenous face recognition scenarios tested in our study. The top row contains probe images from (a) near-infrared, (b) thermal infrared, (c) viewed sketch, and (d) forensic sketch modalities. The bottom row contains the corresponding gallery photograph (visible band face image, called VIS) of the same subject.

Fig. 2. The proposed face recognition method describes a face as a vector of kernel similarities to a set of prototypes. Each prototype has one face image in the probe and gallery modalities.

Fig. 7. CMC plots for each of the four HFR scenarios tested. Each scenario is using an additional 10,000 gallery images to better replicate real-world matching scenarios. Listed are the accuracies for the proposed prototype random subspace method, the direct random subspace method [10], the sum-score fusion of P-RS and D-RS, and Congitec’s FaceVACS system [28].

Fig. 3. Example of thermal probe and visible gallery images after being filtered by a difference of Gaussian, center surround divisive normalization, and Gaussian image filter. The SIFT and MLBP feature descriptors are extracted from the filtered images, and kernel similarities are computed within this image descriptor representation.

Fig. 10. Face recognition results when using photographs for both the probe and gallery (i.e., nonheterogeneous face recognition). (a) CMC plot of matcher accuracies with an additional 10,000 gallery images. (b) Results when different features are used to represent the probe and gallery images. The layout is the same as in Fig. 9.

Fig. 6. Recognition results for the proposed prototype random subspace and the baseline direct random subspace method across five recognition scenarios. Listed are the (a) rank retrieval results with using an additional 10,000 subjects in the gallery, (b) true accept rates at a fixed false accept rate (FAR) of 1.0 percent, and (c) TAR at FAR of 0.1 percent. (P-RS)+(D-RS) is a sum of score fusion between the two methods.

Content maybe subject to copyright Report

Heterogeneous Face Recognition Using

Kernel Prototype Similarities

Brendan F. Klare, Member, IEEE, and Anil K. Jain, Fellow, IEEE

Abstract—Heterogeneous face recognition (HFR) involves matching two face images from alternate imaging modalities, such as an

infrared image to a photograph or a sketch to a photograph. Accurate HFR systems are of great value in various applications (e.g.,

forensics and surveillance), where the gallery databases are populated with photographs (e.g., mug shot or passport photographs) but

the probe images are often limited to some alternate modality. A generic HFR framework is proposed in which both probe and gallery

images are represented in terms of nonlinear similarities to a collection of prototype face images. The prototype subjects (i.e., the

training set) have an image in each modality (probe and gallery), and the similarity of an image is measured against the prototype

images from the corresponding modality. The accuracy of this nonlinear prototype representation is improved by projecting the

features into a linear discriminant subspace. Random sampling is introduced into the HFR framework to better handle challenges

arising from the small sample size problem. The merits of the proposed approach, called prototype random subspace (P-RS), are

demonstrated on four different heterogeneous scenarios: 1) near infrared (NIR) to photograph, 2) thermal to photograph, 3) viewed

sketch to photograph, and 4) forensic sketch to photograph.

Index Terms—Heterogeneous face recognition, prototypes, nonlinear similarity, discriminant analysis, local descriptors, random

subspaces, thermal image, infrared image, forensic sketch

1INTRODUCTION

N emerging topic in face recognition is matching

between heterogeneous image modalities. Coined

heterogeneous face recognition (HFR) [1], the scenario offers

potential solutions to many difficult face recognition

scenarios. While heterogeneous face recognition can involve

matching between any two imaging modalities, the majority

of scenarios involve a gallery dataset consisting of visible

light photographs. Probe images can be of any other

modality, though the practical scenarios of interest to us

are infrared images (NIR and thermal) and hand-drawn

facial sketches.

The motivation behind heterogeneous face recognition is

that circumstances exist in which only a particular modality

of a face image is available for querying a large database of

mug shots (visible band face images). For example, when a

subject’s face can only be acquired in nighttime environ-

ments, the use of infrared imaging may be the only modality

for acquiring a useful face image of the subject. Another

example is situations in which no imaging system was

available to capture the face image of a suspect during a

criminal act. In this case a forensic sketch, drawn by a police

artist based on a verbal description provided by a witness or

the victim, is likely to be the only available source of a face

image. Despite significant progress in the accuracy of face

recognition systems [2], most commercial off-the-shelf

(COTS) face recognition systems (FRS) are not designed to

handle HFR scenarios. The need fo r face r ecognition

systems specifically designed for the task of matching

heterogeneous face images is of substantial interest.

This paper proposes a unified approach to heteroge-

neous face recognition that

1. achieves leading accuracy on multiple HFR scenarios,

2. does not necessitate feature descriptors that are

invariant to changes in image modality,

3. facilitates recognition using different feature de-

scriptors in the probe and gallery modalities, and

4. naturally extends to additional HFR scenarios due to

properties 2 and 3 above.

2RELATED WORK

2.1 Heterogeneous Face Recognition

A flurry of research has emerged providing solutions to

various heterogeneous face recognition problems. This

began with sketch recognition using viewed sketches,

and has continued into other modalities such as near-

infrared (NIR) and forensic sketches. In this section, we will

highlight a representative selection of studies in hetero-

geneous face recognition as well as studies that use kernel-

based approaches for classification.

Tang et al. spearheaded the work in heterogeneous face

recognition with several approaches to synthesize a sketch

from a photograph (or vice versa) [3], [4], [5]. Tang and

Wang initially proposed an eigen-transformation method

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013 1

. B.F. Klare is with Noblis, 3150 Fairview Park Drive, Falls Church, VA

22042. E-mail: brendan.klare@noblis.org.

. A.K. Jain is with the Department of Computer Science and Engineering,

Michigan State University, Room 3115, 428 S. Shaw Lane, Engineering

Building, East Lansing, MI 48824-1226. E-mail: jain@cse.msu.edu.

Manuscript received 18 Dec. 2011; revised 23 July 2012; accepted 16 Sept.

2012; published online 12 Oct. 2012.

Recommended for acceptance by M. Tistarelli.

For information on obtaining reprints of this article, please send e-mail to:

tpami@computer.org, and reference IEEECS Log Number

TPAMI-2011-12-0905.

Digital Object Identifier no. 10.1109/TPAMI.2012.229.

1. A viewed sketch is a facial sketch drawn while viewing a photograph

of the subject. The scenario is not practical because the photograph itself

could be queried in the FR system.

0162-8828/13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society

[3]. Later, Liu et al. performed the transformation using

local linear embedding to estimate the corresponding photo

patch from a sketch patch [4]. Wang and Tang proposed a

Markov random field model for converting a sketch into a

photograph [5]. Other synthesis methods have been

proposed as well [6], [7]. The generative transformation-

based approaches have generally been surpassed in

performance by discriminative feature-based approaches.

A key advantage of synthesis methods is that once a sketch

has been converted to a photograph, matching can be

performed using existing face recognition algorithms. The

proposed prototype framework is similar in spirit to these

methods in that no direct comparison between face images

in the probe and gallery modalities is needed.

A number of discriminative feature-based approaches to

HFR have been proposed [8], [9], [10], [11], [12] which have

shown good matching accuracies in both the sketch and

NIR domains. These approaches first represent face images

using local feature descriptors, such as variants of local

binary patterns (LBPs) [13] and SIFT descriptors [14]. Liao

et al. first used this approach on NIR to VIS face recognition

by processing face images with a difference of Gaussian

(DoG) filter, and encoding them using multiblock local

binary patterns (MB-LBPs). Gentle AdaBoost feature selec-

tion was used in conjunction with R-LDA to improve the

recognition accuracy. Klare and Jain followed this work on

NIR to VIS face recognition by also incorporating SIFT

feature descriptors and an RS-LDA scheme [10]. Bhatt et al.

introduced an extended uniform circular local binary

pattern to the viewed sketch recognition scenario [11].

Klare et al. encoded both viewed sketches and forensic

sketches using SIFT and MLBP feature descriptors, and

performed lo cal feature-bas ed discrimin ant analy sis

(LFDA) to improve the recognition accuracy [9]. Yi et al.

[15] offered a local patch-based method to perform HFR on

partial NIR face images. Zhang et al. extracted local features

and performed recognition between sketches and photos

using coupled information-theoretic encoding [16]. Lei and

Li applied coupled spectral regression (CSR) for NIR to VIS

recognition [12]. In [12], CSR was extended to Kernel CSR,

which is similar to the proposed prototype representation

in this work.

The synthesis method by Li et al. is the only known

method to perform recognition between thermal IR and

visible face images [17]. The only method to perform

recognition between forensic sketches and visible face

images is Klare et al. [9], which is also one of two methods,

to our knowledge, that has been tested on two different

HFR scenarios (viewed sketch and forensic sketch). The

other method is Lin and Tang’s [18] common discriminant

recognition framework, which was applied to viewed

sketches and near-infrared images. In this work, the

proposed prototype random subspace (P-RS) framework

is tested on four different HFR scenarios.

2.2 Kernel Pr ototype Representation

The core of the proposed approach involves using a

relational feature representation for face images (illustrated

in Fig. 2). By using kernel similarities between a novel face

pattern and a set of prototypes, we are able to exploit the

kernel trick [19], which allows us to generate a high

dimensional, nonlinear representation of a face image using

compact feature vectors.

The benefit of a prototype-based approach is provided

by Balcan et al. [19]. Given access to the data distribution

and a kernel similarity function, a prototype representation

2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013

Fig. 1. Example images from each of the four heterogenous face recognition scenarios tested in our study. The top row contains probe images from

(a) near-infrared, (b) thermal infrared, (c) viewed sketch, and (d) forensic sketch modalities. The bottom row contains the corresponding gallery

photograph (visible band face image, called VIS) of the same subject.

Fig. 2. The proposed face recognition method describes a face as a

vector of kernel similarities to a set of prototypes. Each prototype has

one face image in the probe and gallery modalities.

is shown to approximately maintain the desired properties

of the high-dimensional kernel space in a more efficient

representation by using the kernel trick. While it is not

common to refer to kernel methods as prototype represen-

tations, in this work we emphasize the fact that kernel

methods use a training set of images (which serve as

prototypes) to implicitly estimate the distribution of the

nonlinear feature space. One key to our framework is that

each prototype has a pattern for each image modality.

The proposed kernel prototype approach is similar to the

object recognition method of Quattoni et al. [20]. Kernel

PCA [21] and Kernel LDA [22], [23] approaches to face

recognition have used a similar approach, where a face is

represented as the kernel si milarity t o a collection of

prototype images in a high-dimensional space. The bio-

metric indexing scheme by Gyaourova and Ross used

similarity scores to a fixed set of references in the face and

fingerprint modality [24].

These prior works differ from the proposed method

because only a single prototype is used per training subject.

By contrast, our approach is designed for heterogeneous

face recognition, and uses two prototype images per subject

(one per modality). Our earlier work [25] utilized a similar

approach that did not exploit the benefit of nonlinear

kernels, but did use a separate pattern from each image

modality (sketch and photo) for each prototype. The kernel

coupled spectral regression by Lei and Li used a similar

approach of representing heterogeneous face images as

nonlinear similarities to a set of prototypes [12].

2.3 Proposed Method

The proposed method presents a new approach to hetero-

geneous face recognition, and extends existing methods in

face recognition. The use of a nonlinear similarity repre-

sentation is well suited to the HFR problem because a set of

training subjects with an image from each modality can be

used as the prototypes and, depending on the modality of a

new imag e (probe or gallery), the image f rom each

prototype subject can be selected from the corresponding

modality. Unlike previous feature-based methods, where an

image descriptor invariant to changes between the two HFR

modaliti es was needed, the proposed framework only

needs descriptors that are effective within each domain.

Further, the proposed method is effective even when

different feature descriptors are used in the probe and

gallery domains. The proposed prototype framework is

described in detail in Section 4.

The accuracy of the HFR system is improved using a

random subspace framework in conjunction with linear

discriminant analysis (LDA), as described in Section 5. The

previous (or baseline) method of feature-based random

subspaces [10] is revisited in Section 6. Experimental results

on four different heterogeneous face recognition scenarios

(thermal, near-infrared, viewed sketch, and forensic sketch)

are provided in Section 7, and all the results are bench-

marked with a commercial face matcher.

While we demonstrate the strength of the proposed

framework on many different HFR scenarios, the parameters

controlling the framework are the same across all tested

scenarios. This shows that the contribution of this work is a

generic framework for improving solutions to the general

HFR problem. Future use of the proposed framework will

benefit from selecting parameters tailored to a specific

scenario; however, that is beyond the scope of this work.

3IMAGE PREPROCESSING AND REPRESENTATION

All face images are initially represented using a feature-

based representation. The use of local feature descriptors has

been argued to closely resemble the postulated representa-

tion of the human visual processing system [26], and they

have been shown to be well suited for face recognition [27].

3.1 Geometric Normalization

The first step in representing face images using feature

descriptors is to geometrically normalize the face images

with respect to the location of the eyes. This step reduces

the effect of scale, rotation, and translation variations. The

eye locations for the face images from all modalities are

automatically estimated using Cognitec’s FaceVACS SDK

[28]. The only exceptions are the thermal face images where

the eyes are manually located for both the proposed method

and the FaceVACS baseline.

Face images are geometrically normalized by 1) perform-

ing planar rotation to set the angle between the eyes to

0 degrees, 2) scaling the images so that the distance between

the two pupils is 75 pixels, and 3) cropping the images to a

height of 250 pixels and a width of 200 pixels, with the eyes

horizontally centered and vertically placed at row 115.

3.2 Image Filtering

Face images are filtered with three different image filters.

These filters are intended to help compensate for both

intensit y variations within an image domain (such as

nonuniform ill umination changes), as well appearance

variations between image domains. The second aspect is

of particular importance for the direct random subspace

(D-RS) framework (see Section 6). An example of the effects

of each image filter can be seen in Fig. 3.

The three image filters used are as follows.

3.2.1 Difference of Gaussian

A difference of Gaussian image filter has been shown by

Tan and Triggs to improve face recognition performance in

KLARE AND JAIN: HETEROGENEOU S FACE RECOGNITION USING KERNEL PROTOTYPE SIMILARITIES 3

Fig. 3. Example of thermal probe and visible gallery images after being

filtered by a difference of Gaussian, center surround divisive normal-

ization, and Gaus sian image filter . The SIFT and MLBP feature

descriptors are extracted from the filtered images, and kernel similarities

are computed within this image descriptor representation.

the presence of varying illumination [29], as well as in an

NIR to VIS matching scenario by Liao et al. [8]. A difference

of Gaussian image is generated by convolving an image

with a filter obtained by subtracting a Gaussian filter of

width 

from a Gaussian filter of width 

(

>

). In this

paper, 

¼ 2 and 

¼ 4.

3.2.2 Center-Surround Divisive Normalization (CSDN)

Meyers and Wolf [30] introduced the center-surround

divisive normalization filter in conjunction with t heir

biologically inspired face recognition framework. The CSDN

filter divides the value of each pixel by the mean pixel value

in the s  s neighborhood surrounding the pixel. The

nonlinear nature of the CSDN filter is seen as a compliment

to the DoG filter. In our implementation, s ¼ 16.

3.2.3 Gaussian

The Gaussian smoothing filter has long been used in image

processing applications to remove noise contained in high

spatial frequencies while retaining the remainder of the

signal. The width of the filter used in our implementation

was  ¼ 2.

3.3 Local Descriptor Representation

Once an image is geometrically normalized and filtered

using one of the three filters, local feature descriptors are

extracted from uniformly distributed patches across the

face. In this work, we use two different feature descriptors

to represent the face image: the SIFT descriptor [14] and

Local Binary Patterns [13]. The SIFT feature descriptor has

been used effectively in face recognition [27], sketch to VIS

matching [9], and NIR to VIS matching [10]. LBP features

have a longer history of successful use in face recognition.

Ahonen et al. o riginally proposed their use for face

recognition [31], Li et al. demonstrated their use in NIR to

NIR face matching [32], and they have also been success-

fully applied to several HFR scenarios [8], [9], [10], [11].

The SIFT and LBP feature representations are effective in

describing face images due to their ability to encode the

structure of the face and their stability in the presence of

minor external variations [27]. Each feature descriptor

describes an image patch as a d-dimensional vector that is

normalized to sum to one. The face image is divided into a

set of N overlapping patches of size 32  32. Each patch

overlaps its vertical and horizontal neighbors by 16 pixels.

With a face image of size 200  250, this results in a total of

154 total patches.

Multiscale local binary patterns (MLBP) [9], a variant of

the LBP descriptor, is used in place of LBP in this work.

MLBP is the concatenation of LBP feature descriptors with

radii r ¼f1; 3; 5; 7g.

Let I be a (normalized and filtered) face image. Let

F;D

ðI;aÞ denote the local feature descriptor extracted from

image I at patch a, 1  a  N, using image filter F and

feature descriptor D. The DoG, CSDN, and Gaussian image

filters are, respectively, referred to as F

, F

, and F

. The

MLBP and SIFT descriptors are, respectively, referred to as

and D

. Using 16 histograms and 8 orientation bins, as

described by Lowe [14], the SIFT descriptor yields a 128D

feature descriptor. Using uniform patterns at eight sampling

locations, as decribed by Ojala et al. [13], the LBP descriptor

yields a 59D feature descriptor. This results in a 236D MLBP

feature descriptor ( f

F;D

ðI;aÞ2IR

236

). Finally, we have

F;D

ðIÞ¼½f

F;D

ðI;1Þ

; ...;f

F;D

ðI;NÞ



; ð1Þ

which is the concatenation of all N feature descriptors.

Thus, f

F;D

ðIÞ2IR

128N

and f

F;D

ðIÞ2IR

236N

Using the three filters and two descriptors, we have six

different representations available for face image I, namely,

ðIÞ, f

ðIÞ, and f

ðIÞ.

4HETEROGENEOUS PROTOTYPE FRAMEWORK

4.1 Prototype Representation

The heterogeneous prototype framework begins with

images from the probe and gallery modalities represented

by (possibly different) feature descriptors for each of the

N image patches, as described in the previous section. For

compactness, let fðIÞ represent f

F;D

ðIÞ. The similarity

between two images is measured using a kernel function

k : fðIÞ x fðIÞ!IR .

Let T be a set of training images consisting of n

subjects.

The training set contains a probe image P

and gallery

image G

for each of the n

subjects. That is,

T¼fP

; ...;P

g; ð2Þ

For both the probe and gallery modalities, two positive

semi-definite kernel matrices K

and K

are computed

between the training subjects. The probe kernel matrix is

2 IR

, and the gallery kernel matrix is K

2 IR

The entries in the ith row and jth column of K

and K

are

ði; jÞ¼kðfðP

Þ fðP

ÞÞ; ð3Þ

ði; jÞ¼kð fðG

Þ fðG

ÞÞ; ð4Þ

where kð; Þ is the kernel similarity function. Results in all

experiments in this work use the cosine kernel function:

kðfðP

Þ;fðG

ÞÞ ¼

hfðP

Þ;fðG

Þi

kfðP

Þk  kfðG

Þk

: ð5Þ

The cosine kernel was chosen because it resulted in

consistently higher accuracy on all tested scenarios com-

pared to the radial basis function kernel and the polynomial

kernel. Additionally, we preferred the cosine kernel because

it is devoid of parameters.

Let P and G, respectively, be test probe and gallery face

images, i.e., ðP;G 62TÞ. The function 

ðP Þ returns a vector

containing the kernel similarity of image P to each image P

in T . For gallery image G, 

ðGÞ returns a vector of kernel

similarities to the gallery prototypes G

. Thus, face images

are represented as the relational vector 

ðP Þ2IR

for a

probe image and 

ðGÞ2IR

for a gallery image. More

precisely, we have



ðP Þ¼½kðfðP Þ;fðP

ÞÞ; ...;kðfðP Þ;fðP

ÞÞ

; ð6Þ



ðGÞ¼½kðfðGÞ;fðG

ÞÞ; ...;kðfðGÞ;fðG

ÞÞ

: ð7Þ

Using this prototype-based representation, extreme

inputs to the system (e.g., a nonface image) will cause the

4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013

kernel similarity to degenerate to the kernel minimum (0 in

the case of the cosine kernel). This allows the system to

remain stable with respect to scale.

Additionally, because the feature vectors 

ðP Þ and



ðGÞ are a measure of the similarity between the test image

and the prototype training images, the feature spaces for

similarity computation do not have to be the same for the

probe and gallery modalities. For example, the probe images

could be represented using F

F;D

ðP Þ and the gallery images

could be represented using F

F;D

ðGÞ. Despite the fact that

the SIFT and MLBP feature descriptors are heterogeneous

features, the relational representation allows them to be

represented in a common feature space. This is based on the

assumption that

kðfðP Þ;fðP

ÞÞ  kðfðGÞ;fðG

ÞÞ: ð8Þ

We will next introduce a discriminant subspace technique

to project these prototype features into a linear subspace that

better satisfies (8). When necessary, the tersely presented

notation of 

ðIÞ or 

ðIÞ will be expanded to the more

verbose notation 

F;D

ðIÞ or 

F;D

ðIÞ, respectively, in order to

specify which feature descriptor and image filter is initially

being used to represent the image I. For example, 

ðIÞ

denotes the prototype similarity of image I when repre-

sented using the CSDN image filter and SIFT descriptors.

4.2 Discriminant Analysis

After representing the images in the training set T in the

aforementioned prototype representation, we next learn

linear subspaces using linear discriminant analysis [33] to

enhance the discriminative capabilities of the prototype

representation ðÞ. LDA (and its variants) has consistently

demonstrated its ability to improve the accuracy of various

recognition algorithms through feature extraction and

dimensionality reduction. The benefits of LDA in the

context of face recognition have been demonstrated on

image pixel representations [33], [34], Gabor features [35],

and image descriptors [8], [9].

We learn the linear projection matrix W by following the

conventional approach for high-dimensional data, namely,

by first applying PCA, followed by LDA [33]. In all

experiments, the PCA step was used to retain 99.0 percent

of the variance. Let X be a matrix whose columns contain

the prototype representation of each image in T :

X ¼½

ðP

Þ;

ðG

Þ; ...;

ðP

Þ;

ðG

Þ: ð9Þ

Let X

denote the mean-centered version of X. The initial

step involves learning the subspace projection matrix W

performing principal component analysis (PCA) on X

reduce the dimensionality of the feature space. Next, the

within-class and between-class scatter matrices of W

 X

(respectively), S

and S

, are computed. The dimension of

the subspace W

is such that S

will be of full rank. The

scatter matrices are built using each subject as a class; thus

one image from the probe and gallery modality represents

each class. A more detailed description of how to compute

and S

is described in [9]. Last, the matrix W

is learned

by solving the generalized eigenvalue problem:

 W

¼   S

 W

: ð10Þ

This yields the LDA projection matrix W , where

W ¼



 W



: ð11Þ

Letting  denote the mean of X, the final representation for

an unseen probe or gallery image I using the prototype

framework is W

ððIÞÞ. Subsequent uses of W in this

work will assume the appropriate removal of the mean 

from ðIÞ for terseness.

5RANDOM SUBSPACES

5.1 Motivation

The proposed heterogeneous prototype framework uses

training data to define the prototypes and to learn the linear

subspace projection matrix W . This requirement on training

data raises two (somewhat exclusive) issues in the prototype

representation framework. The first issue is that the number

of subjects in T (i.e., the number of prototypes) is generally

too small for an expressive prototype representation. While

Balcan et al. demonstrated that the number of prototypes

does not need to be large to approximately replicate the data

distribution [19], their applications primarily dealt with

binary classification and a small number of features. When

applying a prototype representation to face recognition, a

large number of classes (or subjects) and features are

present. The small sample size problem implies that the

number of prototypes needed to approximate the under-

lying data distribution should be large [36].

The second issue is also related to the small sample size

problem [36]. This common problem in face recognition

arises from too few training subjects to learn model

parameters that are not susceptible to generalization errors.

In the heterogeneous prototype framework this involves

learning a W matrix that generalizes well.

A number of solutions exist to tackle the small sample

size problem in face recognition. Most are designed to

handle deficiencies in the subspace W, such as dual-space

LDA [34] and direct LDA [37]. Regularization methods such

as R-LDA [38] also address degenerative properties of W.

However, these methods do not address the issue of too few

prototypes for an expressive representation.

Another approach to handle deficiencies in learning

parameters is the use of random subspaces [39]. The

random subspace method samples a subset of features

and performs training i n this reduced feature space.

Multiple sets (or bags) of randomly sampled features are

generated, and for each bag the parameters are learned.

This approach is similar to the classical bagging classifica-

tion scheme [40], where the training instances are randomly

sampled into bags multiple times and training occurs on

each bag se parately. Ensemble methods such as Ho’s

random subspaces [39] and Breiman’s bagging classifiers

have been demonstrated to increase the generalization of an

arbitrary classifier [41].

Wang and Tang demonstrated the effectiveness of

random sampling LDA (RS-LDA) for face recognition

[42]. Their approach combined random subspaces and

bagging by sampling both features and training instances.

For each random sample space, a linear subspace was

learned. Klare and Jain utilized this approach in the HFR

KLARE AND JAIN: HETEROGENEOU S FACE RECOGNITION USING KERNEL PROTOTYPE SIMILARITIES 5

HTML Viewer

Frequently Asked Questions (1)

Q1. What are the contributions mentioned in the paper "Heterogeneous face recognition using kernel prototype similarities" ?

The accuracy of this nonlinear prototype representation is improved by projecting the features into a linear discriminant subspace.

Heterogeneous Face Recognition Using Kernel Prototype Similarities

Summary (6 min read)

1 INTRODUCTION

2.1 Heterogeneous Face Recognition

2.2 Kernel Prototype Representation

2.3 Proposed Method

3 IMAGE PREPROCESSING AND REPRESENTATION

3.1 Geometric Normalization

3.2 Image Filtering

3.2.2 Center-Surround Divisive Normalization (CSDN)

3.2.3 Gaussian

3.3 Local Descriptor Representation

4.1 Prototype Representation

4.2 Discriminant Analysis

5.1 Motivation

5.2 Prototype Random Subspaces

5.4 Score Level Fusion

6.1 Commercial Matcher

6.2 Direct Random Subspaces

7 EXPERIMENTS

7.1 Databases

7.1.1 Dataset 1—Near-Infrared to Visible (Fig. 1a)

7.1.2 Dataset 2—Thermal to Visible (Fig. 1b)

7.1.3 Dataset 3—Viewed Sketch to Visible (Fig. 1c)

7.1.4 Dataset 4—Forensic Sketch to Visible (Fig. 1d)

7.1.5 Dataset 5: Standard Face Recognition

7.1.6 Enlarged Gallery

7.2 Results

8 SUMMARY

ACKNOWLEDGMENTS

Figures (10)

Citations

Cites background or methods from "Heterogeneous Face Recognition Usin..."

Cites background from "Heterogeneous Face Recognition Usin..."

References

"Heterogeneous Face Recognition Usin..." refers methods in this paper

"Heterogeneous Face Recognition Usin..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (1)

Q1. What are the contributions mentioned in the paper "Heterogeneous face recognition using kernel prototype similarities" ?