Heterogeneous Face Recognition Using Kernel Prototype Similarities
Summary (6 min read)
1 INTRODUCTION
- AN emerging topic in face recognition is matchingbetween heterogeneous image modalities.
- Coined heterogeneous face recognition (HFR) [1], the scenario offers potential solutions to many difficult face recognition scenarios.
- While heterogeneous face recognition can involve matching between any two imaging modalities, the majority of scenarios involve a gallery dataset consisting of visible light photographs.
- Probe images can be of any other modality, though the practical scenarios of interest to us are infrared images (NIR and thermal) and hand-drawn facial sketches.
- When a subject’s face can only be acquired in nighttime environments, the use of infrared imaging may be the only modality for acquiring a useful face image of the subject.
2.1 Heterogeneous Face Recognition
- A flurry of research has emerged providing solutions to various heterogeneous face recognition problems.
- This began with sketch recognition using viewed sketches,1 and has continued into other modalities such as nearinfrared (NIR) and forensic sketches.
- Published by the IEEE Computer Society [3].
- Klare and Jain followed this work on NIR to VIS face recognition by also incorporating SIFT feature descriptors and an RS-LDA scheme [10].
2.2 Kernel Prototype Representation
- The core of the proposed approach involves using a relational feature representation for face images (illustrated in Fig. 2).
- One key to their framework is that each prototype has a pattern for each image modality.
- Kernel PCA [21] and Kernel LDA [22], [23] approaches to face recognition have used a similar approach, where a face is represented as the kernel similarity to a collection of prototype images in a high-dimensional space.
- The biometric indexing scheme by Gyaourova and Ross used similarity scores to a fixed set of references in the face and fingerprint modality [24].
- These prior works differ from the proposed method because only a single prototype is used per training subject.
2.3 Proposed Method
- The proposed method presents a new approach to heterogeneous face recognition, and extends existing methods in face recognition.
- Unlike previous feature-based methods, where an image descriptor invariant to changes between the two HFR modalities was needed, the proposed framework only needs descriptors that are effective within each domain.
- The accuracy of the HFR system is improved using a random subspace framework in conjunction with linear discriminant analysis (LDA), as described in Section 5.
- While the authors demonstrate the strength of the proposed framework on many different HFR scenarios, the parameters controlling the framework are the same across all tested scenarios.
3 IMAGE PREPROCESSING AND REPRESENTATION
- All face images are initially represented using a featurebased representation.
- The use of local feature descriptors has been argued to closely resemble the postulated representation of the human visual processing system [26], and they have been shown to be well suited for face recognition [27].
3.1 Geometric Normalization
- The first step in representing face images using feature descriptors is to geometrically normalize the face images with respect to the location of the eyes.
- This step reduces the effect of scale, rotation, and translation variations.
- The eye locations for the face images from all modalities are automatically estimated using Cognitec’s FaceVACS SDK [28].
- The only exceptions are the thermal face images where the eyes are manually located for both the proposed method and the FaceVACS baseline.
3.2 Image Filtering
- Face images are filtered with three different image filters.
- These filters are intended to help compensate for both intensity variations within an image domain (such as nonuniform illumination changes), as well appearance variations between image domains.
- The second aspect is of particular importance for the direct random subspace (D-RS) framework (see Section 6).
- The three image filters used are as follows.
3.2.2 Center-Surround Divisive Normalization (CSDN)
- Meyers and Wolf [30] introduced the center-surround divisive normalization filter in conjunction with their biologically inspired face recognition framework.
- The CSDN filter divides the value of each pixel by the mean pixel value in the s s neighborhood surrounding the pixel.
3.2.3 Gaussian
- The Gaussian smoothing filter has long been used in image processing applications to remove noise contained in high spatial frequencies while retaining the remainder of the signal.
- The width of the filter used in their implementation was ¼.
3.3 Local Descriptor Representation
- Once an image is geometrically normalized and filtered using one of the three filters, local feature descriptors are extracted from uniformly distributed patches across the face.
- The authors use two different feature descriptors to represent the face image: the SIFT descriptor [14] and Local Binary Patterns [13].
- LBP features have a longer history of successful use in face recognition.
- Each patch overlaps its vertical and horizontal neighbors by 16 pixels.
- Using uniform patterns at eight sampling locations, as decribed by Ojala et al. [13], the LBP descriptor yields a 59D feature descriptor.
4.1 Prototype Representation
- The heterogeneous prototype framework begins with images from the probe and gallery modalities represented by (possibly different) feature descriptors for each of the N image patches, as described in the previous section.
- The cosine kernel was chosen because it resulted in consistently higher accuracy on all tested scenarios compared to the radial basis function kernel and the polynomial kernel.
- Additionally, because the feature vectors P ðP Þ and GðGÞ are a measure of the similarity between the test image and the prototype training images, the feature spaces for similarity computation do not have to be the same for the probe and gallery modalities.
- Fc;DsP ðIÞ denotes the prototype similarity of image I when represented using the CSDN image filter and SIFT descriptors.
4.2 Discriminant Analysis
- After representing the images in the training set T in the aforementioned prototype representation, the authors next learn linear subspaces using linear discriminant analysis [33] to enhance the discriminative capabilities of the prototype representation ð Þ. LDA (and its variants) has consistently demonstrated its ability to improve the accuracy of various recognition algorithms through feature extraction and dimensionality reduction.
- The authors learn the linear projection matrix W by following the conventional approach for high-dimensional data, namely, by first applying PCA, followed by LDA [33].
- In all experiments, the PCA step was used to retain 99.0 percent of the variance.
- Next, the within-class and between-class scatter matrices of W 01 T X0 , SW and SB, are computed.
- Letting denote the mean of X, the final representation for an unseen probe or gallery image I using the prototype framework is WT ð ðIÞ Þ. Subsequent uses of W in this work will assume the appropriate removal of the mean from ðIÞ for terseness.
5.1 Motivation
- The proposed heterogeneous prototype framework uses training data to define the prototypes and to learn the linear subspace projection matrix W .
- When applying a prototype representation to face recognition, a large number of classes (or subjects) and features are present.
- Most are designed to handle deficiencies in the subspace W , such as dual-space LDA [34] and direct LDA [37].
- These methods do not address the issue of too few prototypes for an expressive representation.
- Their approach combined random subspaces and bagging by sampling both features and training instances.
5.2 Prototype Random Subspaces
- The prototype random subspace framework uses B differ- ent bags (or samples) of the N face patches.
- Let fðI; bÞ denote the concatenation of the N descriptors from the randomly selected patch indices in b.
- For terseness the authors have omitted the superscript F and D in the previous equations.
5.4 Score Level Fusion
- The proposed framework naturally lends to fusion of the different feature representations.
- Given one image filter F and two feature descriptorsD1 andD2, one can utilize the following sum of similarity scores between probe image P and gallery image G : fSF;D1F;D1 ðP;GÞ þ S F;D2 F;D2 ðP;GÞ þ SF;D1F;D2 ðP;GÞ þ S F;D2 F;D1 ðP;GÞg. Min-max score normalization is performed prior to fusion.the authors.
6.1 Commercial Matcher
- The accuracy of the proposed prototype random subspace framework is compared against Cognitec’s FaceVACS [28].
- Comparing the accuracy of their system against a leading COTS FRS offers an unbiased baseline of a state-ofthe-art commercial matcher on each HFR scenario.
- FaceVACS was chosen because it is considered as one of the best commercial face matchers and, in their internal tests, it excels at HFR scenarios (with respect to other commercial matchers).
- The accuracy of FaceVACS on NIR to VIS [10] and Viewed Sketch to VIS [9] performed on par with some previously published HFR methods.
6.2 Direct Random Subspaces
- In addition to a commercial face recognition system, the proposed prototype recognition system is also compared against a recognition system that directly measures the difference between probe and gallery images using a common feature descriptor representation.
- The random subspace framework from [10] is used as the baseline because it is the most similar to the proposed prototype framework, thus helping to isolate the difference between using kernel prototype similarities versus directly measuring the similarity.
- Further, because most of the datasets tested in Section 7 are in the public domain, the proposed framework may also be compared against any other published method on these datasets.
- This follows from the fact that ff;D1ðIÞ and ff;D2ðIÞ are of generally different dimensionality and also have a different interpretation.
- D-RS will be used in conjunction with the six filter/ descriptor representations presented in Section 3 (SIFT+DoG, MLBP+CSDN, etc.).
7 EXPERIMENTS
- The results provided are based on the following parameter values: ¼ 0:1 and B ¼ 30.
- A cosine kernel was used to compute the prototype similarity and 99.0 percent of the variance was retained in the PCA step of LDA.
7.1 Databases
- Five different matching scenarios are tested in this paper: four heterogeneous face recognition scenarios and one standard face recognition scenario.
- Example images from each of HFR dataset can be found in Fig.
- Results shown on each dataset are the average and standard deviation of five random splits of training and testing subjects.
- In every experiment, no subject that was used in training was used for testing.
7.1.1 Dataset 1—Near-Infrared to Visible (Fig. 1a)
- The first dataset consists of 200 subjects with probe images captured in the near-infrared spectrum (~780-1,100 nm) and gallery images captured in the visible spectrum.
- Portions of this dataset are publicly available for download.
- Only one NIR and one VIS image per subject are used, making the scenario more difficult than previous experiments which benefited from multiple images per subject in training and 2.
- The data was split as follows: nt ¼ 133 subjects were used for training set T and the remaining 67 subjects the authors used for testing.
7.1.2 Dataset 2—Thermal to Visible (Fig. 1b)
- The second dataset is a private dataset collected by the Pinellas County Sheriff’s Office (PCSO) and consists of 1,000 subjects with thermal infrared probe images and visible (mug shot) gallery images.
- The thermal infrared images were collected using a FLIR Recon III ObservIR camera, which has sensitivity in the range of 3-5 m.
- The data was split as follows: nt ¼ 667 subjects were used for training set T and the remaining 333 subjects were used for testing.
7.1.3 Dataset 3—Viewed Sketch to Visible (Fig. 1c)
- The third dataset is the CUHK sketch dataset,3 which was used by Tang and Wang [3], [5].
- The CUHK dataset consists of 606 subjects with a viewed sketch image for probe and a visible photograph for gallery.
- The 606 subjects were split to form a training set T with nt ¼ 404 subjects, and the remaining 202 subjects were used for testing.
7.1.4 Dataset 4—Forensic Sketch to Visible (Fig. 1d)
- The fourth and final heterogeneous face dataset consists of real-world forensic sketches and mug shot photos of 159 subjects.
- Forensic sketches are drawn by an artist based only on an eye witness description of the subject.
- The forensic sketch dataset is a collection of images from Gibson [45], Taylor [46], the Michigan State Police, and the Pinellas County Sheriff’s Office.
- Forensic sketches contain incomplete information regarding the subject and are one of the most difficult HFR scenarios because the sketches often do not closely resemble the photograph.
- The number of subjects subjects used in T is 106, and 53 subjects are used for the test set.
7.1.5 Dataset 5: Standard Face Recognition
- A fifth nonheterogeneous (i.e., homogeneous) dataset is used to demonstrate the ability of the proposed approach to operate in standard face recognition scenarios as well.
- The dataset consists of one probe and one gallery photograph of 876 subjects, where 117 subjects were from the AR dataset [43], 294 subjects were from the XM2VTS dataset [44], 193 subjects from the FERET dataset [47], and 272 subjects were from a private dataset collected at the University of Notre Dame.
7.1.6 Enlarged Gallery
- A collection of 10,000 mug shot images from 10,000 different subjects was used in certain experiments to increase the size of the gallery.
- These mug shot images were provided by the Pinellas County Sheriff’s Office.
- Any experiment using these additional images will have a gallery with the number of testing subjects plus 10,000 images.
- Experiments with a large gallery are meant to present results that more closely resemble real-world face retrieval scenarios that would occur in forensic and intelligence applications of heterogeneous face recognition.
7.2 Results
- Fig. 6 lists the rank retrieval results of P-RS, D-RS, and FaceVACS for each dataset using the additional 10,000 gallery images for each experiment.
- Regardless, the improved accuracy using a smaller training set of subjects clearly demonstrates the value of the proposed P-RS method.
- The lower accuracy of P-RS compared to D-RS on the forensic sketch dataset can be attributed to two factors.
- As shown, the recognition accuracy generally saturates around 100 prototypes.
- Using the standard face dataset, Fig. 10a compares the accuracy of P-RS, D-RS, and FaceVACS.
8 SUMMARY
- A method for heterogeneous face recognition, called Prototype Random Subspaces, is proposed.
- Probe and gallery images are initially filtered with three different image filters, and two different local feature descriptors are then extracted.
- A training set acts as a set of prototypes in which each prototype subject has an image in both the gallery and probe modalities.
- Results were compared against a leading commercial face recognition engine.
- Tailoring the P-RS parameters and learning weighted fusion schemes for each HFR scenario separately should offer further accuracy improvements.
ACKNOWLEDGMENTS
- The authors would like to thank Scott McCallum and the rest of the his team at the Pinellas County Sheriff’s Office, and Captain Greg Michaud from the Michigan State Police for their gracious support of this research.
- They would also like to thank Rong Jin and Serhat Bucak for their feedback on this research.
- This manuscript benefited from the value observations provided in the review process.
- Anil Jain’s research was partially supported by the World Class University (WCU) program funded by the Ministry of Education, Science and Technology through the National Research Foundation of Korea (R31-10008).
Did you find this useful? Give us your feedback
Citations
304 citations
Cites background or methods from "Heterogeneous Face Recognition Usin..."
...LBP and SIFT LBP and SIFT Face Recognition 2012 [99]...
[...]
...In addition to HOG and LBP, Hussain and Triggs [98] used LTP. Klare and Jain [99] exploited the combination of LBP and SIFT for heterogeneous face recognition....
[...]
...Klare and Jain [99] exploited the combination of LBP and SIFT for heterogeneous face recognition....
[...]
231 citations
174 citations
162 citations
Cites background from "Heterogeneous Face Recognition Usin..."
...Early studies [49]–[51] design hand-crafted features to represent cross-modal images and use machine learning approaches to minimize the cross-modality gap....
[...]
141 citations
References
46,906 citations
[...]
16,118 citations
14,245 citations
"Heterogeneous Face Recognition Usin..." refers methods in this paper
...A difference of Gaussian image is generated by convolving an image with a filter obtained by subtracting a Gaussian filter of width 1 from a Gaussian filter of width 2 ( 2 > 1)....
[...]
11,674 citations
"Heterogeneous Face Recognition Usin..." refers background in this paper
...The DoG, CSDN, and Gaussian image filters are, respectively, referred to as Fd, Fc, and Fg....
[...]
...For example, Fc;DsP ðIÞ denotes the prototype similarity of image I when represented using the CSDN image filter and SIFT descriptors....
[...]
...D-RS will be used in conjunction with the six filter/ descriptor representations presented in Section 3 (SIFT+DoG, MLBP+CSDN, etc.)....
[...]
...The nonlinear nature of the CSDN filter is seen as a compliment to the DoG filter....
[...]
...The CSDN filter divides the value of each pixel by the mean pixel value in the s s neighborhood surrounding the pixel....
[...]
5,984 citations