scispace - formally typeset
Search or ask a question

Showing papers on "Face detection published in 2016"


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a deep cascaded multitask framework that exploits the inherent correlation between detection and alignment to boost up their performance, which leverages a cascaded architecture with three stages of carefully designed deep convolutional networks to predict face and landmark location in a coarse-to-fine manner.
Abstract: Face detection and alignment in unconstrained environment are challenging due to various poses, illuminations, and occlusions. Recent studies show that deep learning approaches can achieve impressive performance on these two tasks. In this letter, we propose a deep cascaded multitask framework that exploits the inherent correlation between detection and alignment to boost up their performance. In particular, our framework leverages a cascaded architecture with three stages of carefully designed deep convolutional networks to predict face and landmark location in a coarse-to-fine manner. In addition, we propose a new online hard sample mining strategy that further improves the performance in practice. Our method achieves superior accuracy over the state-of-the-art techniques on the challenging face detection dataset and benchmark and WIDER FACE benchmarks for face detection, and annotated facial landmarks in the wild benchmark for face alignment, while keeps real-time performance.

3,980 citations


Journal ArticleDOI
TL;DR: A deep cascaded multitask framework that exploits the inherent correlation between detection and alignment to boost up their performance and achieves superior accuracy over the state-of-the-art techniques on the challenging face detection dataset and benchmark.
Abstract: Face detection and alignment in unconstrained environment are challenging due to various poses, illuminations and occlusions. Recent studies show that deep learning approaches can achieve impressive performance on these two tasks. In this paper, we propose a deep cascaded multi-task framework which exploits the inherent correlation between them to boost up their performance. In particular, our framework adopts a cascaded structure with three stages of carefully designed deep convolutional networks that predict face and landmark location in a coarse-to-fine manner. In addition, in the learning process, we propose a new online hard sample mining strategy that can improve the performance automatically without manual sample selection. Our method achieves superior accuracy over the state-of-the-art techniques on the challenging FDDB and WIDER FACE benchmark for face detection, and AFLW benchmark for face alignment, while keeps real time performance.

1,982 citations


Proceedings ArticleDOI
24 Oct 2016
TL;DR: A novel class of attacks is defined: attacks that are physically realizable and inconspicuous, and allow an attacker to evade recognition or impersonate another individual, and a systematic method to automatically generate such attacks is developed through printing a pair of eyeglass frames.
Abstract: Machine learning is enabling a myriad innovations, including new algorithms for cancer diagnosis and self-driving cars. The broad use of machine learning makes it important to understand the extent to which machine-learning algorithms are subject to attack, particularly when used in applications where physical security or safety is at risk. In this paper, we focus on facial biometric systems, which are widely used in surveillance and access control. We define and investigate a novel class of attacks: attacks that are physically realizable and inconspicuous, and allow an attacker to evade recognition or impersonate another individual. We develop a systematic method to automatically generate such attacks, which are realized through printing a pair of eyeglass frames. When worn by the attacker whose image is supplied to a state-of-the-art face-recognition algorithm, the eyeglasses allow her to evade being recognized or to impersonate another individual. Our investigation focuses on white-box face-recognition systems, but we also demonstrate how similar techniques can be used in black-box scenarios, as well as to avoid face detection.

1,466 citations


Proceedings ArticleDOI
27 Jun 2016
TL;DR: There is a gap between current face detection performance and the real world requirements, and the WIDER FACE dataset, which is 10 times larger than existing datasets is introduced, which contains rich annotations, including occlusions, poses, event categories, and face bounding boxes.
Abstract: Face detection is one of the most studied topics in the computer vision community. Much of the progresses have been made by the availability of face detection benchmark datasets. We show that there is a gap between current face detection performance and the real world requirements. To facilitate future face detection research, we introduce the WIDER FACE dataset1, which is 10 times larger than existing datasets. The dataset contains rich annotations, including occlusions, poses, event categories, and face bounding boxes. Faces in the proposed dataset are extremely challenging due to large variations in scale, pose and occlusion, as shown in Fig. 1. Furthermore, we show that WIDER FACE dataset is an effective training source for face detection. We benchmark several representative detection systems, providing an overview of state-of-the-art performance and propose a solution to deal with large scale variation. Finally, we discuss common failure cases that worth to be further investigated.

1,458 citations


Proceedings ArticleDOI
TL;DR: UnitBox as mentioned in this paper proposes an intersection over union (IoU$) loss function for bounding box prediction, which regresses the four bounds of a predicted box as a whole unit.
Abstract: In present object detection systems, the deep convolutional neural networks (CNNs) are utilized to predict bounding boxes of object candidates, and have gained performance advantages over the traditional region proposal methods. However, existing deep CNN methods assume the object bounds to be four independent variables, which could be regressed by the $\ell_2$ loss separately. Such an oversimplified assumption is contrary to the well-received observation, that those variables are correlated, resulting to less accurate localization. To address the issue, we firstly introduce a novel Intersection over Union ($IoU$) loss function for bounding box prediction, which regresses the four bounds of a predicted box as a whole unit. By taking the advantages of $IoU$ loss and deep fully convolutional networks, the UnitBox is introduced, which performs accurate and efficient localization, shows robust to objects of varied shapes and scales, and converges fast. We apply UnitBox on face detection task and achieve the best performance among all published methods on the FDDB benchmark.

781 citations


Proceedings ArticleDOI
01 Oct 2016
TL;DR: A novel Intersection over Union (IoU) loss function for bounding box prediction, which regresses the four bounds of a predicted box as a whole unit, and introduces the UnitBox, which performs accurate and efficient localization, shows robust to objects of varied shapes and scales, and converges fast.
Abstract: In present object detection systems, the deep convolutional neural networks (CNNs) are utilized to predict bounding boxes of object candidates, and have gained performance advantages over the traditional region proposal methods. However, existing deep CNN methods assume the object bounds to be four independent variables, which could be regressed by the l2 loss separately. Such an oversimplified assumption is contrary to the well-received observation, that those variables are correlated, resulting to less accurate localization. To address the issue, we firstly introduce a novel Intersection over Union (IoU) loss function for bounding box prediction, which regresses the four bounds of a predicted box as a whole unit. By taking the advantages of IoU loss and deep fully convolutional networks, the UnitBox is introduced, which performs accurate and efficient localization, shows robust to objects of varied shapes and scales, and converges fast. We apply UnitBox on face detection task and achieve the best performance among all published methods on the FDDB benchmark.

521 citations


Journal ArticleDOI
TL;DR: This paper introduces a novel and appealing approach for detecting face spoofing using a colour texture analysis that exploits the joint colour-texture information from the luminance and the chrominance channels by extracting complementary low-level feature descriptions from different colour spaces.
Abstract: Research on non-intrusive software-based face spoofing detection schemes has been mainly focused on the analysis of the luminance information of the face images, hence discarding the chroma component, which can be very useful for discriminating fake faces from genuine ones. This paper introduces a novel and appealing approach for detecting face spoofing using a colour texture analysis. We exploit the joint colour-texture information from the luminance and the chrominance channels by extracting complementary low-level feature descriptions from different colour spaces. More specifically, the feature histograms are computed over each image band separately. Extensive experiments on the three most challenging benchmark data sets, namely, the CASIA face anti-spoofing database, the replay-attack database, and the MSU mobile face spoof database, showed excellent results compared with the state of the art. More importantly, unlike most of the methods proposed in the literature, our proposed approach is able to achieve stable performance across all the three benchmark data sets. The promising results of our cross-database evaluation suggest that the facial colour texture representation is more stable in unknown conditions compared with its gray-scale counterparts.

449 citations


Journal ArticleDOI
TL;DR: A new taxonomy of automatic RGB, 3D, thermal and multimodal facial expression analysis is defined, encompassing all steps from face detection to facial expression recognition, and described and classify the state of the art methods accordingly.
Abstract: Facial expressions are an important way through which humans interact socially. Building a system capable of automatically recognizing facial expressions from images and video has been an intense field of study in recent years. Interpreting such expressions remains challenging and much research is needed about the way they relate to human affect. This paper presents a general overview of automatic RGB, 3D, thermal and multimodal facial expression analysis. We define a new taxonomy for the field, encompassing all steps from face detection to facial expression recognition, and describe and classify the state of the art methods accordingly. We also present the important datasets and the bench-marking of most influential methods. We conclude with a general discussion about trends, important questions and future lines of research.

357 citations


Posted Content
TL;DR: Facial expressions are an important way through which humans interact socially as mentioned in this paper, and much research is needed about the way they relate to human affect, and a taxonomy of facial expression analysis methods can be found in this paper.
Abstract: Facial expressions are an important way through which humans interact socially. Building a system capable of automatically recognizing facial expressions from images and video has been an intense field of study in recent years. Interpreting such expressions remains challenging and much research is needed about the way they relate to human affect. This paper presents a general overview of automatic RGB, 3D, thermal and multimodal facial expression analysis. We define a new taxonomy for the field, encompassing all steps from face detection to facial expression recognition, and describe and classify the state of the art methods accordingly. We also present the important datasets and the bench-marking of most influential methods. We conclude with a general discussion about trends, important questions and future lines of research.

340 citations


Book ChapterDOI
08 Oct 2016
TL;DR: In this paper, the authors propose a domain specific data augmentation method to enrich an existing dataset with important facial appearance variations by manipulating the faces it contains, which is also used when matching query images represented by standard convolutional neural networks.
Abstract: Face recognition capabilities have recently made extraordinary leaps. Though this progress is at least partially due to ballooning training set sizes – huge numbers of face images downloaded and labeled for identity – it is not clear if the formidable task of collecting so many images is truly necessary. We propose a far more accessible means of increasing training data sizes for face recognition systems: Domain specific data augmentation. We describe novel methods of enriching an existing dataset with important facial appearance variations by manipulating the faces it contains. This synthesis is also used when matching query images represented by standard convolutional neural networks. The effect of training and testing with synthesized images is tested on the LFW and IJB-A (verification and identification) benchmarks and Janus CS2. The performances obtained by our approach match state of the art results reported by systems trained on millions of downloaded images.

338 citations


Posted Content
TL;DR: Wang et al. as mentioned in this paper proposed a Neural Aggregation Network (NAN) for video face recognition, which consists of two attention blocks which adaptively aggregate the feature vectors to form a single feature inside the convex hull spanned by them.
Abstract: This paper presents a Neural Aggregation Network (NAN) for video face recognition. The network takes a face video or face image set of a person with a variable number of face images as its input, and produces a compact, fixed-dimension feature representation for recognition. The whole network is composed of two modules. The feature embedding module is a deep Convolutional Neural Network (CNN) which maps each face image to a feature vector. The aggregation module consists of two attention blocks which adaptively aggregate the feature vectors to form a single feature inside the convex hull spanned by them. Due to the attention mechanism, the aggregation is invariant to the image order. Our NAN is trained with a standard classification or verification loss without any extra supervision signal, and we found that it automatically learns to advocate high-quality face images while repelling low-quality ones such as blurred, occluded and improperly exposed faces. The experiments on IJB-A, YouTube Face, Celebrity-1000 video face recognition benchmarks show that it consistently outperforms naive aggregation methods and achieves the state-of-the-art accuracy.

Journal ArticleDOI
TL;DR: A new image feature called Normalized Pixel Difference (NPD) is proposed, computed as the difference to sum ratio between two pixel values, inspired by the Weber Fraction in experimental psychology, which is scale invariant, bounded, and able to reconstruct the original image.
Abstract: We propose a method to address challenges in unconstrained face detection, such as arbitrary pose variations and occlusions. First, a new image feature called Normalized Pixel Difference (NPD) is proposed. NPD feature is computed as the difference to sum ratio between two pixel values, inspired by the Weber Fraction in experimental psychology. The new feature is scale invariant, bounded, and is able to reconstruct the original image. Second, we propose a deep quadratic tree to learn the optimal subset of NPD features and their combinations, so that complex face manifolds can be partitioned by the learned rules. This way, only a single soft-cascade classifier is needed to handle unconstrained face detection. Furthermore, we show that the NPD features can be efficiently obtained from a look up table, and the detection template can be easily scaled, making the proposed face detector very fast. Experimental results on three public face datasets (FDDB, GENKI, and CMU-MIT) show that the proposed method achieves state-of-the-art performance in detecting unconstrained faces with arbitrary pose variations and occlusions in cluttered scenes.

Proceedings ArticleDOI
01 Oct 2016
TL;DR: In this article, a CNN-based approach is proposed for reconstructing a 3D face from a single image, which is based on a convolutional-neural-network (CNN) architecture.
Abstract: Fast and robust three-dimensional reconstruction of facial geometric structure from a single image is a challenging task with numerous applications. Here, we introduce a learning-based approach for reconstructing a three-dimensional face from a single image. Recent face recovery methods rely on accurate localization of key characteristic points. In contrast, the proposed approach is based on a Convolutional-Neural-Network (CNN) which extracts the face geometry directly from its image. Although such deep architectures outperform other models in complex computer vision problems, training them properly requires a large dataset of annotated examples. In the case of three-dimensional faces, currently, there are no large volume data sets, while acquiring such big-data is a tedious task. As an alternative, we propose to generate random, yet nearly photo-realistic, facial images for which the geometric form is known. The suggested model successfully recovers facial shapes from real images, even for faces with extreme expressions and under various lighting conditions.

Posted Content
TL;DR: In this article, a multi-task learning framework was proposed for simultaneous face detection, face alignment, pose estimation, gender recognition, smile detection, age estimation and face recognition using a single deep convolutional neural network.
Abstract: We present a multi-purpose algorithm for simultaneous face detection, face alignment, pose estimation, gender recognition, smile detection, age estimation and face recognition using a single deep convolutional neural network (CNN). The proposed method employs a multi-task learning framework that regularizes the shared parameters of CNN and builds a synergy among different domains and tasks. Extensive experiments show that the network has a better understanding of face and achieves state-of-the-art result for most of these tasks.

Proceedings ArticleDOI
27 Jun 2016
TL;DR: It is shown that the back propagation algorithm used in training CNN can be naturally used inTraining CNN cascade, and how jointly training can be conducted on naive CNN cascade and more sophisticated region proposal network (RPN) and fast R-CNN.
Abstract: Cascade has been widely used in face detection, where classifier with low computation cost can be firstly used to shrink most of the background while keeping the recall. The cascade in detection is popularized by seminal Viola-Jones framework and then widely used in other pipelines, such as DPM and CNN. However, to our best knowledge, most of the previous detection methods use cascade in a greedy manner, where previous stages in cascade are fixed when training a new stage. So optimizations of different CNNs are isolated. In this paper, we propose joint training to achieve end-to-end optimization for CNN cascade. We show that the back propagation algorithm used in training CNN can be naturally used in training CNN cascade. We present how jointly training can be conducted on naive CNN cascade and more sophisticated region proposal network (RPN) and fast R-CNN. Experiments on face detection benchmarks verify the advantages of the joint training.

Book ChapterDOI
Dong Chen1, Gang Hua1, Fang Wen1, Jian Sun1
08 Oct 2016
TL;DR: A new cascaded Convolutional Neural Network, dubbed the name Supervised Transformer Network, to address the challenge of large pose variations in real-word face detection, which achieves state-of-the-art detection accuracies on several public benchmarks.
Abstract: Large pose variations remain to be a challenge that confronts real-word face detection. We propose a new cascaded Convolutional Neural Network, dubbed the name Supervised Transformer Network, to address this challenge. The first stage is a multi-task Region Proposal Network (RPN), which simultaneously predicts candidate face regions along with associated facial landmarks. The candidate regions are then warped by mapping the detected facial landmarks to their canonical positions to better normalize the face patterns. The second stage, which is a RCNN, then verifies if the warped candidate regions are valid faces or not. We conduct end-to-end learning of the cascaded network, including optimizing the canonical positions of the facial landmarks. This supervised learning of the transformations automatically selects the best scale to differentiate face/non-face patterns. By combining feature maps from both stages of the network, we achieve state-of-the-art detection accuracies on several public benchmarks. For real-time performance, we run the cascaded network only on regions of interests produced from a boosting cascade face detector. Our detector runs at 30 FPS on a single CPU core for a VGA-resolution image.

Proceedings ArticleDOI
27 Jun 2016
TL;DR: This work proposes a novel deep face recognition framework to learn the ageinvariant deep face features through a carefully designed CNN model, and is the first attempt to show the effectiveness of deep CNNs in advancing the state-of-the-art of AIFR.
Abstract: While considerable progresses have been made on face recognition, age-invariant face recognition (AIFR) still remains a major challenge in real world applications of face recognition systems. The major difficulty of AIFR arises from the fact that the facial appearance is subject to significant intra-personal changes caused by the aging process over time. In order to address this problem, we propose a novel deep face recognition framework to learn the ageinvariant deep face features through a carefully designed CNN model. To the best of our knowledge, this is the first attempt to show the effectiveness of deep CNNs in advancing the state-of-the-art of AIFR. Extensive experiments are conducted on several public domain face aging datasets (MORPH Album2, FGNET, and CACD-VS) to demonstrate the effectiveness of the proposed model over the state-of the-art. We also verify the excellent generalization of our new model on the famous LFW dataset.

Book ChapterDOI
08 Oct 2016
TL;DR: This paper presents a method for face detection in the wild, which integrates a ConvNet and a 3D mean face model in an end-to-end multi-task discriminative learning framework and addresses two issues in adapting state-of-the-art generic object detection ConvNets.
Abstract: This paper presents a method for face detection in the wild, which integrates a ConvNet and a 3D mean face model in an end-to-end multi-task discriminative learning framework. The 3D mean face model is predefined and fixed (e.g., we used the one provided in the AFLW dataset). The ConvNet consists of two components: (i) The face proposal component computes face bounding box proposals via estimating facial key-points and the 3D transformation (rotation and translation) parameters for each predicted key-point w.r.t. the 3D mean face model. (ii) The face verification component computes detection results by pruning and refining proposals based on facial key-points based configuration pooling. The proposed method addresses two issues in adapting state-of-the-art generic object detection ConvNets (e.g., faster R-CNN) for face detection: (i) One is to eliminate the heuristic design of predefined anchor boxes in the region proposals network (RPN) by exploiting a 3D mean face model. (ii) The other is to replace the generic RoI (Region-of-Interest) pooling layer with a configuration pooling layer to respect underlying object structures. The multi-task loss consists of three terms: the classification Softmax loss and the location smooth \(l_1\)-losses of both the facial key-points and the face bounding boxes. In experiments, our ConvNet is trained on the AFLW dataset only and tested on the FDDB benchmark with fine-tuning and on the AFW benchmark without fine-tuning. The proposed method obtains very competitive state-of-the-art performance in the two benchmarks.

Posted Content
TL;DR: In this article, a multi-task learning algorithm is proposed for simultaneous face detection, landmarks localization, pose estimation and gender recognition using deep convolutional neural networks (CNNs).
Abstract: We present an algorithm for simultaneous face detection, landmarks localization, pose estimation and gender recognition using deep convolutional neural networks (CNN). The proposed method called, HyperFace, fuses the intermediate layers of a deep CNN using a separate CNN followed by a multi-task learning algorithm that operates on the fused features. It exploits the synergy among the tasks which boosts up their individual performances. Additionally, we propose two variants of HyperFace: (1) HyperFace-ResNet that builds on the ResNet-101 model and achieves significant improvement in performance, and (2) Fast-HyperFace that uses a high recall fast face detector for generating region proposals to improve the speed of the algorithm. Extensive experiments show that the proposed models are able to capture both global and local information in faces and performs significantly better than many competitive algorithms for each of these four tasks.

Proceedings ArticleDOI
01 Sep 2016
TL;DR: This work proposes a novel scheme to detect morphed face images based on facial micro-textures extracted using statistically independent filters that are trained on natural images.
Abstract: Widespread deployment of Automatic Border Control (ABC) along with the electronic Machine Readable Travel Documents (eMRTD) for person verification has enabled a prominent use case of face biometrics in border control applications. Many countries issue eMRTD passports on the basis of a printed biometric face photo submitted by the applicant. Some countries offer web-portals for passport renewal, where citizens can upload their face photo. These applications allow the possibility of the photo being altered to beautify the appearance of the data subject or being morphed to conceal the applicant identity. Specifically, if an eMRTD passport is issued with a morphed facial image, two or more data subjects, likely the known applicant and one or more unknown companion(s), can use such passport to pass a border control. In this work we propose a novel scheme to detect morphed face images based on facial micro-textures extracted using statistically independent filters that are trained on natural images. Given a face image, the proposed method will obtain a micro-texture variation using Binarized Statistical Image Features (BSIF), and the decision is made using a linear Support Vector Machine (SVM). This is first work carried out towards detecting the morphed face images. Extensive experiments are carried out on a large-scale database of 450 morphed face images created using 110 unique subjects with different ethnicity, age, and gender that indicates the superior performance.

Journal ArticleDOI
TL;DR: This work has shown that the functional similarities of the face-selective areas in humans and macaques indicate they are homologous, and this has important implications for consequential situations such as eyewitness identification and policing.
Abstract: Primate face processing depends on a distributed network of interlinked face-selective areas composed of face-selective neurons. In both humans and macaques, the network is divided into a ventral stream and a dorsal stream, and the functional similarities of the areas in humans and macaques indicate they are homologous. Neural correlates for face detection, holistic processing, face space, and other key properties of human face processing have been identified at the single neuron level, and studies providing causal evidence have established firmly that face-selective brain areas are central to face processing. These mechanisms give rise to our highly accurate familiar face recognition but also to our error-prone performance with unfamiliar faces. This limitation of the face system has important implications for consequential situations such as eyewitness identification and policing.

Journal ArticleDOI
TL;DR: A fully automatic face normalization and recognition system robust to most common face variations in unconstrained environments and improves the performance of AAM fitting by initializing the AAM with estimates of the locations of the facial landmarks obtained by a method based on flexible mixture of parts.
Abstract: We present a fully automatic face normalization and recognition system.It normalizes the face images for both in-plane and out-of-plane pose variations.The performance of AAM fitting is improved using a novel initialization technique.HOG and Gabor features are fused using CCA to have more discriminative features.The proposed system recognizes non-frontal faces using only a single gallery sample. Single sample face recognition have become an important problem because of the limitations on the availability of gallery images. In many real-world applications such as passport or driver license identification, there is only a single facial image per subject available. The variations between the single gallery face image and the probe face images, captured in unconstrained environments, make the single sample face recognition even more difficult. In this paper, we present a fully automatic face recognition system robust to most common face variations in unconstrained environments. Our proposed system is capable of recognizing faces from non-frontal views and under different illumination conditions using only a single gallery sample for each subject. It normalizes the face images for both in-plane and out-of-plane pose variations using an enhanced technique based on active appearance models (AAMs). We improve the performance of AAM fitting, not only by training it with in-the-wild images and using a powerful optimization technique, but also by initializing the AAM with estimates of the locations of the facial landmarks obtained by a method based on flexible mixture of parts. The proposed initialization technique results in significant improvement of AAM fitting to non-frontal poses and makes the normalization process robust, fast and reliable. Owing to the proper alignment of the face images, made possible by this approach, we can use local feature descriptors, such as Histograms of Oriented Gradients (HOG), for matching. The use of HOG features makes the system robust against illumination variations. In order to improve the discriminating information content of the feature vectors, we also extract Gabor features from the normalized face images and fuse them with HOG features using Canonical Correlation Analysis (CCA). Experimental results performed on various databases outperform the state-of-the-art methods and show the effectiveness of our proposed method in normalization and recognition of face images obtained in unconstrained environments.

Proceedings ArticleDOI
31 Oct 2016
TL;DR: This paper presents the implementation details of the proposed solution to the Emotion Recognition in the Wild 2016 Challenge, in the category of video-based emotion recognition, which achieves 59.42% validation accuracy and improves the competition baseline of 38.81%.
Abstract: This paper presents the implementation details of the proposed solution to the Emotion Recognition in the Wild 2016 Challenge, in the category of video-based emotion recognition. The proposed approach takes the video stream from the audio-video trimmed clips provided by the challenge as input and produces the emotion label corresponding to this video sequence. This output is encoded as one out of seven classes: the six basic emotions (Anger, Disgust, Fear, Happiness, Sad, Surprise) and Neutral. Overall, the system consists of several pipelined modules: face detection, image pre-processing, deep feature extraction, feature encoding and, finally, an SVM classification. This system achieves 59.42% validation accuracy, surpassing the competition baseline of 38.81%. With regard to test data, our system achieves 56.66% recognition rate, also improving the competition baseline of 40.47%.

Proceedings ArticleDOI
07 Mar 2016
TL;DR: A novel representation of face recognition using multiple pose-aware deep learning models achieves better results than the state-of-the-art on IARPA's CS2 and NIST's IJB-A in both verification and identification tasks.
Abstract: We introduce our method and system for face recognition using multiple pose-aware deep learning models. In our representation, a face image is processed by several pose-specific deep convolutional neural network (CNN) models to generate multiple pose-specific features. 3D rendering is used to generate multiple face poses from the input image. Sensitivity of the recognition system to pose variations is reduced since we use an ensemble of pose-specific CNN features. The paper presents extensive experimental results on the effect of landmark detection, CNN layer selection and pose model selection on the performance of the recognition pipeline. Our novel representation achieves better results than the state-of-the-art on IARPA's CS2 and NIST's IJB-A in both verification and identification (i.e. search) tasks.

Journal ArticleDOI
TL;DR: A new partial face recognition approach to recognize persons of interest from their partial faces using a robust point set matching method, where both the textural information and geometrical information of local features are explicitly used for matching simultaneously.
Abstract: Over the past three decades, a number of face recognition methods have been proposed in computer vision, and most of them use holistic face images for person identification. In many real-world scenarios especially some unconstrained environments, human faces might be occluded by other objects, and it is difficult to obtain fully holistic face images for recognition. To address this, we propose a new partial face recognition approach to recognize persons of interest from their partial faces. Given a pair of gallery image and probe face patch, we first detect keypoints and extract their local textural features. Then, we propose a robust point set matching method to discriminatively match these two extracted local feature sets, where both the textural information and geometrical information of local features are explicitly used for matching simultaneously. Finally, the similarity of two faces is converted as the distance between these two aligned feature sets. Experimental results on four public face data sets show the effectiveness of the proposed approach.

Book ChapterDOI
08 Oct 2016
TL;DR: Experimental results show that CNNs trained on visible spectrum images can be used to obtain results that are on par or improve over the state-of-the-art for heterogeneous recognition with near-infrared images and sketches.
Abstract: Heterogeneous face recognition aims to recognize faces across different sensor modalities. Typically, gallery images are normal visible spectrum images, and probe images are infrared images or sketches. Recently significant improvements in visible spectrum face recognition have been obtained by CNNs learned from very large training datasets. In this paper, we are interested in the question to what extent the features from a CNN pre-trained on visible spectrum face images can be used to perform heterogeneous face recognition. We explore different metric learning strategies to reduce the discrepancies between the different modalities. Experimental results show that we can use CNNs trained on visible spectrum images to obtain results that are on par or improve over the state-of-the-art for heterogeneous recognition with near-infrared images and sketches.

Proceedings ArticleDOI
07 Mar 2016
TL;DR: In this article, a bilinear CNN (B-CNN) was applied to the IARPA Janus Benchmark A (IJB-A) for face recognition.
Abstract: The recent explosive growth in convolutional neural network (CNN) research has produced a variety of new architectures for deep learning. One intriguing new architecture is the bilinear CNN (B-CNN), which has shown dramatic performance gains on certain fine-grained recognition problems [15]. We apply this new CNN to the challenging new face recognition benchmark, the IARPA Janus Benchmark A (IJB-A) [12]. It features faces from a large number of identities in challenging real-world conditions. Because the face images were not identified automatically using a computerized face detection system, it does not have the bias inherent in such a database. We demonstrate the performance of the B-CNN model beginning from an AlexNet-style network pre-trained on ImageNet. We then show results for fine-tuning using a moderate-sized and public external database, FaceScrub [17]. We also present results with additional fine-tuning on the limited training data provided by the protocol. In each case, the fine-tuned bilinear model shows substantial improvements over the standard CNN. Finally, we demonstrate how a standard CNN pre-trained on a large face database, the recently released VGG-Face model [20], can be converted into a B-CNN without any additional feature training. This B-CNN improves upon the CNN performance on the IJB-A benchmark, achieving 89.5% rank-1 recall.

Proceedings ArticleDOI
01 Dec 2016
TL;DR: This study reveals the limited modeling capacity of the common boosted trees model, motivating a need for architectural changes in order to compete with multi-level and very deep architectures.
Abstract: We aim to study the modeling limitations of the commonly employed boosted decision trees classifier. Inspired by the success of large, data-hungry visual recognition models (e.g. deep convolutional neural networks), this paper focuses on the relationship between modeling capacity of the weak learners, dataset size, and dataset properties. A set of novel experiments on the Caltech Pedestrian Detection benchmark results in the best known performance among non-CNN techniques while operating at fast run-time speed. Furthermore, the performance is on par with deep architectures (9.71% log-average miss rate), while using only HOG+LUV channels as features. The conclusions from this study are shown to generalize over different object detection domains as demonstrated on the FDDB face detection benchmark (93.37% accuracy). Despite the impressive performance, this study reveals the limited modeling capacity of the common boosted trees model, motivating a need for architectural changes in order to compete with multi-level and very deep architectures.

Journal ArticleDOI
TL;DR: A novel method to automatically produce approximately axis-symmetrical virtual face images that is mathematically very tractable and quite easy to implement and verified in comparison with state-of-the-art dictionary learning algorithms.

Proceedings ArticleDOI
01 Sep 2016
TL;DR: In this article, a unique non-commercial dataset, the University of Maryland Active Authentication Dataset 02 (UMDAA-02) for multi-modal user authentication research is introduced.
Abstract: In this paper, automated user verification techniques for smartphones are investigated. A unique non-commercial dataset, the University of Maryland Active Authentication Dataset 02 (UMDAA-02) for multi-modal user authentication research is introduced. This paper focuses on three sensors — front camera, touch sensor and location service while providing a general description for other modalities. Benchmark results for face detection, face verification, touch-based user identification and location-based next-place prediction are presented, which indicate that more robust methods fine-tuned to the mobile platform are needed to achieve satisfactory verification accuracy. The dataset will be made available to the research community for promoting additional research.