scispace - formally typeset
Search or ask a question

Showing papers on "Facial recognition system published in 2016"


Book ChapterDOI
08 Oct 2016
TL;DR: This paper proposes a new supervision signal, called center loss, for face recognition task, which simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers.
Abstract: Convolutional neural networks (CNNs) have been widely used in computer vision community, significantly improving the state-of-the-art. In most of the available CNNs, the softmax loss function is used as the supervision signal to train the deep model. In order to enhance the discriminative power of the deeply learned features, this paper proposes a new supervision signal, called center loss, for face recognition task. Specifically, the center loss simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers. More importantly, we prove that the proposed center loss function is trainable and easy to optimize in the CNNs. With the joint supervision of softmax loss and center loss, we can train a robust CNNs to obtain the deep features with the two key learning objectives, inter-class dispension and intra-class compactness as much as possible, which are very essential to face recognition. It is encouraging to see that our CNNs (with such joint supervision) achieve the state-of-the-art accuracy on several important face recognition benchmarks, Labeled Faces in the Wild (LFW), YouTube Faces (YTF), and MegaFace Challenge. Especially, our new approach achieves the best results on MegaFace (the largest public domain face benchmark) under the protocol of small training set (contains under 500000 images and under 20000 persons), significantly improving the previous results and setting new state-of-the-art for both face recognition and face verification tasks.

3,464 citations


Book ChapterDOI
Yandong Guo1, Lei Zhang1, Yuxiao Hu1, Xiaodong He1, Jianfeng Gao1 
08 Oct 2016
TL;DR: In this article, the authors proposed a benchmark task to recognize one million celebrities from their face images, by using all the possibly collected face images of this individual on the web as training data.
Abstract: In this paper, we design a benchmark task and provide the associated datasets for recognizing face images and link them to corresponding entity keys in a knowledge base. More specifically, we propose a benchmark task to recognize one million celebrities from their face images, by using all the possibly collected face images of this individual on the web as training data. The rich information provided by the knowledge base helps to conduct disambiguation and improve the recognition accuracy, and contributes to various real-world applications, such as image captioning and news video analysis. Associated with this task, we design and provide concrete measurement set, evaluation protocol, as well as training data. We also present in details our experiment setup and report promising baseline results. Our benchmark task could lead to one of the largest classification problems in computer vision. To the best of our knowledge, our training dataset, which contains 10M images in version 1, is the largest publicly available one in the world.

1,346 citations


Proceedings ArticleDOI
27 Jun 2016
TL;DR: A novel approach for real-time facial reenactment of a monocular target video sequence (e.g., Youtube video) that addresses the under-constrained problem of facial identity recovery from monocular video by non-rigid model-based bundling and re-render the manipulated output video in a photo-realistic fashion.
Abstract: We present a novel approach for real-time facial reenactment of a monocular target video sequence (e.g., Youtube video). The source sequence is also a monocular video stream, captured live with a commodity webcam. Our goal is to animate the facial expressions of the target video by a source actor and re-render the manipulated output video in a photo-realistic fashion. To this end, we first address the under-constrained problem of facial identity recovery from monocular video by non-rigid model-based bundling. At run time, we track facial expressions of both source and target video using a dense photometric consistency measure. Reenactment is then achieved by fast and efficient deformation transfer between source and target. The mouth interior that best matches the re-targeted expression is retrieved from the target sequence and warped to produce an accurate fit. Finally, we convincingly re-render the synthesized target face on top of the corresponding video stream such that it seamlessly blends with the real-world illumination. We demonstrate our method in a live setup, where Youtube videos are reenacted in real time.

1,011 citations


01 Jan 2016
TL;DR: It is shown that OpenFace provides near-human accuracy on the LFW benchmark and present a new classification benchmark for mobile scenarios, intended for non-experts interested in using OpenFace and provides a light introduction to the deep neural network techniques the authors use.
Abstract: Cameras are becoming ubiquitous in the Internet of Things (IoT) and can use face recognition technology to improve context. There is a large accuracy gap between today’s publicly available face recognition systems and the state-of-the-art private face recognition systems. This paper presents our OpenFace face recognition library that bridges this accuracy gap. We show that OpenFace provides near-human accuracy on the LFW benchmark and present a new classification benchmark for mobile scenarios. This paper is intended for non-experts interested in using OpenFace and provides a light introduction to the deep neural network techniques we use. We released OpenFace in October 2015 as an open source library under the Apache 2.0 license. It is available at: http://cmusatyalab.github.io/openface/ This research was supported by the National Science Foundation (NSF) under grant number CNS-1518865. Additional support was provided by Crown Castle, the Conklin Kistler family fund, Google, the Intel Corporation, and Vodafone. NVIDIA’s academic hardware grant provided the Tesla K40 GPU used in all of our experiments. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and should not be attributed to their employers or funding sources.

827 citations


Proceedings ArticleDOI
27 Jun 2016
TL;DR: A Deep Relative Distance Learning (DRDL) method is proposed which exploits a two-branch deep convolutional network to project raw vehicle images into an Euclidean space where distance can be directly used to measure the similarity of arbitrary two vehicles.
Abstract: The growing explosion in the use of surveillance cameras in public security highlights the importance of vehicle search from a large-scale image or video database. However, compared with person re-identification or face recognition, vehicle search problem has long been neglected by researchers in vision community. This paper focuses on an interesting but challenging problem, vehicle re-identification (a.k.a precise vehicle search). We propose a Deep Relative Distance Learning (DRDL) method which exploits a two-branch deep convolutional network to project raw vehicle images into an Euclidean space where distance can be directly used to measure the similarity of arbitrary two vehicles. To further facilitate the future research on this problem, we also present a carefully-organized largescale image database "VehicleID", which includes multiple images of the same vehicle captured by different realworld cameras in a city. We evaluate our DRDL method on our VehicleID dataset and another recently-released vehicle model classification dataset "CompCars" in three sets of experiments: vehicle re-identification, vehicle model verification and vehicle retrieval. Experimental results show that our method can achieve promising results and outperforms several state-of-the-art approaches.

689 citations


Journal ArticleDOI
TL;DR: This work proposes a Multi-view Discriminant Analysis (MvDA) approach, which seeks for a single discriminant common space for multiple views in a non-pairwise manner by jointly learning multiple view-specific linear transforms.
Abstract: In many computer vision systems, the same object can be observed at varying viewpoints or even by different sensors, which brings in the challenging demand for recognizing objects from distinct even heterogeneous views. In this work we propose a Multi-view Discriminant Analysis (MvDA) approach, which seeks for a single discriminant common space for multiple views in a non-pairwise manner by jointly learning multiple view-specific linear transforms. Specifically, our MvDA is formulated to jointly solve the multiple linear transforms by optimizing a generalized Rayleigh quotient, i.e., maximizing the between-class variations and minimizing the within-class variations from both intra-view and inter-view in the common space. By reformulating this problem as a ratio trace problem, the multiple linear transforms are achieved analytically and simultaneously through generalized eigenvalue decomposition. Furthermore, inspired by the observation that different views share similar data structures, a constraint is introduced to enforce the view-consistency of the multiple linear transforms. The proposed method is evaluated on three tasks: face recognition across pose, photo versus. sketch face recognition, and visual light image versus near infrared image face recognition on Multi-PIE, CUFSF and HFB databases respectively. Extensive experiments show that our MvDA achieves significant improvements compared with the best known results.

610 citations


Posted Content
Yandong Guo1, Lei Zhang1, Yuxiao Hu1, Xiaodong He1, Jianfeng Gao1 
TL;DR: A benchmark task to recognize one million celebrities from their face images, by using all the possibly collected face images of this individual on the web as training data, which could lead to one of the largest classification problems in computer vision.
Abstract: In this paper, we design a benchmark task and provide the associated datasets for recognizing face images and link them to corresponding entity keys in a knowledge base. More specifically, we propose a benchmark task to recognize one million celebrities from their face images, by using all the possibly collected face images of this individual on the web as training data. The rich information provided by the knowledge base helps to conduct disambiguation and improve the recognition accuracy, and contributes to various real-world applications, such as image captioning and news video analysis. Associated with this task, we design and provide concrete measurement set, evaluation protocol, as well as training data. We also present in details our experiment setup and report promising baseline results. Our benchmark task could lead to one of the largest classification problems in computer vision. To the best of our knowledge, our training dataset, which contains 10M images in version 1, is the largest publicly available one in the world.

568 citations


Book ChapterDOI
01 Jan 2016
TL;DR: A review of the contributions to LFW for which the authors have provided results to the curators and the cross cutting topic of alignment and how it is used in various methods is reviewed.
Abstract: In 2007, Labeled Faces in the Wild was released in an effort to spur research in face recognition, specifically for the problem of face verification with unconstrained images. Since that time, more than 50 papers have been published that improve upon this benchmark in some respect. A remarkably wide variety of innovative methods have been developed to overcome the challenges presented in this database. As performance on some aspects of the benchmark approaches 100 % accuracy, it seems appropriate to review this progress, derive what general principles we can from these works, and identify key future challenges in face recognition. In this survey, we review the contributions to LFW for which the authors have provided results to the curators (results found on the LFW results web page). We also review the cross cutting topic of alignment and how it is used in various methods. We end with a brief discussion of recent databases designed to challenge the next generation of face recognition algorithms.

464 citations


Journal ArticleDOI
TL;DR: An efficient face spoof detection system on an Android smartphone based on the analysis of image distortion in spoof face images and an unconstrained smartphone spoof attack database containing more than 1000 subjects are built.
Abstract: With the wide deployment of the face recognition systems in applications from deduplication to mobile device unlocking, security against the face spoofing attacks requires increased attention; such attacks can be easily launched via printed photos, video replays, and 3D masks of a face. We address the problem of face spoof detection against the print (photo) and replay (photo or video) attacks based on the analysis of image distortion ( e.g. , surface reflection, moire pattern, color distortion, and shape deformation) in spoof face images (or video frames). The application domain of interest is smartphone unlock, given that the growing number of smartphones have the face unlock and mobile payment capabilities. We build an unconstrained smartphone spoof attack database (MSU USSA) containing more than 1000 subjects. Both the print and replay attacks are captured using the front and rear cameras of a Nexus 5 smartphone. We analyze the image distortion of the print and replay attacks using different: 1) intensity channels (R, G, B, and grayscale); 2) image regions (entire image, detected face, and facial component between nose and chin); and 3) feature descriptors. We develop an efficient face spoof detection system on an Android smartphone. Experimental results on the public-domain Idiap Replay-Attack, CASIA FASD, and MSU-MFSD databases, and the MSU USSA database show that the proposed approach is effective in face spoof detection for both the cross-database and intra-database testing scenarios. User studies of our Android face spoof detection system involving 20 participants show that the proposed approach works very well in real application scenarios.

375 citations


Journal ArticleDOI
TL;DR: Experimental results indicate that DCP outperforms the state-of-the-art local descriptors for both face identification and face verification tasks and the best performance is achieved on the challenging LFW and FRGC 2.0 databases by deploying MDML-DCPs in a simple recognition scheme.
Abstract: To perform unconstrained face recognition robust to variations in illumination, pose and expression, this paper presents a new scheme to extract “Multi-Directional Multi-Level Dual-Cross Patterns” (MDML-DCPs) from face images. Specifically, the MDML-DCPs scheme exploits the first derivative of Gaussian operator to reduce the impact of differences in illumination and then computes the DCP feature at both the holistic and component levels. DCP is a novel face image descriptor inspired by the unique textural structure of human faces. It is computationally efficient and only doubles the cost of computing local binary patterns, yet is extremely robust to pose and expression variations. MDML-DCPs comprehensively yet efficiently encodes the invariant characteristics of a face image from multiple levels into patterns that are highly discriminative of inter-personal differences but robust to intra-personal variations. Experimental results on the FERET, CAS-PERL-R1, FRGC 2.0, and LFW databases indicate that DCP outperforms the state-of-the-art local descriptors (e.g., LBP, LTP, LPQ, POEM, tLBP, and LGXP) for both face identification and face verification tasks. More impressively, the best performance is achieved on the challenging LFW and FRGC 2.0 databases by deploying MDML-DCPs in a simple recognition scheme.

344 citations


Book ChapterDOI
08 Oct 2016
TL;DR: In this paper, the authors propose a domain specific data augmentation method to enrich an existing dataset with important facial appearance variations by manipulating the faces it contains, which is also used when matching query images represented by standard convolutional neural networks.
Abstract: Face recognition capabilities have recently made extraordinary leaps. Though this progress is at least partially due to ballooning training set sizes – huge numbers of face images downloaded and labeled for identity – it is not clear if the formidable task of collecting so many images is truly necessary. We propose a far more accessible means of increasing training data sizes for face recognition systems: Domain specific data augmentation. We describe novel methods of enriching an existing dataset with important facial appearance variations by manipulating the faces it contains. This synthesis is also used when matching query images represented by standard convolutional neural networks. The effect of training and testing with synthesized images is tested on the LFW and IJB-A (verification and identification) benchmarks and Janus CS2. The performances obtained by our approach match state of the art results reported by systems trained on millions of downloaded images.

Journal ArticleDOI
TL;DR: A robust optical flow method is applied on micro-expression video clips and the MDMO feature, a ROI-based, normalized statistic feature that considers both local statistic motion information and its spatial location, can achieve better performance than two state-of-the-art baseline features.
Abstract: Micro-expressions are brief facial movements characterized by short duration, involuntariness and low intensity. Recognition of spontaneous facial micro-expressions is a great challenge. In this paper, we propose a simple yet effective Main Directional Mean Optical-flow (MDMO) feature for micro-expression recognition. We apply a robust optical flow method on micro-expression video clips and partition the facial area into regions of interest (ROIs) based partially on action units. The MDMO is a ROI-based, normalized statistic feature that considers both local statistic motion information and its spatial location. One of the significant characteristics of MDMO is that its feature dimension is small. The length of a MDMO feature vector is $36 \times 2=72$ , where $36$ is the number of ROIs. Furthermore, to reduce the influence of noise due to head movements, we propose an optical-flow-driven method to align all frames of a micro-expression video clip. Finally, a SVM classifier with the proposed MDMO feature is adopted for micro-expression recognition. Experimental results on three spontaneous micro-expression databases, namely SMIC, CASME and CASME II, show that the MDMO can achieve better performance than two state-of-the-art baseline features, i.e., LBP-TOP and HOOF.

Proceedings ArticleDOI
01 Jun 2016
TL;DR: This paper proposes a face alignment method for large-pose face images, by combining the powerful cascaded CNN regressor method and 3DMM, and forms the face alignment as a3DMM fitting problem, where the camera projection matrix and3D shape parameters are estimated by a cascade of CNN-based regressors.
Abstract: Large-pose face alignment is a very challenging problem in computer vision, which is used as a prerequisite for many important vision tasks, e.g, face recognition and 3D face reconstruction. Recently, there have been a few attempts to solve this problem, but still more research is needed to achieve highly accurate results. In this paper, we propose a face alignment method for large-pose face images, by combining the powerful cascaded CNN regressor method and 3DMM. We formulate the face alignment as a 3DMM fitting problem, where the camera projection matrix and 3D shape parameters are estimated by a cascade of CNN-based regressors. The dense 3D shape allows us to design pose-invariant appearance features for effective CNN learning. Extensive experiments are conducted on the challenging databases (AFLW and AFW), with comparison to the state of the art.

Posted Content
TL;DR: Wang et al. as mentioned in this paper proposed a Neural Aggregation Network (NAN) for video face recognition, which consists of two attention blocks which adaptively aggregate the feature vectors to form a single feature inside the convex hull spanned by them.
Abstract: This paper presents a Neural Aggregation Network (NAN) for video face recognition. The network takes a face video or face image set of a person with a variable number of face images as its input, and produces a compact, fixed-dimension feature representation for recognition. The whole network is composed of two modules. The feature embedding module is a deep Convolutional Neural Network (CNN) which maps each face image to a feature vector. The aggregation module consists of two attention blocks which adaptively aggregate the feature vectors to form a single feature inside the convex hull spanned by them. Due to the attention mechanism, the aggregation is invariant to the image order. Our NAN is trained with a standard classification or verification loss without any extra supervision signal, and we found that it automatically learns to advocate high-quality face images while repelling low-quality ones such as blurred, occluded and improperly exposed faces. The experiments on IJB-A, YouTube Face, Celebrity-1000 video face recognition benchmarks show that it consistently outperforms naive aggregation methods and achieves the state-of-the-art accuracy.

Proceedings ArticleDOI
01 Jun 2016
TL;DR: A method to push the frontiers of unconstrained face recognition in the wild by using multiple pose specific models and rendered face images called Pose-Aware Models (PAMs), which achieve remarkably better performance than commercial products and surprisingly also outperform methods that are specifically fine-tuned on the target dataset.
Abstract: We propose a method to push the frontiers of unconstrained face recognition in the wild, focusing on the problem of extreme pose variations. As opposed to current techniques which either expect a single model to learn pose invariance through massive amounts of training data, or which normalize images to a single frontal pose, our method explicitly tackles pose variation by using multiple posespecific models and rendered face images. We leverage deep Convolutional Neural Networks (CNNs) to learn discriminative representations we call Pose-Aware Models (PAMs) using 500K images from the CASIA WebFace dataset. We present a comparative evaluation on the new IARPA Janus Benchmark A (IJB-A) and PIPA datasets. On these datasets PAMs achieve remarkably better performance than commercial products and surprisingly also outperform methods that are specifically fine-tuned on the target dataset.

Journal ArticleDOI
TL;DR: This work proposes a hybrid convolutional network-Restricted Boltzmann Machine model for face verification in wild conditions to directly learn relational visual features, which indicate identity similarities, from raw pixels of face pairs with a hybrid deep network.
Abstract: This paper proposes a hybrid convolutional network (ConvNet)-Restricted Boltzmann Machine (RBM) model for face verification. A key contribution of this work is to learn high-level relational visual features with rich identity similarity information. The deep ConvNets in our model start by extracting local relational visual features from two face images in comparison, which are further processed through multiple layers to extract high-level and global relational features. To keep enough discriminative information, we use the last hidden layer neuron activations of the ConvNet as features for face verification instead of those of the output layer. To characterize face similarities from different aspects, we concatenate the features extracted from different face region pairs by different deep ConvNets. The resulting high-dimensional relational features are classified by an RBM for face verification. After pre-training each ConvNet and the RBM separately, the entire hybrid network is jointly optimized to further improve the accuracy. Various aspects of the ConvNet structures, relational features, and face verification classifiers are investigated. Our model achieves the state-of-the-art face verification performance on the challenging LFW dataset under both the unrestricted protocol and the setting when outside data is allowed to be used for training.

Journal ArticleDOI
TL;DR: The inherent difficulties in PIFR are discussed and a comprehensive review of established techniques are presented, that is, pose-robust feature extraction approaches, multiview subspace learning approaches, face synthesis approaches, and hybrid approaches.
Abstract: The capacity to recognize faces under varied poses is a fundamental human ability that presents a unique challenge for computer vision systems. Compared to frontal face recognition, which has been intensively studied and has gradually matured in the past few decades, Pose-Invariant Face Recognition (PIFR) remains a largely unsolved problem. However, PIFR is crucial to realizing the full potential of face recognition for real-world applications, since face recognition is intrinsically a passive biometric technology for recognizing uncooperative subjects. In this article, we discuss the inherent difficulties in PIFR and present a comprehensive review of established techniques. Existing PIFR methods can be grouped into four categories, that is, pose-robust feature extraction approaches, multiview subspace learning approaches, face synthesis approaches, and hybrid approaches. The motivations, strategies, pros/cons, and performance of representative approaches are described and compared. Moreover, promising directions for future research are discussed.

Journal ArticleDOI
Tong Zhang1, Wenming Zheng1, Zhen Cui1, Yuan Zong1, Jingwei Yan1, Keyu Yan1 
TL;DR: A novel deep neural network (DNN)-driven feature learning method is proposed and applied to multi-view facial expression recognition (FER) and the experimental results show that the algorithm outperforms the state-of-the-art methods.
Abstract: In this paper, a novel deep neural network (DNN)-driven feature learning method is proposed and applied to multi-view facial expression recognition (FER). In this method, scale invariant feature transform (SIFT) features corresponding to a set of landmark points are first extracted from each facial image. Then, a feature matrix consisting of the extracted SIFT feature vectors is used as input data and sent to a well-designed DNN model for learning optimal discriminative features for expression classification. The proposed DNN model employs several layers to characterize the corresponding relationship between the SIFT feature vectors and their corresponding high-level semantic information. By training the DNN model, we are able to learn a set of optimal features that are well suitable for classifying the facial expressions across different facial views. To evaluate the effectiveness of the proposed method, two nonfrontal facial expression databases, namely BU-3DFE and Multi-PIE, are respectively used to testify our method and the experimental results show that our algorithm outperforms the state-of-the-art methods.

Book ChapterDOI
08 Oct 2016
TL;DR: This work presents a novel peak-piloted deep network (PPDN) that uses a sample with peak expression to supervise the intermediate feature responses for a sample of non-peak expression (hard sample) of the same type and from the same subject.
Abstract: Objective functions for training of deep networks for face-related recognition tasks, such as facial expression recognition (FER), usually consider each sample independently. In this work, we present a novel peak-piloted deep network (PPDN) that uses a sample with peak expression (easy sample) to supervise the intermediate feature responses for a sample of non-peak expression (hard sample) of the same type and from the same subject. The expression evolving process from non-peak expression to peak expression can thus be implicitly embedded in the network to achieve the invariance to expression intensities. A special-purpose back-propagation procedure, peak gradient suppression (PGS), is proposed for network training. It drives the intermediate-layer feature responses of non-peak expression samples towards those of the corresponding peak expression samples, while avoiding the inverse. This avoids degrading the recognition capability for samples of peak expression due to interference from their non-peak expression counterparts. Extensive comparisons on two popular FER datasets, Oulu-CASIA and CK+, demonstrate the superiority of the PPDN over state-of-the-art FER methods, as well as the advantages of both the network structure and the optimization strategy. Moreover, it is shown that PPDN is a general architecture, extensible to other tasks by proper definition of peak and non-peak samples. This is validated by experiments that show state-of-the-art performance on pose-invariant face recognition, using the Multi-PIE dataset.

Posted Content
TL;DR: In this article, a multi-task learning framework was proposed for simultaneous face detection, face alignment, pose estimation, gender recognition, smile detection, age estimation and face recognition using a single deep convolutional neural network.
Abstract: We present a multi-purpose algorithm for simultaneous face detection, face alignment, pose estimation, gender recognition, smile detection, age estimation and face recognition using a single deep convolutional neural network (CNN). The proposed method employs a multi-task learning framework that regularizes the shared parameters of CNN and builds a synergy among different domains and tasks. Extensive experiments show that the network has a better understanding of face and achieves state-of-the-art result for most of these tasks.

Journal ArticleDOI
TL;DR: A sequential three-way decision method for cost-sensitive face recognition and a series of image granulation methods based on two-dimensional subspace projection methods, which simulate a sequential decision strategy from rough granule to precise granule.
Abstract: Many previous studies on face recognition attempted to seek a precise classifier to achieve a low misclassification error, which is based on an assumption that all misclassification costs are the same. In many real-world scenarios, however, this assumption is not reasonable due to the imbalanced misclassification cost and insufficient high-quality facial image information. To address this issue, we propose a sequential three-way decision method for cost-sensitive face recognition. The proposed method is based on a formal description of granular computing. It develops a sequential strategy in a decision process. In each decision step, it seeks a decision which minimizes the misclassification cost rather than misclassification error, and it incorporates the boundary decision into the decision set such that a delayed decision can be made if available high-quality facial image information is insufficient for a precise decision. To describe the granular information of the facial image in three-way decision steps, we develop a series of image granulation methods based on two-dimensional subspace projection methods including 2DPCA, 2DLDA and 2DLPP. The sequential three-way decisions and granulation methods present an applicable simulation on human decisions in face recognition, which simulate a sequential decision strategy from rough granule to precise granule. The experiments were conducted on two popular facial image database, which validated the effectiveness of the proposed methods.

Proceedings ArticleDOI
27 Feb 2016
TL;DR: The Surrey Face Model is presented, a multi-resolution 3D Morphable Model that is made available to the public for non-commercial purposes and a lightweight open-source C++ library designed with simplicity and ease of integration as its foremost goals.
Abstract: 3D Morphable Face Models are a powerful tool in computer vision. They consist of a PCA model of face shape and colour information and allow to reconstruct a 3D face from a single 2D image. 3D Morphable Face Models are used for 3D head pose estimation, face analysis, face recognition, and, more recently, facial landmark detection and tracking. However, they are not as widely used as 2D methods - the process of building and using a 3D model is much more involved. In this paper, we present the Surrey Face Model, a multi-resolution 3D Morphable Model that we make available to the public for non-commercial purposes. The model contains different mesh resolution levels and landmark point annotations as well as metadata for texture remapping. Accompanying the model is a lightweight open-source C++ library designed with simplicity and ease of integration as its foremost goals. In addition to basic functionality, it contains pose estimation and face frontalisation algorithms. With the tools presented in this paper, we aim to close two gaps. First, by offering different model resolution levels and fast fitting functionality, we enable the use of a 3D Morphable Model in time-critical applications like tracking. Second, the software library makes it easy for the community to adopt the 3D Morphable Face Model in their research, and it offers a public place for collaboration.

Proceedings Article
12 Feb 2016
TL;DR: This work addresses model compression for face recognition, where the learned knowledge of a large teacher network or its ensemble is utilized as supervision to train a compact student network by leveraging the essential characteristics of the learned face representation.
Abstract: The recent advanced face recognition systems were built on large Deep Neural Networks (DNNs) or their ensembles, which have millions of parameters However, the expensive computation of DNNs make their deployment difficult on mobile and embedded devices This work addresses model compression for face recognition, where the learned knowledge of a large teacher network or its ensemble is utilized as supervision to train a compact student network Unlike previous works that represent the knowledge by the soften label probabilities, which are difficult to fit, we represent the knowledge by using the neurons at the higher hidden layer, which preserve as much information as the label probabilities, but are more compact By leveraging the essential characteristics (domain knowledge) of the learned face representation, a neuron selection method is proposed to choose neurons that are most relevant to face recognition Using the selected neurons as supervision to mimic the single networks of DeepID2+ and DeepID3, which are the state-of-the-art face recognition systems, a compact student with simple network structure achieves better verification accuracy on LFW than its teachers, respectively When using an ensemble of DeepID2+ as teacher, a mimicked student is able to outperform it and achieves 516× compression ratio and 90× speed-up in inference, making this cumbersome model applicable on portable devices

Book ChapterDOI
08 Oct 2016
TL;DR: In this paper, a mixed objective optimization network (MOON) is proposed to address the multi-label imbalance problem by introducing a loss function that mixes multiple task objectives with domain adaptive re-weighting of propagated loss.
Abstract: Attribute recognition, particularly facial, extracts many labels for each image. While some multi-task vision problems can be decomposed into separate tasks and stages, e.g., training independent models for each task, for a growing set of problems joint optimization across all tasks has been shown to improve performance. We show that for deep convolutional neural network (DCNN) facial attribute extraction, multi-task optimization is better. Unfortunately, it can be difficult to apply joint optimization to DCNNs when training data is imbalanced, and re-balancing multi-label data directly is structurally infeasible, since adding/removing data to balance one label will change the sampling of the other labels. This paper addresses the multi-label imbalance problem by introducing a novel mixed objective optimization network (MOON) with a loss function that mixes multiple task objectives with domain adaptive re-weighting of propagated loss. Experiments demonstrate that not only does MOON advance the state of the art in facial attribute recognition, but it also outperforms independently trained DCNNs using the same data. When using facial attributes for the LFW face recognition task, we show that our balanced (domain adapted) network outperforms the unbalanced trained network.

Book ChapterDOI
14 Oct 2016
TL;DR: A robust representation integrating deep texture features and face movement cue like eye-blink as countermeasures for presentation attacks like photos and replays for face spoof attacks is proposed.
Abstract: With the wide applications of user authentication based on face recognition, face spoof attacks against face recognition systems are drawing increasing attentions. While emerging approaches of face antispoofing have been reported in recent years, most of them limit to the non-realistic intra-database testing scenarios instead of the cross-database testing scenarios. We propose a robust representation integrating deep texture features and face movement cue like eye-blink as countermeasures for presentation attacks like photos and replays. We learn deep texture features from both aligned facial images and whole frames, and use a frame difference based approach for eye-blink detection. A face video clip is classified as live if it is categorized as live using both cues. Cross-database testing on public-domain face databases shows that the proposed approach significantly outperforms the state-of-the-art.

Proceedings ArticleDOI
26 May 2016
TL;DR: This work presents a novel approach to Facial Action Unit detection using a combination of Convolutional and Bi-directional Long Short-Term Memory Neural Networks (CNN-BLSTM), which jointly learns shape, appearance and dynamics in a deep learning manner.
Abstract: Spontaneous facial expression recognition under uncontrolled conditions is a hard task. It depends on multiple factors including shape, appearance and dynamics of the facial features, all of which are adversely affected by environmental noise and low intensity signals typical of such conditions. In this work, we present a novel approach to Facial Action Unit detection using a combination of Convolutional and Bi-directional Long Short-Term Memory Neural Networks (CNN-BLSTM), which jointly learns shape, appearance and dynamics in a deep learning manner. In addition, we introduce a novel way to encode shape features using binary image masks computed from the locations of facial landmarks. We show that the combination of dynamic CNN features and Bi-directional Long Short-Term Memory excels at modelling the temporal information. We thoroughly evaluate the contributions of each component in our system and show that it achieves state-of-the-art performance on the FERA-2015 Challenge dataset.

Proceedings ArticleDOI
27 Jun 2016
TL;DR: This work proposes a novel deep face recognition framework to learn the ageinvariant deep face features through a carefully designed CNN model, and is the first attempt to show the effectiveness of deep CNNs in advancing the state-of-the-art of AIFR.
Abstract: While considerable progresses have been made on face recognition, age-invariant face recognition (AIFR) still remains a major challenge in real world applications of face recognition systems. The major difficulty of AIFR arises from the fact that the facial appearance is subject to significant intra-personal changes caused by the aging process over time. In order to address this problem, we propose a novel deep face recognition framework to learn the ageinvariant deep face features through a carefully designed CNN model. To the best of our knowledge, this is the first attempt to show the effectiveness of deep CNNs in advancing the state-of-the-art of AIFR. Extensive experiments are conducted on several public domain face aging datasets (MORPH Album2, FGNET, and CACD-VS) to demonstrate the effectiveness of the proposed model over the state-of the-art. We also verify the excellent generalization of our new model on the famous LFW dataset.

Journal ArticleDOI
TL;DR: It is demonstrated that people vary in systematic ways, and that this variability is idiosyncratic-the dimensions of variability in one face do not generalize well to another, and this framework provides an explanation for various effects in face recognition.

Journal ArticleDOI
TL;DR: A new rolling bearing fault diagnosis method that is based on short-time Fourier transform and stacked sparse autoencoder is first proposed; this method analyzes sound signals and is compared with empirical mode decomposition, Teager energy operator, and stacked scant autoen coder when using vibration signals to verify the performance and effectiveness of the proposed method.
Abstract: The main challenge of fault diagnosis lies in finding good fault features. A deep learning network has the ability to automatically learn good characteristics from input data in an unsupervised fashion, and its unique layer-wise pretraining and fine-tuning using the backpropagation strategy can solve the difficulties of training deep multilayer networks. Stacked sparse autoencoders or other deep architectures have shown excellent performance in speech recognition, face recognition, text classification, image recognition, and other application domains. Thus far, however, there have been very few research studies on deep learning in fault diagnosis. In this paper, a new rolling bearing fault diagnosis method that is based on short-time Fourier transform and stacked sparse autoencoder is first proposed; this method analyzes sound signals. After spectrograms are obtained by short-time Fourier transform, stacked sparse autoencoder is employed to automatically extract the fault features, and softmax regression is adopted as the method for classifying the fault modes. The proposed method, when applied to sound signals that are obtained from a rolling bearing test rig, is compared with empirical mode decomposition, Teager energy operator, and stacked sparse autoencoder when using vibration signals to verify the performance and effectiveness of the proposed method.

Book ChapterDOI
08 Oct 2016
TL;DR: The proposed method iteratively and alternately applies two sets of cascaded regressors, one for updating 2D landmarks and the other for updating reconstructed pose-expression-normalized (PEN) 3D face shape, to simultaneously solve the two problems of face alignment and3D face reconstruction from an input 2D face image of arbitrary poses and expressions.
Abstract: We present an approach to simultaneously solve the two problems of face alignment and 3D face reconstruction from an input 2D face image of arbitrary poses and expressions. The proposed method iteratively and alternately applies two sets of cascaded regressors, one for updating 2D landmarks and the other for updating reconstructed pose-expression-normalized (PEN) 3D face shape. The 3D face shape and the landmarks are correlated via a 3D-to-2D mapping matrix. In each iteration, adjustment to the landmarks is firstly estimated via a landmark regressor, and this landmark adjustment is also used to estimate 3D face shape adjustment via a shape regressor. The 3D-to-2D mapping is then computed based on the adjusted 3D face shape and 2D landmarks, and it further refines the 2D landmarks. An effective algorithm is devised to learn these regressors based on a training dataset of pairing annotated 3D face shapes and 2D face images. Compared with existing methods, the proposed method can fully automatically generate PEN 3D face shapes in real time from a single 2D face image and locate both visible and invisible 2D landmarks. Extensive experiments show that the proposed method can achieve the state-of-the-art accuracy in both face alignment and 3D face reconstruction, and benefit face recognition owing to its reconstructed PEN 3D face shapes.