scispace - formally typeset
Search or ask a question

Showing papers on "Facial recognition system published in 2015"


Proceedings ArticleDOI
07 Jun 2015
TL;DR: A system that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure offace similarity, and achieves state-of-the-art face recognition performance using only 128-bytes perface.
Abstract: Despite significant recent advances in the field of face recognition [10, 14, 15, 17], implementing face verification and recognition efficiently at scale presents serious challenges to current approaches. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors.

8,289 citations


Proceedings ArticleDOI
01 Jan 2015
TL;DR: It is shown how a very large scale dataset can be assembled by a combination of automation and human in the loop, and the trade off between data purity and time is discussed.
Abstract: The goal of this paper is face recognition – from either a single photograph or from a set of faces tracked in a video. Recent progress in this area has been due to two factors: (i) end to end learning for the task using a convolutional neural network (CNN), and (ii) the availability of very large scale training datasets. We make two contributions: first, we show how a very large scale dataset (2.6M images, over 2.6K people) can be assembled by a combination of automation and human in the loop, and discuss the trade off between data purity and time; second, we traverse through the complexities of deep network training and face recognition to present methods and procedures to achieve comparable state of the art results on the standard LFW and YTF face benchmarks.

5,308 citations


Proceedings ArticleDOI
TL;DR: FaceNet as discussed by the authors uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches, and achieves state-of-the-art face recognition performance using only 128 bytes per face.
Abstract: Despite significant recent advances in the field of face recognition, implementing face verification and recognition efficiently at scale presents serious challenges to current approaches. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors. Our method uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches. To train, we use triplets of roughly aligned matching / non-matching face patches generated using a novel online triplet mining method. The benefit of our approach is much greater representational efficiency: we achieve state-of-the-art face recognition performance using only 128-bytes per face. On the widely used Labeled Faces in the Wild (LFW) dataset, our system achieves a new record accuracy of 99.63%. On YouTube Faces DB it achieves 95.12%. Our system cuts the error rate in comparison to the best published result by 30% on both datasets. We also introduce the concept of harmonic embeddings, and a harmonic triplet loss, which describe different versions of face embeddings (produced by different networks) that are compatible to each other and allow for direct comparison between each other.

4,560 citations


Journal ArticleDOI
TL;DR: Surprisingly, for all tasks, such a seemingly naive PCANet model is on par with the state-of-the-art features either prefixed, highly hand-crafted, or carefully learned [by deep neural networks (DNNs)].
Abstract: In this paper, we propose a very simple deep learning network for image classification that is based on very basic data processing components: 1) cascaded principal component analysis (PCA); 2) binary hashing; and 3) blockwise histograms. In the proposed architecture, the PCA is employed to learn multistage filter banks. This is followed by simple binary hashing and block histograms for indexing and pooling. This architecture is thus called the PCA network (PCANet) and can be extremely easily and efficiently designed and learned. For comparison and to provide a better understanding, we also introduce and study two simple variations of PCANet: 1) RandNet and 2) LDANet. They share the same topology as PCANet, but their cascaded filters are either randomly selected or learned from linear discriminant analysis. We have extensively tested these basic networks on many benchmark visual data sets for different tasks, including Labeled Faces in the Wild (LFW) for face verification; the MultiPIE, Extended Yale B, AR, Facial Recognition Technology (FERET) data sets for face recognition; and MNIST for hand-written digit recognition. Surprisingly, for all tasks, such a seemingly naive PCANet model is on par with the state-of-the-art features either prefixed, highly hand-crafted, or carefully learned [by deep neural networks (DNNs)]. Even more surprisingly, the model sets new records for many classification tasks on the Extended Yale B, AR, and FERET data sets and on MNIST variations. Additional experiments on other public data sets also demonstrate the potential of PCANet to serve as a simple but highly competitive baseline for texture classification and object recognition.

1,034 citations


Posted Content
TL;DR: This paper proposes two very deep neural network architectures, referred to as DeepID3, for face recognition, rebuilt from stacked convolution and inception layers proposed in VGG net and GoogLeNet to make them suitable to face recognition.
Abstract: The state-of-the-art of face recognition has been significantly advanced by the emergence of deep learning. Very deep neural networks recently achieved great success on general object recognition because of their superb learning capacity. This motivates us to investigate their effectiveness on face recognition. This paper proposes two very deep neural network architectures, referred to as DeepID3, for face recognition. These two architectures are rebuilt from stacked convolution and inception layers proposed in VGG net and GoogLeNet to make them suitable to face recognition. Joint face identification-verification supervisory signals are added to both intermediate and final feature extraction layers during training. An ensemble of the proposed two architectures achieves 99.53% LFW face verification accuracy and 96.0% LFW rank-1 face identification accuracy, respectively. A further discussion of LFW face verification result is given in the end.

908 citations


Proceedings ArticleDOI
07 Jun 2015
TL;DR: Baseline accuracies for both face detection and face recognition from commercial and open source algorithms demonstrate the challenge offered by this new unconstrained benchmark.
Abstract: Rapid progress in unconstrained face recognition has resulted in a saturation in recognition accuracy for current benchmark datasets. While important for early progress, a chief limitation in most benchmark datasets is the use of a commodity face detector to select face imagery. The implication of this strategy is restricted variations in face pose and other confounding factors. This paper introduces the IARPA Janus Benchmark A (IJB-A), a publicly available media in the wild dataset containing 500 subjects with manually localized face images. Key features of the IJB-A dataset are: (i) full pose variation, (ii) joint use for face recognition and face detection benchmarking, (iii) a mix of images and videos, (iv) wider geographic variation of subjects, (v) protocols supporting both open-set identification (1∶N search) and verification (1∶1 comparison), (vi) an optional protocol that allows modeling of gallery subjects, and (vii) ground truth eye and nose locations. The dataset has been developed using 1,501,267 million crowd sourced annotations. Baseline accuracies for both face detection and face recognition from commercial and open source algorithms demonstrate the challenge offered by this new unconstrained benchmark.

738 citations


Journal ArticleDOI
TL;DR: An efficient and rather robust face spoof detection algorithm based on image distortion analysis (IDA) that outperforms the state-of-the-art methods in spoof detection and highlights the difficulty in separating genuine and spoof faces, especially in cross-database and cross-device scenarios.
Abstract: Automatic face recognition is now widely used in applications ranging from deduplication of identity to authentication of mobile payment. This popularity of face recognition has raised concerns about face spoof attacks (also known as biometric sensor presentation attacks), where a photo or video of an authorized person’s face could be used to gain access to facilities or services. While a number of face spoof detection techniques have been proposed, their generalization ability has not been adequately addressed. We propose an efficient and rather robust face spoof detection algorithm based on image distortion analysis (IDA). Four different features (specular reflection, blurriness, chromatic moment, and color diversity) are extracted to form the IDA feature vector. An ensemble classifier, consisting of multiple SVM classifiers trained for different face spoof attacks (e.g., printed photo and replayed video), is used to distinguish between genuine (live) and spoof faces. The proposed approach is extended to multiframe face spoof detection in videos using a voting-based scheme. We also collect a face spoof database, MSU mobile face spoofing database (MSU MFSD), using two mobile devices (Google Nexus 5 and MacBook Air) with three types of spoof attacks (printed photo, replayed video with iPhone 5S, and replayed video with iPad Air). Experimental results on two public-domain face spoof databases (Idiap REPLAY-ATTACK and CASIA FASD), and the MSU MFSD database show that the proposed approach outperforms the state-of-the-art methods in spoof detection. Our results also highlight the difficulty in separating genuine and spoof faces, especially in cross-database and cross-device scenarios.

716 citations


Proceedings ArticleDOI
Heechul Jung1, Sihaeng Lee1, Junho Yim1, Sunjeong Park1, Junmo Kim1 
07 Dec 2015
TL;DR: A deep learning technique, which is regarded as a tool to automatically extract useful features from raw data, is adopted and is combined using a new integration method in order to boost the performance of the facial expression recognition.
Abstract: Temporal information has useful features for recognizing facial expressions. However, to manually design useful features requires a lot of effort. In this paper, to reduce this effort, a deep learning technique, which is regarded as a tool to automatically extract useful features from raw data, is adopted. Our deep network is based on two different models. The first deep network extracts temporal appearance features from image sequences, while the other deep network extracts temporal geometry features from temporal facial landmark points. These two models are combined using a new integration method in order to boost the performance of the facial expression recognition. Through several experiments, we show that the two models cooperate with each other. As a result, we achieve superior performance to other state-of-the-art methods in the CK+ and Oulu-CASIA databases. Furthermore, we show that our new integration method gives more accurate results than traditional methods, such as a weighted summation and a feature concatenation method.

668 citations


Journal ArticleDOI
TL;DR: This paper provides a comprehensive analysis of facial representations by uncovering their advantages and limitations, and elaborate on the type of information they encode and how they deal with the key challenges of illumination variations, registration errors, head-pose variations, occlusions, and identity bias.
Abstract: Automatic affect analysis has attracted great interest in various contexts including the recognition of action units and basic or non-basic emotions In spite of major efforts, there are several open questions on what the important cues to interpret facial expressions are and how to encode them In this paper, we review the progress across a range of affect recognition applications to shed light on these fundamental questions We analyse the state-of-the-art solutions by decomposing their pipelines into fundamental components, namely face registration, representation, dimensionality reduction and recognition We discuss the role of these components and highlight the models and new trends that are followed in their design Moreover, we provide a comprehensive analysis of facial representations by uncovering their advantages and limitations; we elaborate on the type of information they encode and discuss how they deal with the key challenges of illumination variations, registration errors, head-pose variations, occlusions, and identity bias This survey allows us to identify open issues and to define future directions for designing real-world affect recognition systems

601 citations


Proceedings ArticleDOI
07 Dec 2015
TL;DR: In this paper, a Pose-based Convolutional Neural Network descriptor (P-CNN) is proposed for action recognition in videos, which aggregates motion and appearance information along tracks of human body parts.
Abstract: This work targets human action recognition in video. While recent methods typically represent actions by statistics of local video features, here we argue for the importance of a representation derived from human pose. To this end we propose a new Pose-based Convolutional Neural Network descriptor (P-CNN) for action recognition. The descriptor aggregates motion and appearance information along tracks of human body parts. We investigate different schemes of temporal aggregation and experiment with P-CNN features obtained both for automatically estimated and manually annotated human poses. We evaluate our method on the recent and challenging JHMDB and MPII Cooking datasets. For both datasets our method shows consistent improvement over the state of the art.

596 citations


Proceedings ArticleDOI
07 Jun 2015
TL;DR: This work explores the simpler approach of using a single, unmodified, 3D surface as an approximation to the shape of all input faces, and shows that this leads to a straightforward, efficient and easy to implement method for frontalization.
Abstract: “Frontalization” is the process of synthesizing frontal facing views of faces appearing in single unconstrained photos. Recent reports have suggested that this process may substantially boost the performance of face recognition systems. This, by transforming the challenging problem of recognizing faces viewed from unconstrained viewpoints to the easier problem of recognizing faces in constrained, forward facing poses. Previous frontalization methods did this by attempting to approximate 3D facial shapes for each query image. We observe that 3D face shape estimation from unconstrained photos may be a harder problem than frontalization and can potentially introduce facial misalignments. Instead, we explore the simpler approach of using a single, unmodified, 3D surface as an approximation to the shape of all input faces. We show that this leads to a straightforward, efficient and easy to implement method for frontalization. More importantly, it produces aesthetic new frontal views and is surprisingly effective when used for face recognition and gender estimation.

Proceedings ArticleDOI
Xiangyu Zhu1, Zhen Lei1, Junjie Yan1, Dong Yi1, Stan Z. Li1 
07 Jun 2015
TL;DR: A High-fidelity Pose and Expression Normalization (HPEN) method with 3D Morphable Model (3DMM) which can automatically generate a natural face image in frontal pose and neutral expression and an inpainting method based on Possion Editing to fill the invisible region caused by self occlusion is proposed.
Abstract: Pose and expression normalization is a crucial step to recover the canonical view of faces under arbitrary conditions, so as to improve the face recognition performance. An ideal normalization method is desired to be automatic, database independent and high-fidelity, where the face appearance should be preserved with little artifact and information loss. However, most normalization methods fail to satisfy one or more of the goals. In this paper, we propose a High-fidelity Pose and Expression Normalization (HPEN) method with 3D Morphable Model (3DMM) which can automatically generate a natural face image in frontal pose and neutral expression. Specifically, we firstly make a landmark marching assumption to describe the non-correspondence between 2D and 3D landmarks caused by pose variations and propose a pose adaptive 3DMM fitting algorithm. Secondly, we mesh the whole image into a 3D object and eliminate the pose and expression variations using an identity preserving 3D transformation. Finally, we propose an inpainting method based on Possion Editing to fill the invisible region caused by self occlusion. Extensive experiments on Multi-PIE and LFW demonstrate that the proposed method significantly improves face recognition performance and outperforms state-of-the-art methods in both constrained and unconstrained environments.

Journal ArticleDOI
TL;DR: An automated learning-free facial landmark detection technique has been proposed, which achieves similar performances as that of other state-of-art landmark detection methods, yet requires significantly less execution time.
Abstract: Extraction of discriminative features from salient facial patches plays a vital role in effective facial expression recognition. The accurate detection of facial landmarks improves the localization of the salient patches on face images. This paper proposes a novel framework for expression recognition by using appearance features of selected facial patches. A few prominent facial patches, depending on the position of facial landmarks, are extracted which are active during emotion elicitation. These active patches are further processed to obtain the salient patches which contain discriminative features for classification of each pair of expressions, thereby selecting different facial patches as salient for different pair of expression classes. One-against-one classification method is adopted using these features. In addition, an automated learning-free facial landmark detection technique has been proposed, which achieves similar performances as that of other state-of-art landmark detection methods, yet requires significantly less execution time. The proposed method is found to perform well consistently in different resolutions, hence, providing a solution for expression recognition in low resolution images. Experiments on CK+ and JAFFE facial expression databases show the effectiveness of the proposed system.

Journal ArticleDOI
TL;DR: A compact binary face descriptor (CBFD) feature learning method for face representation and recognition that reduces the modality gap of heterogeneous faces at the feature level to make the method applicable to heterogeneous face recognition.
Abstract: Binary feature descriptors such as local binary patterns (LBP) and its variations have been widely used in many face recognition systems due to their excellent robustness and strong discriminative power. However, most existing binary face descriptors are hand-crafted, which require strong prior knowledge to engineer them by hand. In this paper, we propose a compact binary face descriptor (CBFD) feature learning method for face representation and recognition. Given each face image, we first extract pixel difference vectors (PDVs) in local patches by computing the difference between each pixel and its neighboring pixels. Then, we learn a feature mapping to project these pixel difference vectors into low-dimensional binary vectors in an unsupervised manner, where 1) the variance of all binary codes in the training set is maximized, 2) the loss between the original real-valued codes and the learned binary codes is minimized, and 3) binary codes evenly distribute at each learned bin, so that the redundancy information in PDVs is removed and compact binary codes are obtained. Lastly, we cluster and pool these binary codes into a histogram feature as the final representation for each face image. Moreover, we propose a coupled CBFD (C-CBFD) method by reducing the modality gap of heterogeneous faces at the feature level to make our method applicable to heterogeneous face recognition. Extensive experimental results on five widely used face datasets show that our methods outperform state-of-the-art face descriptors.

Proceedings ArticleDOI
09 Nov 2015
TL;DR: This work focuses its presentation and experimental analysis on a hybrid CNN-RNN architecture for facial expression analysis that can outperform a previously applied CNN approach using temporal averaging for aggregation.
Abstract: Deep learning based approaches to facial analysis and video analysis have recently demonstrated high performance on a variety of key tasks such as face recognition, emotion recognition and activity recognition. In the case of video, information often must be aggregated across a variable length sequence of frames to produce a classification result. Prior work using convolutional neural networks (CNNs) for emotion recognition in video has relied on temporal averaging and pooling operations reminiscent of widely used approaches for the spatial aggregation of information. Recurrent neural networks (RNNs) have seen an explosion of recent interest as they yield state-of-the-art performance on a variety of sequence analysis tasks. RNNs provide an attractive framework for propagating information over a sequence using a continuous valued hidden layer representation. In this work we present a complete system for the 2015 Emotion Recognition in the Wild (EmotiW) Challenge. We focus our presentation and experimental analysis on a hybrid CNN-RNN architecture for facial expression analysis that can outperform a previously applied CNN approach using temporal averaging for aggregation.

Proceedings ArticleDOI
07 Dec 2015
TL;DR: The first benchmark for long-term facial landmark tracking, containing currently over 110 annotated videos, is presented, and the results of the competition on facial landmark localisation in static imagery are summarized.
Abstract: Detection and tracking of faces in image sequences is among the most well studied problems in the intersection of statistical machine learning and computer vision. Often, tracking and detection methodologies use a rigid representation to describe the facial region 1, hence they can neither capture nor exploit the non-rigid facial deformations, which are crucial for countless of applications (e.g., facial expression analysis, facial motion capture, high-performance face recognition etc.). Usually, the non-rigid deformations are captured by locating and tracking the position of a set of fiducial facial landmarks (e.g., eyes, nose, mouth etc.). Recently, we witnessed a burst of research in automatic facial landmark localisation in static imagery. This is partly attributed to the availability of large amount of annotated data, many of which have been provided by the first facial landmark localisation challenge (also known as 300-W challenge). Even though now well established benchmarks exist for facial landmark localisation in static imagery, to the best of our knowledge, there is no established benchmark for assessing the performance of facial landmark tracking methodologies, containing an adequate number of annotated face videos. In conjunction with ICCV'2015 we run the first competition/challenge on facial landmark tracking in long-term videos. In this paper, we present the first benchmark for long-term facial landmark tracking, containing currently over 110 annotated videos, and we summarise the results of the competition.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a deep learning structure composed of a set of elaborately designed convolutional neural networks (CNNs) and a three-layer stacked auto-encoder (SAE).
Abstract: Face images appearing in multimedia applications, e.g., social networks and digital entertainment, usually exhibit dramatic pose, illumination, and expression variations, resulting in considerable performance degradation for traditional face recognition algorithms. This paper proposes a comprehensive deep learning framework to jointly learn face representation using multimodal information. The proposed deep learning structure is composed of a set of elaborately designed convolutional neural networks (CNNs) and a three-layer stacked auto-encoder (SAE). The set of CNNs extracts complementary facial features from multimodal data. Then, the extracted features are concatenated to form a high-dimensional feature vector, whose dimension is compressed by SAE. All of the CNNs are trained using a subset of 9,000 subjects from the publicly available CASIA-WebFace database, which ensures the reproducibility of this work. Using the proposed single CNN architecture and limited training data, 98.43% verification rate is achieved on the LFW database. Benefitting from the complementary information contained in multimodal data, our small ensemble system achieves higher than 99.0% recognition rate on LFW using publicly available training set.

Journal ArticleDOI
TL;DR: A comprehensive deep learning framework to jointly learn face representation using multimodal information is proposed and a small ensemble system achieves higher than 99.0% recognition rate on LFW using publicly available training set.
Abstract: Face images appeared in multimedia applications, e.g., social networks and digital entertainment, usually exhibit dramatic pose, illumination, and expression variations, resulting in considerable performance degradation for traditional face recognition algorithms. This paper proposes a comprehensive deep learning framework to jointly learn face representation using multimodal information. The proposed deep learning structure is composed of a set of elaborately designed convolutional neural networks (CNNs) and a three-layer stacked auto-encoder (SAE). The set of CNNs extracts complementary facial features from multimodal data. Then, the extracted features are concatenated to form a high-dimensional feature vector, whose dimension is compressed by SAE. All the CNNs are trained using a subset of 9,000 subjects from the publicly available CASIA-WebFace database, which ensures the reproducibility of this work. Using the proposed single CNN architecture and limited training data, 98.43% verification rate is achieved on the LFW database. Benefited from the complementary information contained in multimodal data, our small ensemble system achieves higher than 99.0% recognition rate on LFW using publicly available training set.

Proceedings ArticleDOI
09 Nov 2015
TL;DR: This work proposes novel transformations of image intensities to 3D spaces, designed to be invariant to monotonic photometric transformations, which are applied to CASIA Webface images and used to train an ensemble of multiple architecture CNNs on multiple representations.
Abstract: We present a novel method for classifying emotions from static facial images. Our approach leverages on the recent success of Convolutional Neural Networks (CNN) on face recognition problems. Unlike the settings often assumed there, far less labeled data is typically available for training emotion classification systems. Our method is therefore designed with the goal of simplifying the problem domain by removing confounding factors from the input images, with an emphasis on image illumination variations. This, in an effort to reduce the amount of data required to effectively train deep CNN models. To this end, we propose novel transformations of image intensities to 3D spaces, designed to be invariant to monotonic photometric transformations. These are applied to CASIA Webface images which are then used to train an ensemble of multiple architecture CNNs on multiple representations. Each model is then fine-tuned with limited emotion labeled training data to obtain final classification models. Our method was tested on the Emotion Recognition in the Wild Challenge (EmotiW 2015), Static Facial Expression Recognition sub-challenge (SFEW) and shown to provide a substantial, 15.36% improvement over baseline results (40% gain in performance).

Proceedings ArticleDOI
Junho Yim1, Heechul Jung1, ByungIn Yoo1, Changkyu Choi2, Du-sik Park2, Junmo Kim1 
07 Jun 2015
TL;DR: A new deep architecture based on a novel type of multitask learning, which can achieve superior performance in rotating to a target-pose face image from an arbitrary pose and illumination image while preserving identity is proposed.
Abstract: Face recognition under viewpoint and illumination changes is a difficult problem, so many researchers have tried to solve this problem by producing the pose- and illumination- invariant feature. Zhu et al. [26] changed all arbitrary pose and illumination images to the frontal view image to use for the invariant feature. In this scheme, preserving identity while rotating pose image is a crucial issue. This paper proposes a new deep architecture based on a novel type of multitask learning, which can achieve superior performance in rotating to a target-pose face image from an arbitrary pose and illumination image while preserving identity. The target pose can be controlled by the user's intention. This novel type of multi-task model significantly improves identity preservation over the single task model. By using all the synthesized controlled pose images, called Controlled Pose Image (CPI), for the pose-illumination-invariant feature and voting among the multiple face recognition results, we clearly outperform the state-of-the-art algorithms by more than 4∼6% on the MultiPIE dataset.

Journal ArticleDOI
TL;DR: Experimental results show that although state-of-the-art methods can achieve competitive performance compared to average human performance, majority votes of several humans can achieve much higher performance on this task and the gap between machine and human would imply possible directions for further improvement of cross-age face recognition in the future.
Abstract: This paper introduces a method for face recognition across age and also a dataset containing variations of age in the wild. We use a data-driven method to address the cross-age face recognition problem, called cross-age reference coding (CARC). By leveraging a large-scale image dataset freely available on the Internet as a reference set, CARC can encode the low-level feature of a face image with an age-invariant reference space. In the retrieval phase, our method only requires a linear projection to encode the feature and thus it is highly scalable. To evaluate our method, we introduce a large-scale dataset called cross-age celebrity dataset (CACD). The dataset contains more than 160 000 images of 2,000 celebrities with age ranging from 16 to 62. Experimental results show that our method can achieve state-of-the-art performance on both CACD and the other widely used dataset for face recognition across age. To understand the difficulties of face recognition across age, we further construct a verification subset from the CACD called CACD-VS and conduct human evaluation using Amazon Mechanical Turk. CACD-VS contains 2,000 positive pairs and 2,000 negative pairs and is carefully annotated by checking both the associated image and web contents. Our experiments show that although state-of-the-art methods can achieve competitive performance compared to average human performance, majority votes of several humans can achieve much higher performance on this task. The gap between machine and human would imply possible directions for further improvement of cross-age face recognition in the future.

Proceedings ArticleDOI
07 Jun 2015
TL;DR: A link between the representation norm and the ability to discriminate in a target domain is found, which sheds lights on how deep convolutional networks represent faces.
Abstract: Scaling machine learning methods to very large datasets has attracted considerable attention in recent years, thanks to easy access to ubiquitous sensing and data from the web. We study face recognition and show that three distinct properties have surprising effects on the transferability of deep convolutional networks (CNN): (1) The bottleneck of the network serves as an important transfer learning regularizer, and (2) in contrast to the common wisdom, performance saturation may exist in CNN's (as the number of training samples grows); we propose a solution for alleviating this by replacing the naive random subsampling of the training set with a bootstrapping process. Moreover, (3) we find a link between the representation norm and the ability to discriminate in a target domain, which sheds lights on how such networks represent faces. Based on these discoveries, we are able to improve face recognition accuracy on the widely used LFW benchmark, both in the verification (1∶1) and identification (1∶N) protocols, and directly compare, for the first time, with the state of the art Commercially-Off-The-Shelf system and show a sizable leap in performance.

Proceedings ArticleDOI
07 Dec 2015
TL;DR: An extensive evaluation of CNN-based face recognition systems (CNN-FRS) and proposes three CNN architectures which are the first reported architectures trained using LFW data to make the work easily reproducible.
Abstract: Deep learning, in particular Convolutional Neural Network (CNN), has achieved promising results in face recognition recently. However, it remains an open question: why CNNs work well and how to design a 'good' architecture. The existing works tend to focus on reporting CNN architectures that work well for face recognition rather than investigate the reason. In this work, we conduct an extensive evaluation of CNN-based face recognition systems (CNN-FRS) on a common ground to make our work easily reproducible. Specifically, we use public database LFW (Labeled Faces in the Wild) to train CNNs, unlike most existing CNNs trained on private databases. We propose three CNN architectures which are the first reported architectures trained using LFW data. This paper quantitatively compares the architectures of CNNs and evaluates the effect of different implementation choices. We identify several useful properties of CNN-FRS. For instance, the dimensionality of the learned features can be significantly reduced without adverse effect on face recognition accuracy. In addition, a traditional metric learning method exploiting CNN-learned features is evaluated. Experiments show two crucial factors to good CNN-FRS performance are the fusion of multiple CNNs and metric learning. To make our work reproducible, source code and models will be made publicly available.

Posted Content
TL;DR: A comprehensive review of pose-invariant face recognition methods can be found in this paper, where pose-robust feature extraction approaches, multi-view subspace learning approaches, face synthesis approaches, and hybrid approaches are compared.
Abstract: The capacity to recognize faces under varied poses is a fundamental human ability that presents a unique challenge for computer vision systems. Compared to frontal face recognition, which has been intensively studied and has gradually matured in the past few decades, pose-invariant face recognition (PIFR) remains a largely unsolved problem. However, PIFR is crucial to realizing the full potential of face recognition for real-world applications, since face recognition is intrinsically a passive biometric technology for recognizing uncooperative subjects. In this paper, we discuss the inherent difficulties in PIFR and present a comprehensive review of established techniques. Existing PIFR methods can be grouped into four categories, i.e., pose-robust feature extraction approaches, multi-view subspace learning approaches, face synthesis approaches, and hybrid approaches. The motivations, strategies, pros/cons, and performance of representative approaches are described and compared. Moreover, promising directions for future research are discussed.

Proceedings ArticleDOI
07 Dec 2015
TL;DR: This work trains a zero-bias CNN on facial expression data and achieves, to the knowledge, state-of-the-art performance on two expression recognition benchmarks: the extended Cohn-Kanade (CK+) dataset and the Toronto Face Dataset (TFD).
Abstract: Despite being the appearance-based classifier of choice in recent years, relatively few works have examined how much convolutional neural networks (CNNs) can improve performance on accepted expression recognition benchmarks and, more importantly, examine what it is they actually learn. In this work, not only do we show that CNNs can achieve strong performance, but we also introduce an approach to decipher which portions of the face influence the CNN's predictions. First, we train a zero-bias CNN on facial expression data and achieve, to our knowledge, state-of-the-art performance on two expression recognition benchmarks: the extended Cohn-Kanade (CK+) dataset and the Toronto Face Dataset (TFD). We then qualitatively analyze the network by visualizing the spatial patterns that maximally excite different neurons in the convolutional layers and show how they resemble Facial Action Units (FAUs). Finally, we use the FAU labels provided in the CK+ dataset to verify that the FAUs observed in our filter visualizations indeed align with the subject's facial movements.

Proceedings ArticleDOI
04 May 2015
TL;DR: This work outlines the evaluation protocol, the data used, and the results of a baseline method for the fully automatic AU intensity estimation in automatic recognition of facial expressions.
Abstract: Despite efforts towards evaluation standards in facial expression analysis (e.g. FERA 2011), there is a need for up-to-date standardised evaluation procedures, focusing in particular on current challenges in the field. One of the challenges that is actively being addressed is the automatic estimation of expression intensities. To continue to provide a standardisation platform and to help the field progress beyond its current limitations, the FG 2015 Facial Expression Recognition and Analysis challenge (FERA 2015) will challenge participants to estimate FACS Action Unit (AU) intensity as well as AU occurrence on a common benchmark dataset with reliable manual annotations. Evaluation will be done using a clear and well-defined protocol. In this paper we present the second such challenge in automatic recognition of facial expressions, to be held in conjunction with the 11 IEEE conference on Face and Gesture Recognition, May 2015, in Ljubljana, Slovenia. Three sub-challenges are defined: the detection of AU occurrence, the estimation of AU intensity for pre-segmented data, and fully automatic AU intensity estimation. In this work we outline the evaluation protocol, the data used, and the results of a baseline method for the three sub-challenges.

Journal ArticleDOI
TL;DR: A novel face identification framework capable of handling the full range of pose variations within ±90° of yaw is proposed and consistently outperforms single-task-based baselines as well as state-of-the-art methods for the pose problem.
Abstract: Face images captured in unconstrained environments usually contain significant pose variation, which dramatically degrades the performance of algorithms designed to recognize frontal faces. This paper proposes a novel face identification framework capable of handling the full range of pose variations within ±90° of yaw. The proposed framework first transforms the original pose-invariant face recognition problem into a partial frontal face recognition problem. A robust patch-based face representation scheme is then developed to represent the synthesized partial frontal faces. For each patch, a transformation dictionary is learnt under the proposed multi-task learning scheme. The transformation dictionary transforms the features of different poses into a discriminative subspace. Finally, face matching is performed at patch level rather than at the holistic level. Extensive and systematic experimentation on FERET, CMU-PIE, and Multi-PIE databases shows that the proposed method consistently outperforms single-task-based baselines as well as state-of-the-art methods for the pose problem. We further extend the proposed algorithm for the unconstrained face verification problem and achieve top-level performance on the challenging LFW data set.

Posted Content
TL;DR: A two-stage approach that combines a multi-patch deep CNN and deep metric learning, which extracts low dimensional but very discriminative features for face verification and recognition is proposed, showing a clear path to practical high-performance face recognition systems in real world.
Abstract: Face Recognition has been studied for many decades. As opposed to traditional hand-crafted features such as LBP and HOG, much more sophisticated features can be learned automatically by deep learning methods in a data-driven way. In this paper, we propose a two-stage approach that combines a multi-patch deep CNN and deep metric learning, which extracts low dimensional but very discriminative features for face verification and recognition. Experiments show that this method outperforms other state-of-the-art methods on LFW dataset, achieving 99.77% pair-wise verification accuracy and significantly better accuracy under other two more practical protocols. This paper also discusses the importance of data size and the number of patches, showing a clear path to practical high-performance face recognition systems in real world.

Journal ArticleDOI
TL;DR: In this paper, the authors define a jet image using calorimeter towers as the elements of the image and establish jet-image preprocessing methods, and develop a discriminant for classifying the jet-images derived using Fisher discriminant analysis.
Abstract: We introduce a novel approach to jet tagging and classification through the use of techniques inspired by computer vision. Drawing parallels to the problem of facial recognition in images, we define a jet-image using calorimeter towers as the elements of the image and establish jet-image preprocessing methods. For the jet-image processing step, we develop a discriminant for classifying the jet-images derived using Fisher discriminant analysis. The effectiveness of the technique is shown within the context of identifying boosted hadronic W boson decays with respect to a background of quark- and gluoninitiated jets. Using Monte Carlo simulation, we demonstrate that the performance of this technique introduces additional discriminating power over other substructure approaches, and gives significant insight into the internal structure of jets.

DOI
01 Jan 2015
TL;DR: A personal identification using iris recognition system with the help of six major steps i.e. image acquisition, localization, Isolation, normalization, feature extraction and matching and also these six steps consists a numbers of minor steps to complete each step.
Abstract: Iris recognition is an efficient method for identification of persons. This paper provides a technique for the iris recognition. A biometric system provides automatic identification of individuals. Iris is an unique feature which is applicable for identification. Unlike face recognition or finger prints, iris recognition comes from randomly distributed features. It is most reliable and accurate method of identification. This paper proposes a personal identification using iris recognition system with the help of six major steps i.e. image acquisition,localization, Isolation, normalization, feature extraction and matching and also these six steps consists a numbers of minor steps to complete each step. The boundaries of the iris, as papillary and limbic boundary, are detected by using Canny Edge Detector. We can use masking technique to isolate the iris image form the given eye image, this isolated iris image is transformed from Cartesian to polar co-ordinate. Now finally extract the unique features (feature vector) of the iris after enhancing the iris image and then perform matching process on iris code using Hamming Distance for acceptance and rejectance process. Keywords: Canny edge detection , Hamming distance , Hough transform, Masking, Histogram equalization, Haar wavelet.