scispace - formally typeset
Search or ask a question

Showing papers on "Facial recognition system published in 2019"


Proceedings ArticleDOI
15 Jun 2019
TL;DR: This paper presents arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks, and shows that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead.
Abstract: One of the main challenges in feature learning using Deep Convolutional Neural Networks (DCNNs) for large-scale face recognition is the design of appropriate loss functions that can enhance the discriminative power. Centre loss penalises the distance between deep features and their corresponding class centres in the Euclidean space to achieve intra-class compactness. SphereFace assumes that the linear transformation matrix in the last fully connected layer can be used as a representation of the class centres in the angular space and therefore penalises the angles between deep features and their corresponding weights in a multiplicative way. Recently, a popular line of research is to incorporate margins in well-established loss functions in order to maximise face class separability. In this paper, we propose an Additive Angular Margin Loss (ArcFace) to obtain highly discriminative features for face recognition. The proposed ArcFace has a clear geometric interpretation due to its exact correspondence to geodesic distance on a hypersphere. We present arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks which includes a new large-scale image database with trillions of pairs and a large-scale video dataset. We show that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead. To facilitate future research, the code has been made available.

4,312 citations


Journal ArticleDOI
TL;DR: HyperFace as discussed by the authors combines face detection, landmarks localization, pose estimation and gender recognition using deep convolutional neural networks (CNNs) and achieves significant improvement in performance by fusing intermediate layers of a deep CNN using a separate CNN followed by a multi-task learning algorithm that operates on the fused features.
Abstract: We present an algorithm for simultaneous face detection, landmarks localization, pose estimation and gender recognition using deep convolutional neural networks (CNN). The proposed method called, HyperFace, fuses the intermediate layers of a deep CNN using a separate CNN followed by a multi-task learning algorithm that operates on the fused features. It exploits the synergy among the tasks which boosts up their individual performances. Additionally, we propose two variants of HyperFace: (1) HyperFace-ResNet that builds on the ResNet-101 model and achieves significant improvement in performance, and (2) Fast-HyperFace that uses a high recall fast face detector for generating region proposals to improve the speed of the algorithm. Extensive experiments show that the proposed models are able to capture both global and local information in faces and performs significantly better than many competitive algorithms for each of these four tasks.

1,218 citations


Journal ArticleDOI
TL;DR: Visualization results demonstrate that, compared with the CNN without Gate Unit, ACNNs are capable of shifting the attention from the occluded patches to other related but unobstructed ones and outperform other state-of-the-art methods on several widely used in thelab facial expression datasets under the cross-dataset evaluation protocol.
Abstract: Facial expression recognition in the wild is challenging due to various unconstrained conditions. Although existing facial expression classifiers have been almost perfect on analyzing constrained frontal faces, they fail to perform well on partially occluded faces that are common in the wild. In this paper, we propose a convolution neutral network (CNN) with attention mechanism (ACNN) that can perceive the occlusion regions of the face and focus on the most discriminative un-occluded regions. ACNN is an end-to-end learning framework. It combines the multiple representations from facial regions of interest (ROIs). Each representation is weighed via a proposed gate unit that computes an adaptive weight from the region itself according to the unobstructedness and importance. Considering different RoIs, we introduce two versions of ACNN: patch-based ACNN (pACNN) and global–local-based ACNN (gACNN). pACNN only pays attention to local facial patches. gACNN integrates local representations at patch-level with global representation at image-level. The proposed ACNNs are evaluated on both real and synthetic occlusions, including a self-collected facial expression dataset with real-world occlusions, the two largest in-the-wild facial expression datasets (RAF-DB and AffectNet) and their modifications with synthesized facial occlusions. Experimental results show that ACNNs improve the recognition accuracy on both the non-occluded faces and occluded faces. Visualization results demonstrate that, compared with the CNN without Gate Unit, ACNNs are capable of shifting the attention from the occluded patches to other related but unobstructed ones. ACNNs also outperform other state-of-the-art methods on several widely used in-the-lab facial expression datasets under the cross-dataset evaluation protocol.

536 citations


Journal ArticleDOI
TL;DR: A new deep locality-preserving convolutional neural network (DLP-CNN) method that aims to enhance the discriminative power of deep features by preserving the locality closeness while maximizing the inter-class scatter is proposed.
Abstract: Facial expression is central to human experience, but most previous databases and studies are limited to posed facial behavior under controlled conditions In this paper, we present a novel facial expression database, Real-world Affective Face Database (RAF-DB), which contains approximately 30 000 facial images with uncontrolled poses and illumination from thousands of individuals of diverse ages and races During the crowdsourcing annotation, each image is independently labeled by approximately 40 annotators An expectation–maximization algorithm is developed to reliably estimate the emotion labels, which reveals that real-world faces often express compound or even mixture emotions A cross-database study between RAF-DB and CK+ database further indicates that the action units of real-world emotions are much more diverse than, or even deviate from, those of laboratory-controlled emotions To address the recognition of multi-modal expressions in the wild, we propose a new deep locality-preserving convolutional neural network (DLP-CNN) method that aims to enhance the discriminative power of deep features by preserving the locality closeness while maximizing the inter-class scatter Benchmark experiments on 7-class basic expressions and 11-class compound expressions, as well as additional experiments on CK+, MMI, and SFEW 20 databases, show that the proposed DLP-CNN outperforms the state-of-the-art handcrafted features and deep learning-based methods for expression recognition in the wild To promote further study, we have made the RAF database, benchmarks, and descriptor encodings publicly available to the research community

429 citations


Journal ArticleDOI
TL;DR: The proposed two-layer RNN model provides an effective way to make use of both spatial and temporal dependencies of the input signals for emotion recognition and experimental results demonstrate the proposed STRNN method is more competitive over those state-of-the-art methods.
Abstract: In this paper, we propose a novel deep learning framework, called spatial–temporal recurrent neural network (STRNN), to integrate the feature learning from both spatial and temporal information of signal sources into a unified spatial–temporal dependency model. In STRNN, to capture those spatially co-occurrent variations of human emotions, a multidirectional recurrent neural network (RNN) layer is employed to capture long-range contextual cues by traversing the spatial regions of each temporal slice along different directions. Then a bi-directional temporal RNN layer is further used to learn the discriminative features characterizing the temporal dependencies of the sequences, where sequences are produced from the spatial RNN layer. To further select those salient regions with more discriminative ability for emotion recognition, we impose sparse projection onto those hidden states of spatial and temporal domains to improve the model discriminant ability. Consequently, the proposed two-layer RNN model provides an effective way to make use of both spatial and temporal dependencies of the input signals for emotion recognition. Experimental results on the public emotion datasets of electroencephalogram and facial expression demonstrate the proposed STRNN method is more competitive over those state-of-the-art methods.

335 citations


Journal ArticleDOI
TL;DR: Major deep learning concepts pertinent to face image analysis and face recognition are reviewed, and a concise overview of studies on specific face recognition problems is provided, such as handling variations in pose, age, illumination, expression, and heterogeneous face matching.

312 citations


Proceedings ArticleDOI
15 Jun 2019
TL;DR: A center-based feature transfer framework to augment the feature space of under-represented subjects from the regular subjects that have sufficiently diverse samples, which presents smooth visual interpolation, which conducts disentanglement to preserve identity of a class while augmenting its feature space with non-identity variations such as pose and lighting.
Abstract: Despite the large volume of face recognition datasets, there is a significant portion of subjects, of which the samples are insufficient and thus under-represented. Ignoring such significant portion results in insufficient training data. Training with under-represented data leads to biased classifiers in conventionally-trained deep networks. In this paper, we propose a center-based feature transfer framework to augment the feature space of under-represented subjects from the regular subjects that have sufficiently diverse samples. A Gaussian prior of the variance is assumed across all subjects and the variance from regular ones are transferred to the under-represented ones. This encourages the under-represented distribution to be closer to the regular distribution. Further, an alternating training regimen is proposed to simultaneously achieve less biased classifiers and a more discriminative feature representation. We conduct ablative study to mimic the under-represented datasets by varying the portion of under-represented classes on the MS-Celeb-1M dataset. Advantageous results on LFW, IJB-A and MS-Celeb-1M demonstrate the effectiveness of our feature transfer and training strategy, compared to both general baselines and state-of-the-art methods. Moreover, our feature transfer successfully presents smooth visual interpolation, which conducts disentanglement to preserve identity of a class while augmenting its feature space with non-identity variations such as pose and lighting.

263 citations


Journal ArticleDOI
TL;DR: This paper systematically review all components of such systems: pre-processing, feature extraction and machine coding of facial actions, and the existing FACS-coded facial expression databases are summarised.
Abstract: As one of the most comprehensive and objective ways to describe facial expressions, the Facial Action Coding System (FACS) has recently received significant attention. Over the past 30 years, extensive research has been conducted by psychologists and neuroscientists on various aspects of facial expression analysis using FACS. Automating FACS coding would make this research faster and more widely applicable, opening up new avenues to understanding how we communicate through facial expressions. Such an automated process can also potentially increase the reliability, precision and temporal resolution of coding. This paper provides a comprehensive survey of research into machine analysis of facial actions. We systematically review all components of such systems: pre-processing, feature extraction and machine coding of facial actions. In addition, the existing FACS-coded facial expression databases are summarised. Finally, challenges that have to be addressed to make automatic facial action analysis applicable in real-life situations are extensively discussed. There are two underlying motivations for us to write this survey paper: the first is to provide an up-to-date review of the existing literature, and the second is to offer some insights into the future of machine recognition of facial actions: what are the challenges and opportunities that researchers in the field face.

257 citations


Proceedings ArticleDOI
01 Oct 2019
TL;DR: Zhang et al. as discussed by the authors proposed Probabilistic Face Embeddings (PFEs), which represent each face image as a Gaussian distribution in the latent space and derive probabilistic solutions for matching and fusing PFEs using the uncertainty information.
Abstract: Embedding methods have achieved success in face recognition by comparing facial features in a latent semantic space. However, in a fully unconstrained face setting, the facial features learned by the embedding model could be ambiguous or may not even be present in the input face, leading to noisy representations. We propose Probabilistic Face Embeddings (PFEs), which represent each face image as a Gaussian distribution in the latent space. The mean of the distribution estimates the most likely feature values while the variance shows the uncertainty in the feature values. Probabilistic solutions can then be naturally derived for matching and fusing PFEs using the uncertainty information. Empirical evaluation on different baseline models, training datasets and benchmarks show that the proposed method can improve the face recognition performance of deterministic embeddings by converting them into PFEs. The uncertainties estimated by PFEs also serve as good indicators of the potential matching accuracy, which are important for a risk-controlled recognition system.

246 citations


Proceedings ArticleDOI
15 Jun 2019
TL;DR: This work proposes to learn a generalized feature space via a novel multi-adversarial discriminative deep domain generalization framework under a dual-force triplet-mining constraint, which ensures that the learned feature space is discriminating and shared by multiple source domains, and thus more generalized to new face presentation attacks.
Abstract: Face presentation attacks have become an increasingly critical issue in the face recognition community. Many face anti-spoofing methods have been proposed, but they cannot generalize well on "unseen" attacks. This work focuses on improving the generalization ability of face anti-spoofing methods from the perspective of the domain generalization. We propose to learn a generalized feature space via a novel multi-adversarial discriminative deep domain generalization framework. In this framework, a multi-adversarial deep domain generalization is performed under a dual-force triplet-mining constraint. This ensures that the learned feature space is discriminative and shared by multiple source domains, and thus is more generalized to new face presentation attacks. An auxiliary face depth supervision is incorporated to further enhance the generalization ability. Extensive experiments on four public datasets validate the effectiveness of the proposed method.

245 citations


Journal ArticleDOI
TL;DR: Wasserstein convolutional neural network (WCNN) as discussed by the authors was proposed to learn invariant features between near-infrared (NIR) and visual (VIS) face images, and the Wasserstein distance was introduced into the NIR-VIS shared layer to measure the dissimilarity between heterogeneous feature distributions.
Abstract: Heterogeneous face recognition (HFR) aims at matching facial images acquired from different sensing modalities with mission-critical applications in forensics, security and commercial sectors. However, HFR presents more challenging issues than traditional face recognition because of the large intra-class variation among heterogeneous face images and the limited availability of training samples of cross-modality face image pairs. This paper proposes the novel Wasserstein convolutional neural network (WCNN) approach for learning invariant features between near-infrared (NIR) and visual (VIS) face images (i.e., NIR-VIS face recognition). The low-level layers of the WCNN are trained with widely available face images in the VIS spectrum, and the high-level layer is divided into three parts: the NIR layer, the VIS layer and the NIR-VIS shared layer. The first two layers aim at learning modality-specific features, and the NIR-VIS shared layer is designed to learn a modality-invariant feature subspace. The Wasserstein distance is introduced into the NIR-VIS shared layer to measure the dissimilarity between heterogeneous feature distributions. W-CNN learning is performed to minimize the Wasserstein distance between the NIR distribution and the VIS distribution for invariant deep feature representations of heterogeneous face images. To avoid the over-fitting problem on small-scale heterogeneous face data, a correlation prior is introduced on the fully-connected WCNN layers to reduce the size of the parameter space. This prior is implemented by a low-rank constraint in an end-to-end network. The joint formulation leads to an alternating minimization for deep feature representation at the training stage and an efficient computation for heterogeneous data at the testing stage. Extensive experiments using three challenging NIR-VIS face recognition databases demonstrate the superiority of the WCNN method over state-of-the-art methods.

Journal ArticleDOI
TL;DR: A comprehensive survey of various techniques explored for face detection in digital images is presented in this paper, where the practical aspects towards the development of a robust face detection system and several promising directions for future research are discussed.
Abstract: With the marvelous increase in video and image database there is an incredible need of automatic understanding and examination of information by the intelligent systems as manually it is getting to be plainly distant. Face plays a major role in social intercourse for conveying identity and feelings of a person. Human beings have not tremendous ability to identify different faces than machines. So, automatic face detection system plays an important role in face recognition, facial expression recognition, head-pose estimation, human–computer interaction etc. Face detection is a computer technology that determines the location and size of a human face in a digital image. Face detection has been a standout amongst topics in the computer vision literature. This paper presents a comprehensive survey of various techniques explored for face detection in digital images. Different challenges and applications of face detection are also presented in this paper. At the end, different standard databases for face detection are also given with their features. Furthermore, we organize special discussions on the practical aspects towards the development of a robust face detection system and conclude this paper with several promising directions for future research.

Journal ArticleDOI
TL;DR: In this article, an overview of methods to recognize emotions and to compare their applicability based on existing studies is given, which should enable practitioners, researchers and engineers to find a system most suitable for certain applications.

Journal ArticleDOI
16 May 2019
TL;DR: Zhang et al. as discussed by the authors proposed an approach that combines automatic features learned by convolutional neural networks (CNN) and handcrafted features computed by the bag-of-visual-words (BOVW) model in order to achieve the state of the art results in facial expression recognition (FER).
Abstract: We present an approach that combines automatic features learned by convolutional neural networks (CNN) and handcrafted features computed by the bag-of-visual-words (BOVW) model in order to achieve the state-of-the-art results in facial expression recognition (FER). To obtain automatic features, we experiment with multiple CNN architectures, pre-trained models, and training procedures, e.g., Dense–Sparse–Dense. After fusing the two types of features, we employ a local learning framework to predict the class label for each test image. The local learning framework is based on three steps. First, a k-nearest neighbors model is applied in order to select the nearest training samples for an input test image. Second, a one-versus-all support vector machines (SVM) classifier is trained on the selected training samples. Finally, the SVM classifier is used to predict the class label only for the test image it was trained for. Although we have used local learning in combination with handcrafted features in our previous work, to the best of our knowledge, local learning has never been employed in combination with deep features. The experiments on the 2013 FER Challenge data set, the FER+ data set, and the AffectNet data set demonstrate that our approach achieves the state-of-the-art results. With a top accuracy of 75.42% on the FER 2013, 87.76% on the FER+, 59.58% on the AffectNet eight-way classification, and 63.31% on the AffectNet seven-way classification, we surpass the state-of-the-art methods by more than 1% on all data sets.

Journal ArticleDOI
TL;DR: A conceptual categorization and metrics for an evaluation of such methods are presented, followed by a comprehensive survey of relevant publications, and technical considerations and tradeoffs of the surveyed methods are discussed.
Abstract: Recently, researchers found that the intended generalizability of (deep) face recognition systems increases their vulnerability against attacks. In particular, the attacks based on morphed face images pose a severe security risk to face recognition systems. In the last few years, the topic of (face) image morphing and automated morphing attack detection has sparked the interest of several research laboratories working in the field of biometrics and many different approaches have been published. In this paper, a conceptual categorization and metrics for an evaluation of such methods are presented, followed by a comprehensive survey of relevant publications. In addition, technical considerations and tradeoffs of the surveyed methods are discussed along with open issues and challenges in the field.

Journal ArticleDOI
TL;DR: A new spatio-temporal feature representation learning for FER that is robust to expression intensity variations is proposed that achieved higher recognition rates in both datasets compared to the state-of-the-art methods.
Abstract: Facial expression recognition (FER) is increasingly gaining importance in various emerging affective computing applications. In practice, achieving accurate FER is challenging due to the large amount of inter-personal variations such as expression intensity variations. In this paper, we propose a new spatio-temporal feature representation learning for FER that is robust to expression intensity variations. The proposed method utilizes representative expression-states (e.g., onset, apex and offset of expressions) which can be specified in facial sequences regardless of the expression intensity. The characteristics of facial expressions are encoded in two parts in this paper. As the first part, spatial image characteristics of the representative expression-state frames are learned via a convolutional neural network. Five objective terms are proposed to improve the expression class separability of the spatial feature representation. In the second part, temporal characteristics of the spatial feature representation in the first part are learned with a long short-term memory of the facial expression. Comprehensive experiments have been conducted on a deliberate expression dataset (MMI) and a spontaneous micro-expression dataset (CASME II). Experimental results showed that the proposed method achieved higher recognition rates in both datasets compared to the state-of-the-art methods.

Proceedings ArticleDOI
08 Apr 2019
TL;DR: In this paper, a lightweight ''clip-sampling'' model is proposed to identify the most salient temporal clips within a long video. But the model is limited to action recognition on untrimmed videos.
Abstract: While many action recognition datasets consist of collections of brief, trimmed videos each containing a relevant action, videos in the real-world (e.g., on YouTube) exhibit very different properties: they are often several minutes long, where brief relevant clips are often interleaved with segments of extended duration containing little change. Applying densely an action recognition system to every temporal clip within such videos is prohibitively expensive. Furthermore, as we show in our experiments, this results in suboptimal recognition accuracy as informative predictions from relevant clips are outnumbered by meaningless classification outputs over long uninformative sections of the video. In this paper we introduce a lightweight ``clip-sampling'' model that can efficiently identify the most salient temporal clips within a long video. We demonstrate that the computational cost of action recognition on untrimmed videos can be dramatically reduced by invoking recognition only on these most salient clips. Furthermore, we show that this yields significant gains in recognition accuracy compared to analysis of all clips or randomly selected clips. On Sports1M, our clip sampling scheme elevates the accuracy of an already state-of-the-art action classifier by 7% and reduces by more than 15 times its computational cost.

Posted Content
TL;DR: This paper evaluates the robustness of state-of-the-art face recognition models in the decision-based black-box attack setting, where the attackers have no access to the model parameters and gradients, but can only acquire hard-label predictions by sending queries to the target model.
Abstract: Face recognition has obtained remarkable progress in recent years due to the great improvement of deep convolutional neural networks (CNNs). However, deep CNNs are vulnerable to adversarial examples, which can cause fateful consequences in real-world face recognition applications with security-sensitive purposes. Adversarial attacks are widely studied as they can identify the vulnerability of the models before they are deployed. In this paper, we evaluate the robustness of state-of-the-art face recognition models in the decision-based black-box attack setting, where the attackers have no access to the model parameters and gradients, but can only acquire hard-label predictions by sending queries to the target model. This attack setting is more practical in real-world face recognition systems. To improve the efficiency of previous methods, we propose an evolutionary attack algorithm, which can model the local geometries of the search directions and reduce the dimension of the search space. Extensive experiments demonstrate the effectiveness of the proposed method that induces a minimum perturbation to an input face image with fewer queries. We also apply the proposed method to attack a real-world face recognition system successfully.

Proceedings ArticleDOI
15 Jun 2019
TL;DR: Zhang et al. as discussed by the authors evaluated the robustness of state-of-the-art face recognition models in the decision-based black-box attack setting, where the attackers have no access to the model parameters and gradients, but can only acquire hard-label predictions by sending queries to the target model.
Abstract: Face recognition has obtained remarkable progress in recent years due to the great improvement of deep convolutional neural networks (CNNs). However, deep CNNs are vulnerable to adversarial examples, which can cause fateful consequences in real-world face recognition applications with security-sensitive purposes. Adversarial attacks are widely studied as they can identify the vulnerability of the models before they are deployed. In this paper, we evaluate the robustness of state-of-the-art face recognition models in the decision-based black-box attack setting, where the attackers have no access to the model parameters and gradients, but can only acquire hard-label predictions by sending queries to the target model. This attack setting is more practical in real-world face recognition systems. To improve the efficiency of previous methods, we propose an evolutionary attack algorithm, which can model the local geometry of the search directions and reduce the dimension of the search space. Extensive experiments demonstrate the effectiveness of the proposed method that induces a minimum perturbation to an input face image with fewer queries. We also apply the proposed method to attack a real-world face recognition system successfully.

Proceedings ArticleDOI
04 Jun 2019
TL;DR: A Convolutional Neural Network (CNN) based framework for presentation attack detection, with deep pixel-wise supervision, suitable for deployment in smart devices with minimal computational and time overhead is introduced.
Abstract: Face recognition has evolved as a prominent biometric authentication modality. However, vulnerability to presentation attacks curtails its reliable deployment. Automatic detection of presentation attacks is essential for secure use of face recognition technology in unattended scenarios. In this work, we introduce a Convolutional Neural Network (CNN) based framework for presentation attack detection, with deep pixel-wise supervision. The framework uses only frame level information making it suitable for deployment in smart devices with minimal computational and time overhead. We demonstrate the effectiveness of the proposed approach in public datasets for both intra as well as cross-dataset experiments. The proposed approach achieves an HTER of 0% in Replay Mobile dataset and an ACER of 0.42% in Protocol-1 of OULU dataset outperforming state of the art methods.

Proceedings ArticleDOI
03 May 2019
TL;DR: This paper represents an implementation of Principal Component Analysis (PCA) on masked and non-masked face recognition and a comparative study is done here for a better understanding.
Abstract: This paper represents an implementation of Principal Component Analysis (PCA) on masked and non-masked face recognition. Security is an essential term in our today’s life. In various Biometric technology, face recognition is widely used to secure any system because it is better than any other traditional techniques like PIN, password, fingerprint etc. and most reliable to identify or verify a person efficiently. In recent years, face recognition is a very challenging task because of different occlusion or masks like the existence of sunglasses, scarves, hats and different types of make-up or disguise ingredients. The accuracy rate of face recognition is influenced by these types of masks. Many algorithms have been developed recently for non-masked face recognition which are widely used and give better performance. Still in the field of masked face recognition, few contributions has been done. Therefore, in this work a statistical procedure has been selected which is applied in non-masked face recognition and also apply in the masked face recognition technique. PCA is more effective and successful statistical technique and widely used. For this reason in this work, PCA algorithm has been chosen. Finally, a comparative study also done here for a better understanding.

Proceedings ArticleDOI
01 Jun 2019
TL;DR: In this article, the authors introduce a unique 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio from 12 speakers and train a neural network on their dataset that factors identity from facial motion.
Abstract: Audio-driven 3D facial animation has been widely explored, but achieving realistic, human-like performance is still unsolved. This is due to the lack of available 3D datasets, models, and standard evaluation metrics. To address this, we introduce a unique 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio from 12 speakers. We then train a neural network on our dataset that factors identity from facial motion. The learned model, VOCA (Voice Operated Character Animation) takes any speech signal as input—even speech in languages other than English—and realistically animates a wide range of adult faces. Conditioning on subject labels during training allows the model to learn a variety of realistic speaking styles. VOCA also provides animator controls to alter speaking style, identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball rotations) during animation. To our knowledge, VOCA is the only realistic 3D facial animation model that is readily applicable to unseen subjects without retargeting. This makes VOCA suitable for tasks like in-game video, virtual reality avatars, or any scenario in which the speaker, speech, or language is not known in advance. We make the dataset and model available for research purposes at http://voca.is.tue.mpg.de.

Journal ArticleDOI
TL;DR: A novel method, named deep comprehensive multipatches aggregation convolutional neural networks (CNNs), to solve the FER problem, which is a deep-based framework, which mainly consists of two branches of the CNN.
Abstract: Facial expression recognition (FER) has long been a challenging task in computer vision. In this paper, we propose a novel method, named deep comprehensive multipatches aggregation convolutional neural networks (CNNs), to solve the FER problem. The proposed method is a deep-based framework, which mainly consists of two branches of the CNN. One branch extracts local features from image patches while the other extracts holistic features from the whole expressional image. In the model, local features depict expressional details and holistic features characterize the high-level semantic information of an expression. We aggregate both local and holistic features before making classification. These two types of hierarchical features represent expressions in different scales. Compared with most current methods with single type of feature, the model can represent expressions more comprehensively. Additionally, in the training stage, a novel pooling strategy named expressional transformation-invariant pooling is proposed for handling nuisance variations, such as rotations, noises, etc. Extensive experiments are conducted on the famous the Extended Cohn-Kanade (CK+) dataset and the Japanese Female Facial Expression (JAFFE) database expression datasets, where the recognition results obtained.

Journal ArticleDOI
Rejeesh M R1
TL;DR: The performance of the proposed ANFIS-ABC technique is evaluated using an ORL database with 400 images of 40 individuals, YALE-B database with 165 images of 15 individuals and finally with real time video the detection rate and false alarm rate is compared with proposed and existing methods to prove the system efficiency.
Abstract: In this paper, an efficient face recognition method using AGA and ANFIS-ABC has been proposed. At first stage, the face images gathered from the database are preprocessed. At Second stage, an interest point which is used to improve the detection rate consequently. The parameters used in the interest point determination are optimized using the Adaptive Genetic Algorithm. Finally using ANFIS, face images are classified by using extracted features. During the training process, the parameters of ANFIS are optimized using Artificial Bee Colony Algorithm (ABC) in order to improve the accuracy. The performance of the proposed ANFIS-ABC technique is evaluated using an ORL database with 400 images of 40 individuals, YALE-B database with 165 images of 15 individuals and finally with real time video the detection rate and false alarm rate is compared with proposed and existing methods to prove the system efficiency.

Proceedings ArticleDOI
15 Jun 2019
TL;DR: This paper proposes the Adaptive Margin Softmax to adjust the margins for different classes adaptively, and makes the sampling process adaptive in two folds: Firstly, the Hard Prototype Mining to adaptively select a small number of hard classes to participate in classification, and secondly, theAdaptive Data Sampling to find valuable samples for training adaptively.
Abstract: Training large-scale unbalanced data is the central topic in face recognition. In the past two years, face recognition has achieved remarkable improvements due to the introduction of margin based Softmax loss. However, these methods have an implicit assumption that all the classes possess sufficient samples to describe its distribution, so that a manually set margin is enough to equally squeeze each intra-class variations. However, real face datasets are highly unbalanced, which means the classes have tremendously different numbers of samples. In this paper, we argue that the margin should be adapted to different classes. We propose the Adaptive Margin Softmax to adjust the margins for different classes adaptively. In addition to the unbalance challenge, face data always consists of large-scale classes and samples. Smartly selecting valuable classes and samples to participate in the training makes the training more effective and efficient. To this end, we also make the sampling process adaptive in two folds: Firstly, we propose the Hard Prototype Mining to adaptively select a small number of hard classes to participate in classification. Secondly, for data sampling, we introduce the Adaptive Data Sampling to find valuable samples for training adaptively. We combine these three parts together as AdaptiveFace. Extensive analysis and experiments on LFW, LFW BLUFR and MegaFace show that our method performs better than state-of-the-art methods using the same network architecture and training dataset. Code is available at https://github.com/haoliu1994/AdaptiveFace.

Journal ArticleDOI
TL;DR: This paper proposes a neighborly de-convolutional neural network (NbNet) to reconstruct face images from their deep templates and demonstrates the need to secure deep templates in face recognition systems.
Abstract: State-of-the-art face recognition systems are based on deep (convolutional) neural networks. Therefore, it is imperative to determine to what extent face templates derived from deep networks can be inverted to obtain the original face image. In this paper, we study the vulnerabilities of a state-of-the-art face recognition system based on template reconstruction attack. We propose a neighborly de-convolutional neural network ( NbNet ) to reconstruct face images from their deep templates. In our experiments, we assumed that no knowledge about the target subject and the deep network are available. To train the NbNet reconstruction models, we augmented two benchmark face datasets (VGG-Face and Multi-PIE) with a large collection of images synthesized using a face generator. The proposed reconstruction was evaluated using type-I (comparing the reconstructed images against the original face images used to generate the deep template) and type-II (comparing the reconstructed images against a different face image of the same subject) attacks. Given the images reconstructed from NbNets , we show that for verification, we achieve TAR of 95.20 percent (58.05 percent) on LFW under type-I (type-II) attacks @ FAR of 0.1 percent. Besides, 96.58 percent (92.84 percent) of the images reconstructed from templates of partition fa ( fb ) can be identified from partition fa in color FERET. Our study demonstrates the need to secure deep templates in face recognition systems.

Journal ArticleDOI
TL;DR: A Disentangled Representation learning-Generative Adversarial Network (DR-GAN) with three distinct novelties that demonstrate the superiority of DR-GAN over the state of the art in both learning representations and rotating large-pose face images.
Abstract: The large pose discrepancy between two face images is one of the fundamental challenges in automatic face recognition. Conventional approaches to pose-invariant face recognition either perform face frontalization on, or learn a pose-invariant representation from, a non-frontal face image. We argue that it is more desirable to perform both tasks jointly to allow them to leverage each other. To this end, this paper proposes a Disentangled Representation learning-Generative Adversarial Network (DR-GAN) with three distinct novelties. First, the encoder-decoder structure of the generator enables DR-GAN to learn a representation that is both generative and discriminative, which can be used for face image synthesis and pose-invariant face recognition. Second, this representation is explicitly disentangled from other face variations such as pose, through the pose code provided to the decoder and pose estimation in the discriminator. Third, DR-GAN can take one or multiple images as the input, and generate one unified identity representation along with an arbitrary number of synthetic face images. Extensive quantitative and qualitative evaluation on a number of controlled and in-the-wild databases demonstrate the superiority of DR-GAN over the state of the art in both learning representations and rotating large-pose face images.

Posted Content
TL;DR: Diversity in Faces (DiF) provides a data set of one million annotated human face images for advancing the study of facial diversity, and believes that by making the extracted coding schemes available on a large set of faces, can accelerate research and development towards creating more fair and accurate facial recognition systems.
Abstract: Face recognition is a long standing challenge in the field of Artificial Intelligence (AI). The goal is to create systems that accurately detect, recognize, verify, and understand human faces. There are significant technical hurdles in making these systems accurate, particularly in unconstrained settings due to confounding factors related to pose, resolution, illumination, occlusion, and viewpoint. However, with recent advances in neural networks, face recognition has achieved unprecedented accuracy, largely built on data-driven deep learning methods. While this is encouraging, a critical aspect that is limiting facial recognition accuracy and fairness is inherent facial diversity. Every face is different. Every face reflects something unique about us. Aspects of our heritage - including race, ethnicity, culture, geography - and our individual identify - age, gender, and other visible manifestations of self-expression, are reflected in our faces. We expect face recognition to work equally accurately for every face. Face recognition needs to be fair. As we rely on data-driven methods to create face recognition technology, we need to ensure necessary balance and coverage in training data. However, there are still scientific questions about how to represent and extract pertinent facial features and quantitatively measure facial diversity. Towards this goal, Diversity in Faces (DiF) provides a data set of one million annotated human face images for advancing the study of facial diversity. The annotations are generated using ten well-established facial coding schemes from the scientific literature. The facial coding schemes provide human-interpretable quantitative measures of facial features. We believe that by making the extracted coding schemes available on a large set of faces, we can accelerate research and development towards creating more fair and accurate facial recognition systems.

Proceedings ArticleDOI
15 Jun 2019
TL;DR: CASIA-SURF as mentioned in this paper is a large-scale multi-modal dataset for face anti-spoofing, which consists of 1,000 subjects with 21,000 videos and each sample has three modalities (i.e., RGB, depth and IR).
Abstract: Face anti-spoofing is essential to prevent face recognition systems from a security breach. Much of the progresses have been made by the availability of face anti-spoofing benchmark datasets in recent years. However, existing face anti-spoofing benchmarks have limited number of subjects (≤170) and modalities (≤2), which hinder the further development of the academic community. To facilitate face anti-spoofing research, we introduce a large-scale multi-modal dataset, namely CASIA-SURF, which is the largest publicly available dataset for face anti-spoofing in terms of both subjects and visual modalities. Specifically, it consists of 1,000 subjects with 21,000 videos and each sample has 3 modalities (i.e., RGB, Depth and IR). We also provide a measurement set, evaluation protocol and training/validation/testing subsets, developing a new benchmark for face anti-spoofing. Moreover, we present a new multi-modal fusion method as baseline, which performs feature re-weighting to select the more informative channel features while suppressing the less useful ones for each modal. Extensive experiments have been conducted on the proposed dataset to verify its significance and generalization capability. The dataset is available at https://sites.google.com/qq.com/chalearnfacespoofingattackdete/.

Journal ArticleDOI
TL;DR: A novel image recognition and navigation system which provides precise and quick messages in the form of audio to visually challenged people so that they can navigate easily is proposed.
Abstract: Most of the advancements are now carried out by interconnecting physical devices with computers; this is what known as Internet of Things (IoT). The major problems facing by blind people fall in the category of navigating through indoor and outdoor environments consisting of various obstacles and recognition of person in front of them. Identification of objects or person only with perceptive and audio information is difficult. An intelligent, portable, less expensive, self-contained navigation and face recognition system is highly demanded for blind people. This helps blind people to navigate with the help of a Smartphone, global positioning system (GPS) and a system equipped with ultrasonic sensors. Face recognition can be done using neural learning techniques with feature extraction and training modules. The images of friends, relatives are stored in the database of user Smartphone. Whenever a person comes in front of the blind user, the application with the help of neural network gives the voice aid to the user. Thus this system can replace the regular imprecise use of guide dogs as well as white sticks to help the navigation and face recognition process for people with impaired vision.In this paper, we have proposed a novel image recognition and navigation system which provides precise and quick messages in the form of audio to visually challenged people so that they can navigate easily. The performance of the proposed method is comparatively analyzed with the help of ROC analysis.