scispace - formally typeset
Search or ask a question

Showing papers on "Facial recognition system published in 2021"


Journal ArticleDOI
TL;DR: This paper provides a comprehensive survey of existing open set recognition techniques covering various aspects ranging from related definitions, representations of models, datasets, evaluation criteria, and algorithm comparisons to highlight the limitations of existing approaches and point out some promising subsequent research directions.
Abstract: In real-world recognition/classification tasks, limited by various objective factors, it is usually difficult to collect training samples to exhaust all classes when training a recognizer or classifier. A more realistic scenario is open set recognition (OSR), where incomplete knowledge of the world exists at training time, and unknown classes can be submitted to an algorithm during testing, requiring the classifiers to not only accurately classify the seen classes, but also effectively deal with unseen ones. This paper provides a comprehensive survey of existing open set recognition techniques covering various aspects ranging from related definitions, representations of models, datasets, evaluation criteria, and algorithm comparisons. Furthermore, we briefly analyze the relationships between OSR and its related tasks including zero-shot, one-shot (few-shot) recognition/learning techniques, classification with reject option, and so forth. Additionally, we also review the open world recognition which can be seen as a natural extension of OSR. Importantly, we highlight the limitations of existing approaches and point out some promising subsequent research directions in this field.

492 citations


Proceedings ArticleDOI
11 Mar 2021
TL;DR: MagFace as discussed by the authors introduces an adaptive mechanism to learn a well-structured within-class feature distributions by pulling easy samples to class centers while pushing hard samples away, which prevents models from overfitting on noisy low-quality samples and improves face recognition in the wild.
Abstract: The performance of face recognition system degrades when the variability of the acquired faces increases. Prior work alleviates this issue by either monitoring the face quality in pre-processing or predicting the data uncertainty along with the face feature. This paper proposes MagFace, a category of losses that learn a universal feature embedding whose magnitude can measure the quality of the given face. Under the new loss, it can be proven that the magnitude of the feature embedding monotonically increases if the subject is more likely to be recognized. In addition, Mag-Face introduces an adaptive mechanism to learn a well-structured within-class feature distributions by pulling easy samples to class centers while pushing hard samples away. This prevents models from overfitting on noisy low-quality samples and improves face recognition in the wild. Extensive experiments conducted on face recognition, quality assessments as well as clustering demonstrate its superiority over state-of-the-arts. The code is available at https://github.com/IrvingMeng/MagFace.

268 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present a comprehensive survey of body gesture recognition methods and discuss multi-modal approaches that combine speech or face with body gestures for improved emotion recognition, and define a complete framework for automatic emotional body gestures recognition.
Abstract: Automatic emotion recognition has become a trending research topic in the past decade. While works based on facial expressions or speech abound, recognizing affect from body gestures remains a less explored topic. We present a new comprehensive survey hoping to boost research in the field. We first introduce emotional body gestures as a component of what is commonly known as ”body language” and comment general aspects as gender differences and culture dependence. We then define a complete framework for automatic emotional body gesture recognition. We introduce person detection and comment static and dynamic body pose estimation methods both in RGB and 3D. We then comment the recent literature related to representation learning and emotion recognition from images of emotionally expressive gestures. We also discuss multi-modal approaches that combine speech or face with body gestures for improved emotion recognition. While pre-processing methodologies (e.g., human detection and pose estimation) are nowadays mature technologies fully developed for robust large scale analysis, we show that for emotion recognition the quantity of labelled data is scarce. There is no agreement on clearly defined output spaces and the representations are shallow and largely based on naive geometrical representations.

256 citations


Proceedings ArticleDOI
18 Aug 2021
TL;DR: TediGAN as discussed by the authors uses StyleGAN inversion module, visual-linguistic similarity learning, and instance-level optimization to produce diverse and high-quality images with an unprecedented resolution at 10242.
Abstract: In this work, we propose TediGAN, a novel framework for multi-modal image generation and manipulation with textual descriptions. The proposed method consists of three components: StyleGAN inversion module, visual-linguistic similarity learning, and instance-level optimization. The inversion module maps real images to the latent space of a well-trained StyleGAN. The visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space. The instancelevel optimization is for identity preservation in manipulation. Our model can produce diverse and high-quality images with an unprecedented resolution at 10242. Using a control mechanism based on style-mixing, our TediGAN inherently supports image synthesis with multi-modal inputs, such as sketches or semantic labels, with or without instance guidance. To facilitate text-guided multi-modal synthesis, we propose the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions. Extensive experiments on the introduced dataset demonstrate the superior performance of our proposed method. Code and data are available at https://github.com/weihaox/TediGAN.

212 citations


Proceedings ArticleDOI
01 Jun 2021
TL;DR: Patch-NetVLAD as discussed by the authors combines the advantages of both local and global descriptor methods by deriving patch-level features from NetVLAD residuals, which enables aggregation and matching of deep-learned local features defined over the feature-space grid.
Abstract: Visual Place Recognition is a challenging task for robotics and autonomous systems, which must deal with the twin problems of appearance and viewpoint change in an always changing world. This paper introduces Patch-NetVLAD, which provides a novel formulation for combining the advantages of both local and global descriptor methods by deriving patch-level features from NetVLAD residuals. Unlike the fixed spatial neighborhood regime of existing local keypoint features, our method enables aggregation and matching of deep-learned local features defined over the feature-space grid. We further introduce a multi-scale fusion of patch features that have complementary scales (i.e. patch sizes) via an integral feature space and show that the fused features are highly invariant to both condition (season, structure, and illumination) and viewpoint (translation and rotation) changes. Patch-NetVLAD achieves state-of-the-art visual place recognition results in computationally limited scenarios, validated on a range of challenging real-world datasets, including winning the Facebook Mapillary Visual Place Recognition Challenge at ECCV2020. It is also adaptable to user requirements, with a speed-optimised version operating over an order of magnitude faster than the state-of-the-art. By combining superior performance with improved computational efficiency in a configurable framework, Patch-NetVLAD is well suited to enhance both stand-alone place recognition capabilities and the overall performance of SLAM systems.

199 citations


Journal ArticleDOI
TL;DR: In this article, the authors proposed an approach using deep learning, TensorFlow, Keras, and OpenCV to detect face masks using Single Shot Multibox Detector as a face detector and MobilenetV2 architecture as a framework for the classifier.

193 citations


Proceedings ArticleDOI
01 Jan 2021
TL;DR: In this article, the authors proposed a Deep Attentive Center Loss (DACL) method to adaptively select a subset of significant feature elements for enhanced discrimination, which integrates an attention mechanism to estimate attention weights correlated with feature importance.
Abstract: Learning discriminative features for Facial Expression Recognition (FER) in the wild using Convolutional Neural Networks (CNNs) is a non-trivial task due to the significant intra-class variations and inter-class similarities. Deep Metric Learning (DML) approaches such as center loss and its variants jointly optimized with softmax loss have been adopted in many FER methods to enhance the discriminative power of learned features in the embedding space. However, equally supervising all features with the metric learning method might include irrelevant features and ultimately degrade the generalization ability of the learning algorithm. We propose a Deep Attentive Center Loss (DACL) method to adaptively select a subset of significant feature elements for enhanced discrimination. The proposed DACL integrates an attention mechanism to estimate attention weights correlated with feature importance using the intermediate spatial feature maps in CNN as context. The estimated weights accommodate the sparse formulation of center loss to selectively achieve intra-class compactness and inter-class separation for the relevant information in the embedding space. An extensive study on two widely used wild FER datasets demonstrates the superiority of the proposed DACL method compared to state-of-the-art methods.

137 citations


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed an additive angular margin loss (ArcFace), which not only has a clear geometric interpretation, but also significantly enhances the discriminative power.
Abstract: Recently, a popular line of research in face recognition is adopting margins in the well-established softmax loss function to maximize class separability. In this paper, we first introduce an Additive Angular Margin Loss (ArcFace), which not only has a clear geometric interpretation but also significantly enhances the discriminative power. Since ArcFace is susceptible to the massive label noise, we further propose sub-center ArcFace, in which each class contains K sub-centers and training samples only need to be close to any of the $K$ positive sub-centers. Sub-center ArcFace encourages one dominant sub-class that contains the majority of clean faces and non-dominant sub-classes that include hard or noisy faces. Based on this self-propelled isolation, we boost the performance through automatically purifying raw web faces under massive real-world noise. Besides discriminative feature embedding, we also explore the inverse problem, mapping feature vectors to face images. Without training any additional generator or discriminator, the pre-trained ArcFace model can generate identity-preserved face images for both subjects inside and outside the training data only by using the network gradient and Batch Normalization (BN) priors. Extensive experiments demonstrate that ArcFace can enhance the discriminative feature embedding as well as strengthen the generative face synthesis.

136 citations


Proceedings ArticleDOI
16 Mar 2021
TL;DR: Wang et al. as discussed by the authors proposed a frequency-aware discriminative feature learning framework, which only compresses intra-class variations of natural faces while boosting inter-class differences in the embedding space.
Abstract: Face forgery detection is raising ever-increasing interest in computer vision since facial manipulation technologies cause serious worries. Though recent works have reached sound achievements, there are still unignorable problems: a) learned features supervised by softmax loss are separable but not discriminative enough, since softmax loss does not explicitly encourage intra-class compactness and inter-class separability; and b) fixed filter banks and hand-crafted features are insufficient to capture forgery patterns of frequency from diverse inputs. To compensate for such limitations, a novel frequency-aware discriminative feature learning framework is proposed in this paper. Specifically, we design a novel single-center loss (SCL) that only compresses intra-class variations of natural faces while boosting inter-class differences in the embedding space. In such a case, the network can learn more discriminative features with less optimization difficulty. Besides, an adaptive frequency feature generation module is developed to mine frequency clues in a completely data-driven fashion. With the above two modules, the whole framework can learn more discriminative features in an end-to-end manner. Extensive experiments demonstrate the effectiveness and superiority of our framework on three versions of the FF++ dataset.

111 citations


Journal ArticleDOI
08 May 2021-Sensors
TL;DR: In this article, an improved CSPDarkNet53 is introduced into the trunk feature extraction network, which reduces the computing cost of the network and improves the learning ability of the model.
Abstract: To solve the problems of low accuracy, low real-time performance, poor robustness and others caused by the complex environment, this paper proposes a face mask recognition and standard wear detection algorithm based on the improved YOLO-v4. Firstly, an improved CSPDarkNet53 is introduced into the trunk feature extraction network, which reduces the computing cost of the network and improves the learning ability of the model. Secondly, the adaptive image scaling algorithm can reduce computation and redundancy effectively. Thirdly, the improved PANet structure is introduced so that the network has more semantic information in the feature layer. At last, a face mask detection data set is made according to the standard wearing of masks. Based on the object detection algorithm of deep learning, a variety of evaluation indexes are compared to evaluate the effectiveness of the model. The results of the comparations show that the mAP of face mask recognition can reach 98.3% and the frame rate is high at 54.57 FPS, which are more accurate compared with the exiting algorithm.

111 citations


Journal ArticleDOI
TL;DR: The authors have presented the feature-based method for 2D face images, which uses speeded up robust features (SURF) and scale-invariant feature transform (SIFT) for feature extraction and has a maximum recognition accuracy of 99.7%.
Abstract: Face recognition is the process of identifying people through facial images. It has become vital for security and surveillance applications and required everywhere including institutions, organizations, offices, and social places. There are a number of challenges faced in face recognition which includes face pose, age, gender, illumination, and other variable condition. Another challenge is that the database size for these applications is usually small. So, training and recognition become difficult. Face recognition methods can be divided into two major categories, appearance-based method and feature-based method. In this paper, the authors have presented the feature-based method for 2D face images. speeded up robust features (SURF) and scale-invariant feature transform (SIFT) are used for feature extraction. Five public datasets, namely Yale2B, Face 94, M2VTS, ORL, and FERET, are used for experimental work. Various combinations of SIFT and SURF features with two classification techniques, namely decision tree and random forest, have experimented in this work. A maximum recognition accuracy of 99.7% has been reported by the authors with a combination of SIFT (64-components) and SURF (32-components).

Proceedings ArticleDOI
06 Mar 2021
TL;DR: Wang et al. as discussed by the authors proposed a new million-scale face benchmark containing noisy 4M identities/260M faces (WebFace260M) and cleaned 2m identities/42M faces(WebFace42M) training data, as well as an elaborately designed time-constrained evaluation protocol.
Abstract: In this paper, we contribute a new million-scale face benchmark containing noisy 4M identities/260M faces (WebFace260M) and cleaned 2M identities/42M faces (WebFace42M) training data, as well as an elaborately designed time-constrained evaluation protocol. Firstly, we collect 4M name list and download 260M faces from the Internet. Then, a Cleaning Automatically utilizing Self-Training (CAST) pipeline is devised to purify the tremendous WebFace260M, which is efficient and scalable. To the best of our knowledge, the cleaned WebFace42M is the largest public face recognition training set and we expect to close the data gap between academia and industry. Referring to practical scenarios, Face Recognition Under Inference Time conStraint (FRUITS) protocol and a test set are constructed to comprehensively evaluate face matchers.Equipped with this benchmark, we delve into million-scale face recognition problems. A distributed framework is developed to train face recognition models efficiently without tampering with the performance. Empowered by Web-Face42M, we reduce relative 40% failure rate on the challenging IJB-C set, and rank the 3rd among 430 entries on NIST-FRVT. Even 10% data (WebFace4M) shows superior performance compared with public training set. Furthermore, comprehensive baselines are established on our rich-attribute test set under FRUITS-100ms/500ms/1000ms protocol, including MobileNet, EfficientNet, AttentionNet, ResNet, SENet, ResNeXt and RegNet families. Benchmark website is https://www.face-benchmark.org.

Posted ContentDOI
TL;DR: This paper proposes a reliable method based on discard masked region and deep learning based features in order to address the problem of masked face recognition process and results show high recognition performance.
Abstract: The coronavirus disease (COVID-19) is an unparalleled crisis leading to a huge number of casualties and security problems. In order to reduce the spread of coronavirus, people often wear masks to protect themselves. This makes face recognition a very difficult task since certain parts of the face are hidden. A primary focus of researchers during the ongoing coronavirus pandemic is to come up with suggestions to handle this problem through rapid and efficient solutions. In this paper, we propose a reliable method based on occlusion removal and deep learning-based features in order to address the problem of the masked face recognition process. The first step is to remove the masked face region. Next, we apply three pre-trained deep Convolutional Neural Networks (CNN), namely VGG-16, AlexNet, and ResNet-50, and use them to extract deep features from the obtained regions (mostly eyes and forehead regions). The Bag-of-features paradigm is then applied to the feature maps of the last convolutional layer in order to quantize them and to get a slight representation comparing to the fully connected layer of classical CNN. Finally, Multilayer Perceptron (MLP) is applied for the classification process. Experimental results on Real-World-Masked-Face-Dataset show high recognition performance compared to other state-of-the-art methods.

Proceedings ArticleDOI
01 Jun 2021
TL;DR: In this paper, a large in-the-wild high-resolution audio-visual dataset is built and a novel flow-guided talking face generation framework is proposed to synthesize high-definition videos.
Abstract: One-shot talking face generation should synthesize high visual quality facial videos with reasonable animations of expression and head pose, and just utilize arbitrary driving audio and arbitrary single face image as the source. Current works fail to generate over 256×256 resolution realistic-looking videos due to the lack of an appropriate high-resolution audio-visual dataset, and the limitation of the sparse facial landmarks in providing poor expression details. To synthesize high-definition videos, we build a large in-the-wild high-resolution audio-visual dataset and propose a novel flow-guided talking face generation framework. The new dataset is collected from youtube and consists of about 16 hours 720P or 1080P videos. We leverage the facial 3D morphable model (3DMM) to split the framework into two cascaded modules instead of learning a direct mapping from audio to video. In the first module, we propose a novel animation generator to produce the movements of mouth, eyebrow and head pose simultaneously. In the second module, we transform animation into dense flow to provide more expression details and carefully design a novel flow-guided video generator to synthesize videos. Our method is able to produce high-definition videos and outperforms state-of-the-art works in objective and subjective comparisons*.

Proceedings ArticleDOI
Jiahui She1, Yibo Hu, Hailin Shi, Jun Wang, Qiu Shen1, Tao Mei1 
01 Apr 2021
TL;DR: DMUE as mentioned in this paper proposes an auxiliary multi-branch learning framework to better mine and describe the latent distribution in the label space, and the pairwise relationship of semantic feature between instances is fully exploited to estimate the ambiguity extent in the instance space.
Abstract: Due to the subjective annotation and the inherent interclass similarity of facial expressions, one of key challenges in Facial Expression Recognition (FER) is the annotation ambiguity. In this paper, we proposes a solution, named DMUE, to address the problem of annotation ambiguity from two perspectives: the latent Distribution Mining and the pairwise Uncertainty Estimation. For the former, an auxiliary multi-branch learning framework is introduced to better mine and describe the latent distribution in the label space. For the latter, the pairwise relationship of semantic feature between instances are fully exploited to estimate the ambiguity extent in the instance space. The proposed method is independent to the backbone architectures, and brings no extra burden for inference. The experiments are conducted on the popular real-world benchmarks and the synthetic noisy datasets. Either way, the proposed DMUE stably achieves leading performance.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors presented a survey of micro-expression analysis in a cascaded structure, including neuropsychological basis, datasets, features, detection/spotting algorithms, recognition algorithms, applications and evaluation of state of the arts.
Abstract: Different from the conventional facial expression, micro-expression is an involuntary and transient facial expression, which can reveal a genuine emotion that people attempt to hide. The detection and recognition of micro-expressions are difficult and heavily rely on expert experiences, since micro-expressions are transient and of low intensity. Due to its intrinsic particularity and complexity, micro-expression analysis is attractive but challenging, and recently becomes an active area of research. Although there are many developments in this area, a comprehensive survey that can help researchers to systematically review them is still lacking. In this survey paper, we highlight the key differences between macro- and micro-expressions, and use these differences to guide the research survey of micro-expression analysis in a cascaded structure, including neuropsychological basis, datasets, features, detection/spotting algorithms, recognition algorithms, applications and evaluation of state of the arts. In each aspect, basic techniques, advanced developments and major challenges are addressed and discussed. Furthermore, by considering the limitations in existing micro-expression datasets, we present and release a new dataset called MMEW that has more video samples and more labeled emotion types, and perform a unified comparison of representative recognition methods on MMEW. Finally, some potential research directions are explored and outlined.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a new method for masked face recognition by integrating a cropping-based approach with the Convolutional Block Attention Module (CBAM), where the optimal cropping is explored for each case, while the CBAM module is adopted to focus on the regions around eyes.
Abstract: The global epidemic of COVID-19 makes people realize that wearing a mask is one of the most effective ways to protect ourselves from virus infections, which poses serious challenges for the existing face recognition system. To tackle the difficulties, a new method for masked face recognition is proposed by integrating a cropping-based approach with the Convolutional Block Attention Module (CBAM). The optimal cropping is explored for each case, while the CBAM module is adopted to focus on the regions around eyes. Two special application scenarios, using faces without mask for training to recognize masked faces, and using masked faces for training to recognize faces without mask, have also been studied. Comprehensive experiments on SMFRD, CISIA-Webface, AR and Extend Yela B datasets show that the proposed approach can significantly improve the performance of masked face recognition compared with other state-of-the-art approaches.

Proceedings ArticleDOI
Yuhao Zhu, Qi Li, Jian Wang, Chengzhong Xu1, Zhenan Sun 
11 May 2021
TL;DR: Zhang et al. as mentioned in this paper proposed the first megapixel level method for one shot face swapping, which organizes face representation hierarchically by the proposed Hierarchical Representation Face Encoder (HieRFE) in an extended latent space to maintain more facial details, rather than compressed representation in previous face swapping methods.
Abstract: Face swapping has both positive applications such as entertainment, human-computer interaction, etc., and negative applications such as DeepFake threats to politics, economics, etc. Nevertheless, it is necessary to understand the scheme of advanced methods for high-quality face swapping and generate enough and representative face swapping images to train DeepFake detection algorithms. This paper proposes the first Megapixel level method for one shot Face Swapping (or MegaFS for short). Firstly, MegaFS organizes face representation hierarchically by the proposed Hierarchical Representation Face Encoder (HieRFE) in an extended latent space to maintain more facial details, rather than compressed representation in previous face swapping methods. Secondly, a carefully designed Face Transfer Module (FTM) is proposed to transfer the identity from a source image to the target by a non-linear trajectory without explicit feature disentanglement. Finally, the swapped faces can be synthesized by StyleGAN2 with the benefits of its training stability and powerful generative capability. Each part of MegaFS can be trained separately so the requirement of our model for GPU memory can be satisfied for megapixel face swapping. In summary, complete face representation, stable training, and limited memory usage are the three novel contributions to the success of our method. Extensive experiments demonstrate the superiority of MegaFS and the first megapixel level face swapping database is released for research on DeepFake detection and face image editing in the public domain.

Proceedings ArticleDOI
Xintao Wang1, Yu Li1, Honglun Zhang1, Ying Shan1
20 Jun 2021
TL;DR: GFP-GAN as discussed by the authors incorporates a generative facial prior into the face restoration process via spatial feature transform layers, which allows the method to achieve a good balance of realness and fidelity.
Abstract: Blind face restoration usually relies on facial priors, such as facial geometry prior or reference prior, to restore realistic and faithful details. However, very low-quality inputs cannot offer accurate geometric prior while high-quality references are inaccessible, limiting the applicability in real-world scenarios. In this work, we propose GFP-GAN that leverages rich and diverse priors encapsulated in a pretrained face GAN for blind face restoration. This Generative Facial Prior (GFP) is incorporated into the face restoration process via spatial feature transform layers, which allow our method to achieve a good balance of realness and fidelity. Thanks to the powerful generative facial prior and delicate designs, our GFP-GAN could jointly restore facial details and enhance colors with just a single forward pass, while GAN inversion methods require image-specific optimization at inference. Extensive experiments show that our method achieves superior performance to prior art on both synthetic and real-world datasets.

Proceedings ArticleDOI
20 Jun 2021
TL;DR: Wang et al. as mentioned in this paper proposed a Feature Decomposition and Reconstruction Learning (FDRL) method for effective facial expression recognition, which consists of two crucial networks: a Feature decomposition Network (FDN) and a Feature Reconstruction Network (FRN).
Abstract: In this paper, we propose a novel Feature Decomposition and Reconstruction Learning (FDRL) method for effective facial expression recognition. We view the expression information as the combination of the shared information (expression similarities) across different expressions and the unique information (expression-specific variations) for each expression. More specifically, FDRL mainly consists of two crucial networks: a Feature Decomposition Network (FDN) and a Feature Reconstruction Network (FRN). In particular, FDN first decomposes the basic features extracted from a backbone network into a set of facial action-aware latent features to model expression similarities. Then, FRN captures the intra-feature and inter-feature relationships for la-tent features to characterize expression-specific variations, and reconstructs the expression feature. To this end, two modules including an intra-feature relation modeling module and an inter-feature relation modeling module are developed in FRN. Experimental results on both the in-the-lab databases (including CK+, MMI, and Oulu-CASIA) and the in-the-wild databases (including RAF-DB and SFEW) show that the proposed FDRL method consistently achieves higher recognition accuracy than several state-of-the-art methods. This clearly highlights the benefit of feature decomposition and reconstruction for classifying expressions.

Journal ArticleDOI
TL;DR: This work proposes an unsupervised domain adaptation with disentangled representation (DR-UDA) approach to improve the generalization capability of PAD into new scenarios and shows promisinggeneralization capability in several public-domain face PAD databases.
Abstract: Face presentation attack detection (PAD) is essential for securing the widely used face recognition systems. Most of the existing PAD methods do not generalize well to unseen scenarios because labeled training data of the new domain is usually not available. In light of this, we propose an unsupervised domain adaptation with disentangled representation (DR-UDA) approach to improve the generalization capability of PAD into new scenarios. DR-UDA consists of three modules, i.e., ML-Net, UDA-Net and DR-Net. ML-Net aims to learn a discriminative feature representation using the labeled source domain face images via metric learning. UDA-Net performs unsupervised adversarial domain adaptation in order to optimize the source domain and target domain encoders jointly, and obtain a common feature space shared by both domains. As a result, the source domain PAD model can be effectively transferred to the unlabeled target domain for PAD. DR-Net further disentangles the features irrelevant to specific domains by reconstructing the source and target domain face images from the common feature space. Therefore, DR-UDA can learn a disentangled representation space which is generative for face images in both domains and discriminative for live vs. spoof classification. The proposed approach shows promising generalization capability in several public-domain face PAD databases.

Journal ArticleDOI
TL;DR: In this article, a deep learning-based scheme is proposed for identifying the facial expression of a person, which consists of two parts: the former one finds out local features from face images using a local gravitational force descriptor, while, in the latter part, the descriptor is fed into a novel deep convolution neural network (DCNN) model.
Abstract: An image is worth a thousand words; hence, a face image illustrates extensive details about the specification, gender, age, and emotional states of mind. Facial expressions play an important role in community-based interactions and are often used in the behavioral analysis of emotions. Recognition of automatic facial expressions from a facial image is a challenging task in the computer vision community and admits a large set of applications, such as driver safety, human–computer interactions, health care, behavioral science, video conferencing, cognitive science, and others. In this work, a deep-learning-based scheme is proposed for identifying the facial expression of a person. The proposed method consists of two parts. The former one finds out local features from face images using a local gravitational force descriptor, while, in the latter part, the descriptor is fed into a novel deep convolution neural network (DCNN) model. The proposed DCNN has two branches. The first branch explores geometric features, such as edges, curves, and lines, whereas holistic features are extracted by the second branch. Finally, the score-level fusion technique is adopted to compute the final classification score. The proposed method along with 25 state-of-the-art methods is implemented on five benchmark available databases, namely, Facial Expression Recognition 2013, Japanese Female Facial Expressions, Extended CohnKanade, Karolinska Directed Emotional Faces, and Real-world Affective Faces. The databases consist of seven basic emotions: neutral, happiness, anger, sadness, fear, disgust, and surprise. The proposed method is compared with existing approaches using four evaluation metrics, namely, accuracy, precision, recall, and f1-score. The obtained results demonstrate that the proposed method outperforms all state-of-the-art methods on all the databases.

Journal ArticleDOI
14 Apr 2021
TL;DR: The proposed MIPGAN is derived from the StyleGAN with a newly formulated loss function exploiting perceptual quality and identity factor to generate a high quality morphed facial image with minimal artefacts and with high resolution.
Abstract: Face morphing attacks target to circumvent Face Recognition Systems (FRS) by employing face images derived from multiple data subjects (e.g., accomplices and malicious actors). Morphed images can be verified against contributing data subjects with a reasonable success rate, given they have a high degree of facial resemblance. The success of morphing attacks is directly dependent on the quality of the generated morph images. We present a new approach for generating strong attacks extending our earlier framework for generating face morphs. We present a new approach using an Identity Prior Driven Generative Adversarial Network, which we refer to as MIPGAN (Morphing through Identity Prior driven GAN) . The proposed MIPGAN is derived from the StyleGAN with a newly formulated loss function exploiting perceptual quality and identity factor to generate a high quality morphed facial image with minimal artefacts and with high resolution. We demonstrate the proposed approach’s applicability to generate strong morphing attacks by evaluating its vulnerability against both commercial and deep learning based Face Recognition System (FRS) and demonstrate the success rate of attacks. Extensive experiments are carried out to assess the FRS’s vulnerability against the proposed morphed face generation technique on three types of data such as digital images, re-digitized (printed and scanned) images, and compressed images after re-digitization from newly generated MIPGAN Face Morph Dataset . The obtained results demonstrate that the proposed approach of morph generation poses a high threat to FRS.

Journal ArticleDOI
TL;DR: Zeng et al. as mentioned in this paper proposed a global multi-scale and local attention network (MA-Net) for facial expression recognition in the wild, which consists of three main components: a feature pre-extractor, a multiscale module, and a local attention module.
Abstract: Facial expression recognition (FER) in the wild received broad concerns in which occlusion and pose variation are two key issues. This paper proposed a global multi-scale and local attention network (MA-Net) for FER in the wild. Specifically, the proposed network consists of three main components: a feature pre-extractor, a multi-scale module, and a local attention module. The feature pre-extractor is utilized to pre-extract middle-level features, the multi-scale module to fuse features with different receptive fields, which reduces the susceptibility of deeper convolution towards occlusion and variant pose, while the local attention module can guide the network to focus on local salient features, which releases the interference of occlusion and non-frontal pose problems on FER in the wild. Extensive experiments demonstrate that the proposed MA-Net achieves the state-of-the-art results on several in-the-wild FER benchmarks: CAER-S, AffectNet-7, AffectNet-8, RAFDB, and SFEW with accuracies of 88.42%, 64.53%, 60.29%, 88.40%, and 59.40% respectively. The codes and training logs are publicly available at https://github.com/zengqunzhao/MA-Net .

Journal ArticleDOI
TL;DR: SPARNet as mentioned in this paper introduces a spatial attention mechanism to the vanilla residual blocks to adaptively bootstrap features related to the key face structures and pay less attention to those less feature-rich regions.
Abstract: General image super-resolution techniques have difficulties in recovering detailed face structures when applying to low resolution face images. Recent deep learning based methods tailored for face images have achieved improved performance by jointly trained with additional task such as face parsing and landmark prediction. However, multi-task learning requires extra manually labeled data. Besides, most of the existing works can only generate relatively low resolution face images ( e.g ., $128\times 128$ ), and their applications are therefore limited. In this paper, we introduce a novel SPatial Attention Residual Network (SPARNet) built on our newly proposed Face Attention Units (FAUs) for face super-resolution. Specifically, we introduce a spatial attention mechanism to the vanilla residual blocks. This enables the convolutional layers to adaptively bootstrap features related to the key face structures and pay less attention to those less feature-rich regions. This makes the training more effective and efficient as the key face structures only account for a very small portion of the face image. Visualization of the attention maps shows that our spatial attention network can capture the key face structures well even for very low resolution faces ( e.g ., $16\times 16$ ). Quantitative comparisons on various kinds of metrics (including PSNR, SSIM, identity similarity, and landmark detection) demonstrate the superiority of our method over current state-of-the-arts. We further extend SPARNet with multi-scale discriminators, named as SPARNetHD, to produce high resolution results ( i.e ., $512\times 512$ ). We show that SPARNetHD trained with synthetic data can not only produce high quality and high resolution outputs for synthetically degraded face images, but also show good generalization ability to real world low quality face images. Codes are available at https://github.com/chaofengc/Face-SPARNet .

Journal ArticleDOI
01 Jan 2021
TL;DR: In this paper, the authors present the possible underlying factors (data-driven and scenario modeling) and methodological considerations for assessing race bias in face recognition algorithms and conclude that race bias needs to be measured for individual applications and provide a checklist for measuring this bias.
Abstract: Previous generations of face recognition algorithms differ in accuracy for images of different races (race bias). Here, we present the possible underlying factors (data-driven and scenario modeling) and methodological considerations for assessing race bias in algorithms. We discuss data-driven factors (e.g., image quality, image population statistics, and algorithm architecture), and scenario modeling factors that consider the role of the “user” of the algorithm (e.g., threshold decisions and demographic constraints). To illustrate how these issues apply, we present data from four face recognition algorithms (a previous-generation algorithm and three deep convolutional neural networks, DCNNs) for East Asian and Caucasian faces. First, dataset difficulty affected both overall recognition accuracy and race bias, such that race bias increased with item difficulty. Second, for all four algorithms, the degree of bias varied depending on the identification decision threshold. To achieve equal false accept rates (FARs), East Asian faces required higher identification thresholds than Caucasian faces, for all algorithms. Third, demographic constraints on the formulation of the distributions used in the test, impacted estimates of algorithm accuracy. We conclude that race bias needs to be measured for individual applications and we provide a checklist for measuring this bias in face recognition algorithms.

Journal ArticleDOI
TL;DR: An identity and emotion joint learning approach with deep convolutional neural networks (CNNs) to enhance the performance of facial expression recognition (FER) tasks and outperforms the residual network baseline as well as many other state-of-the-art methods.
Abstract: Different subjects may express a specific expression in different ways due to inter-subject variabilities. In this work, besides training deep-learned facial expression feature (emotional feature), we also consider the influence of latent face identity feature such as the shape or appearance of face. We propose an identity and emotion joint learning approach with deep convolutional neural networks (CNNs) to enhance the performance of facial expression recognition (FER) tasks. First, we learn the emotion and identity features separately using two different CNNs with their corresponding training data. Second, we concatenate these two features together as a deep-learned Tandem Facial Expression (TFE) Feature and feed it to the subsequent fully connected layers to form a new model. Finally, we perform joint learning on the newly merged network using only the facial expression training data. Experimental results show that our proposed approach achieves 99.31 and 84.29 percent accuracy on the CK+ and the FER+ database, respectively, which outperforms the residual network baseline as well as many other state-of-the-art methods.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a loss function named sigmoid-constrained hypersphere loss (SFace), which can make a better balance between decreasing the intra-class distances for clean examples and preventing overfitting to label noise.
Abstract: Deep face recognition has achieved great success due to large-scale training databases and rapidly developing loss functions. The existing algorithms devote to realizing an ideal idea: minimizing the intra-class distance and maximizing the inter-class distance. However, they may neglect that there are also low quality training images which should not be optimized in this strict way. Considering the imperfection of training databases, we propose that intra-class and inter-class objectives can be optimized in a moderate way to mitigate overfitting problem, and further propose a novel loss function, named sigmoid-constrained hypersphere loss (SFace). Specifically, SFace imposes intra-class and inter-class constraints on a hypersphere manifold, which are controlled by two sigmoid gradient re-scale functions respectively. The sigmoid curves precisely re-scale the intra-class and inter-class gradients so that training samples can be optimized to some degree. Therefore, SFace can make a better balance between decreasing the intra-class distances for clean examples and preventing overfitting to the label noise, and contributes more robust deep face recognition models. Extensive experiments of models trained on CASIA-WebFace, VGGFace2, and MS-Celeb-1M databases, and evaluated on several face recognition benchmarks, such as LFW, MegaFace and IJB-C databases, have demonstrated the superiority of SFace.

Proceedings ArticleDOI
16 Sep 2021
TL;DR: In this article, the multi-task learning of lightweight convolutional neural networks is studied for face identification and classification of facial attributes (age, gender, ethnicity) trained on cropped faces without margins.
Abstract: In this paper, the multi-task learning of lightweight convolutional neural networks is studied for face identification and classification of facial attributes (age, gender, ethnicity) trained on cropped faces without margins. The necessity to fine-tune these networks to predict facial expressions is highlighted. Several models are presented based on lightweight architectures, such as MobileNet, EfficientNet and RexNet. It was experimentally demonstrated that they lead to near state-of-the-art results in age, gender and race recognition on the UTKFace dataset and emotion classification on the AffectNet dataset. Moreover, it is shown that the usage of the trained models as feature extractors of facial regions in video frames leads to 4.5% higher accuracy than the previously known state-of-the-art single models for the AFEW and the VGAF datasets from the EmotiW challenges.

Proceedings ArticleDOI
10 Jan 2021
TL;DR: Wang et al. as discussed by the authors combined the ubiquitous Deep Residual Network and Unet-like architecture to produce a residual masking network, which holds state-of-the-art accuracy on the well-known FER2013 and private VEMO datasets.
Abstract: Automatic facial expression recognition (FER) has gained much attention due to its applications in human-computer interaction. Among the approaches to improve FER tasks, this paper focuses on deep architecture with the attention mechanism. We propose a novel Masking Idea to boost the performance of CNN in facial expression task. It uses a segmentation network to refine feature maps, enabling the network to focus on relevant information to make correct decisions. In experiments, we combine the ubiquitous Deep Residual Network and Unet-like architecture to produce a Residual Masking Network. The proposed method holds state-of-the-art (SOTA) accuracy on the well-known FER2013 and private VEMO datasets.