scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

15 Jun 2019-pp 4690-4699
TL;DR: This paper presents arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks, and shows that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead.
Abstract: One of the main challenges in feature learning using Deep Convolutional Neural Networks (DCNNs) for large-scale face recognition is the design of appropriate loss functions that can enhance the discriminative power. Centre loss penalises the distance between deep features and their corresponding class centres in the Euclidean space to achieve intra-class compactness. SphereFace assumes that the linear transformation matrix in the last fully connected layer can be used as a representation of the class centres in the angular space and therefore penalises the angles between deep features and their corresponding weights in a multiplicative way. Recently, a popular line of research is to incorporate margins in well-established loss functions in order to maximise face class separability. In this paper, we propose an Additive Angular Margin Loss (ArcFace) to obtain highly discriminative features for face recognition. The proposed ArcFace has a clear geometric interpretation due to its exact correspondence to geodesic distance on a hypersphere. We present arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks which includes a new large-scale image database with trillions of pairs and a large-scale video dataset. We show that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead. To facilitate future research, the code has been made available.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
Wenxuan Wang1, Yanwei Fu1, Xuelin Qian1, Yu-Gang Jiang1, Qi Tian2, Xiangyang Xue1 
14 Jun 2020
TL;DR: A unified Face Morphological Multi-branch Network (FMMu-Net) is proposed for makeup-invariant face verification, which can simultaneously synthesize many diverse makeup faces through face morphology network (FM-Net), and effectively learn cosmetics-robust face representations using attention-based multi-br branch learning network (AttM-Net).
Abstract: It is challenging in learning a makeup-invariant face verification model, due to (1) insufficient makeup/non-makeup face training pairs, (2) the lack of diverse makeup faces, and (3) the significant appearance changes caused by cosmetics. To address these challenges, we propose a unified Face Morphological Multi-branch Network (FMMu-Net) for makeup-invariant face verification, which can simultaneously synthesize many diverse makeup faces through face morphology network (FM-Net) and effectively learn cosmetics-robust face representations using attention-based multi-branch learning network (AttM-Net). For challenges (1) and (2), FM-Net (two stacked auto-encoders) can synthesize realistic makeup face images by transferring specific regions of cosmetics via cycle consistent loss. For challenge (3), AttM-Net, consisting of one global and three local (task-driven on two eyes and mouth) branches, can effectively capture the complementary holistic and detailed information. Unlike DeepID2 which uses simple concatenation fusion, we introduce a heuristic method AttM-FM, attached to AttM-Net, to adaptively weight the features of different branches guided by the holistic information. We conduct extensive experiments on makeup face verification benchmarks (M-501, M-203, and FAM) and general face recognition datasets (LFW and IJB-A). Our framework FMMu-Net achieves state-of-the-art performances.

13 citations

Book ChapterDOI
23 Aug 2020
TL;DR: This material proposes some approaches widely used to visualize the original spaces and to provide an intuition about the proposal about FER has smaller inter class variance.
Abstract: In this paper, as we aim to construct a semi-supervised learning algorithm, we exploit the characteristics of the Deep Convolutional Networks to provide, for an input image, both an embedding descriptor and a prediction. The unlabeled data is combined with the labeled one in order to provide synthetic data, which describes better the input space. The network is asked to provide a large margin between clusters, while new data is self-labeled by the distance to class centroids, in the embedding space. The method is tested on standard benchmarks for semi-supervised learning, where it matches state of the art performance and on the problem of face expression recognition where it increases the accuracy by a noticeable margin .

13 citations


Cites methods from "ArcFace: Additive Angular Margin Lo..."

  • ...While the solution presented in the results section concentrates solely on the Euclidean distance, thus retrieving the classical K-means algorithm, one may extend the algorithm based on non-Euclidean distances [10] and other margin based losses [24,44,9] can be used....

    [...]

Journal ArticleDOI
TL;DR: A Cycle Age-Adversarial Model (CAAM) is proposed for CAFR, which only uses the age labels for training without considering independence hypothesis, and introduces cycle optimization strategy to merge the advantages of two branch networks, which is a novel strategy to fuse multi-task features.
Abstract: Age variations bring a large challenge for face recognition tasks. Existing Cross-Age Face Recognition (CAFR) methods have two limitations. Firstly, many CAFR approaches require both age labels and identity labels for training. However, it is difficult to collect images under a large age span from each individual. Secondly, many works are based on the assumption that age and identity information are independent of each other, which may not satisfy various conditions. In this paper, a Cycle Age-Adversarial Model (CAAM) is proposed for CAFR, which only uses the age labels for training without considering independence hypothesis. CAAM includes two different branch networks. Firstly, the branch of Age-robust Feature Extracting Model (AFEM) is designed to adaptively learn age-invariant features by adversarial learning scheme, which includes an age discriminator network and a feature generator network. The age discriminator network is trained to discriminate the age information, and the generator extracts age-invariant features through adversarial learning with discriminator. Secondly, a branch of the Identity Preserving Network (IPN) is proposed to keep identity information, which introduces Unsupervised Identity Loss (UIL) to enlarge the inter-class distance, and decrease the loss of identity information in the learning process. Finally, the features of the two branches are cyclically optimized through minmizing Feature Consistency Loss (FCL), which integrates age invariance learning and identity discrimination learning into final feature representation. Different from existing CAFR networks, our adversarial learning strategy for age-robust feature learning can be generalized to other attributes including pose and expression. Moreover, we introduce cycle optimization strategy to merge the advantages of two branch networks, which is a novel strategy to fuse multi-task features. Extensive CAFR experiments performed on the benchmark MORPH Album2, CACD-VS and Cross Age LFW databases demonstrate the effectiveness and superiority of CAAM.

13 citations

Journal ArticleDOI
TL;DR: In this article, a coupled convolutional network architecture was proposed to leverage visible face data when training a model for thermal-only face landmark detection, which achieved a 65% - 95% improvement on the DEVCOM ARL Multi-modal Thermal Face Dataset and a 4% improvement over the baseline model.
Abstract: There has been increasing interest in face recognition in the thermal infrared spectrum A critical step in this process is face landmark detection However, landmark detection in the thermal spectrum presents a unique set of challenges compared to in the visible spectrum: inherently lower spatial resolution due to longer wavelength, differences in phenomenology, and limited availability of labeled thermal face imagery for algorithm development and training Thermal infrared imaging does have the advantage of being able to passively acquire facial heat signatures without the need for active or ambient illumination in low light and nighttime environments In such scenarios, thermal imaging must operate by itself without corresponding/paired visible imagery Mindful of this constraint, we propose visible-to-thermal parameter transfer learning using a coupled convolutional network architecture as a means to leverage visible face data when training a model for thermal-only face landmark detection This differentiates our approach from models trained either solely on thermal images or models which require a fusion of visible and thermal images at test time In this work, we implement and analyze four types of parameter transfer learning methods in the context of thermal face landmark detection: Siamese (shared) layers, Linear Layer Regularization (LLR), Linear Kernel Regularization (LKR), and Residual Parameter Transformations (RPT) These transfer learning approaches are compared against a baseline version of the network and an Active Appearance Model (AAM), both of which are trained only on thermal data We achieve a 65% - 95% improvement on the DEVCOM ARL Multi-modal Thermal Face Dataset and a 4% improvement on the RWTH Aachen University Thermal Face Dataset over the baseline model We show that LLR, LKR, and RPT all result in improved thermal face landmark detection performance compared to the baseline and AAM, demonstrating that transfer learning leveraging visible spectrum data improves thermal face landmarking

13 citations

Proceedings ArticleDOI
29 Nov 2020
TL;DR: This study proposes a variant of BCE that enforces a margin in angular space and incorporate it in training the DeepPixBis model and presents a method to incorporate such a loss for attentive pixel wise supervision applicable in a fully convolutional setting.
Abstract: Face Anti Spoofing (FAS) systems are used to identify malicious spoofing attempts targeting face recognition systems using mediums such as video replay or printed papers. With increasing adoption of face recognition technology as a biometric authentication method, FAS techniques are gaining in importance. From a learning perspective, such systems pose a binary classification task. When implemented with Neural Network based solutions, it is common to use the binary cross entropy (BCE) function as the loss to optimize. In this study, we propose a variant of BCE that enforces a margin in angular space and incorporate it in training the DeepPixBis model [1]. In addition, we also present a method to incorporate such a loss for attentive pixel wise supervision applicable in a fully convolutional setting. Our proposed approach achieves competitive scores in both intra and inter-dataset testing on multiple benchmark datasets, consistently outperforming vanilla DeepPixBis. Interestingly, in the case of Protocol 4 of OULU-NPU, considered to be the hardest protocol, our proposed method achieves 5.22% ACER, which is only 0.22% higher than the current State of the Art without requiring any expensive Neural Architecture Search.

13 citations


Cites background or methods from "ArcFace: Additive Angular Margin Lo..."

  • ...Following the work of [34] the value of m was set to 0....

    [...]

  • ...Inspired by the reported better discriminatory ability and the success of angular margin based loss functions in facial recognition tasks [34], [32], [33], we add binary angular margin loss to both parts of equation 15 using equation 13....

    [...]

  • ...However, as mentioned in [32], [33], [34], this loss function does not optimize properly to enforce higher intra-class similarity and inter-class dissimilarity....

    [...]

  • ...As per authors of [34], if we introduce a hyper-parameter m then 6 becomes:...

    [...]

  • ...From the literature [33], [32], [34] it’s very obvious that the discriminatory power of angular margin loss over its vanilla counterpart gives A-BCE an edge ....

    [...]

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Journal Article
TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Abstract: Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

33,597 citations

Proceedings Article
Sergey Ioffe1, Christian Szegedy1
06 Jul 2015
TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.

30,843 citations

28 Oct 2017
TL;DR: An automatic differentiation module of PyTorch is described — a library designed to enable rapid research on machine learning models that focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead.
Abstract: In this article, we describe an automatic differentiation module of PyTorch — a library designed to enable rapid research on machine learning models. It builds upon a few projects, most notably Lua Torch, Chainer, and HIPS Autograd [4], and provides a high performance environment with easy access to automatic differentiation of models executed on different devices (CPU and GPU). To make prototyping easier, PyTorch does not follow the symbolic approach used in many other deep learning frameworks, but focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead. Note that this preprint is a draft of certain sections from an upcoming paper covering all PyTorch features.

13,268 citations

Posted Content
TL;DR: The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.
Abstract: TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.

10,447 citations