ArcFace: Additive Angular Margin Loss for Deep Face Recognition
15 Jun 2019-pp 4690-4699
TL;DR: This paper presents arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks, and shows that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead.
Abstract: One of the main challenges in feature learning using Deep Convolutional Neural Networks (DCNNs) for large-scale face recognition is the design of appropriate loss functions that can enhance the discriminative power. Centre loss penalises the distance between deep features and their corresponding class centres in the Euclidean space to achieve intra-class compactness. SphereFace assumes that the linear transformation matrix in the last fully connected layer can be used as a representation of the class centres in the angular space and therefore penalises the angles between deep features and their corresponding weights in a multiplicative way. Recently, a popular line of research is to incorporate margins in well-established loss functions in order to maximise face class separability. In this paper, we propose an Additive Angular Margin Loss (ArcFace) to obtain highly discriminative features for face recognition. The proposed ArcFace has a clear geometric interpretation due to its exact correspondence to geodesic distance on a hypersphere. We present arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks which includes a new large-scale image database with trillions of pairs and a large-scale video dataset. We show that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead. To facilitate future research, the code has been made available.
Citations
More filters
••
TL;DR: In this article , a semi-supervised performance evaluation for face recognition (SPE-FR) method is proposed, which is based on parametric Bayesian modeling of the face embedding similarity scores.
Abstract: AbstractWe introduce Semi-supervised Performance Evaluation for Face Recognition (SPE-FR). SPE-FR is a statistical method for evaluating the performance and algorithmic bias of face verification systems when identity labels are unavailable or incomplete. The method is based on parametric Bayesian modeling of the face embedding similarity scores. SPE-FR produces point estimates, performance curves, and confidence bands that reflect uncertainty in the estimation procedure. Focusing on the unsupervised setting wherein no identity labels are available, we validate our method through experiments on a wide range of face embedding models and two publicly available evaluation datasets. Experiments show that SPE-FR can accurately assess performance on data with no identity labels, and confidently reveal demographic biases in system performance.KeywordsAlgorithmic biasSemi-supervised evaluationFace verificationBayesian inference
4 citations
••
29 May 2019TL;DR: This study asks 26 participants to provide subjective quality scores that represent the ease of recognizing the face on the images from a smartphone based face image dataset and observes that the subjective scores outperform the implemented objective scores while having a low correlation with them.
Abstract: The performance of any face recognition system gets affected by the quality of the probe and the reference images. Rejecting or recapturing images with low-quality can improve the overall performance of the biometric system. There are many statistical as well as learning-based methods that provide quality scores given an image for the task of face recognition. In this study, we take a different approach by asking 26 participants to provide subjective quality scores that represent the ease of recognizing the face on the images from a smartphone based face image dataset. These scores are then compared to measures implemented from ISO/IEC TR 29794-5. We observe that the subjective scores outperform the implemented objective scores while having a low correlation with them. Furthermore, we analyze the effect of pose, illumination, and distance on face recognition similarity scores as well as the generated mean opinion scores.
4 citations
Cites methods from "ArcFace: Additive Angular Margin Lo..."
...To calculate the biometric scores, the ArcFace[4] deep learning method is used to extract embeddings representing each face image....
[...]
••
TL;DR: The purpose of this paper is to describe the 2020 RFIW challenge, end-to-end, along with forecasts in promising future directions.
Abstract: Recognizing Families In the Wild (RFIW): an annual large-scale, multi-track automatic kinship recognition evaluation that supports various visual kin-based problems on scales much higher than ever before. Organized in conjunction with the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG) as a Challenge, RFIW provides a platform for publishing original work and the gathering of experts for a discussion of the next steps. This paper summarizes the supported tasks (i.e., kinship verification, tri-subject verification, and search & retrieval of missing children) in the evaluation protocols, which include the practical motivation, technical background, data splits, metrics, and benchmark results. Furthermore, top submissions (i.e., leader-board stats) are listed and reviewed as a high-level analysis on the state of the problem. In the end, the purpose of this paper is to describe the 2020 RFIW challenge, end-to-end, along with forecasts in promising future directions.
4 citations
Cites methods from "ArcFace: Additive Angular Margin Lo..."
...th FaceNet (Inception-ResNet-v1) and with VGG-Face (Resnet-50) as the pre-trained models. FaceNet uses Triplet Loss as the main loss function in the training phase. The authors also implement ArcFace [2] - a family of loss functions based on the geodesic distance between feature vectors which aim to discriminate the latent representation of deep NNs. V. DISCUSSION A. A Broader Impact The fourth Recog...
[...]
••
TL;DR: In this article , a hierarchical pseudo-3D convolution neural network based on a kernel attention mechanism with a new global context block is proposed to accurately predict Alzheimer's disease even when the input features are complex.
4 citations
••
TL;DR: This article used representational similarity analysis to investigate how well the representations learned by high-performing deep convolutional neural networks (DCNNs) match human brain representations across the entire distributed face processing system and found that the correspondence between intermediate layers and neural representations of naturalistic human face processing is weak at best, and diverges even further in the later fully connected layers.
Abstract: Deep convolutional neural networks (DCNNs) trained for face identification can rival and even exceed human-level performance. The relationships between internal representations learned by DCNNs and those of the primate face processing system are not well understood, especially in naturalistic settings. We developed the largest naturalistic dynamic face stimulus set in human neuroimaging research (700+ naturalistic video clips of unfamiliar faces) and used representational similarity analysis to investigate how well the representations learned by high-performing DCNNs match human brain representations across the entire distributed face processing system. DCNN representational geometries were strikingly consistent across diverse architectures and captured meaningful variance among faces. Similarly, representational geometries throughout the human face network were highly consistent across subjects. Nonetheless, correlations between DCNN and neural representations were very weak overall--DCNNs captured 3% of variance in the neural representational geometries at best. Intermediate DCNN layers better matched visual and face-selective cortices than the final fully-connected layers. Behavioral ratings of face similarity were highly correlated with intermediate layers of DCNNs, but also failed to capture representational geometry in the human brain. Our results suggest that the correspondence between intermediate DCNN layers and neural representations of naturalistic human face processing is weak at best, and diverges even further in the later fully-connected layers. This poor correspondence can be attributed, at least in part, to the dynamic and cognitive information that plays an essential role in human face processing but is not modeled by DCNNs. These mismatches indicate that current DCNNs have limited validity as in silico models of dynamic, naturalistic face processing in humans.
4 citations
References
More filters
••
27 Jun 2016TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
123,388 citations
•
TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Abstract: Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
33,597 citations
•
06 Jul 2015TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.
30,843 citations
28 Oct 2017
TL;DR: An automatic differentiation module of PyTorch is described — a library designed to enable rapid research on machine learning models that focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead.
Abstract: In this article, we describe an automatic differentiation module of PyTorch — a library designed to enable rapid research on machine learning models. It builds upon a few projects, most notably Lua Torch, Chainer, and HIPS Autograd [4], and provides a high performance environment with easy access to automatic differentiation of models executed on different devices (CPU and GPU). To make prototyping easier, PyTorch does not follow the symbolic approach used in many other deep learning frameworks, but focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead. Note that this preprint is a draft of certain sections from an upcoming paper covering all PyTorch features.
13,268 citations
•
TL;DR: The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.
Abstract: TensorFlow is an interface for expressing machine learning algorithms, and an
implementation for executing such algorithms. A computation expressed using
TensorFlow can be executed with little or no change on a wide variety of
heterogeneous systems, ranging from mobile devices such as phones and tablets
up to large-scale distributed systems of hundreds of machines and thousands of
computational devices such as GPU cards. The system is flexible and can be used
to express a wide variety of algorithms, including training and inference
algorithms for deep neural network models, and it has been used for conducting
research and for deploying machine learning systems into production across more
than a dozen areas of computer science and other fields, including speech
recognition, computer vision, robotics, information retrieval, natural language
processing, geographic information extraction, and computational drug
discovery. This paper describes the TensorFlow interface and an implementation
of that interface that we have built at Google. The TensorFlow API and a
reference implementation were released as an open-source package under the
Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.
10,447 citations