scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

15 Jun 2019-pp 4690-4699
TL;DR: This paper presents arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks, and shows that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead.
Abstract: One of the main challenges in feature learning using Deep Convolutional Neural Networks (DCNNs) for large-scale face recognition is the design of appropriate loss functions that can enhance the discriminative power. Centre loss penalises the distance between deep features and their corresponding class centres in the Euclidean space to achieve intra-class compactness. SphereFace assumes that the linear transformation matrix in the last fully connected layer can be used as a representation of the class centres in the angular space and therefore penalises the angles between deep features and their corresponding weights in a multiplicative way. Recently, a popular line of research is to incorporate margins in well-established loss functions in order to maximise face class separability. In this paper, we propose an Additive Angular Margin Loss (ArcFace) to obtain highly discriminative features for face recognition. The proposed ArcFace has a clear geometric interpretation due to its exact correspondence to geodesic distance on a hypersphere. We present arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks which includes a new large-scale image database with trillions of pairs and a large-scale video dataset. We show that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead. To facilitate future research, the code has been made available.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
31 Jan 2020
TL;DR: It is shown that the accuracy difference persists even if a state-of-the-art deep learning method is trained from scratch using training data explicitly balanced between male and female images and subjects.
Abstract: We present a comprehensive analysis of how and why face recognition accuracy differs between men and women. We show that accuracy is lower for women due to the combination of(1) the impostor distribution for women having a skew toward higher similarity scores, and (2) the genuine distribution for women having a skew toward lower similarity scores. We show that this phenomenon of the impostor and genuine distributions for women shifting closer towards each other is general across datasets of African-American, Caucasian, and Asian faces. We show that the distribution of facial expressions may differ between male/female, but that the accuracy difference persists for image subsets rated confidently as neutral expression. The accuracy difference also persists for image subsets rated as close to zero pitch angle. Even when removing images with forehead partially occluded by hair/hat, the same impostor/genuine accuracy difference persists. We show that the female genuine distribution improves when only female images without facial cosmetics are used, but that the female impostor distribution also degrades at the same time. Lastly, we show that the accuracy difference persists even if a state-of-the-art deep learning method is trained from scratch using training data explicitly balanced between male and female images and subjects.

77 citations


Cites methods from "ArcFace: Additive Angular Margin Lo..."

  • ...The ArcFace matcher was trained using the MS1MV2 dataset, which is only around 27% female....

    [...]

  • ...For this reason, and to save space, we present only ArcFace results....

    [...]

  • ...The ArcFace model [14] used in this paper was trained with ResNet100 using the MS1MV2 dataset, which is a curated version of the MS1M dataset [15]....

    [...]

  • ...We report results for one of the best state-of-the-art opensource deep learning face matcher, ArcFace [9]....

    [...]

  • ...rate ResNet-50 [16] networks with combined margin loss (which combines CosFace [31], SphereFace [20] and ArcFace [9] margins) using a subsets of the VGGFace2 dataset [6] and the MS1MV2 dataset, that we balanced to have the exactly same number of male and female images and subjects....

    [...]

Journal ArticleDOI
TL;DR: The winning models far outperformed the previous effort at multi-label classification of protein localization patterns by ~20% and can be used as classifiers to annotate new images, feature extractors to measure pattern similarity or pretrained networks for a wide range of biological applications.
Abstract: Pinpointing subcellular protein localizations from microscopy images is easy to the trained eye, but challenging to automate. Based on the Human Protein Atlas image collection, we held a competition to identify deep learning solutions to solve this task. Challenges included training on highly imbalanced classes and predicting multiple labels per image. Over 3 months, 2,172 teams participated. Despite convergence on popular networks and training techniques, there was considerable variety among the solutions. Participants applied strategies for modifying neural networks and loss functions, augmenting data and using pretrained networks. The winning models far outperformed our previous effort at multi-label classification of protein localization patterns by ~20%. These models can be used as classifiers to annotate new images, feature extractors to measure pattern similarity or pretrained networks for a wide range of biological applications. The 2018 Human Protein Atlas Image Classification competition sought to improve automated classification of protein subcellular localizations from fluorescence images. The winning strategies involved innovative deep learning approaches for multi-label classification.

77 citations

Book ChapterDOI
08 Sep 2018
TL;DR: It is shown that unlabeled face data can be as effective as the labeled ones, and Consensus-Driven Propagation (CDP) is proposed to tackle this challenging problem with two modules, the “committee” and the ”mediator”, which select positive face pairs robustly by carefully aggregating multi-view information.
Abstract: Face recognition has witnessed great progress in recent years, mainly attributed to the high-capacity model designed and the abundant labeled data collected However, it becomes more and more prohibitive to scale up the current million-level identity annotations In this work, we show that unlabeled face data can be as effective as the labeled ones Here, we consider a setting closely mimicking the real-world scenario, where the unlabeled data are collected from unconstrained environments and their identities are exclusive from the labeled ones Our main insight is that although the class information is not available, we can still faithfully approximate these semantic relationships by constructing a relational graph in a bottom-up manner We propose Consensus-Driven Propagation (CDP) to tackle this challenging problem with two modules, the “committee” and the “mediator”, which select positive face pairs robustly by carefully aggregating multi-view information Extensive experiments validate the effectiveness of both modules to discard outliers and mine hard positives With CDP, we achieve a compelling accuracy of 7818% on MegaFace identification challenge by using only 9% of the labels, comparing to 6178% when no unlabeled data are used and 7852% when all labels are employed

77 citations


Cites background or methods from "ArcFace: Additive Angular Margin Lo..."

  • ...Apart from softmax, we also equip CDP with an advanced loss function, ArcFace [7], the current top entry on MegaFace benchmark....

    [...]

  • ...For parameters related to ArcFace, we set the margin m = 0.5 and adopt the output setting “E”, that is “BN-Dropout-FC-BN”....

    [...]

  • ...Table 6: Comparisons of the gain brought by CDP with 2-folds unlabeled data between the previous baseline (Softmax) and the new baseline (ArcFace [7] with a cleaner training set)....

    [...]

Posted Content
TL;DR: This paper provides a comprehensive and up-to-date literature review of popular face recognition methods including both traditional (geometry-based, holistic, feature-based and hybrid methods) and deep learning methods.
Abstract: Starting in the seventies, face recognition has become one of the most researched topics in computer vision and biometrics. Traditional methods based on hand-crafted features and traditional machine learning techniques have recently been superseded by deep neural networks trained with very large datasets. In this paper we provide a comprehensive and up-to-date literature review of popular face recognition methods including both traditional (geometry-based, holistic, feature-based and hybrid methods) and deep learning methods.

74 citations


Cites background from "ArcFace: Additive Angular Margin Lo..."

  • ...An alternative additive margin was proposed in [121] which keeps the advantages of [119], [120] but has a better geometric interpretation since the margin is added to the angle and not to the cosine....

    [...]

  • ...A thorough study of different CNN architectures was carried out in [121]....

    [...]

  • ...Multiplicative angular margin [116] ‖x‖(cosmθ1 − cos θ2) = 0 Additive cosine margin [119], [120] s(cos θ1 −m− cos θ2) = 0 Additive angular margin [121] s(cos(θ1 +m)− cos θ2) = 0...

    [...]

  • ...(ResNets) [114] have become the preferred choice for many object recognition tasks, including face recognition [115], [116], [117], [118], [119], [120], [121]....

    [...]

Proceedings ArticleDOI
15 Jun 2019
TL;DR: This work proposes a novel training paradigm that employs the idea of weighting samples based on the above probability of being clean, and without any prior knowledge of noise, can train high performance CNN models with largescale FR datasets.
Abstract: Benefit from large-scale training datasets, deep Convolutional Neural Networks(CNNs) have achieved impressive results in face recognition(FR). However, tremendous scale of datasets inevitably lead to noisy data, which obviously reduce the performance of the trained CNN models. Kicking out wrong labels from large-scale FR datasets is still very expensive, although some cleaning approaches are proposed. According to the analysis of the whole process of training CNN models supervised by angular margin based loss(AM-Loss) functions, we find that the distribution of training samples implicitly reflects their probability of being clean. Thus, we propose a novel training paradigm that employs the idea of weighting samples based on the above probability. Without any prior knowledge of noise, we can train high performance CNN models with largescale FR datasets. Experiments demonstrate the effectiveness of our training paradigm. The codes are available at https://github.com/huangyangyu/NoiseFace.

74 citations


Cites background or methods from "ArcFace: Additive Angular Margin Lo..."

  • ...These observations can be explained in theory, and more experiments are further performed to confirm them: (1)We increase the noise rate from 40% to 60%; (2)ArcFace [5] is employed to supervise the CNNs; (3) we replace the ResNet-20 with a deeper ResNet-64 [24]; (4) Another clean dataset IMDB-Face [42]2 is chosen to replace WebFaceClean....

    [...]

  • ...The scaling parameter s is better to be a properly large value as the hypersphere radius, according to the discussion in [45, 5]....

    [...]

  • ...Very recently, some angular margin based loss(AMLoss for short in the paper) functions [24, 5, 45, 43] are proposed and achieve the state-of-the-art performance....

    [...]

  • ...Cleaning large-scale FR datasets with automatic or semi-automatic approaches [11, 49, 5] cannot really solve this problem....

    [...]

  • ...In this paper, we propose a noise-tolerant paradigm to learn face features on a large-scale noisy dataset directly, different from other related approaches [49, 5] which aim to clean the noisy dataset firstly....

    [...]

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Journal Article
TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Abstract: Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

33,597 citations

Proceedings Article
Sergey Ioffe1, Christian Szegedy1
06 Jul 2015
TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.

30,843 citations

28 Oct 2017
TL;DR: An automatic differentiation module of PyTorch is described — a library designed to enable rapid research on machine learning models that focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead.
Abstract: In this article, we describe an automatic differentiation module of PyTorch — a library designed to enable rapid research on machine learning models. It builds upon a few projects, most notably Lua Torch, Chainer, and HIPS Autograd [4], and provides a high performance environment with easy access to automatic differentiation of models executed on different devices (CPU and GPU). To make prototyping easier, PyTorch does not follow the symbolic approach used in many other deep learning frameworks, but focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead. Note that this preprint is a draft of certain sections from an upcoming paper covering all PyTorch features.

13,268 citations

Posted Content
TL;DR: The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.
Abstract: TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.

10,447 citations