scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

15 Jun 2019-pp 4690-4699
TL;DR: This paper presents arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks, and shows that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead.
Abstract: One of the main challenges in feature learning using Deep Convolutional Neural Networks (DCNNs) for large-scale face recognition is the design of appropriate loss functions that can enhance the discriminative power. Centre loss penalises the distance between deep features and their corresponding class centres in the Euclidean space to achieve intra-class compactness. SphereFace assumes that the linear transformation matrix in the last fully connected layer can be used as a representation of the class centres in the angular space and therefore penalises the angles between deep features and their corresponding weights in a multiplicative way. Recently, a popular line of research is to incorporate margins in well-established loss functions in order to maximise face class separability. In this paper, we propose an Additive Angular Margin Loss (ArcFace) to obtain highly discriminative features for face recognition. The proposed ArcFace has a clear geometric interpretation due to its exact correspondence to geodesic distance on a hypersphere. We present arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks which includes a new large-scale image database with trillions of pairs and a large-scale video dataset. We show that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead. To facilitate future research, the code has been made available.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
01 Jun 2022
TL;DR: The TikTok Facial Expression Recognition (FER) dataset as discussed by the authors is a dataset of 6428 videos scraped from TikTok and consists of 9392 distinct individuals and labels for 15 emotion-related prompts.
Abstract: Facial expression recognition (FER) is a critical computer vision task for a variety of applications. Despite the widespread use of FER, there is a dearth of racially diverse facial emotion datasets which are enriched for children, teens, and adults. To bridge this gap, we have built a diverse expression recognition database using publicly available videos from TikTok, a video-focused social networking service. We describe the construction of the TikTok Facial expression recognition (FER) database. The dataset is extracted from 6428 videos scraped from TikTok. The videos consist of 9392 distinct individuals and labels for 15 emotion-related prompts. We were able to achieve a F1 score 0.78 for Ekman emotions on expression classification using transfer learning. We hope that the scale and diversity of the TikTokFER dataset will be of use to affective computing practitioners.

2 citations

Posted Content
TL;DR: In this article, a modular architecture for face template protection, called IronMask, is presented, which can be combined with any face recognition system using angular distance metric. But, it does not address the problem of face template binarization, which is the main cause of performance degradation.
Abstract: Convolutional neural networks have made remarkable progress in the face recognition field. The more the technology of face recognition advances, the greater discriminative features into a face template. However, this increases the threat to user privacy in case the template is exposed. In this paper, we present a modular architecture for face template protection, called IronMask, that can be combined with any face recognition system using angular distance metric. We circumvent the need for binarization, which is the main cause of performance degradation in most existing face template protections, by proposing a new real-valued error-correcting-code that is compatible with real-valued templates and can therefore, minimize performance degradation. We evaluate the efficacy of IronMask by extensive experiments on two face recognitions, ArcFace and CosFace with three datasets, CMU-Multi-PIE, FEI, and Color-FERET. According to our experimental results, IronMask achieves a true accept rate (TAR) of 99.79% at a false accept rate (FAR) of 0.0005% when combined with ArcFace, and 95.78% TAR at 0% FAR with CosFace, while providing at least 115-bit security against known attacks.

2 citations

Posted Content
TL;DR: This work investigates the problem by formulating face images as points in a shape-appearance parameter space, and demonstrates that the encoding and decoding of the neuron responses (representations) to face images in CNNs could be achieved under a linear model in the parameter space.
Abstract: Recently, Convolutional Neural Networks (CNNs) have achieved tremendous performances on face recognition, and one popular perspective regarding CNNs' success is that CNNs could learn discriminative face representations from face images with complex image feature encoding. However, it is still unclear what is the intrinsic mechanism of face representation in CNNs. In this work, we investigate this problem by formulating face images as points in a shape-appearance parameter space, and our results demonstrate that: (i) The encoding and decoding of the neuron responses (representations) to face images in CNNs could be achieved under a linear model in the parameter space, in agreement with the recent discovery in primate IT face neurons, but different from the aforementioned perspective on CNNs' face representation with complex image feature encoding; (ii) The linear model for face encoding and decoding in the parameter space could achieve close or even better performances on face recognition and verification than state-of-the-art CNNs, which might provide new lights on the design strategies for face recognition systems; (iii) The neuron responses to face images in CNNs could not be adequately modelled by the axis model, a model recently proposed on face modelling in primate IT cortex. All these results might shed some lights on the often complained blackbox nature behind CNNs' tremendous performances on face recognition.

2 citations


Cites background or methods from "ArcFace: Additive Angular Margin Lo..."

  • ...Following the standard evaluation protocol defined for the unrestricted setting [22, 24], we test the model CNNs on 6000 face pairs (3000 matched pairs and 3000 mismatched pairs)....

    [...]

  • ...[24] proposed a geometrically interpretable loss function, called ArcFace, which is integrated with different CNN models (e....

    [...]

  • ...As seen from Figure 5, both the computed Pearson and Spearman coefficients for the six CNNs on the two datasets are larger than 0.4 in most cases, and particularly, those coefficients for the four more recent CNNs (ArcFace, Light-CNN, SphereFace, and ResNet-Face) are even close to or larger than 0.6 with relatively smaller standard deviations, in agreement with the previous results for VGG-Face1/VGG-Face2 on MultiPIE, which further suggests that the predicted representations by the linear model are strongly correlated with those outputted from CNNs....

    [...]

  • ...Addressing the above questions, we investigated the face representation problem at higher CNN layers by formulating face images as points in a parameter space in this work, with six typical multi-layered CNNs for face recognition: VGG-Face [17], DeepID [14], ResNet-Face (defived from ResNet [27]), SphereFace [22], Light-CNN [23], and ArcFace [24], and three commonly used face datasets: Multi-PIE [28], LFW [29], and MegaFace [30]....

    [...]

  • ...Deng et al. [24] proposed a geometrically interpretable loss function, called ArcFace, which is integrated with different CNN models (e.g. ResNet [27]) for face recognition and verification....

    [...]

Journal Article
TL;DR: This work presents a method for finding paths in a deep generative model’s latent space that can maximally vary one set of image features while holding others constant, and suggests that a wealth of opportunities lies in the local analysis of the geometry and semantics of latent spaces.
Abstract: We present a method for finding paths in a deep generative model’s latent space that can maximally vary one set of image features while holding others constant. Crucially, unlike past traversal approaches, ours can manipulate multidimensional features of an image such as facial identity and pixels within a specified region. Our method is principled and conceptually simple: optimal traversal directions are chosen by maximizing differential changes to one feature set such that changes to another set are negligible. We show that this problem is nearly equivalent to one of Rayleigh quotient maximization, and provide a closed-form solution to it based on solving a generalized eigenvalue equation. We use repeated computations of the corresponding optimal directions, which we call Rayleigh EigenDirections (REDs), to generate appropriately curved paths in latent space. We empirically evaluate our method using StyleGAN2 on two image domains: faces and living rooms. We show that our method is capable of controlling various multidimensional features out of the scope of previous latent space traversal methods: face identity, spatial frequency bands, pixels within a region, and the appearance and position of an object. Our work suggests that a wealth of opportunities lies in the local analysis of the geometry and semantics of latent spaces.

2 citations

Journal ArticleDOI
TL;DR: This work proposes to a new framework to learn point cloud representation by bidirectional reasoning between the local structures at different abstraction hierarchies and the global shape, and proposes a hierarchical reasoning scheme among local parts, structural proxies and the overall point cloud to learn powerful 3D representation in an unsupervised manner.
Abstract: This work explores the use of global and local structures of 3D point clouds as a free and powerful supervision signal for representation learning. Local and global patterns of a 3D object are closely related. Although each part of an object is incomplete, the underlying attributes about the object are shared among all parts, which makes reasoning about the whole object from a single part possible. We hypothesize that a powerful representation of a 3D object should model the attributes that are shared between parts and the whole object, and distinguishable from other objects. Based on this hypothesis, we propose a new framework to learn point cloud representations by bidirectional reasoning between the local structures at different abstraction hierarchies and the global shape. Moreover, we extend the unsupervised structural representation learning method to more complex 3D scenes. By introducing structural proxies as the intermediate-level representations between local and global ones, we propose a hierarchical reasoning scheme among local parts, structural proxies, and the overall point cloud to learn powerful 3D representations in an unsupervised manner. Extensive experimental results demonstrate that the unsupervised representations can be very competitive alternatives of supervised representations in discriminative power, and exhibit better performance in generalization ability and robustness. Our method establishes the new state-of-the-art of unsupervised/few-shot 3D object classification and part segmentation. We also show our method can serve as a simple yet effective regime for model pre-training on 3D scene segmentation and detection tasks. We expect our observations to offer a new perspective on learning better representations from data structures instead of human annotations for point cloud understanding.

2 citations

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Journal Article
TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Abstract: Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

33,597 citations

Proceedings Article
Sergey Ioffe1, Christian Szegedy1
06 Jul 2015
TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.

30,843 citations

28 Oct 2017
TL;DR: An automatic differentiation module of PyTorch is described — a library designed to enable rapid research on machine learning models that focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead.
Abstract: In this article, we describe an automatic differentiation module of PyTorch — a library designed to enable rapid research on machine learning models. It builds upon a few projects, most notably Lua Torch, Chainer, and HIPS Autograd [4], and provides a high performance environment with easy access to automatic differentiation of models executed on different devices (CPU and GPU). To make prototyping easier, PyTorch does not follow the symbolic approach used in many other deep learning frameworks, but focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead. Note that this preprint is a draft of certain sections from an upcoming paper covering all PyTorch features.

13,268 citations

Posted Content
TL;DR: The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.
Abstract: TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.

10,447 citations