scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

15 Jun 2019-pp 4690-4699
TL;DR: This paper presents arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks, and shows that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead.
Abstract: One of the main challenges in feature learning using Deep Convolutional Neural Networks (DCNNs) for large-scale face recognition is the design of appropriate loss functions that can enhance the discriminative power. Centre loss penalises the distance between deep features and their corresponding class centres in the Euclidean space to achieve intra-class compactness. SphereFace assumes that the linear transformation matrix in the last fully connected layer can be used as a representation of the class centres in the angular space and therefore penalises the angles between deep features and their corresponding weights in a multiplicative way. Recently, a popular line of research is to incorporate margins in well-established loss functions in order to maximise face class separability. In this paper, we propose an Additive Angular Margin Loss (ArcFace) to obtain highly discriminative features for face recognition. The proposed ArcFace has a clear geometric interpretation due to its exact correspondence to geodesic distance on a hypersphere. We present arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks which includes a new large-scale image database with trillions of pairs and a large-scale video dataset. We show that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead. To facilitate future research, the code has been made available.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
01 Nov 2020
TL;DR: The first large-scale multimodal language dataset for Spanish, Portuguese, German and French, called CMU-MOSEAS (CMU Multimodal Opinion Sentiment, Emotions and Attributes), is introduced, which is the largest of its kind with 40, 000 total labelled sentences.
Abstract: Modeling multimodal language is a core research area in natural language processing. While languages such as English have relatively large multimodal language resources, other widely spoken languages across the globe have few or no large-scale datasets in this area. This disproportionately affects native speakers of languages other than English. As a step towards building more equitable and inclusive multimodal systems, we introduce the first large-scale multimodal language dataset for Spanish, Portuguese, German and French. The proposed dataset, called CMU-MOSEAS (CMU Multimodal Opinion Sentiment, Emotions and Attributes), is the largest of its kind with 40,000 total labelled sentences. It covers a diverse set topics and speakers, and carries supervision of 20 labels including sentiment (and subjectivity), emotions, and attributes. Our evaluations on a state-of-the-art multimodal model demonstrates that CMU-MOSEAS enables further research for multilingual studies in multimodal language.

30 citations


Cites methods from "ArcFace: Additive Angular Margin Lo..."

  • ...The bounding box of the face is extracted using the RetinaFace (Deng et al., 2019b)....

    [...]

  • ...Identities are extracted using ArcFace (Deng et al., 2019a)....

    [...]

Journal ArticleDOI
TL;DR: In this paper, a detailed review of recent and state-of-the-art research advances of deep reinforcement learning in computer vision is provided, including landmark localization, object detection, object tracking, registration on both 2D image and 3D image volumetric data, image segmentation, and videos analysis.
Abstract: Deep reinforcement learning augments the reinforcement learning framework and utilizes the powerful representation of deep neural networks. Recent works have demonstrated the remarkable successes of deep reinforcement learning in various domains including finance, medicine, healthcare, video games, robotics, and computer vision. In this work, we provide a detailed review of recent and state-of-the-art research advances of deep reinforcement learning in computer vision. We start with comprehending the theories of deep learning, reinforcement learning, and deep reinforcement learning. We then propose a categorization of deep reinforcement learning methodologies and discuss their advantages and limitations. In particular, we divide deep reinforcement learning into seven main categories according to their applications in computer vision, i.e. (i) landmark localization (ii) object detection; (iii) object tracking; (iv) registration on both 2D image and 3D image volumetric data (v) image segmentation; (vi) videos analysis; and (vii) other applications. Each of these categories is further analyzed with reinforcement learning techniques, network design, and performance. Moreover, we provide a comprehensive analysis of the existing publicly available datasets and examine source code availability. Finally, we present some open issues and discuss future research directions on deep reinforcement learning in computer vision.

30 citations

Posted Content
TL;DR: This paper introduces iQIYI-VID, the largest video dataset for multi-modal person identification, and proposed a Multi- modal Attention module to fuse multi-Modal features that can improve person identification considerably.
Abstract: Person identification in the wild is very challenging due to great variation in poses, face quality, clothes, makeup and so on. Traditional research, such as face recognition, person re-identification, and speaker recognition, often focuses on a single modal of information, which is inadequate to handle all the situations in practice. Multi-modal person identification is a more promising way that we can jointly utilize face, head, body, audio features, and so on. In this paper, we introduce iQIYI-VID, the largest video dataset for multi-modal person identification. It is composed of 600K video clips of 5,000 celebrities. These video clips are extracted from 400K hours of online videos of various types, ranging from movies, variety shows, TV series, to news broadcasting. All video clips pass through a careful human annotation process, and the error rate of labels is lower than 0.2\%. We evaluated the state-of-art models of face recognition, person re-identification, and speaker recognition on the iQIYI-VID dataset. Experimental results show that these models are still far from being perfect for the task of person identification in the wild. We proposed a Multi-modal Attention module to fuse multi-modal features that can improve person identification considerably. We have released the dataset online to promote multi-modal person identification research.

30 citations


Cites background or methods from "ArcFace: Additive Angular Margin Lo..."

  • ...In the area of face recognition, ArcFace [11] reached a precision of 99....

    [...]

  • ...It should be mentioned that, ArcFace [11] achieved a precision of 99....

    [...]

  • ...The faces are detected by the SSH model [39], and then recognized by the ArcFace model [11]....

    [...]

  • ...We train a head classifier based on the ArcFace model [11]....

    [...]

  • ...The state-of-art method, ArcFace [11] achieved a face verification accuracy of 99....

    [...]

Proceedings Article
15 Aug 2020
TL;DR: This work proposes a novel method called BroadFace, which is a learning process to consider a massive set of identities, comprehensively, and achieves the state-of-the-art results with significant improvements on nine datasets in 1:1 face verification and 1:N face identification tasks, and is also effective in image retrieval.
Abstract: The datasets of face recognition contain an enormous number of identities and instances. However, conventional methods have difficulty in reflecting the entire distribution of the datasets because a mini-batch of small size contains only a small portion of all identities. To overcome this difficulty, we propose a novel method called BroadFace, which is a learning process to consider a massive set of identities, comprehensively. In BroadFace, a linear classifier learns optimal decision boundaries among identities from a large number of embedding vectors accumulated over past iterations. By referring more instances at once, the optimality of the classifier is naturally increased on the entire datasets. Thus, the encoder is also globally optimized by referring the weight matrix of the classifier. Moreover, we propose a novel compensation method to increase the number of referenced instances in the training stage. BroadFace can be easily applied on many existing methods to accelerate a learning process and obtain a significant improvement in accuracy without extra computational burden at inference stage. We perform extensive ablation studies and experiments on various datasets to show the effectiveness of BroadFace, and also empirically prove the validity of our compensation method. BroadFace achieves the state-of-the-art results with significant improvements on nine datasets in 1:1 face verification and 1:N face identification tasks, and is also effective in image retrieval.

30 citations


Cites background or methods from "ArcFace: Additive Angular Margin Lo..."

  • ...As pre-processing, we normalize a face image to 112× 112 by warping a face-region using five facial points from two eyes, nose and two corners of mouth [6,22,40]....

    [...]

  • ...The recent adoption [6,7,22,32,37,39,40,41] of Convolutional Neural Networks (CNNs) has dramatically increased recognition accuracy....

    [...]

  • ...A backbone network is ResNet-100 [11] that is used in the recent works [6,15]....

    [...]

  • ...1 ArcFace [6] 99....

    [...]

  • ...The computed embedding vectors and the weight vectors of the linear classifier are L2-normalized and trained by the ArcFace [6]....

    [...]

Book ChapterDOI
23 Aug 2020
TL;DR: A novel position-aware exclusivity is proposed to encourage large diversity among different filters of the same layer to alleviate the low-capability of student network and investigates the effect of several prevailing knowledge for face recognition distillation to conclude that the knowledge of feature consistency is more flexible and preserves much more information than others.
Abstract: Knowledge distillation is an effective tool to compress large pre-trained Convolutional Neural Networks (CNNs) or their ensembles into models applicable to mobile and embedded devices. The success of which mainly comes from two aspects: the designed student network and the exploited knowledge. However, current methods usually suffer from the low-capability of mobile-level student network and the unsatisfactory knowledge for distillation. In this paper, we propose a novel position-aware exclusivity to encourage large diversity among different filters of the same layer to alleviate the low-capability of student network. Moreover, we investigate the effect of several prevailing knowledge for face recognition distillation and conclude that the knowledge of feature consistency is more flexible and preserves much more information than others. Experiments on a variety of face recognition benchmarks have revealed the superiority of our method over the state-of-the-arts.

30 citations


Cites background or methods from "ArcFace: Additive Angular Margin Lo..."

  • ...In face recognition, it is very important to perform open-set evaluation [28, 42, 7], i....

    [...]

  • ...To achieve better performance, large CNNs like ResNet [7] or AttentionNet [46] are usually employed, which makes them hard to deploy on mobile and embedded devices....

    [...]

  • ...There are many kinds of network architectures [28, 3, 41] and several loss functions [7, 46] for face recognition....

    [...]

  • ...Specifically, rather than the traditional softmax loss, face recognition is usually supervised by margin-based softmax losses [28, 24, 42, 47, 7, 46], metric learning losses [37] or both [39]....

    [...]

  • ...Without loss of generality, we use SEResNet50-IR [7] as the teacher model, which was trained by SV-AMSoftmax loss [46]....

    [...]

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Journal Article
TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Abstract: Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

33,597 citations

Proceedings Article
Sergey Ioffe1, Christian Szegedy1
06 Jul 2015
TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.

30,843 citations

28 Oct 2017
TL;DR: An automatic differentiation module of PyTorch is described — a library designed to enable rapid research on machine learning models that focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead.
Abstract: In this article, we describe an automatic differentiation module of PyTorch — a library designed to enable rapid research on machine learning models. It builds upon a few projects, most notably Lua Torch, Chainer, and HIPS Autograd [4], and provides a high performance environment with easy access to automatic differentiation of models executed on different devices (CPU and GPU). To make prototyping easier, PyTorch does not follow the symbolic approach used in many other deep learning frameworks, but focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead. Note that this preprint is a draft of certain sections from an upcoming paper covering all PyTorch features.

13,268 citations

Posted Content
TL;DR: The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.
Abstract: TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.

10,447 citations