scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

15 Jun 2019-pp 4690-4699
TL;DR: This paper presents arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks, and shows that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead.
Abstract: One of the main challenges in feature learning using Deep Convolutional Neural Networks (DCNNs) for large-scale face recognition is the design of appropriate loss functions that can enhance the discriminative power. Centre loss penalises the distance between deep features and their corresponding class centres in the Euclidean space to achieve intra-class compactness. SphereFace assumes that the linear transformation matrix in the last fully connected layer can be used as a representation of the class centres in the angular space and therefore penalises the angles between deep features and their corresponding weights in a multiplicative way. Recently, a popular line of research is to incorporate margins in well-established loss functions in order to maximise face class separability. In this paper, we propose an Additive Angular Margin Loss (ArcFace) to obtain highly discriminative features for face recognition. The proposed ArcFace has a clear geometric interpretation due to its exact correspondence to geodesic distance on a hypersphere. We present arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks which includes a new large-scale image database with trillions of pairs and a large-scale video dataset. We show that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead. To facilitate future research, the code has been made available.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: LOTR as discussed by the authors is a direct coordinate regression approach leveraging a Transformer network to better utilize the spatial information in the feature map, which can be trained end-to-end without requiring any post-processing steps.
Abstract: This paper presents a novel Transformer-based facial landmark localization network named Localization Transformer (LOTR). The proposed framework is a direct coordinate regression approach leveraging a Transformer network to better utilize the spatial information in the feature map. An LOTR model consists of three main modules: 1) a visual backbone that converts an input image into a feature map, 2) a Transformer module that improves the feature representation from the visual backbone, and 3) a landmark prediction head that directly predicts the landmark coordinates from the Transformer's representation. Given cropped-and-aligned face images, the proposed LOTR can be trained end-to-end without requiring any post-processing steps. This paper also introduces the smooth-Wing loss function, which addresses the gradient discontinuity of the Wing loss, leading to better convergence than standard loss functions such as L1, L2, and Wing loss. Experimental results on the JD landmark dataset provided by the First Grand Challenge of 106-Point Facial Landmark Localization indicate the superiority of LOTR over the existing methods on the leaderboard and two recent heatmap-based approaches. On the WFLW dataset, the proposed LOTR framework demonstrates promising results compared with several state-of-the-art methods. Additionally, we report the improvement in state-of-the-art face recognition performance when using our proposed LOTRs for face alignment.

3 citations

Proceedings ArticleDOI
21 Aug 2022
TL;DR: Huber et al. as mentioned in this paper proposed a specialized, likelihood-based fusion method to enable deep learning-based face recognition on historic portrait paintings and also proposed a method to accurately determine the confidence of the made decision to assist art historians in their research.
Abstract: Verifying the identity of a person (sitter) portrayed in a historical painting is often a challenging but critical task in art historian research. In many cases, this information has been lost due to time or other circumstances and today there are only speculations of art historians about which person it could be. Art historians often use subjective factors for this purpose and then infer from the identity information about the person depicted in terms of his or her life, status, and era. On the other hand, automated face recognition has achieved a high level of accuracy, especially on photographs, and considers objective factors to determine the identity or verify a suspected identity. The limited amount of data, as well as the domain-specific challenges, make the use of automated face recognition methods in the domain of historic paintings difficult. We propose a specialized, likelihood-based fusion method to enable deep learning-based face recognition on historic portrait paintings. We additionally propose a method to accurately determine the confidence of the made decision to assist art historians in their research. For this purpose, we used a model trained on common photographs and adapted it to the domain of historical paintings through transfer learning. By using an underlying challenge dataset, we compute the likelihood for the assumed identity against reference images of the identity and fuse them to utilize as much information as possible. From these results of the likelihoods fusion, we then derive decision confidence to make statements to determine the certainty of the model’s decision. The experiments were carried out in a leave-one-out evaluation scenario on our created database, the largest authentic database of historic portrait paintings to date, consisting of over 760 portrait paintings of 210 different sitters by over 250 different artists. The experiments demonstrated, that a) the proposed approach outperforms pure face recognition solutions, b) the fusion approach effectively combines the sitter information towards a higher verification accuracy, and c) the proposed confidence estimation approach is highly successful in capturing the estimated accuracy of the decision. The meta-information of the used historic face images can be found at https://github.com/marcohuber/HistoricalFaces.

3 citations

Proceedings ArticleDOI
23 May 2022
TL;DR: In this paper , the authors proposed an asymmetric structure for automatic speaker verification, which takes the large-scale ECAPA-TDNN model for enrollment and the small-scale EAPA -TDNNLite model for verification.
Abstract: With the development of deep learning, automatic speaker verification has made considerable progress over the past few years. However, to design a lightweight and robust system with limited computational resources is still a challenging problem. Traditionally, a speaker verification system is symmetrical, indicating that the same embedding extraction model is applied for both enrollment and verification in inference. In this paper, we come up with an innovative asymmetric structure, which takes the large-scale ECAPA-TDNN model for enrollment and the small-scale ECAPA-TDNNLite model for verification. As a symmetrical system, our proposed ECAPA-TDNNLite model achieves an EER of 3.07% on the Voxceleb1 original test set with only 11.6M FLOPS. Moreover, the asymmetric structure further reduces the EER to 2.31%, without increasing any computational costs during verification.

3 citations

Journal ArticleDOI
TL;DR: Li et al. as mentioned in this paper presented a privacy-aware architecture, MobiDIV, which is a client-only scheme, where all sensitive data are processed locally on the driver's smartphone.
Abstract: As car hire and sharing service is popular in the transportation market, secure driver identity verification is attracting more attention. However, the current verification mechanism focuses on performing authentication operations in the cloud server before drivers get access to the car, which results in potential privacy security issues. In this article, we present a privacy-aware architecture, MobiDIV, which is a client-only scheme, where all sensitive data are processed locally on the driver’s smartphone. To achieve real-time and robust driver identification during the driving life cycle, an efficient face feature extractor is proposed in MobiDIV. Specifically, two three-stream neural networks using the proposed efficient SqueezeNet structure are trained on our synthesized data set for different in-car uncertainties (pose, motion blur, nonalignment and low illumination). During authentication, only an adaptable embedding model is selected and conducted on phone for continuous feature extraction. The anomaly operation monitoring algorithm is then applied to the optical signal generated by phone flash for secure identity reidentification and verification failure message transmission. This allows us to further ensure the privacy of driver facial images without compromising on the real-time identity verification. We perform extensive experiments on various data sets. Compared to most SOTA deep neural networks on real-world open data sets, we achieve similar verification accuracy with fewer parameters and floating-point calculations. On the challenging synthetic test data sets, we even achieve a higher average verification accuracy. To assess the MobiDIV in-depth, the proposed model is integrated in car-sharing platform ICICV-E100 and the obtained results show the feasibility of our system.

3 citations

Posted Content
Jiawen Kang1, Ruiqi Liu, Lantian Li1, Yunqi Cai1, Dong Wang1, Thomas Fang Zheng1 
TL;DR: In this paper, a domain-invariant projection is proposed to improve the generalizability of speaker vectors by using the Model-Agnostic Meta-Learning (MAML) principle.
Abstract: Domain generalization remains a critical problem for speaker recognition, even with the state-of-the-art architectures based on deep neural nets. For example, a model trained on reading speech may largely fail when applied to scenarios of singing or movie. In this paper, we propose a domain-invariant projection to improve the generalizability of speaker vectors. This projection is a simple neural net and is trained following the Model-Agnostic Meta-Learning (MAML) principle, for which the objective is to classify speakers in one domain if it had been updated with speech data in another domain. We tested the proposed method on CNCeleb, a new dataset consisting of single-speaker multi-condition (SSMC) data. The results demonstrated that the MAML-based domain-invariant projection can produce more generalizable speaker vectors, and effectively improve the performance in unseen domains.

3 citations

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Journal Article
TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Abstract: Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

33,597 citations

Proceedings Article
Sergey Ioffe1, Christian Szegedy1
06 Jul 2015
TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.

30,843 citations

28 Oct 2017
TL;DR: An automatic differentiation module of PyTorch is described — a library designed to enable rapid research on machine learning models that focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead.
Abstract: In this article, we describe an automatic differentiation module of PyTorch — a library designed to enable rapid research on machine learning models. It builds upon a few projects, most notably Lua Torch, Chainer, and HIPS Autograd [4], and provides a high performance environment with easy access to automatic differentiation of models executed on different devices (CPU and GPU). To make prototyping easier, PyTorch does not follow the symbolic approach used in many other deep learning frameworks, but focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead. Note that this preprint is a draft of certain sections from an upcoming paper covering all PyTorch features.

13,268 citations

Posted Content
TL;DR: The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.
Abstract: TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.

10,447 citations