scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

15 Jun 2019-pp 4690-4699
TL;DR: This paper presents arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks, and shows that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead.
Abstract: One of the main challenges in feature learning using Deep Convolutional Neural Networks (DCNNs) for large-scale face recognition is the design of appropriate loss functions that can enhance the discriminative power. Centre loss penalises the distance between deep features and their corresponding class centres in the Euclidean space to achieve intra-class compactness. SphereFace assumes that the linear transformation matrix in the last fully connected layer can be used as a representation of the class centres in the angular space and therefore penalises the angles between deep features and their corresponding weights in a multiplicative way. Recently, a popular line of research is to incorporate margins in well-established loss functions in order to maximise face class separability. In this paper, we propose an Additive Angular Margin Loss (ArcFace) to obtain highly discriminative features for face recognition. The proposed ArcFace has a clear geometric interpretation due to its exact correspondence to geodesic distance on a hypersphere. We present arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks which includes a new large-scale image database with trillions of pairs and a large-scale video dataset. We show that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead. To facilitate future research, the code has been made available.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a new loss function called noise-tolerant deep neighborhood embedding, which can accurately encode the semantic relationships among remote sensing (RS) scenes.
Abstract: Recently, many deep learning-based methods have been developed for solving remote sensing (RS) scene classification or retrieval tasks. Most of the adopted loss functions for training these models require accurate annotations. However, the presence of noise in such annotations (also known as label noise) cannot be avoided in large-scale RS benchmark archives, resulting from geo-location/registration errors, land-cover changes, and diverse knowledge background of annotators. To overcome the influence of noisy labels on the learning process of deep models, we propose a new loss function called noise-tolerant deep neighborhood embedding which can accurately encode the semantic relationships among RS scenes. Specifically, we target at maximizing the leave-one-out $K$ -NN score for uncovering the inherent neighborhood structure among the images in feature space. Moreover, we down-weight the contribution of potential noisy images by learning their localized structure and pruning the images with low leave-one-out $K$ -NN scores. Based on our newly proposed loss function, classwise features can be more robustly discriminated. Our experiments, conducted on two benchmark RS datasets, validate the effectiveness of the proposed approach on three different RS scene interpretation tasks, including classification, clustering, and retrieval. The codes of this article will be publicly available from https://github.com/jiankang1991 .

6 citations

Journal ArticleDOI
TL;DR: TomoTwin this article is an open source general picking model for cryogenic-electron tomograms based on deep metric learning, embedding tomograms in an information-rich, high-dimensional space that separates macromolecules according to their three-dimensional structure.
Abstract: Abstract Cryogenic-electron tomography enables the visualization of cellular environments in extreme detail, however, tools to analyze the full amount of information contained within these densely packed volumes are still needed. Detailed analysis of macromolecules through subtomogram averaging requires particles to first be localized within the tomogram volume, a task complicated by several factors including a low signal to noise ratio and crowding of the cellular space. Available methods for this task suffer either from being error prone or requiring manual annotation of training data. To assist in this crucial particle picking step, we present TomoTwin: an open source general picking model for cryogenic-electron tomograms based on deep metric learning. By embedding tomograms in an information-rich, high-dimensional space that separates macromolecules according to their three-dimensional structure, TomoTwin allows users to identify proteins in tomograms de novo without manually creating training data or retraining the network to locate new proteins.

6 citations

Proceedings ArticleDOI
TL;DR: In this paper, the authors proposed Contrastive Uncertainty Learning (CUL) by integrating the merits of uncertainty learning and contrastive self-supervised learning to improve the performance of iris recognition with insufficient labeled data.
Abstract: Cross-database recognition is still an unavoidable challenge when deploying an iris recognition system to a new environment. In the paper, we present a compromise problem that resembles the real-world scenario, named iris recognition with insufficient labeled samples. This new problem aims to improve the recognition performance by utilizing partially-or un-labeled data. To address the problem, we propose Contrastive Uncertainty Learning (CUL) by integrating the merits of uncertainty learning and contrastive self-supervised learning. CUL makes two efforts to learn a discriminative and robust feature representation. On the one hand, CUL explores the uncertain acquisition factors and adopts a probabilistic embedding to represent the iris image. In the probabilistic representation, the identity information and acquisition factors are disentangled into the mean and variance, avoiding the impact of uncertain acquisition factors on the identity information. On the other hand, CUL utilizes probabilistic embeddings to generate virtual positive and negative pairs. Then CUL builds its contrastive loss to group the similar samples closely and push the dissimilar samples apart. The experimental results demonstrate the effectiveness of the proposed CUL for iris recognition with insufficient labeled samples.

6 citations

Book ChapterDOI
23 Aug 2020
TL;DR: This work proposes an training framework to handle extreme classification tasks based on Random Projection and demonstrates that the proposed framework is able to train deep learning models with millions of classes and achieve above 10⇥ speedup compared to existing approaches.
Abstract: Deep learning has achieved remarkable success in many classification tasks because of its great power of representation learning for complex data. However, it remains challenging when extending to classification tasks with millions of classes. Previous studies are focused on solving this problem in a distributed fashion or using a sampling-based approach to reduce the computational cost caused by the softmax layer. However, these approaches still need high GPU memory in order to work with large models and it is non-trivial to extend them to parallel settings. To address these issues, we propose an efficient training framework to handle extreme classification tasks based on Random Projection. The key idea is that we first train a slimmed model with a random projected softmax classifier and then we recover it to the original classifier. We also show a theoretical guarantee that this recovered classifier can approximate the original classifier with a small error. Later, we extend our framework to parallel settings by adopting a communication reduction technique. In our experiments, we demonstrate that the proposed framework is able to train deep learning models with millions of classes and achieve above \(10{\times }\) speedup compared to existing approaches.

6 citations


Cites background from "ArcFace: Additive Angular Margin Lo..."

  • ...In addition, it is also possible to distribute the computation process of softmax layer based on matrix partition to multiple parallel workers in order to overcome the di culty of training large models on a large-scale dataset [7]....

    [...]

  • ...1, dividing it by 10 at {30, 60, 90} and {10, 16} epoch for ImageNet and Facial datasets [7]....

    [...]

Journal ArticleDOI
TL;DR: A low-cost, accurate method of masked face synthesis, i.e. mask transfer, is proposed for data augmentation and a mask-aware similarity matching strategy(MS) is taken to improve the performance in the inference stage.
Abstract: Face masks bring a new challenge to face recognition systems especially against the background of the COVID-19 pandemic. In this paper, a method mitigating the negative effects of mask defects on face recognition is proposed. Firstly, a low-cost, accurate method of masked face synthesis, i.e. mask transfer, is proposed for data augmentation. Secondly, an attention-aware masked face recognition (AMaskNet) is proposed to improve the performance of masked face recognition, which includes two modules: a feature extractor and a contribution estimator. Therein, the contribution estimator is employed to learn the contribution of the feature elements, thus achieving refined feature representation by simple matrix multiplications. Meanwhile, the end-to-end training strategy is utilized to optimize the entire model. Finally, a mask-aware similarity matching strategy(MS) is taken to improve the performance in the inference stage. The experiments show that the proposed method consistently outperforms on three masked face recognition datasets: RMFRD [1], COX [2] and Public-IvS [3]. Meanwhile, qualitative analysis experiments using CAM [4] indicate that the contribution learned by AMaskNet is more conducive to masked face recognition.

6 citations

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Journal Article
TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Abstract: Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

33,597 citations

Proceedings Article
Sergey Ioffe1, Christian Szegedy1
06 Jul 2015
TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.

30,843 citations

28 Oct 2017
TL;DR: An automatic differentiation module of PyTorch is described — a library designed to enable rapid research on machine learning models that focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead.
Abstract: In this article, we describe an automatic differentiation module of PyTorch — a library designed to enable rapid research on machine learning models. It builds upon a few projects, most notably Lua Torch, Chainer, and HIPS Autograd [4], and provides a high performance environment with easy access to automatic differentiation of models executed on different devices (CPU and GPU). To make prototyping easier, PyTorch does not follow the symbolic approach used in many other deep learning frameworks, but focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead. Note that this preprint is a draft of certain sections from an upcoming paper covering all PyTorch features.

13,268 citations

Posted Content
TL;DR: The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.
Abstract: TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.

10,447 citations