scispace - formally typeset
Open AccessPosted Content

Subclass Distillation

Reads0
Chats0
TLDR
In this article, the student is trained to match the teacher's probabilities of assigning incorrect classes to incorrect classes, and the teacher is then trained to transfer most of the generalization ability of the teacher to the student.
Abstract
After a large "teacher" neural network has been trained on labeled data, the probabilities that the teacher assigns to incorrect classes reveal a lot of information about the way in which the teacher generalizes. By training a small "student" model to match these probabilities, it is possible to transfer most of the generalization ability of the teacher to the student, often producing a much better small model than directly training the student on the training data. The transfer works best when there are many possible classes because more is then revealed about the function learned by the teacher, but in cases where there are only a few possible classes we show that we can improve the transfer by forcing the teacher to divide each class into many subclasses that it invents during the supervised training. The student is then trained to match the subclass probabilities. For datasets where there are known, natural subclasses we demonstrate that the teacher learns similar subclasses and these improve distillation. For clickthrough datasets where the subclasses are unknown we demonstrate that subclass distillation allows the student to learn faster and better.

read more

Citations
More filters
Posted Content

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

TL;DR: This article proposed an additive angular margin loss (ArcFace) to obtain highly discriminative features for face recognition, which has a clear geometric interpretation due to the exact correspondence to the geodesic distance on the hypersphere.
Proceedings Article

Sub-center ArcFace: Boosting Face Recognition by Large-Scale Noisy Web Faces.

TL;DR: This paper relaxes the intra-class constraint of ArcFace to improve the robustness to label noise and designs K sub-centers for each class and the training sample only needs to be close to any of the K positive subcenters instead of the only one positive center.
Posted Content

No Subclass Left Behind: Fine-Grained Robustness in Coarse-Grained Classification Problems

TL;DR: This work proposes GEORGE, a method to both measure and mitigate hidden stratification even when subclass labels are unknown, and theoretically characterize the performance of GEORGE in terms of the worst-case generalization error across any subclass.
Posted Content

Asking without Telling: Exploring Latent Ontologies in Contextual Representations

TL;DR: This work introduces latent subclass learning (LSL): a modification to existing classifier-based probing methods that induces a latent categorization (or ontology) of the probe's inputs that extracts emergent structure from input representations in an interpretable and quantifiable form.
Journal Article

Anti-Distillation: Improving Reproducibility of Deep Networks

TL;DR: A novel approach, Anti-Distillation, is proposed to address irreproducibility in deep networks, where ensemble models are used to generate predictions, and enhances the benefit of ensembles, making the final predictions more reproducible.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Book

Elements of information theory

TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.
Dissertation

Learning Multiple Layers of Features from Tiny Images

TL;DR: In this paper, the authors describe how to train a multi-layer generative model of natural images, using a dataset of millions of tiny colour images, described in the next section.
Posted Content

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

TL;DR: This work introduces two simple global hyper-parameters that efficiently trade off between latency and accuracy and demonstrates the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.
Related Papers (5)