Subclass Distillation

Open AccessPosted Content

Subclass Distillation

Rafael Muller, +2 more

- 10 Feb 2020 -

arXiv: Learning

Chats0

TLDR

In this article, the student is trained to match the teacher's probabilities of assigning incorrect classes to incorrect classes, and the teacher is then trained to transfer most of the generalization ability of the teacher to the student.

Abstract:

After a large "teacher" neural network has been trained on labeled data, the probabilities that the teacher assigns to incorrect classes reveal a lot of information about the way in which the teacher generalizes. By training a small "student" model to match these probabilities, it is possible to transfer most of the generalization ability of the teacher to the student, often producing a much better small model than directly training the student on the training data. The transfer works best when there are many possible classes because more is then revealed about the function learned by the teacher, but in cases where there are only a few possible classes we show that we can improve the transfer by forcing the teacher to divide each class into many subclasses that it invents during the supervised training. The student is then trained to match the subclass probabilities. For datasets where there are known, natural subclasses we demonstrate that the teacher learns similar subclasses and these improve distillation. For clickthrough datasets where the subclasses are unknown we demonstrate that subclass distillation allows the student to learn faster and better.

Citations

PDF

Open Access

More filters

Posted Content

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

Jiankang Deng, +3 more

- 23 Jan 2018 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This article proposed an additive angular margin loss (ArcFace) to obtain highly discriminative features for face recognition, which has a clear geometric interpretation due to the exact correspondence to the geodesic distance on the hypersphere.

...read moreread less

Proceedings Article

Sub-center ArcFace: Boosting Face Recognition by Large-Scale Noisy Web Faces.

Jiankang Deng, +4 more

TL;DR: This paper relaxes the intra-class constraint of ArcFace to improve the robustness to label noise and designs K sub-centers for each class and the training sample only needs to be close to any of the K positive subcenters instead of the only one positive center.

...read moreread less

Posted Content

No Subclass Left Behind: Fine-Grained Robustness in Coarse-Grained Classification Problems

Nimit Sharad Sohoni, +4 more

- 25 Nov 2020 -

arXiv: Learning

TL;DR: This work proposes GEORGE, a method to both measure and mitigate hidden stratification even when subclass labels are unknown, and theoretically characterize the performance of GEORGE in terms of the worst-case generalization error across any subclass.

...read moreread less

Posted Content

Asking without Telling: Exploring Latent Ontologies in Contextual Representations

Julian Michael, +2 more

- 29 Apr 2020 -

arXiv: Computation and Language

TL;DR: This work introduces latent subclass learning (LSL): a modification to existing classifier-based probing methods that induces a latent categorization (or ontology) of the probe's inputs that extracts emergent structure from input representations in an interpretable and quantifiable form.

...read moreread less

Journal Article

Anti-Distillation: Improving Reproducibility of Deep Networks

Gil I. Shamir, +1 more

- 04 May 2021 -

arXiv: Learning

TL;DR: A novel approach, Anti-Distillation, is proposed to address irreproducibility in deep networks, where ensemble models are used to generate predictions, and enhances the benefit of ensembles, making the final predictions more reproducible.

...read moreread less

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Book

Elements of information theory

Thomas M. Cover, +1 more

TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.

...read moreread less

Dissertation

Learning Multiple Layers of Features from Tiny Images

Alex Krizhevsky

TL;DR: In this paper, the authors describe how to train a multi-layer generative model of natural images, using a dataset of millions of tiny colour images, described in the next section.

...read moreread less

Posted Content

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew Howard, +7 more

- 17 Apr 2017 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This work introduces two simple global hyper-parameters that efficiently trade off between latency and accuracy and demonstrates the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.

...read moreread less