Distilling the Knowledge in a Neural Network

Open AccessPosted Content

Distilling the Knowledge in a Neural Network

Geoffrey E. Hinton, +2 more

- 09 Mar 2015 -

arXiv: Machine Learning

Chats0

TLDR

This work shows that it can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model and introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.

Abstract:

A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.

Citations

PDF

Open Access

More filters

Book

Deep Learning

Ian Goodfellow, +2 more

TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.

...read moreread less

Posted Content

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew Howard, +7 more

- 17 Apr 2017 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This work introduces two simple global hyper-parameters that efficiently trade off between latency and accuracy and demonstrates the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.

...read moreread less

Proceedings ArticleDOI

Xception: Deep Learning with Depthwise Separable Convolutions

François Chollet

TL;DR: This work proposes a novel deep convolutional neural network architecture inspired by Inception, where Inception modules have been replaced with depthwise separable convolutions, and shows that this architecture, dubbed Xception, slightly outperforms Inception V3 on the ImageNet dataset, and significantly outperforms it on a larger image classification dataset.

...read moreread less

Proceedings Article

Language Models are Few-Shot Learners

Tom B. Brown, +30 more

TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.

...read moreread less

Book ChapterDOI

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Liang-Chieh Chen, +4 more

TL;DR: This work extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries and applies the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Adaptive mixtures of local experts

Robert A. Jacobs, +3 more

- 01 Mar 1991 -

Neural Computation

TL;DR: A new supervised learning procedure for systems composed of many separate networks, each of which learns to handle a subset of the complete set of training cases, which is demonstrated to be able to be solved by a very simple expert network.

...read moreread less

Proceedings Article

Large Scale Distributed Deep Networks

Jeffrey Dean, +11 more

TL;DR: This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.

...read moreread less

Proceedings ArticleDOI

Model compression

Cristian Buciluǎ, +2 more

TL;DR: This work presents a method for "compressing" large, complex ensembles into smaller, faster models, usually without significant loss in performance.

...read moreread less

Proceedings ArticleDOI

Learning small-size DNN with output-distribution-based criteria.

Jinyu Li, +3 more

TL;DR: This study proposes to better address issues by utilizing the DNN output distribution and cluster the senones in the large set into a small one by directly relating the clustering process to DNN parameters, as opposed to decoupling the senone generation and DNN training process in the standard training.

...read moreread less

Distilling the Knowledge in a Neural Network

Citations

Deep Learning

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Xception: Deep Learning with Depthwise Separable Convolutions

Language Models are Few-Shot Learners

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

References

Adaptive mixtures of local experts

Large Scale Distributed Deep Networks

Model compression

Learning small-size DNN with output-distribution-based criteria.

Related Papers (5)

Deep Residual Learning for Image Recognition

ImageNet: A large-scale hierarchical image database

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

Adam: A Method for Stochastic Optimization