Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Open AccessProceedings Article

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

- Vol. 1, pp 448-456

TLDR

Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

Abstract:

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Posted Content

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

- 10 Dec 2015 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

...read moreread less

Book

Deep Learning

Ian Goodfellow, +2 more

TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.

...read moreread less

Proceedings ArticleDOI

Densely Connected Convolutional Networks

Gao Huang, +3 more

TL;DR: DenseNet as mentioned in this paper proposes to connect each layer to every other layer in a feed-forward fashion, which can alleviate the vanishing gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters.

...read moreread less

Proceedings ArticleDOI

Rethinking the Inception Architecture for Computer Vision

Christian Szegedy, +4 more

TL;DR: In this article, the authors explore ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Posted Content

Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging

Daniel Povey, +2 more

TL;DR: In this paper, the authors describe the neural-network training framework used in the Kaldi speech recognition toolkit, which is geared towards training DNNs with large amounts of training data using multiple GPU-equipped or multi-core machines.

...read moreread less

Posted Content

Knowledge Matters: Importance of Prior Information for Optimization

Caglar Gulcehre, +1 more

- 17 Jan 2013 -

arXiv: Learning

TL;DR: In this paper, the authors explore the effect of introducing prior information into the intermediate level of neural networks for a learning task on which all the state-of-theart machine learning algorithms tested failed to learn.

...read moreread less

Proceedings Article

Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging

Daniel Povey, +2 more

TL;DR: In this paper, the authors describe the neural-network training framework used in the Kaldi speech recognition toolkit, which is geared towards training DNNs with large amounts of training data using multiple GPU-equipped or multicore machines.

...read moreread less

Proceedings ArticleDOI

Mean-normalized stochastic gradient for large-scale deep learning

Simon Wiesler, +3 more

TL;DR: This work proposes a novel second-order stochastic optimization algorithm based on analytic results showing that a non-zero mean of features is harmful for the optimization, and proves convergence of the algorithm in a convex setting.

...read moreread less

Posted Content

Natural Neural Networks

Guillaume Desjardins, +3 more

- 01 Jul 2015 -

arXiv: Machine Learning

TL;DR: Natural Neural Networks as discussed by the authors adapts the internal representation of neural networks during training to improve the conditioning of the Fisher matrix by implicitly whitening the representation obtained at each layer, while preserving the feed-forward computation of the network.

...read moreread less

Collapse

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Citations

Deep Residual Learning for Image Recognition

Deep Residual Learning for Image Recognition

Deep Learning

Densely Connected Convolutional Networks

Rethinking the Inception Architecture for Computer Vision

References

Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging

Knowledge Matters: Importance of Prior Information for Optimization

Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging

Mean-normalized stochastic gradient for large-scale deep learning

Natural Neural Networks

Related Papers (5)

Deep Residual Learning for Image Recognition

Adam: A Method for Stochastic Optimization

ImageNet Classification with Deep Convolutional Neural Networks

Going deeper with convolutions

Very Deep Convolutional Networks for Large-Scale Image Recognition