Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Open AccessPosted Content

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, +1 more

- 11 Feb 2015 -

arXiv: Learning

Chats0

TLDR

Batch Normalization as mentioned in this paper normalizes layer inputs for each training mini-batch to reduce the internal covariate shift in deep neural networks, and achieves state-of-the-art performance on ImageNet.

Abstract:

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.

Citations

PDF

Open Access

More filters

Book ChapterDOI

SSD: Single Shot MultiBox Detector

Wei Liu, +6 more

TL;DR: The approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location, which makes SSD easy to train and straightforward to integrate into systems that require a detection component.

...read moreread less

Posted Content

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew Howard, +7 more

- 17 Apr 2017 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This work introduces two simple global hyper-parameters that efficiently trade off between latency and accuracy and demonstrates the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.

...read moreread less

Journal ArticleDOI

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

Vijay Badrinarayanan, +2 more

- 01 Dec 2017 -

IEEE Transactions on Pattern Analysis an...

TL;DR: Quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures, including FCN and DeconvNet.

...read moreread less

Posted Content

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, +11 more

- 22 Oct 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

...read moreread less

Book ChapterDOI

SSD: Single Shot MultiBox Detector

Wei Liu, +6 more

- 08 Dec 2015 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: SSD as mentioned in this paper discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location, and combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

On the importance of initialization and momentum in deep learning

Ilya Sutskever, +3 more

TL;DR: It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization.

...read moreread less

Proceedings Article

Large Scale Distributed Deep Networks

Jeffrey Dean, +11 more

TL;DR: This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.

...read moreread less

Posted Content

Going Deeper with Convolutions

Christian Szegedy, +8 more

- 17 Sep 2014 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A deep convolutional neural network architecture codenamed Inception is proposed that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

...read moreread less

Journal ArticleDOI

Improving predictive inference under covariate shift by weighting the log-likelihood function

Hidetoshi Shimodaira

- 01 Oct 2000 -

Journal of Statistical Planning and Infe...

TL;DR: A class of predictive densities is derived by weighting the observed samples in maximizing the log-likelihood function, effective in cases such as sample surveys or design of experiments, where the observed covariate follows a different distribution than that in the whole population.

...read moreread less

Posted Content

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

Andrew M. Saxe, +2 more

- 20 Dec 2013 -

arXiv: Neural and Evolutionary Computing

TL;DR: In this paper, the authors show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions.

...read moreread less

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Citations

SSD: Single Shot MultiBox Detector

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

SSD: Single Shot MultiBox Detector

References

On the importance of initialization and momentum in deep learning

Large Scale Distributed Deep Networks

Going Deeper with Convolutions

Improving predictive inference under covariate shift by weighting the log-likelihood function

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

Related Papers (5)

Deep Residual Learning for Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet Classification with Deep Convolutional Neural Networks

Adam: A Method for Stochastic Optimization

Going deeper with convolutions