Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Open AccessPosted Content

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, +1 more

- 11 Feb 2015 -

arXiv: Learning

Chats0

TLDR

Batch Normalization as mentioned in this paper normalizes layer inputs for each training mini-batch to reduce the internal covariate shift in deep neural networks, and achieves state-of-the-art performance on ImageNet.

Abstract:

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.

Citations

PDF

Open Access

More filters

Posted Content

3D human pose estimation in video with temporal convolutions and semi-supervised training

Dario Pavllo, +3 more

- 28 Nov 2018 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: In this paper, a fully convolutional model based on dilated temporal convolutions over 2D keypoints is proposed to estimate 3D pose in video, and a simple and effective semi-supervised training method that leverages unlabeled video data is introduced.

...read moreread less

Proceedings ArticleDOI

FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters

Forrest Iandola, +3 more

TL;DR: FireCaffe is presented, which successfully scales deep neural network training across a cluster of GPUs, and finds that reduction trees are more efficient and scalable than the traditional parameter server approach.

...read moreread less

Posted Content

Learning Physical Intuition of Block Towers by Example

Adam Lerer, +2 more

- 03 Mar 2016 -

arXiv: Artificial Intelligence

TL;DR: This paper creates small towers of wooden blocks whose stability is randomized and render them collapsing (or remaining upright) to train large convolutional network models which can accurately predict the outcome, as well as estimating the block trajectories.

...read moreread less

Proceedings ArticleDOI

Spatially Adaptive Computation Time for Residual Networks

Michael Figurnov, +6 more

TL;DR: Experimental results are presented showing that this model improves the computational efficiency of Residual Networks on the challenging ImageNet classification and COCO object detection datasets and the computation time maps on the visual saliency dataset cat2000 correlate surprisingly well with human eye fixation positions.

...read moreread less

Journal ArticleDOI

Brain-inspired replay for continual learning with artificial neural networks.

Gido M. van de Ven, +4 more

- 13 Aug 2020 -

Nature Communications

TL;DR: A replay-based algorithm for deep learning without the need to store data is proposed in which internal or hidden representations are replayed that are generated by the network’s own, context-modulated feedback connections, and it provides a novel model for replay in the brain.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Gradient-based learning applied to document recognition

Yann LeCun, +6 more

TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.

...read moreread less

Journal Article

Dropout: a simple way to prevent neural networks from overfitting

Nitish Srivastava, +4 more

- 01 Jan 2014 -

Journal of Machine Learning Research

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

...read moreread less

Proceedings ArticleDOI

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Kaiming He, +3 more

TL;DR: In this paper, a Parametric Rectified Linear Unit (PReLU) was proposed to improve model fitting with nearly zero extra computational cost and little overfitting risk, which achieved a 4.94% top-5 test error on ImageNet 2012 classification dataset.

...read moreread less

Journal ArticleDOI

Independent component analysis: algorithms and applications

Aapo Hyvärinen, +1 more

- 01 May 2000 -

Neural Networks

TL;DR: The basic theory and applications of ICA are presented, and the goal is to find a linear representation of non-Gaussian data so that the components are statistically independent, or as independent as possible.

...read moreread less

Journal Article

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

John C. Duchi, +2 more

- 01 Feb 2011 -

Journal of Machine Learning Research

TL;DR: This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.

...read moreread less

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Citations

3D human pose estimation in video with temporal convolutions and semi-supervised training

FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters

Learning Physical Intuition of Block Towers by Example

Spatially Adaptive Computation Time for Residual Networks

Brain-inspired replay for continual learning with artificial neural networks.

References

Gradient-based learning applied to document recognition

Dropout: a simple way to prevent neural networks from overfitting

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Independent component analysis: algorithms and applications

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

Related Papers (5)

Deep Residual Learning for Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet Classification with Deep Convolutional Neural Networks

Adam: A Method for Stochastic Optimization

Going deeper with convolutions