Open AccessProceedings Article
Towards Understanding Regularization in Batch Normalization
TLDR
In this paper, a basic block of neural networks, consisting of a kernel layer, a BN layer, and a nonlinear activation function, is used to understand the impacts of batch normalization in training neural networks.Abstract:
Batch Normalization (BN) improves both convergence and generalization in training neural networks. This work understands these phenomena theoretically. We analyze BN by using a basic block of neural networks, consisting of a kernel layer, a BN layer, and a nonlinear activation function. This basic network helps us understand the impacts of BN in three aspects. First, by viewing BN as an implicit regularizer, BN can be decomposed into population normalization (PN) and gamma decay as an explicit regularization. Second, learning dynamics of BN and the regularization show that training converged with large maximum and effective learning rate. Third, generalization of BN is explored by using statistical mechanics. Experiments demonstrate that BN in convolutional neural networks share the same traits of regularization as the above analyses.read more
Citations
More filters
Posted Content
Quantifying Generalization in Reinforcement Learning
TL;DR: This paper investigated the problem of overfitting in deep reinforcement learning by using procedurally generated environments to construct distinct training and test sets, and found that agents overfit to surprisingly large training sets.
Proceedings Article
Differentiable Learning-to-Normalize via Switchable Normalization
TL;DR: Switchable Normalization (SN), which learns to select different normalizers for different normalization layers of a deep neural network, is proposed, which will help ease the usage and understand the normalization techniques in deep learning.
Journal ArticleDOI
A systematic review on overfitting control in shallow and deep neural networks
TL;DR: A systematic review of the overfit controlling methods and categorizes them into passive, active, and semi-active subsets, which includes the theoretical and experimental backgrounds of these methods, their strengths and weaknesses, and the emerging techniques for overfitting detection.
Posted Content
Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs
TL;DR: The results highlight the under-appreciated role of the affine parameters in BatchNorm, but - in a broader sense - they characterize the expressive power of neural networks constructed simply by shifting and rescaling random features.
Posted Content
Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks
Soham De,Samuel L. Smith +1 more
TL;DR: This work develops a simple initialization scheme that can train deep residual networks without normalization, and provides a detailed empirical study of residual networks, which clarifies that, although batch normalized networks can be trained with larger learning rates, this effect is only beneficial in specific compute regimes, and has minimal benefits when the batch size is small.