scispace - formally typeset
Open AccessPosted Content

Rethinking the limiting dynamics of SGD: modified loss, phase space oscillations, and anomalous diffusion

Reads0
Chats0
TLDR
In this article, the authors explore the limiting dynamics of deep neural networks trained with stochastic gradient descent (SGD) and find empirically that long after performance has converged, networks continue to move through parameter space by a process of anomalous diffusion in which distance travelled grows as a power law in the number of gradient updates with a nontrivial exponent.
Abstract
In this work we explore the limiting dynamics of deep neural networks trained with stochastic gradient descent (SGD). We find empirically that long after performance has converged, networks continue to move through parameter space by a process of anomalous diffusion in which distance travelled grows as a power law in the number of gradient updates with a nontrivial exponent. We reveal an intricate interaction between the hyperparameters of optimization, the structure in the gradient noise, and the Hessian matrix at the end of training that explains this anomalous diffusion. To build this understanding, we first derive a continuous-time model for SGD with finite learning rates and batch sizes as an underdamped Langevin equation. We study this equation in the setting of linear regression, where we can derive exact, analytic expressions for the phase space dynamics of the parameters and their instantaneous velocities from initialization to stationarity. Using the Fokker-Planck equation, we show that the key ingredient driving these dynamics is not the original training loss, but rather the combination of a modified loss, which implicitly regularizes the velocity, and probability currents, which cause oscillations in phase space. We identify qualitative and quantitative predictions of this theory in the dynamics of a ResNet-18 model trained on ImageNet. Through the lens of statistical physics, we uncover a mechanistic origin for the anomalous limiting dynamics of deep neural networks trained with SGD.

read more

Citations
More filters
Posted Content

Stochastic Training is Not Necessary for Generalization

TL;DR: The authors showed that SGD can achieve strong generalization performance on CIFAR-10 that is on-par with SGD, using modern architectures in settings with and without data augmentation.
Posted Content

Revisiting the Characteristics of Stochastic Gradient Noise and Dynamics.

TL;DR: In this article, the authors characterize the noise of stochastic gradients and analyze the noise-induced dynamics during training deep neural networks by gradient-based optimizers, showing that the gradient noise is asymptotically Gaussian.
Posted Content

On Convergence of Training Loss Without Reaching Stationary Points.

TL;DR: In this article, a new perspective based on the ergodic theory of dynamical systems was proposed to explain the convergence of the distribution of weight values to an approximate invariant measure.
Posted Content

Noether's Learning Dynamics: Role of Symmetry Breaking in Neural Networks.

TL;DR: In this article, the authors develop a theoretical framework to study the geometry of learning dynamics in neural networks, and reveal a key mechanism of explicit symmetry breaking behind the efficiency and stability of modern neural networks.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

Automatic differentiation in PyTorch

TL;DR: An automatic differentiation module of PyTorch is described — a library designed to enable rapid research on machine learning models that focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead.
Posted Content

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

TL;DR: This paper empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization and enable training visual recognition models on internet-scale data with high efficiency.
Journal ArticleDOI

On the momentum term in gradient descent learning algorithms

TL;DR: The bounds for convergence on learning-rate and momentum parameters are derived, and it is demonstrated that the momentum term can increase the range of learning rate over which the system converges.
Proceedings Article

Neural Tangent Kernel: Convergence and Generalization in Neural Networks

TL;DR: This talk will introduce this formalism and give a number of results on the Neural Tangent Kernel and explain how they give us insight into the dynamics of neural networks during training and into their generalization features.
Related Papers (5)