Stochastic Gradient Descent with Large Learning Rate.

Open AccessPosted Content

Stochastic Gradient Descent with Large Learning Rate.

TLDR

The main contributions of this work are to derive the stable distribution for discrete-time SGD in a quadratic loss function with and without momentum.

Abstract:

As a simple and efficient optimization method in deep learning, stochastic gradient descent (SGD) has attracted tremendous attention. In the vanishing learning rate regime, SGD is now relatively well understood, and the majority of theoretical approaches to SGD set their assumptions in the continuous-time limit. However, the continuous-time predictions are unlikely to reflect the experimental observations well because the practice often runs in the large learning rate regime, where the training is faster and the generalization of models are often better. In this paper, we propose to study the basic properties of SGD and its variants in the non-vanishing learning rate regime. The focus is on deriving exactly solvable results and relating them to experimental observations. The main contributions of this work are to derive the stable distribution for discrete-time SGD in a quadratic loss function with and without momentum. Examples of applications of the proposed theory considered in this work include the approximation error of variants of SGD, the effect of mini-batch noise, the escape rate from a sharp minimum, and and the stationary distribution of a few second order methods.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Stochastic Processes in Physics and Chemistry

D Sherrington

- 01 Apr 1983 -

Physics Bulletin

TL;DR: Van Kampen as mentioned in this paper provides an extensive graduate-level introduction which is clear, cautious, interesting and readable, and could be expected to become an essential part of the library of every physical scientist concerned with problems involving fluctuations and stochastic processes.

...read moreread less

Posted Content

Meta-LR-Schedule-Net: Learned LR Schedules that Scale and Generalize.

Jun Shu, +4 more

- 29 Jul 2020 -

arXiv: Learning

TL;DR: This work designs a meta-learner with explicit mapping formulation to parameterize LR schedules, which can adjust LR adaptively to comply with current training dynamic by leveraging the information from past training histories.

...read moreread less

Posted Content

On the Distributional Properties of Adaptive Gradients

Zhang Zhiyi, +1 more

- 15 May 2021 -

arXiv: Learning

TL;DR: In this article, it was shown that the variance of the magnitude of the update is an increasing and bounded function of time and does not diverge, contrary to what is believed in the current literature.

...read moreread less

Posted Content

Stochastic gradient descent with noise of machine learning type. Part II: Continuous time analysis.

Stephan Wojtowytsch

- 04 Jun 2021 -

arXiv: Learning

TL;DR: In this paper, a continuous time model for stochastic gradient descent with noise that follows the machine learning scaling was proposed, where the optimization algorithm prefers flat minima of the objective function in a sense which is different from the flat minimum selection of continuous time SGD with homogeneous noise.

...read moreread less

Posted Content

Strength of Minibatch Noise in SGD

Liu Ziyin, +3 more

- 10 Feb 2021 -

arXiv: Learning

TL;DR: This paper showed that some degree of mismatch between model and data complexity is needed for SGD to ''stir'' a noise; such mismatch may be due to a label or input noise, regularization, or underparametrization.

...read moreread less

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Book

Deep Learning

Ian Goodfellow, +2 more

TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.

...read moreread less

Journal ArticleDOI

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

Geoffrey E. Hinton, +10 more

- 18 Oct 2012 -

IEEE Signal Processing Magazine

TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

...read moreread less