scispace - formally typeset
Open AccessPosted Content

Stochastic Gradient Descent with Large Learning Rate.

TLDR
The main contributions of this work are to derive the stable distribution for discrete-time SGD in a quadratic loss function with and without momentum.
Abstract
As a simple and efficient optimization method in deep learning, stochastic gradient descent (SGD) has attracted tremendous attention. In the vanishing learning rate regime, SGD is now relatively well understood, and the majority of theoretical approaches to SGD set their assumptions in the continuous-time limit. However, the continuous-time predictions are unlikely to reflect the experimental observations well because the practice often runs in the large learning rate regime, where the training is faster and the generalization of models are often better. In this paper, we propose to study the basic properties of SGD and its variants in the non-vanishing learning rate regime. The focus is on deriving exactly solvable results and relating them to experimental observations. The main contributions of this work are to derive the stable distribution for discrete-time SGD in a quadratic loss function with and without momentum. Examples of applications of the proposed theory considered in this work include the approximation error of variants of SGD, the effect of mini-batch noise, the escape rate from a sharp minimum, and and the stationary distribution of a few second order methods.

read more

Citations
More filters
Journal ArticleDOI

Stochastic Processes in Physics and Chemistry

D Sherrington
- 01 Apr 1983 - 
TL;DR: Van Kampen as mentioned in this paper provides an extensive graduate-level introduction which is clear, cautious, interesting and readable, and could be expected to become an essential part of the library of every physical scientist concerned with problems involving fluctuations and stochastic processes.
Posted Content

Meta-LR-Schedule-Net: Learned LR Schedules that Scale and Generalize.

TL;DR: This work designs a meta-learner with explicit mapping formulation to parameterize LR schedules, which can adjust LR adaptively to comply with current training dynamic by leveraging the information from past training histories.
Posted Content

On the Distributional Properties of Adaptive Gradients

Zhang Zhiyi, +1 more
- 15 May 2021 - 
TL;DR: In this article, it was shown that the variance of the magnitude of the update is an increasing and bounded function of time and does not diverge, contrary to what is believed in the current literature.
Posted Content

Stochastic gradient descent with noise of machine learning type. Part II: Continuous time analysis.

TL;DR: In this paper, a continuous time model for stochastic gradient descent with noise that follows the machine learning scaling was proposed, where the optimization algorithm prefers flat minima of the objective function in a sense which is different from the flat minimum selection of continuous time SGD with homogeneous noise.
Posted Content

Strength of Minibatch Noise in SGD

TL;DR: This paper showed that some degree of mismatch between model and data complexity is needed for SGD to ''stir'' a noise; such mismatch may be due to a label or input noise, regularization, or underparametrization.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Book

Deep Learning

TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Journal ArticleDOI

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Related Papers (5)
Trending Questions (1)
How can i use Stochastic Gradient Descent algorithm with learning rate in deep learning?

The paper discusses the properties of Stochastic Gradient Descent (SGD) with a non-vanishing learning rate in deep learning, but does not provide specific instructions on how to use the algorithm with a learning rate.