scispace - formally typeset
Open AccessPosted Content

Stochastic gradient descent with noise of machine learning type. Part II: Continuous time analysis.

TLDR
In this paper, a continuous time model for stochastic gradient descent with noise that follows the machine learning scaling was proposed, where the optimization algorithm prefers flat minima of the objective function in a sense which is different from the flat minimum selection of continuous time SGD with homogeneous noise.
Abstract
The representation of functions by artificial neural networks depends on a large number of parameters in a non-linear fashion. Suitable parameters of these are found by minimizing a 'loss functional', typically by stochastic gradient descent (SGD) or an advanced SGD-based algorithm. In a continuous time model for SGD with noise that follows the 'machine learning scaling', we show that in a certain noise regime, the optimization algorithm prefers 'flat' minima of the objective function in a sense which is different from the flat minimum selection of continuous time SGD with homogeneous noise.

read more

Citations
More filters
Posted Content

Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis.

TL;DR: In particular, the authors showed that SGD with machine learning noise can be chosen to be small, but uniformly positive for all times if the energy landscape resembles that of overparametrized deep learning problems.
Posted Content

On minimal representations of shallow ReLU networks.

TL;DR: In this article, the authors show that the minimal representation of a shallow ReLU network can be represented by a set of hyperplanes, where each hyperplane represents a continuous and piecewise affine function.
Posted Content

SGD May Never Escape Saddle Points

TL;DR: The authors showed that SGD may escape a saddle point arbitrarily slowly, SGD prefers sharp minima over the flat ones, and AMSGrad may converge to a local maximum, and that the noise structure of SGD might be more important than the loss landscape in neural network training.
References
More filters
Book

Partial Differential Equations

TL;DR: In this paper, the authors present a theory for linear PDEs: Sobolev spaces Second-order elliptic equations Linear evolution equations, Hamilton-Jacobi equations and systems of conservation laws.
Book

Elliptic Partial Differential Equations of Second Order

TL;DR: In this article, Leray-Schauder and Harnack this article considered the Dirichlet Problem for Poisson's Equation and showed that it is a special case of Divergence Form Operators.
Journal ArticleDOI

A Stochastic Approximation Method

TL;DR: In this article, a method for making successive experiments at levels x1, x2, ··· in such a way that xn will tend to θ in probability is presented.
Book

Brownian Motion and Stochastic Calculus

TL;DR: In this paper, the authors present a characterization of continuous local martingales with respect to Brownian motion in terms of Markov properties, including the strong Markov property, and a generalized version of the Ito rule.
Journal ArticleDOI

Optimization Methods for Large-Scale Machine Learning

TL;DR: The authors provides a review and commentary on the past, present, and future of numerical optimization algorithms in the context of machine learning applications and discusses how optimization problems arise in machine learning and what makes them challenging.
Related Papers (5)