Gradient Descent Provably Optimizes Over-parameterized Neural Networks

Open AccessPosted Content

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

- 04 Oct 2018 -

TLDR

This article showed that gradient descent converges at a global linear rate to the global optimum for two-layer fully connected ReLU activated neural networks, where over-parameterization and random initialization jointly restrict weight vector to be close to its initialization for all iterations.

Abstract:

One of the mysteries in the success of neural networks is randomly initialized first order methods like gradient descent can achieve zero training loss even though the objective function is non-convex and non-smooth. This paper demystifies this surprising phenomenon for two-layer fully connected ReLU activated neural networks. For an $m$ hidden node shallow neural network with ReLU activation and $n$ training data, we show as long as $m$ is large enough and no two inputs are parallel, randomly initialized gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. Our analysis relies on the following observation: over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows us to exploit a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. We believe these insights are also useful in analyzing deep models and other first order methods.

Citations

PDF

Open Access

More filters

Posted Content

Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains

Matthew Tancik, +8 more

- 18 Jun 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: An approach for selecting problem-specific Fourier features that greatly improves the performance of MLPs for low-dimensional regression tasks relevant to the computer vision and graphics communities is suggested.

...read moreread less

Posted Content

Surprises in High-Dimensional Ridgeless Least Squares Interpolation.

Trevor Hastie, +3 more

- 19 Mar 2019 -

arXiv: Statistics Theory

TL;DR: This paper recovers---in a precise quantitative way---several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.

...read moreread less

Journal ArticleDOI

Benign overfitting in linear regression

Peter L. Bartlett, +3 more

- 24 Apr 2020 -

Proceedings of the National Academy of S...

TL;DR: A characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size.

...read moreread less

Posted Content

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

Sanjeev Arora, +4 more

- 24 Jan 2019 -

arXiv: Learning

TL;DR: This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: a tighter characterization of training speed, an explanation for why training a neuralNet with random labels leads to slower training, and a data-dependent complexity measure.

...read moreread less

Posted Content

Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

Difan Zou, +3 more

- 21 Nov 2018 -

arXiv: Learning

TL;DR: In particular, this article showed that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Posted Content

Understanding deep learning requires rethinking generalization

Chiyuan Zhang, +4 more

- 10 Nov 2016 -

arXiv: Learning

TL;DR: The authors showed that deep neural networks can fit a random labeling of the training data, and that this phenomenon is qualitatively unaffected by explicit regularization, and occurs even if the true images are replaced by completely unstructured random noise.

...read moreread less

Proceedings Article

Escaping From Saddle Points --- Online Stochastic Gradient for Tensor Decomposition

Rong Ge, +3 more

TL;DR: In this article, the authors show that stochastic gradient descent converges to a local minimum in a polynomial number of iterations for orthogonal tensor decomposition.

...read moreread less

Proceedings Article

Deep Learning without Poor Local Minima

Kenji Kawaguchi

TL;DR: This paper proves a conjecture published in 1989 and partially addresses an open problem announced at the Conference on Learning Theory (COLT) 2015, and presents an instance for which it can answer the following question: how difficult is it to directly train a deep model in theory?

...read moreread less

Journal ArticleDOI

Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks

Mahdi Soltanolkotabi, +2 more

- 01 Feb 2019 -

IEEE Transactions on Information Theory

TL;DR: In this paper, the problem of learning a shallow neural network that best fits a training data set was studied in the over-parameterized regime, where the numbers of observations are fewer than the number of parameters in the model.

...read moreread less

Posted Content

Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data

Yuanzhi Li, +1 more

- 03 Aug 2018 -

arXiv: Learning

TL;DR: In this article, the problem of learning a two-layer overparameterized ReLU neural network for multi-class classification via stochastic gradient descent (SGD) from random initialization is studied.

...read moreread less

Collapse

Related Papers (5)

Neural Tangent Kernel: Convergence and Generalization in Neural Networks

Arthur Jacot, +2 more

A Convergence Theory for Deep Learning via Over-Parameterization

Zeyuan Allen-Zhu, +2 more

- 09 Nov 2018 -

arXiv: Learning

arXiv: Machine Learning

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

Citations

Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains

Surprises in High-Dimensional Ridgeless Least Squares Interpolation.

Benign overfitting in linear regression

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

References

Understanding deep learning requires rethinking generalization

Escaping From Saddle Points --- Online Stochastic Gradient for Tensor Decomposition

Deep Learning without Poor Local Minima

Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks

Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data

Related Papers (5)

Neural Tangent Kernel: Convergence and Generalization in Neural Networks

A Convergence Theory for Deep Learning via Over-Parameterization

Gradient descent finds global minima of deep neural networks

Deep Residual Learning for Image Recognition

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent