scispace - formally typeset
Open AccessPosted Content

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

TLDR
This article showed that gradient descent converges at a global linear rate to the global optimum for two-layer fully connected ReLU activated neural networks, where over-parameterization and random initialization jointly restrict weight vector to be close to its initialization for all iterations.
Abstract
One of the mysteries in the success of neural networks is randomly initialized first order methods like gradient descent can achieve zero training loss even though the objective function is non-convex and non-smooth. This paper demystifies this surprising phenomenon for two-layer fully connected ReLU activated neural networks. For an $m$ hidden node shallow neural network with ReLU activation and $n$ training data, we show as long as $m$ is large enough and no two inputs are parallel, randomly initialized gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. Our analysis relies on the following observation: over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows us to exploit a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. We believe these insights are also useful in analyzing deep models and other first order methods.

read more

Citations
More filters
Posted Content

Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains

TL;DR: An approach for selecting problem-specific Fourier features that greatly improves the performance of MLPs for low-dimensional regression tasks relevant to the computer vision and graphics communities is suggested.
Posted Content

Surprises in High-Dimensional Ridgeless Least Squares Interpolation.

TL;DR: This paper recovers---in a precise quantitative way---several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.
Journal ArticleDOI

Benign overfitting in linear regression

TL;DR: A characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size.
Posted Content

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

TL;DR: This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: a tighter characterization of training speed, an explanation for why training a neuralNet with random labels leads to slower training, and a data-dependent complexity measure.
Posted Content

Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

TL;DR: In particular, this article showed that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data.
References
More filters
Posted Content

Understanding deep learning requires rethinking generalization

TL;DR: The authors showed that deep neural networks can fit a random labeling of the training data, and that this phenomenon is qualitatively unaffected by explicit regularization, and occurs even if the true images are replaced by completely unstructured random noise.
Proceedings Article

Escaping From Saddle Points --- Online Stochastic Gradient for Tensor Decomposition

TL;DR: In this article, the authors show that stochastic gradient descent converges to a local minimum in a polynomial number of iterations for orthogonal tensor decomposition.
Proceedings Article

Deep Learning without Poor Local Minima

TL;DR: This paper proves a conjecture published in 1989 and partially addresses an open problem announced at the Conference on Learning Theory (COLT) 2015, and presents an instance for which it can answer the following question: how difficult is it to directly train a deep model in theory?
Journal ArticleDOI

Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks

TL;DR: In this paper, the problem of learning a shallow neural network that best fits a training data set was studied in the over-parameterized regime, where the numbers of observations are fewer than the number of parameters in the model.
Posted Content

Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data

TL;DR: In this article, the problem of learning a two-layer overparameterized ReLU neural network for multi-class classification via stochastic gradient descent (SGD) from random initialization is studied.
Related Papers (5)