Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent

Open AccessPosted Content

Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent

- 28 Nov 2017 -

TLDR

In this article, a simple variant of Nesterov's accelerated gradient descent (AGD) was shown to achieve faster convergence rate than GD in the nonconvex setting.

Abstract:

Nesterov's accelerated gradient descent (AGD), an instance of the general family of "momentum methods", provably achieves faster convergence rate than gradient descent (GD) in the convex setting. However, whether these methods are superior to GD in the nonconvex setting remains open. This paper studies a simple variant of AGD, and shows that it escapes saddle points and finds a second-order stationary point in $\tilde{O}(1/\epsilon^{7/4})$ iterations, faster than the $\tilde{O}(1/\epsilon^{2})$ iterations required by GD. To the best of our knowledge, this is the first Hessian-free algorithm to find a second-order stationary point faster than GD, and also the first single-loop algorithm with a faster rate than GD even in the setting of finding a first-order stationary point. Our analysis is based on two key ideas: (1) the use of a simple Hamiltonian function, inspired by a continuous-time perspective, which AGD monotonically decreases per step even for nonconvex functions, and (2) a novel framework called improve or localize, which is useful for tracking the long-term behavior of gradient-based optimization algorithms. We believe that these techniques may deepen our understanding of both acceleration algorithms and nonconvex optimization.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

A high-bias, low-variance introduction to Machine Learning for physicists

Pankaj Mehta, +6 more

- 30 May 2019 -

Physics Reports

TL;DR: The review begins by covering fundamental concepts in ML and modern statistics such as the bias-variance tradeoff, overfitting, regularization, generalization, and gradient descent before moving on to more advanced topics in both supervised and unsupervised learning.

...read moreread less

Posted Content

SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator

Cong Fang, +3 more

- 04 Jul 2018 -

arXiv: Optimization and Control

TL;DR: This paper proposes a new technique named SPIDER, which can be used to track many deterministic quantities of interest with significantly reduced computational cost and proves that SPIDER-SFO nearly matches the algorithmic lower bound for finding approximate first-order stationary points under the gradient Lipschitz assumption in the finite-sum setting.

...read moreread less

Posted Content

On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points

Chi Jin, +4 more

- 13 Feb 2019 -

arXiv: Learning

TL;DR: Perturbed versions of GD and SGD are analyzed and it is shown that they are truly efficient---their dimension dependence is only polylogarithmic.

...read moreread less

Posted Content

On Symplectic Optimization

Michael Betancourt, +2 more

- 10 Feb 2018 -

arXiv: Computation

TL;DR: This paper provides a systematic methodology for converting continuous-time dynamics into discrete-time algorithms while retaining oracle rates, based on ideas from Hamiltonian dynamical systems and symplectic integration.

...read moreread less

Posted Content

Convergence Guarantees for RMSProp and ADAM in Non-Convex Optimization and an Empirical Comparison to Nesterov Acceleration

Soham De, +2 more

- 18 Jul 2018 -

arXiv: Learning

TL;DR: This work provides proofs that these adaptive gradient algorithms are guaranteed to reach criticality for smooth non-convex objectives, and gives bounds on the running time of these algorithms.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems

Amir Beck, +1 more

- 01 Jan 2009 -

Siam Journal on Imaging Sciences

TL;DR: A new fast iterative shrinkage-thresholding algorithm (FISTA) which preserves the computational simplicity of ISTA but with a global rate of convergence which is proven to be significantly better, both theoretically and practically.

...read moreread less

Proceedings Article

Escaping From Saddle Points --- Online Stochastic Gradient for Tensor Decomposition

Rong Ge, +3 more

TL;DR: In this article, the authors show that stochastic gradient descent converges to a local minimum in a polynomial number of iterations for orthogonal tensor decomposition.

...read moreread less

Journal ArticleDOI

Cubic regularization of Newton method and its global performance

Yurii Nesterov, +1 more

- 01 Aug 2006 -

Mathematical Programming

TL;DR: This paper provides theoretical analysis for a cubic regularization of Newton method as applied to unconstrained minimization problem and proves general local convergence results for this scheme.

...read moreread less

Journal Article

A differential equation for modeling Nesterov's accelerated gradient method: theory and insights

Weijie J. Su, +2 more

- 01 Jan 2016 -

Journal of Machine Learning Research

TL;DR: A second-order ordinary differential equation is derived, which is the limit of Nesterov's accelerated gradient method, and it is shown that the continuous time ODE allows for a better understanding of Nestersov's scheme.

...read moreread less

Journal ArticleDOI

Accelerated gradient methods for nonconvex nonlinear and stochastic programming

Saeed Ghadimi, +1 more

- 01 Mar 2016 -

Mathematical Programming

TL;DR: The AG method is generalized to solve nonconvex and possibly stochastic optimization problems and it is demonstrated that by properly specifying the stepsize policy, the AG method exhibits the best known rate of convergence for solving general non Convex smooth optimization problems by using first-order information, similarly to the gradient descent method.

...read moreread less