scispace - formally typeset
Open AccessPosted Content

Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent

TLDR
In this article, a simple variant of Nesterov's accelerated gradient descent (AGD) was shown to achieve faster convergence rate than GD in the nonconvex setting.
Abstract
Nesterov's accelerated gradient descent (AGD), an instance of the general family of "momentum methods", provably achieves faster convergence rate than gradient descent (GD) in the convex setting. However, whether these methods are superior to GD in the nonconvex setting remains open. This paper studies a simple variant of AGD, and shows that it escapes saddle points and finds a second-order stationary point in $\tilde{O}(1/\epsilon^{7/4})$ iterations, faster than the $\tilde{O}(1/\epsilon^{2})$ iterations required by GD. To the best of our knowledge, this is the first Hessian-free algorithm to find a second-order stationary point faster than GD, and also the first single-loop algorithm with a faster rate than GD even in the setting of finding a first-order stationary point. Our analysis is based on two key ideas: (1) the use of a simple Hamiltonian function, inspired by a continuous-time perspective, which AGD monotonically decreases per step even for nonconvex functions, and (2) a novel framework called improve or localize, which is useful for tracking the long-term behavior of gradient-based optimization algorithms. We believe that these techniques may deepen our understanding of both acceleration algorithms and nonconvex optimization.

read more

Citations
More filters
Journal ArticleDOI

A high-bias, low-variance introduction to Machine Learning for physicists

TL;DR: The review begins by covering fundamental concepts in ML and modern statistics such as the bias-variance tradeoff, overfitting, regularization, generalization, and gradient descent before moving on to more advanced topics in both supervised and unsupervised learning.
Posted Content

SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator

TL;DR: This paper proposes a new technique named SPIDER, which can be used to track many deterministic quantities of interest with significantly reduced computational cost and proves that SPIDER-SFO nearly matches the algorithmic lower bound for finding approximate first-order stationary points under the gradient Lipschitz assumption in the finite-sum setting.
Posted Content

On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points

TL;DR: Perturbed versions of GD and SGD are analyzed and it is shown that they are truly efficient---their dimension dependence is only polylogarithmic.
Posted Content

On Symplectic Optimization

TL;DR: This paper provides a systematic methodology for converting continuous-time dynamics into discrete-time algorithms while retaining oracle rates, based on ideas from Hamiltonian dynamical systems and symplectic integration.
Posted Content

Convergence Guarantees for RMSProp and ADAM in Non-Convex Optimization and an Empirical Comparison to Nesterov Acceleration

TL;DR: This work provides proofs that these adaptive gradient algorithms are guaranteed to reach criticality for smooth non-convex objectives, and gives bounds on the running time of these algorithms.
References
More filters
Journal ArticleDOI

A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems

TL;DR: A new fast iterative shrinkage-thresholding algorithm (FISTA) which preserves the computational simplicity of ISTA but with a global rate of convergence which is proven to be significantly better, both theoretically and practically.
Proceedings Article

Escaping From Saddle Points --- Online Stochastic Gradient for Tensor Decomposition

TL;DR: In this article, the authors show that stochastic gradient descent converges to a local minimum in a polynomial number of iterations for orthogonal tensor decomposition.
Journal ArticleDOI

Cubic regularization of Newton method and its global performance

TL;DR: This paper provides theoretical analysis for a cubic regularization of Newton method as applied to unconstrained minimization problem and proves general local convergence results for this scheme.
Journal Article

A differential equation for modeling Nesterov's accelerated gradient method: theory and insights

TL;DR: A second-order ordinary differential equation is derived, which is the limit of Nesterov's accelerated gradient method, and it is shown that the continuous time ODE allows for a better understanding of Nestersov's scheme.
Journal ArticleDOI

Accelerated gradient methods for nonconvex nonlinear and stochastic programming

TL;DR: The AG method is generalized to solve nonconvex and possibly stochastic optimization problems and it is demonstrated that by properly specifying the stepsize policy, the AG method exhibits the best known rate of convergence for solving general non Convex smooth optimization problems by using first-order information, similarly to the gradient descent method.
Related Papers (5)