Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent

Open AccessProceedings Article

Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent

Chi Jin, +2 more

- pp 1042-1085

Chats0

TLDR

In this article, a simple variant of Nesterov's accelerated gradient descent (AGD) was shown to achieve faster convergence rate than GD in the nonconvex setting.

Abstract:

Nesterov's accelerated gradient descent (AGD), an instance of the general family of "momentum methods", provably achieves faster convergence rate than gradient descent (GD) in the convex setting. However, whether these methods are superior to GD in the nonconvex setting remains open. This paper studies a simple variant of AGD, and shows that it escapes saddle points and finds a second-order stationary point in $\tilde{O}(1/\epsilon^{7/4})$ iterations, faster than the $\tilde{O}(1/\epsilon^{2})$ iterations required by GD. To the best of our knowledge, this is the first Hessian-free algorithm to find a second-order stationary point faster than GD, and also the first single-loop algorithm with a faster rate than GD even in the setting of finding a first-order stationary point. Our analysis is based on two key ideas: (1) the use of a simple Hamiltonian function, inspired by a continuous-time perspective, which AGD monotonically decreases per step even for nonconvex functions, and (2) a novel framework called improve or localize, which is useful for tracking the long-term behavior of gradient-based optimization algorithms. We believe that these techniques may deepen our understanding of both acceleration algorithms and nonconvex optimization.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview

Yuejie Chi, +2 more

- 15 Oct 2019 -

IEEE Transactions on Signal Processing

TL;DR: This tutorial-style overview highlights the important role of statistical models in enabling efficient nonconvex optimization with performance guarantees and reviews two contrasting approaches: two-stage algorithms, which consist of a tailored initialization step followed by successive refinement; and global landscape analysis and initialization-free algorithms.

...read moreread less

Posted Content

Why gradient clipping accelerates training: A theoretical justification for adaptivity

Jingzhao Zhang, +3 more

- 28 May 2019 -

arXiv: Optimization and Control

TL;DR: It is shown that gradient smoothness, a concept central to the analysis of first-order optimization algorithms that is often assumed to be a constant, demonstrates significant variability along the training trajectory of deep neural networks, and positively correlates with the gradient norm, and contrary to standard assumptions in the literature.

...read moreread less

Proceedings Article

Global convergence of langevin dynamics based algorithms for nonconvex optimization

Pan Xu, +3 more

TL;DR: In this article, the authors present a unified framework to analyze the global convergence of Langevin dynamics based algorithms for non-convex finite-sum optimization with n component functions.

...read moreread less

Proceedings Article

Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity

Jingzhao Zhang, +3 more

TL;DR: In this paper, the authors provide a theoretical explanation for the effectiveness of gradient clipping in training deep neural networks and introduce a novel relaxation of gradient smoothness that is weaker than the commonly used Lipschitz smoothness assumption.

...read moreread less

Journal ArticleDOI

A Framework for One-Bit and Constant-Envelope Precoding Over Multiuser Massive MISO Channels

Mingjie Shao, +3 more

- 15 Oct 2019 -

IEEE Transactions on Signal Processing

TL;DR: In this paper, a framework for designing multiuser precoding under one-bit and constant-envelope (CE) massive MIMO scenarios was established, where high-resolution digital-to-analog converters (DACs) are replaced by one bit DACs and phase shifters, respectively, for reducing hardware cost and energy consumption.

...read moreread less