Katyusha: the first direct acceleration of stochastic gradient methods

doi:10.1145/3055399.3055448

Open AccessProceedings ArticleDOI

Katyusha: the first direct acceleration of stochastic gradient methods

- pp 1200-1205

TLDR

Katyusha as discussed by the authors is a primal-only stochastic gradient method with negative momentum on top of Nesterov's momentum that can be incorporated into a variance reduction based algorithm and speed it up.

Abstract:

Nesterov's momentum trick is famously known for accelerating gradient descent, and has been proven useful in building fast iterative algorithms. However, in the stochastic setting, counterexamples exist and prevent Nesterov's momentum from providing similar acceleration, even if the underlying problem is convex. We introduce Katyusha, a direct, primal-only stochastic gradient method to fix this issue. It has a provably accelerated convergence rate in convex (off-line) stochastic optimization. The main ingredient is Katyusha momentum, a novel "negative momentum" on top of Nesterov's momentum that can be incorporated into a variance-reduction based algorithm and speed it up. Since variance reduction has been successfully applied to a growing list of practical problems, our paper suggests that in each of such cases, one could potentially give Katyusha a hug.

Citations

PDF

Open Access

More filters

Posted Content

Lookahead Optimizer: k steps forward, 1 step back

Michael R. Zhang, +3 more

- 19 Jul 2019 -

arXiv: Learning

TL;DR: Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost, and can significantly improve the performance of SGD and Adam, even with their default hyperparameter settings.

...read moreread less

Journal Article

Stochastic primal-dual coordinate method for regularized empirical risk minimization

Yuchen Zhang, +1 more

- 01 Jan 2017 -

Journal of Machine Learning Research

TL;DR: This work proposes a stochastic primal-dual coordinate method, which alternates between maximizing over one (or more) randomly chosen dual variable and minimizing over the primal variable, and develops an extension to non-smooth and nonstrongly convex loss functions.

...read moreread less

Journal Article

Federated Learning of a Mixture of Global and Local Models

Filip Hanzely, +1 more

- 04 May 2021 -

arXiv: Learning

TL;DR: This work proposes a new optimization formulation for training federated learning models that seeks an explicit trade-off between this traditional global model and the local models, which can be learned by each device from its own private data without any communication.

...read moreread less

Posted Content

Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron

Sharan Vaswani, +2 more

- 16 Oct 2018 -

arXiv: Learning

TL;DR: It is proved that constant step-size stochastic gradient descent (SGD) with Nesterov acceleration matches the convergence rate of the deterministic accelerated method for both convex and strongly-convex functions.

...read moreread less

Journal ArticleDOI

Accelerated Distributed Nesterov Gradient Descent

Guannan Qu, +1 more

- 01 Jun 2020 -

IEEE Transactions on Automatic Control

TL;DR: In this article, an accelerated distributed Nesterov gradient descent method was proposed for distributed optimization over a network, where the objective is to optimize a global function formed by a sum of local functions, using only local computation and communication.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Book

Introductory Lectures on Convex Optimization: A Basic Course

I︠u︡. E. Nesterov

TL;DR: A polynomial-time interior-point method for linear optimization was proposed in this paper, where the complexity bound was not only in its complexity, but also in the theoretical pre- diction of its high efficiency was supported by excellent computational results.

...read moreread less

Journal ArticleDOI

Smooth minimization of non-smooth functions

Yu. Nesterov

- 01 May 2005 -

Mathematical Programming

TL;DR: A new approach for constructing efficient schemes for non-smooth convex optimization is proposed, based on a special smoothing technique, which can be applied to functions with explicit max-structure, and can be considered as an alternative to black-box minimization.

...read moreread less

Proceedings Article

Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

Rie Johnson, +1 more

TL;DR: It is proved that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG), but the analysis is significantly simpler and more intuitive.

...read moreread less

Proceedings Article

SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives

Aaron Defazio, +2 more

TL;DR: SAGA as discussed by the authors improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and has support for composite objectives where a proximal operator is used on the regulariser.

...read moreread less

Journal ArticleDOI

Efficiency of coordinate descent methods on huge-scale optimization problems

Yurii Nesterov

- 24 Apr 2012 -

Siam Journal on Optimization

TL;DR: Surprisingly enough, for certain classes of objective functions, the proposed methods for solving huge-scale optimization problems are better than the standard worst-case bounds for deterministic algorithms.

...read moreread less

Collapse

Katyusha: the first direct acceleration of stochastic gradient methods

Citations

Lookahead Optimizer: k steps forward, 1 step back

Stochastic primal-dual coordinate method for regularized empirical risk minimization

Federated Learning of a Mixture of Global and Local Models

Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron

Accelerated Distributed Nesterov Gradient Descent

References

Introductory Lectures on Convex Optimization: A Basic Course

Smooth minimization of non-smooth functions

Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives

Efficiency of coordinate descent methods on huge-scale optimization problems

Related Papers (5)

Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

Introductory Lectures on Convex Optimization: A Basic Course

SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives

Stochastic dual coordinate ascent methods for regularized loss

A Stochastic Approximation Method