Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

Open AccessProceedings Article

Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

Lin Xiao

- Vol. 11, Iss: 88, pp 2116-2124

Chats0

TLDR

A new online algorithm is developed, the regularized dual averaging (RDA) method, that can explicitly exploit the regularization structure in an online setting and can be very effective for sparse online learning with l1-regularization.

Abstract:

We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as l1-norm for promoting sparsity. We develop extensions of Nesterov's dual averaging method, that can exploit the regularization structure in an online setting. At each iteration of these methods, the learning variables are adjusted by solving a simple minimization problem that involves the running average of all past subgradients of the loss function and the whole regularization term, not just its subgradient. In the case of l1-regularization, our method is particularly effective in obtaining sparse solutions. We show that these methods achieve the optimal convergence rates or regret bounds that are standard in the literature on stochastic and online convex optimization. For stochastic learning problems in which the loss functions have Lipschitz continuous gradients, we also present an accelerated version of the dual averaging method.

Citations

PDF

Open Access

More filters

Proceedings Article

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.

John C. Duchi, +2 more

TL;DR: Adaptive subgradient methods as discussed by the authors dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning, which allows us to find needles in haystacks in the form of very predictive but rarely seen features.

...read moreread less

Journal Article

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

John C. Duchi, +2 more

- 01 Feb 2011 -

Journal of Machine Learning Research

TL;DR: This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.

...read moreread less

Journal ArticleDOI

Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling

John C. Duchi, +2 more

- 01 Mar 2012 -

IEEE Transactions on Automatic Control

TL;DR: This work develops and analyze distributed algorithms based on dual subgradient averaging and provides sharp bounds on their convergence rates as a function of the network size and topology, and shows that the number of iterations required by the algorithm scales inversely in the spectral gap of thenetwork.

...read moreread less

Book

Convex Optimization: Algorithms and Complexity

Sébastien Bubeck

TL;DR: This monograph presents the main complexity theorems in convex optimization and their corresponding algorithms and provides a gentle introduction to structural optimization with FISTA, saddle-point mirror prox, Nemirovski's alternative to Nesterov's smoothing, and a concise description of interior point methods.

...read moreread less

Proceedings ArticleDOI

Ad click prediction: a view from the trenches

H. Brendan McMahan, +15 more

TL;DR: The goal of this paper is to highlight the close relationship between theoretical advances and practical engineering in this industrial setting, and to show the depth of challenges that appear when applying traditional machine learning methods in a complex dynamic system.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Gradient-based learning applied to document recognition

Yann LeCun, +6 more

TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.

...read moreread less

Journal ArticleDOI

Regression Shrinkage and Selection via the Lasso

Robert Tibshirani

- 01 Jan 1996 -

Journal of the royal statistical society...

TL;DR: A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.

...read moreread less

Book

Convex Optimization

Stephen Boyd, +1 more

TL;DR: In this article, the focus is on recognizing convex optimization problems and then finding the most appropriate technique for solving them, and a comprehensive introduction to the subject is given. But the focus of this book is not on the optimization problem itself, but on the problem of finding the appropriate technique to solve it.

...read moreread less

Journal ArticleDOI

A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems

Amir Beck, +1 more

- 01 Jan 2009 -

Siam Journal on Imaging Sciences

TL;DR: A new fast iterative shrinkage-thresholding algorithm (FISTA) which preserves the computational simplicity of ISTA but with a global rate of convergence which is proven to be significantly better, both theoretically and practically.

...read moreread less

Journal ArticleDOI

Atomic Decomposition by Basis Pursuit

Scott Chen, +2 more

- 11 Dec 1998 -

SIAM Journal on Scientific Computing

TL;DR: Basis Pursuit (BP) is a principle for decomposing a signal into an "optimal" superposition of dictionary elements, where optimal means having the smallest l1 norm of coefficients among all such decompositions.

...read moreread less