Deep Frank-Wolfe For Neural Network Optimization

Open AccessProceedings Article

Deep Frank-Wolfe For Neural Network Optimization

Chats0

TLDR

This work presents an optimization method based on a composite proximal framework that exploits the compositional nature of deep neural networks and can leverage powerful convex optimization algorithms by design, and employs the Frank-Wolfe algorithm for SVM, which computes an optimal step-size in closed-form at each time-step.

Abstract:

Learning a deep neural network requires solving a challenging optimization problem: it is a high-dimensional, non-convex and non-smooth minimization problem with a large number of terms. The current practice in neural network optimization is to rely on the stochastic gradient descent (SGD) algorithm or its adaptive variants. However, SGD requires a hand-designed schedule for the learning rate. In addition, its adaptive variants tend to produce solutions that generalize less well on unseen data than SGD with a hand-designed schedule. We present an optimization method that offers empirically the best of both worlds: our algorithm yields good generalization performance while requiring only one hyper-parameter. Our approach is based on a composite proximal framework, which exploits the compositional nature of deep neural networks and can leverage powerful convex optimization algorithms by design. Specifically, we employ the Frank-Wolfe (FW) algorithm for SVM, which computes an optimal step-size in closed-form at each time-step. We further show that the descent direction is given by a simple backward pass in the network, yielding the same computational cost per iteration as SGD. We present experiments on the CIFAR and SNLI data sets, where we demonstrate the significant superiority of our method over Adam, Adagrad, as well as the recently proposed BPGrad and AMSGrad. Furthermore, we compare our algorithm to SGD with a hand-designed learning rate schedule, and show that it provides similar generalization while converging faster. The code is publicly available at this https URL.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Semantic-aware scene recognition

Alejandro López-Cifuentes, +3 more

- 01 Jun 2020 -

Pattern Recognition

TL;DR: A novel approach for scene recognition based on an end-to-end multi-modal CNN that combines image and context information by means of an attention module is described, which outperforms every other state-of-the-art method while significantly reducing the number of network parameters.

...read moreread less

Posted Content

Training Neural Networks for and by Interpolation

Leonard Berrada, +2 more

- 13 Jun 2019 -

arXiv: Learning

TL;DR: Adaptive Learning-rates for Interpolation with Gradients (ALI-G) as discussed by the authors exploits the interpolation property to compute an adaptive learning-rate in closed form, which can be used for nonconvex problems.

...read moreread less

Journal ArticleDOI

LightAdam: Towards a Fast and Accurate Adaptive Momentum Online Algorithm

Yangfan Zhou, +4 more

- 11 Jan 2022 -

Cognitive Computation

Proceedings ArticleDOI

Accelerating Frank-Wolfe with Weighted Average Gradients

Yilang Zhang, +2 more

TL;DR: In this article, a generalization of the Frank-Wolfe (FW) algorithm was proposed by replacing the gradient per subproblem with a weighted average of gradients, which speeds up the convergence by alleviating its zigzag behavior.

...read moreread less

Posted Content

Avoiding bad steps in Frank Wolfe variants

Francesco Rinaldi, +1 more

- 23 Dec 2020 -

arXiv: Optimization and Control

TL;DR: The Short Step Chain (SSC) procedure, which skips gradient computations in consecutive short steps until proper stopping conditions are satisfied, is defined, which allows a unified analysis and converge rates in the general smooth non convex setting, as well as a linear convergence rate under a Kurdyka-Lojasiewicz (KL).

...read moreread less