Cyclical Learning Rates for Training Neural Networks

doi:10.1109/WACV.2017.58

Open AccessProceedings ArticleDOI

Cyclical Learning Rates for Training Neural Networks

- pp 464-472

TLDR

A new method for setting the learning rate, named cyclical learning rates, is described, which practically eliminates the need to experimentally find the best values and schedule for the global learning rates.

Abstract:

It is known that the learning rate is the most important hyper-parameter to tune for training deep neural networks. This paper describes a new method for setting the learning rate, named cyclical learning rates, which practically eliminates the need to experimentally find the best values and schedule for the global learning rates. Instead of monotonically decreasing the learning rate, this method lets the learning rate cyclically vary between reasonable boundary values. Training with cyclical learning rates instead of fixed values achieves improved classification accuracy without a need to tune and often in fewer iterations. This paper also describes a simple way to estimate "reasonable bounds" – linearly increasing the learning rate of the network for a few epochs. In addition, cyclical learning rates are demonstrated on the CIFAR-10 and CIFAR-100 datasets with ResNets, Stochastic Depth networks, and DenseNets, and the ImageNet dataset with the AlexNet and GoogLeNet architectures. These are practical tools for everyone who trains neural networks.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Universal Language Model Fine-tuning for Text Classification

Jeremy Howard, +1 more

TL;DR: Universal Language Model Fine-tuning (ULMFiT) as mentioned in this paper is an effective transfer learning method that can be applied to any task in NLP, and introduces techniques that are key for finetuning a language model.

...read moreread less

Posted Content

Fixing Weight Decay Regularization in Adam

Ilya Loshchilov, +1 more

TL;DR: This work decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam and substantially improves Adam's generalization performance, allowing it to compete with SGD with momentum on image classification datasets.

...read moreread less

Proceedings ArticleDOI

Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates

Leslie N. Smith, +1 more

TL;DR: Super-convergence as discussed by the authors is a phenomenon where residual networks can be trained using an order of magnitude fewer iterations than is used with standard training methods, which is relevant to understanding why deep networks generalize well.

...read moreread less

Posted Content

A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay

Leslie N. Smith

- 26 Mar 2018 -

arXiv: Learning

TL;DR: This report shows how to examine the training validation/test loss function for subtle clues of underfitting and overfitting and suggests guidelines for moving toward the optimal balance point and discusses how to increase/decrease the learning rate/momentum to speed up training.

...read moreread less

Proceedings ArticleDOI

ECAPA-TDNN : Emphasized Channel Attention, Propagation and Aggregation in TDNN based speaker verification

Brecht Desplanques, +2 more

TL;DR: The proposed ECAPA-TDNN architecture significantly outperforms state-of-the-art TDNN based systems on the Voxceleb test sets and the 2019 VoxCeleb Speaker Recognition Challenge.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky, +11 more

- 01 Dec 2015 -

International Journal of Computer Vision

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.

...read moreread less

Collapse

Cyclical Learning Rates for Training Neural Networks

Citations

Universal Language Model Fine-tuning for Text Classification

Fixing Weight Decay Regularization in Adam

Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates

A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay

ECAPA-TDNN : Emphasized Channel Attention, Propagation and Aggregation in TDNN based speaker verification

References

Deep Residual Learning for Image Recognition

Adam: A Method for Stochastic Optimization

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet Large Scale Visual Recognition Challenge

Related Papers (5)

Deep Residual Learning for Image Recognition

Adam: A Method for Stochastic Optimization

ImageNet: A large-scale hierarchical image database

U-Net: Convolutional Networks for Biomedical Image Segmentation

Very Deep Convolutional Networks for Large-Scale Image Recognition