Flat minima

doi:10.1162/NECO.1997.9.1.1

Journal ArticleDOI

Flat minima

Sepp Hochreiter, +1 more

- Vol. 9, Iss: 1, pp 1

Chats0

TLDR

A new algorithm for finding low-complexity neural networks with high generalization capability that outperforms conventional backprop, weight decay, and optimal brain surgeon/optimal brain damage and requires the computation of second-order derivatives, but has backpropagation's order of complexity.

Abstract:

We present a new algorithm for finding low-complexity neural networks with high generalization capability. The algorithm searches for a flat minimum of the error function. A flat minimum is a large connected region in weight space where the error remains approximately constant. An MDL-based, Bayesian argument suggests that flat minima correspond to simple networks and low expected overfitting. The argument is based on a Gibbs algorithm variant and a novel way of splitting generalization error into underfitting and overfitting error. Unlike many previous approaches, ours does not require gaussian assumptions and does not depend on a good weight prior. Instead we have a prior over input output functions, thus taking into account net architecture and training set. Although our algorithm requires the computation of second-order derivatives, it has backpropagation's order of complexity. Automatically, it effectively prunes units, weights, and input lines. Various experiments with feedforward and recurrent nets are described. In an application to stock market prediction, flat minimum search outperforms conventional backprop, weight decay, and optimal brain surgeon/optimal brain damage.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Deep learning in neural networks

Jürgen Schmidhuber

- 01 Jan 2015 -

Neural Networks

TL;DR: This historical survey compactly summarizes relevant work, much of it from the previous millennium, review deep supervised learning, unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.

...read moreread less

Book

Pattern recognition and neural networks

Brian D. Ripley, +1 more

TL;DR: Professor Ripley brings together two crucial ideas in pattern recognition; statistical methods and machine learning via neural networks in this self-contained account.

...read moreread less

Posted Content

GANs Trained by a Two Time-Scale Update Rule Converge to a Nash Equilibrium

Martin Heusel, +5 more

- 26 Jun 2017 -

arXiv: Learning

TL;DR: In this article, a two time-scale update rule (TTUR) was proposed for training GANs with stochastic gradient descent on arbitrary GAN loss functions, which has an individual learning rate for both the discriminator and the generator.

...read moreread less

Proceedings ArticleDOI

Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates

Leslie N. Smith, +1 more

TL;DR: Super-convergence as discussed by the authors is a phenomenon where residual networks can be trained using an order of magnitude fewer iterations than is used with standard training methods, which is relevant to understanding why deep networks generalize well.

...read moreread less

Book ChapterDOI

Learning to Learn Using Gradient Descent

Sepp Hochreiter, +2 more

TL;DR: This paper makes meta- learning in large systems feasible by using recurrent neural networks with attendant learning routines as meta-learning systems and shows that the approach to gradient descent methods forms non-stationary time series prediction.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

A mathematical theory of communication

Claude E. Shannon

- 01 Jul 1948 -

Bell System Technical Journal

TL;DR: This final installment of the paper considers the case where the signals or the messages or both are continuously variable, in contrast with the discrete nature assumed until now.

...read moreread less

Journal ArticleDOI

Cross-Validatory Choice and Assessment of Statistical Predictions

M. Stone

- 01 Jan 1974 -

Journal of the royal statistical society...

TL;DR: In this article, a generalized form of the cross-validation criterion is applied to the choice and assessment of prediction using the data-analytic concept of a prescription, and examples used to illustrate the application are drawn from the problem areas of univariate estimation, linear regression and analysis of variance.

...read moreread less

Journal ArticleDOI

Paper: Modeling by shortest data description

Jorma Rissanen

- 01 Sep 1978 -

Automatica

TL;DR: The number of digits it takes to write down an observed sequence x1,...,xN of a time series depends on the model with its parameters that one assumes to have generated the observed data.

...read moreread less

Journal ArticleDOI

Bayesian interpolation

David J. C. MacKay

TL;DR: The Bayesian approach to regularization and model-comparison is demonstrated by studying the inference problem of interpolating noisy data by examining the posterior probability distribution of regularizing constants and noise levels.

...read moreread less

Proceedings Article

Optimal Brain Damage

Yann LeCun, +2 more

TL;DR: A class of practical and nearly optimal schemes for adapting the size of a neural network by using second-derivative information to make a tradeoff between network complexity and training set error is derived.

...read moreread less

Collapse

Flat minima

Citations

Deep learning in neural networks

Pattern recognition and neural networks

GANs Trained by a Two Time-Scale Update Rule Converge to a Nash Equilibrium

Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates

Learning to Learn Using Gradient Descent

References

A mathematical theory of communication

Cross-Validatory Choice and Assessment of Statistical Predictions

Paper: Modeling by shortest data description

Bayesian interpolation

Optimal Brain Damage

Related Papers (5)

Deep Residual Learning for Image Recognition

Learning Multiple Layers of Features from Tiny Images

Adam: A Method for Stochastic Optimization

Long short-term memory

Gradient-based learning applied to document recognition