scispace - formally typeset
Journal ArticleDOI

Flat minima

Sepp Hochreiter, +1 more
- Vol. 9, Iss: 1, pp 1
Reads0
Chats0
TLDR
A new algorithm for finding low-complexity neural networks with high generalization capability that outperforms conventional backprop, weight decay, and optimal brain surgeon/optimal brain damage and requires the computation of second-order derivatives, but has backpropagation's order of complexity.
Abstract
We present a new algorithm for finding low-complexity neural networks with high generalization capability. The algorithm searches for a flat minimum of the error function. A flat minimum is a large connected region in weight space where the error remains approximately constant. An MDL-based, Bayesian argument suggests that flat minima correspond to simple networks and low expected overfitting. The argument is based on a Gibbs algorithm variant and a novel way of splitting generalization error into underfitting and overfitting error. Unlike many previous approaches, ours does not require gaussian assumptions and does not depend on a good weight prior. Instead we have a prior over input output functions, thus taking into account net architecture and training set. Although our algorithm requires the computation of second-order derivatives, it has backpropagation's order of complexity. Automatically, it effectively prunes units, weights, and input lines. Various experiments with feedforward and recurrent nets are described. In an application to stock market prediction, flat minimum search outperforms conventional backprop, weight decay, and optimal brain surgeon/optimal brain damage.

read more

Citations
More filters
Journal ArticleDOI

Deep learning in neural networks

TL;DR: This historical survey compactly summarizes relevant work, much of it from the previous millennium, review deep supervised learning, unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.
Book

Pattern recognition and neural networks

TL;DR: Professor Ripley brings together two crucial ideas in pattern recognition; statistical methods and machine learning via neural networks in this self-contained account.
Posted Content

GANs Trained by a Two Time-Scale Update Rule Converge to a Nash Equilibrium

TL;DR: In this article, a two time-scale update rule (TTUR) was proposed for training GANs with stochastic gradient descent on arbitrary GAN loss functions, which has an individual learning rate for both the discriminator and the generator.
Proceedings ArticleDOI

Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates

TL;DR: Super-convergence as discussed by the authors is a phenomenon where residual networks can be trained using an order of magnitude fewer iterations than is used with standard training methods, which is relevant to understanding why deep networks generalize well.
Book ChapterDOI

Learning to Learn Using Gradient Descent

TL;DR: This paper makes meta- learning in large systems feasible by using recurrent neural networks with attendant learning routines as meta-learning systems and shows that the approach to gradient descent methods forms non-stationary time series prediction.
References
More filters
Journal ArticleDOI

A mathematical theory of communication

TL;DR: This final installment of the paper considers the case where the signals or the messages or both are continuously variable, in contrast with the discrete nature assumed until now.
Journal ArticleDOI

Cross-Validatory Choice and Assessment of Statistical Predictions

TL;DR: In this article, a generalized form of the cross-validation criterion is applied to the choice and assessment of prediction using the data-analytic concept of a prescription, and examples used to illustrate the application are drawn from the problem areas of univariate estimation, linear regression and analysis of variance.
Journal ArticleDOI

Paper: Modeling by shortest data description

Jorma Rissanen
- 01 Sep 1978 - 
TL;DR: The number of digits it takes to write down an observed sequence x1,...,xN of a time series depends on the model with its parameters that one assumes to have generated the observed data.
Journal ArticleDOI

Bayesian interpolation

TL;DR: The Bayesian approach to regularization and model-comparison is demonstrated by studying the inference problem of interpolating noisy data by examining the posterior probability distribution of regularizing constants and noise levels.
Proceedings Article

Optimal Brain Damage

TL;DR: A class of practical and nearly optimal schemes for adapting the size of a neural network by using second-derivative information to make a tradeoff between network complexity and training set error is derived.
Related Papers (5)