scispace - formally typeset
Open AccessProceedings ArticleDOI

Generalization Error Bounds for Noisy, Iterative Algorithms

Reads0
Chats0
TLDR
This paper derived generalization error bounds for a broad class of iterative algorithms that are characterized by bounded, noisy updates with Markovian structure, including stochastic gradient Langevin dynamics (SGLD) and variants of the SGHMC algorithm.
Abstract
In statistical learning theory, generalization error is used to quantify the degree to which a supervised machine learning algorithm may overfit to training data. Recent work [Xu and Raginsky (2017)] has established a bound on the generalization error of empirical risk minimization based on the mutual information $I$ ( $S$ ; W) between the algorithm input $S$ and the algorithm output W, when the loss function is sub-Gaussian. We leverage these results to derive generalization error bounds for a broad class of iterative algorithms that are characterized by bounded, noisy updates with Markovian structure. Our bounds are very general and are applicable to numerous settings of interest, including stochastic gradient Langevin dynamics (SGLD) and variants of the stochastic gradient Hamiltonian Monte Carlo (SGHMC) algorithm. Furthermore, our error bounds hold for any output function computed over the path of iterates, including the last iterate of the algorithm or the average of subsets of iterates, and also allow for non-uniform sampling of data in successive updates of the algorithm.

read more

Citations
More filters
Proceedings Article

On the Power of Over-parametrization in Neural Networks with Quadratic Activation

TL;DR: The authors showed that over-parametrization enables local search algorithms to find a globally optimal solution for general smooth and convex loss functions, and showed that the solution also generalizes well if the data is sampled from a regular distribution such as Gaussian.
Proceedings Article

Control Batch Size and Learning Rate to Generalize Well: Theoretical and Empirical Evidence

TL;DR: A PAC-Bayes generalization bound for neural networks trained by SGD is proved, which has a positive correlation with the ratio of batch size to learning rate, which builds the theoretical foundation of the training strategy.
Proceedings ArticleDOI

Tightening Mutual Information Based Bounds on Generalization Error

TL;DR: Application to noisy and iterative algorithms, e.g., stochastic gradient Langevin dynamics (SGLD), is also studied, where the constructed bound provides a tighter characterization of the generalization error than existing results.
Posted Content

Chaining Mutual Information and Tightening Generalization Bounds.

TL;DR: This paper introduces a technique to combine the chaining and mutual information methods, to obtain a generalization bound that is both algorithm-dependent and that exploits the dependencies between the hypotheses.
Posted Content

Where is the Information in a Deep Neural Network

TL;DR: A novel notion of effective information in the activations of a deep network is established, which is used to show that models with low (information) complexity not only generalize better, but are bound to learn invariant representations of future inputs.
References
More filters

Statistical learning theory

TL;DR: Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.
Book

Understanding Machine Learning: From Theory To Algorithms

TL;DR: The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a principled way in an advanced undergraduate or beginning graduate course.
Book

Concentration Inequalities: A Nonasymptotic Theory of Independence

TL;DR: Deep connections with isoperimetric problems are revealed whilst special attention is paid to applications to the supremum of empirical processes.
Proceedings Article

Bayesian Learning via Stochastic Gradient Langevin Dynamics

TL;DR: This paper proposes a new framework for learning from large scale datasets based on iterative learning from small mini-batches by adding the right amount of noise to a standard stochastic gradient optimization algorithm and shows that the iterates will converge to samples from the true posterior distribution as the authors anneal the stepsize.
Journal ArticleDOI

Stability and generalization

TL;DR: These notions of stability for learning algorithms are defined and it is shown how to use these notions to derive generalization error bounds based on the empirical error and the leave-one-out error.