Towards Understanding Generalization via Analytical Learning Theory

Home
/
Papers
/
Towards Understanding Generalization via Analytical Learning Theory

Posted Content•

Towards Understanding Generalization via Analytical Learning Theory

Kenji Kawaguchi, Yoshua Benigo, Vikas Verma, Leslie Pack Kaelbling

01 Oct 2018-

TL;DR: A novel measure-theoretic theory for machine learning that does not require statistical assumptions is introduced and a new regularization method in deep learning is derived and shown to outperform previous methods in CIFar-10, CIFAR-100, and SVHN.

read less

Abstract: This paper introduces a novel measure-theoretic theory for machine learning that does not require statistical assumptions. Based on this theory, a new regularization method in deep learning is derived and shown to outperform previous methods in CIFAR-10, CIFAR-100, and SVHN. Moreover, the proposed theory provides a theoretical basis for a family of practically successful regularization methods in deep learning. We discuss several consequences of our results on one-shot learning, representation learning, deep learning, and curriculum learning. Unlike statistical learning theory, the proposed learning theory analyzes each problem instance individually via measure theory, rather than a set of problem instances via statistics. As a result, it provides different types of results and insights when compared to statistical learning theory.

...read moreread less

Citations

PDF

Open Access

More filters

Posted Content•

On Feature Normalization and Data Augmentation

[...]

Boyi Li¹, Felix Wu¹, Ser-Nam Lim¹, Serge Belongie¹, Kilian Q. Weinberger² - Show less +1 more•Institutions (2)

Cornell University¹, Facebook²

25 Feb 2020-arXiv: Learning

TL;DR: This paper proposes Moment Exchange, an implicit data augmentation method that encourages the model to utilize the moment information also for recognition models, and replaces the moments of the learned features of one training image by those of another, and also interpolate the target labels to extract training signal from the moments.

...read moreread less

Abstract: Modern neural network training relies heavily on data augmentation for improved generalization. After the initial success of label-preserving augmentations, there has been a recent surge of interest in label-perturbing approaches, which combine features and labels across training samples to smooth the learned decision surface. In this paper, we propose a new augmentation method that leverages the first and second moments extracted and re-injected by feature normalization. We replace the moments of the learned features of one training image by those of another, and also interpolate the target labels. As our approach is fast, operates entirely in feature space, and mixes different signals than prior methods, one can effectively combine it with existing augmentation methods. We demonstrate its efficacy across benchmark data sets in computer vision, speech, and natural language processing, where it consistently improves the generalization performance of highly competitive baseline networks.

...read moreread less

71 citations

Additional excerpts

...Many subsequent papers proposed alternative flavors of this augmentation approach based on similar insights [4, 5, 7, 27, 28, 36, 51, 63, 70]....
[...]

Proceedings Article•DOI•

On Feature Normalization and Data Augmentation

[...]

Boyi Li¹, Felix Wu, Ser-Nam Lim², Serge Belongie¹, Kilian Q. Weinberger¹ - Show less +1 more•Institutions (2)

Cornell University¹, Facebook²

01 Jun 2021

TL;DR: In this paper, the authors propose Moment Exchange, an implicit data augmentation method that encourages the model to utilize the moment information also for recognition models, by replacing the moments of the learned features of one training image by those of another, and interpolating the target labels.

...read moreread less

Abstract: The moments (a.k.a., mean and standard deviation) of latent features are often removed as noise when training image recognition models, to increase stability and reduce training time. However, in the field of image generation, the moments play a much more central role. Studies have shown that the moments extracted from instance normalization and positional normalization can roughly capture style and shape information of an image. Instead of being discarded, these moments are instrumental to the generation process. In this paper we propose Moment Exchange, an implicit data augmentation method that encourages the model to utilize the moment information also for recognition models. Specifically, we replace the moments of the learned features of one training image by those of another, and also interpolate the target labels—forcing the model to extract training signal from the moments in addition to the normalized features. As our approach is fast, operates entirely in feature space, and mixes different signals than prior methods, one can effectively combine it with existing augmentation approaches. We demonstrate its efficacy across several recognition benchmark data sets where it improves the generalization capability of highly competitive baseline networks with remarkable consistency.

...read moreread less

32 citations

Book Chapter•DOI•

Semi-supervised Learning by Disentangling and Self-ensembling over Stochastic Latent Space

[...]

Prashnna Kumar Gyawali¹, Zhiyuan Li¹, Sandesh Ghimire¹, Linwei Wang¹•Institutions (1)

Rochester Institute of Technology¹

13 Oct 2019

TL;DR: In this article, a stacked semi-supervised learning (SSL) model was proposed to improve self-ensembling by exploiting the stochasticity of disentangled latent space.

...read moreread less

Abstract: The success of deep learning in medical imaging is mostly achieved at the cost of a large labeled data set. Semi-supervised learning (SSL) provides a promising solution by leveraging the structure of unlabeled data to improve learning from a small set of labeled data. Self-ensembling is a simple approach used in SSL to encourage consensus among ensemble predictions of unknown labels, improving generalization of the model by making it more insensitive to the latent space. Currently, such an ensemble is obtained by randomization such as dropout regularization and random data augmentation. In this work, we hypothesize – from the generalization perspective – that self-ensembling can be improved by exploiting the stochasticity of a disentangled latent space. To this end, we present a stacked SSL model that utilizes unsupervised disentangled representation learning as the stochastic embedding for self-ensembling. We evaluate the presented model for multi-label classification using chest X-ray images, demonstrating its improved performance over related SSL models as well as the interpretability of its disentangled representations.

...read moreread less

17 citations

Book Chapter•DOI•

Improving Generalization of Deep Networks for Inverse Reconstruction of Image Sequences

[...]

Sandesh Ghimire¹, Prashnna Kumar Gyawali¹, Jwala Dhamala¹, John L. Sapp², Milan B. Horacek², Linwei Wang¹ - Show less +2 more•Institutions (2)

Rochester Institute of Technology¹, Dalhousie University²

02 Jun 2019

TL;DR: A sequence image reconstruction network optimized by a variational approximation of the information bottleneck principle with stochastic latent space and shows that a latent representation minimally informative of the input data will help a network generalize to unseen input variations that are irrelevant to the output reconstruction.

...read moreread less

Abstract: Deep learning networks have shown state-of-the-art performance in many image reconstruction problems. However, it is not well understood what properties of representation and learning may improve the generalization ability of the network. In this paper, we propose that the generalization ability of an encoder-decoder network for inverse reconstruction can be improved in two means. First, drawing from analytical learning theory, we theoretically show that a stochastic latent space will improve the ability of a network to generalize to test data outside the training distribution. Second, following the information bottleneck principle, we show that a latent representation minimally informative of the input data will help a network generalize to unseen input variations that are irrelevant to the output reconstruction. Therefore, we present a sequence image reconstruction network optimized by a variational approximation of the information bottleneck principle with stochastic latent space. In the application setting of reconstructing the sequence of cardiac transmembrane potential from body-surface potential, we assess the two types of generalization abilities of the presented network against its deterministic counterpart. The results demonstrate that the generalization ability of an inverse reconstruction network can be improved by stochasticity as well as the information bottleneck.

...read moreread less

12 citations

Posted Content•

Learning Invariant Feature Representation to Improve Generalization across Chest X-ray Datasets

[...]

Sandesh Ghimire¹, Satyananda Kashyap¹, Joy T. Wu¹, Alexandros Karargyris¹, Mehdi Moradi¹ - Show less +1 more•Institutions (1)

IBM¹

04 Aug 2020-arXiv: Image and Video Processing

TL;DR: This work addresses the challenge of generalization to a new source by forcing the network to learn a source-invariant representation by employing an adversarial training strategy.

...read moreread less

Abstract: Chest radiography is the most common medical image examination for screening and diagnosis in hospitals. Automatic interpretation of chest X-rays at the level of an entry-level radiologist can greatly benefit work prioritization and assist in analyzing a larger population. Subsequently, several datasets and deep learning-based solutions have been proposed to identify diseases based on chest X-ray images. However, these methods are shown to be vulnerable to shift in the source of data: a deep learning model performing well when tested on the same dataset as training data, starts to perform poorly when it is tested on a dataset from a different source. In this work, we address this challenge of generalization to a new source by forcing the network to learn a source-invariant representation. By employing an adversarial training strategy, we show that a network can be forced to learn a source-invariant representation. Through pneumonia-classification experiments on multi-source chest X-ray datasets, we show that this algorithm helps in improving classification accuracy on a new source of X-ray dataset.

...read moreread less

4 citations

Cites background from "Towards Understanding Generalizatio..."

...Modern works on generalization, however, find statistical learning theory insufficient [27] and propose other theories from an analytical perspectives [13]....
[...]

References

PDF

Open Access

More filters

Book•

Learning Deep Architectures for AI

[...]

Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

01 Jan 2009

TL;DR: The motivations and principles regarding learning algorithms for deep architectures, in particular those exploiting as building blocks unsupervised learning of single-layer modelssuch as Restricted Boltzmann Machines, used to construct deeper models such as Deep Belief Networks are discussed.

...read moreread less

Abstract: Can machine learning deliver AI? Theoretical results, inspiration from the brain and cognition, as well as machine learning experiments suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g. in vision, language, and other AI-level tasks), one would need deep architectures. Deep architectures are composed of multiple levels of non-linear operations, such as in neural nets with many hidden layers, graphical models with many levels of latent variables, or in complicated propositional formulae re-using many sub-formulae. Each level of the architecture represents features at a different level of abstraction, defined as a composition of lower-level features. Searching the parameter space of deep architectures is a difficult task, but new algorithms have been discovered and a new sub-area has emerged in the machine learning community since 2006, following these discoveries. Learning algorithms such as those for Deep Belief Networks and other related unsupervised learning algorithms have recently been proposed to train deep architectures, yielding exciting results and beating the state-of-the-art in certain areas. Learning Deep Architectures for AI discusses the motivations for and principles of learning algorithms for deep architectures. By analyzing and comparing recent results with different learning algorithms for deep architectures, explanations for their success are proposed and discussed, highlighting challenges and suggesting avenues for future explorations in this area.

...read moreread less

7,767 citations

Proceedings Article•DOI•

Curriculum learning

[...]

Yoshua Bengio, Jérôme Louradour, Ronan Collobert¹, Jason Weston¹•Institutions (1)

Princeton University¹

14 Jun 2009

TL;DR: It is hypothesized that curriculum learning has both an effect on the speed of convergence of the training process to a minimum and on the quality of the local minima obtained: curriculum learning can be seen as a particular form of continuation method (a general strategy for global optimization of non-convex functions).

...read moreread less

Abstract: Humans and animals learn much better when the examples are not randomly presented but organized in a meaningful order which illustrates gradually more concepts, and gradually more complex ones. Here, we formalize such training strategies in the context of machine learning, and call them "curriculum learning". In the context of recent research studying the difficulty of training in the presence of non-convex training criteria (for deep deterministic and stochastic neural networks), we explore curriculum learning in various set-ups. The experiments show that significant improvements in generalization can be achieved. We hypothesize that curriculum learning has both an effect on the speed of convergence of the training process to a minimum and, in the case of non-convex criteria, on the quality of the local minima obtained: curriculum learning can be seen as a particular form of continuation method (a general strategy for global optimization of non-convex functions).

...read moreread less

4,588 citations

"Towards Understanding Generalizatio..." refers background or methods in this paper

...Furthermore, by being strongly instance-dependent on the learned model ŷA(Sm), Theorem 1 supports the concept of curriculum learning (Bengio et al., 2009a)....
[...]
...…how interpolating between the representations of two images (in representation space) corresponds (when projected in image space) to other images that are plausible (are on or near the manifold of natural images), rather than to the simple addition of two natural images (Bengio et al., 2009b)....
[...]
...Example 1 partially supports the concept of the disentanglement in deep learning (Bengio et al., 2009b) and proposes a new concrete method to measure the degree of disentanglement as follows....
[...]

Journal Article•DOI•

Stability and generalization

[...]

Olivier Bousquet¹, André Elisseeff•Institutions (1)

École Polytechnique¹

01 Mar 2002-Journal of Machine Learning Research

TL;DR: These notions of stability for learning algorithms are defined and it is shown how to use these notions to derive generalization error bounds based on the empirical error and the leave-one-out error.

...read moreread less

Abstract: We define notions of stability for learning algorithms and show how to use these notions to derive generalization error bounds based on the empirical error and the leave-one-out error. The methods we use can be applied in the regression framework as well as in the classification one when the classifier is obtained by thresholding a real-valued function. We study the stability properties of large classes of learning algorithms such as regularization based algorithms. In particular we focus on Hilbert space regularization and Kullback-Leibler regularization. We demonstrate how to apply the results to SVM for regression and classification.

...read moreread less

1,690 citations

Proceedings Article•

The Loss Surfaces of Multilayer Networks

[...]

Anna Choromanska¹, Mikael Henaff², Michael Mathieu², Gérard Ben Arous², Yann LeCun² - Show less +1 more•Institutions (2)

Wrocław Medical University¹, New York University²

21 Feb 2015

TL;DR: In this paper, the authors study the connection between the loss function of a simple model of the fully-connected feed-forward neural network and the Hamiltonian of the spherical spin-glass model under the assumptions of variable independence, redundancy in network parametrization, and uniformity.

...read moreread less

Abstract: We study the connection between the highly non-convex loss function of a simple model of the fully-connected feed-forward neural network and the Hamiltonian of the spherical spin-glass model under the assumptions of: i) variable independence, ii) redundancy in network parametrization, and iii) uniformity. These assumptions enable us to explain the complexity of the fully decoupled neural network through the prism of the results from random matrix theory. We show that for large-size decoupled networks the lowest critical values of the random loss function form a layered structure and they are located in a well-defined band lower-bounded by the global minimum. The number of local minima outside that band diminishes exponentially with the size of the network. We empirically verify that the mathematical model exhibits similar behavior as the computer simulations, despite the presence of high dependencies in real networks. We conjecture that both simulated annealing and SGD converge to the band of low critical points, and that all critical points found there are local minima of high quality measured by the test error. This emphasizes a major difference between largeand small-size networks where for the latter poor quality local minima have nonzero probability of being recovered. Finally, we prove that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting.

...read moreread less

970 citations

Journal Article•DOI•

Quasi-Monte Carlo methods and pseudo-random numbers

[...]

Harald Niederreiter

01 Nov 1978-Bulletin of the American Mathematical Society

TL;DR: In this paper, the authors propose a method to solve the problem of the problem: this paper...,.. ].. ).. ]... )...

...read moreread less

Abstract: CONTENTS

...read moreread less

820 citations

"Towards Understanding Generalizatio..." refers methods in this paper

...These definitions have been used in harmonic analysis, number theory, and numerical analysis (Krause, 1903; Hardy, 1906; Hlawka, 1961; Niederreiter, 1978; Aistleitner et al., 2017)....
[...]