scispace - formally typeset
Search or ask a question

Showing papers on "Generalization published in 2018"


Proceedings ArticleDOI
01 Jun 2018
TL;DR: This paper presents a novel framework based on adversarial autoencoders to learn a generalized latent feature representation across domains for domain generalization, and proposed an algorithm to jointly train different components of the proposed framework.
Abstract: In this paper, we tackle the problem of domain generalization: how to learn a generalized feature representation for an "unseen" target domain by taking the advantage of multiple seen source-domain data. We present a novel framework based on adversarial autoencoders to learn a generalized latent feature representation across domains for domain generalization. To be specific, we extend adversarial autoencoders by imposing the Maximum Mean Discrepancy (MMD) measure to align the distributions among different domains, and matching the aligned distribution to an arbitrary prior distribution via adversarial feature learning. In this way, the learned feature representation is supposed to be universal to the seen source domains because of the MMD regularization, and is expected to generalize well on the target domain because of the introduction of the prior distribution. We proposed an algorithm to jointly train different components of our proposed framework. Extensive experiments on various vision tasks demonstrate that our proposed framework can learn better generalized features for the unseen target domain compared with state-of-the-art domain generalization methods.

908 citations


Proceedings Article
03 Dec 2018
TL;DR: In this paper, a theoretically grounded set of noise-robust loss functions that can be seen as a generalization of mean absolute error (MAE) and categorical cross entropy (CCE) loss is proposed.
Abstract: Deep neural networks (DNNs) have achieved tremendous success in a variety of applications across many disciplines. Yet, their superior performance comes with the expensive cost of requiring correctly annotated large-scale datasets. Moreover, due to DNNs' rich capacity, errors in training labels can hamper performance. To combat this problem, mean absolute error (MAE) has recently been proposed as a noise-robust alternative to the commonly-used categorical cross entropy (CCE) loss. However, as we show in this paper, MAE can perform poorly with DNNs and challenging datasets. Here, we present a theoretically grounded set of noise-robust loss functions that can be seen as a generalization of MAE and CCE. Proposed loss functions can be readily applied with any existing DNN architecture and algorithm, while yielding good performance in a wide range of noisy label scenarios. We report results from experiments conducted with CIFAR-10, CIFAR-100 and FASHION-MNIST datasets and synthetically generated noisy labels.

703 citations


Proceedings ArticleDOI
04 Jun 2018
TL;DR: The proposed method, normalized direction-preserving Adam (ND-Adam), enables more precise control of the direction and step size for updating weight vectors, leading to significantly improved generalization performance.
Abstract: Adaptive optimization algorithms, such as Adam and RMSprop, have witnessed better optimization performance than stochastic gradient descent (SGD) in some scenarios. However, recent studies show that they often lead to worse generalization performance than SGD, especially for training deep neural networks (DNNs). In this work, we identify the reasons that Adam generalizes worse than SGD, and develop a variant of Adam to eliminate the generalization gap. The proposed method, normalized direction-preserving Adam (ND-Adam), enables more precise control of the direction and step size for updating weight vectors, leading to significantly improved generalization performance. Following a similar rationale, we further improve the generalization performance in classification tasks by regularizing the softmax logits. By bridging the gap between SGD and Adam, we also hope to shed light on why certain optimization algorithms generalize better than others.

591 citations


Proceedings Article
03 Jul 2018
TL;DR: This paper introduces the SCAN domain, consisting of a set of simple compositional navigation commands paired with the corresponding action sequences, and tests the zero-shot generalization capabilities of a variety of recurrent neural networks trained on SCAN with sequence-to-sequence methods.
Abstract: Humans can understand and produce new utterances effortlessly, thanks to their compositional skills. Once a person learns the meaning of a new verb "dax," he or she can immediately understand the meaning of "dax twice" or "sing and dax." In this paper, we introduce the SCAN domain, consisting of a set of simple compositional navigation commands paired with the corresponding action sequences. We then test the zero-shot generalization capabilities of a variety of recurrent neural networks (RNNs) trained on SCAN with sequence-to-sequence methods. We find that RNNs can make successful zero-shot generalizations when the differences between training and test commands are small, so that they can apply "mix-and-match" strategies to solve the task. However, when generalization requires systematic compositional skills (as in the "dax" example above), RNNs fail spectacularly. We conclude with a proof-of-concept experiment in neural machine translation, suggesting that lack of systematicity might be partially responsible for neural networks' notorious training data thirst.

583 citations


Proceedings Article
14 Feb 2018
TL;DR: The authors showed that deep nets generalize well despite having more parameters than the number of training samples, and provided theoretical justification for widespread empirical success in compressing deep nets, and extended these results to convolutional networks.
Abstract: Deep nets generalize well despite having more parameters than the number of training samples. Recent works try to give an explanation using PAC-Bayes and Margin-based analyses, but do not as yet result in sample complexity bounds better than naive parameter counting. The current paper shows generalization bounds that're orders of magnitude better in practice. These rely upon new succinct reparametrizations of the trained net --- a compression that is explicit and efficient. These yield generalization bounds via a simple compression-based framework introduced here. Our results also provide some theoretical justification for widespread empirical success in compressing deep nets. Analysis of correctness of our compression relies upon some newly identified \textquotedblleft noise stability\textquotedblright properties of trained deep nets, which are also experimentally verified. The study of these properties and resulting generalization bounds are also extended to convolutional nets, which had eluded earlier attempts on proving generalization.

554 citations


Proceedings ArticleDOI
15 Feb 2018
TL;DR: This paper explore the structure of neural loss functions and the effect of loss landscapes on generalization, using a range of visualization methods, and explore how network architecture affects the loss landscape, and how training parameters affect the shape of minimizers.
Abstract: Neural network training relies on our ability to find "good" minimizers of highly non-convex loss functions. It is well known that certain network architecture designs (e.g., skip connections) produce loss functions that train easier, and well-chosen training parameters (batch size, learning rate, optimizer) produce minimizers that generalize better. However, the reasons for these differences, and their effect on the underlying loss landscape, is not well understood. In this paper, we explore the structure of neural loss functions, and the effect of loss landscapes on generalization, using a range of visualization methods. First, we introduce a simple "filter normalization" method that helps us visualize loss function curvature, and make meaningful side-by-side comparisons between loss functions. Then, using a variety of visualizations, we explore how network architecture affects the loss landscape, and how training parameters affect the shape of minimizers.

554 citations


Proceedings Article
29 Apr 2018
TL;DR: This article propose a meta-learning method for domain generalization, which simulates train/test domain shift during training by synthesizing virtual testing domains within each mini-batch, and achieve state-of-the-art results on a recent cross-domain image classification benchmark, as well demonstrating its potential on two classic reinforcement learning tasks.
Abstract: Domain shift refers to the well known problem that a model trained in one source domain performs poorly when applied to a target domain with different statistics. Domain Generalization (DG) techniques attempt to alleviate this issue by producing models which by design generalize well to novel testing domains. We propose a novel meta-learning method for domain generalization. Rather than designing a specific model that is robust to domain shift as in most previous DG work, we propose a model agnostic training procedure for DG. Our algorithm simulates train/test domain shift during training by synthesizing virtual testing domains within each mini-batch. The meta-optimization objective requires that steps to improve training domain performance should also improve testing domain performance. This meta-learning procedure trains models with good generalization ability to novel domains. We evaluate our method and achieve state of the art results on a recent cross-domain image classification benchmark, as well demonstrating its potential on two classic reinforcement learning tasks.

524 citations


Proceedings Article
01 Apr 2018
TL;DR: In this paper, the authors study adversarially robust learning from the viewpoint of generalization and show that the sample complexity of robust learning can be significantly larger than that of "standard" learning.
Abstract: Machine learning models are often susceptible to adversarial perturbations of their inputs. Even small perturbations can cause state-of-the-art classifiers with high "standard" accuracy to produce an incorrect prediction with high confidence. To better understand this phenomenon, we study adversarially robust learning from the viewpoint of generalization. We show that already in a simple natural data model, the sample complexity of robust learning can be significantly larger than that of "standard" learning. This gap is information theoretic and holds irrespective of the training algorithm or the model family. We complement our theoretical results with experiments on popular image classification datasets and show that a similar gap exists here as well. We postulate that the difficulty of training robust classifiers stems, at least partially, from this inherently larger sample complexity.

488 citations


Proceedings Article
06 Dec 2018
TL;DR: It is shown that deeper convolutional architectures improve generalization, as do methods traditionally found in supervised learning, including L2 regularization, dropout, data augmentation and batch normalization.
Abstract: In this paper, we investigate the problem of overfitting in deep reinforcement learning. Among the most common benchmarks in RL, it is customary to use the same environments for both training and testing. This practice offers relatively little insight into an agent's ability to generalize. We address this issue by using procedurally generated environments to construct distinct training and test sets. Most notably, we introduce a new environment called CoinRun, designed as a benchmark for generalization in RL. Using CoinRun, we find that agents overfit to surprisingly large training sets. We then show that deeper convolutional architectures improve generalization, as do methods traditionally found in supervised learning, including L2 regularization, dropout, data augmentation and batch normalization.

402 citations


Journal ArticleDOI
01 Jan 2018
TL;DR: A comparative study on various reported data splitting methods found that the size of the data is the deciding factor for the qualities of the generalization performance estimated from the validation set, suggesting that it is necessary to have a good balance between the sizes of training set and validation set toHave a reliable estimation of model performance.
Abstract: Model validation is the most important part of building a supervised model. For building a model with good generalization performance one must have a sensible data splitting strategy, and this is crucial for model validation. In this study, we conducted a comparative study on various reported data splitting methods. The MixSim model was employed to generate nine simulated datasets with different probabilities of mis-classification and variable sample sizes. Then partial least squares for discriminant analysis and support vector machines for classification were applied to these datasets. Data splitting methods tested included variants of cross-validation, bootstrapping, bootstrapped Latin partition, Kennard-Stone algorithm (K-S) and sample set partitioning based on joint X–Y distances algorithm (SPXY). These methods were employed to split the data into training and validation sets. The estimated generalization performances from the validation sets were then compared with the ones obtained from the blind test sets which were generated from the same distribution but were unseen by the training/validation procedure used in model construction. The results showed that the size of the data is the deciding factor for the qualities of the generalization performance estimated from the validation set. We found that there was a significant gap between the performance estimated from the validation set and the one from the test set for the all the data splitting methods employed on small datasets. Such disparity decreased when more samples were available for training/validation, and this is because the models were then moving towards approximations of the central limit theory for the simulated datasets used. We also found that having too many or too few samples in the training set had a negative effect on the estimated model performance, suggesting that it is necessary to have a good balance between the sizes of training set and validation set to have a reliable estimation of model performance. We also found that systematic sampling method such as K-S and SPXY generally had very poor estimation of the model performance, most likely due to the fact that they are designed to take the most representative samples first and thus left a rather poorly representative sample set for model performance estimation.

380 citations


Proceedings ArticleDOI
06 May 2018
TL;DR: A new NLI test set is created that shows the deficiency of state-of-the-art models in inferences that require lexical and world knowledge, demonstrating that these systems are limited in their generalization ability.
Abstract: We create a new NLI test set that shows the deficiency of state-of-the-art models in inferences that require lexical and world knowledge. The new examples are simpler than the SNLI test set, containing sentences that differ by at most one word from sentences in the training set. Yet, the performance on the new test set is substantially worse across systems trained on SNLI, demonstrating that these systems are limited in their generalization ability, failing to capture many simple inferences.

Posted Content
TL;DR: In this paper, the authors study adversarially robust learning from the viewpoint of generalization and show that the sample complexity of robust learning can be significantly larger than that of "standard" learning.
Abstract: Machine learning models are often susceptible to adversarial perturbations of their inputs. Even small perturbations can cause state-of-the-art classifiers with high "standard" accuracy to produce an incorrect prediction with high confidence. To better understand this phenomenon, we study adversarially robust learning from the viewpoint of generalization. We show that already in a simple natural data model, the sample complexity of robust learning can be significantly larger than that of "standard" learning. This gap is information theoretic and holds irrespective of the training algorithm or the model family. We complement our theoretical results with experiments on popular image classification datasets and show that a similar gap exists here as well. We postulate that the difficulty of training robust classifiers stems, at least partially, from this inherently larger sample complexity.

Journal ArticleDOI
TL;DR: In this article, an ensemble of pyramidal one-dimensional convolutional neural network (P-1D-CNN) models is proposed to detect ternary cases e.g. normal vs. interictal.
Abstract: Epilepsy is a life-threatening and challenging neurological disorder, which is affecting a large number of people all over the world. For its detection, encephalography (EEG) is a commonly used clinical approach, but manual inspection of EEG brain signals is a time-consuming and laborious process, which puts a heavy burden on neurologists and affects their performance. Several automatic systems have been proposed using traditional approaches to assist neurologists, which perform well in detecting binary epilepsy scenarios e.g. normal vs. ictal, but their performance degrades in classifying ternary case e.g. ictal vs. normal vs. inter-ictal. To overcome this problem, we propose a system that is an ensemble of pyramidal one-dimensional convolutional neural network (P-1D-CNN) models. Though a CNN model learns the internal structure of data and outperforms hand-engineered techniques, the main issue is the large number of learnable parameters, whose learning requires a huge volume of data. To overcome this issue, P-1D-CNN works on the concept of refinement approach and it involves 61% fewer parameters compared to standard CNN models and as such it has better generalization. Further to overcome the limitations of the small amount of data, we propose two augmentation schemes. We tested the system on the University of Bonn dataset, a benchmark dataset; in almost all the cases concerning epilepsy detection, it gives an accuracy of 99.1 ± 0.9% and outperforms the state-of-the-art systems. In addition, while enjoying the strength of a CNN model, P-1D-CNN model requires 61% less memory space and its detection time is very short (

Proceedings Article
27 Apr 2018
TL;DR: This paper proposes a pose grammar to tackle the problem of 3D human pose estimation, which takes 2D pose as input and learns a generalized 2D-3D mapping function and enforces high-level constraints over human poses.
Abstract: In this paper, we propose a pose grammar to tackle the problem of 3D human pose estimation. Our model directly takes 2D pose as input and learns a generalized 2D-3D mapping function. The proposed model consists of a base network which efficiently captures pose-aligned features and a hierarchy of Bi-directional RNNs (BRNN) on the top to explicitly incorporate a set of knowledge regarding human body configuration (i.e., kinematics, symmetry, motor coordination). The proposed model thus enforces high-level constraints over human poses. In learning, we develop a pose sample simulator to augment training samples in virtual camera views, which further improves our model generalizability. We validate our method on public 3D human pose benchmarks and propose a new evaluation protocol working on cross-view setting to verify the generalization capability of different methods. We empirically observe that most state-of-the-art methods encounter difficulty under such setting while our method can well handle such challenges.

Posted Content
TL;DR: A novel complexity measure based on unit-wise capacities resulting in a tighter generalization bound for two layer ReLU networks and a matching lower bound for the Rademacher complexity that improves over previous capacity lower bounds for neural networks are presented.
Abstract: Despite existing work on ensuring generalization of neural networks in terms of scale sensitive complexity measures, such as norms, margin and sharpness, these complexity measures do not offer an explanation of why neural networks generalize better with over-parametrization. In this work we suggest a novel complexity measure based on unit-wise capacities resulting in a tighter generalization bound for two layer ReLU networks. Our capacity bound correlates with the behavior of test error with increasing network sizes, and could potentially explain the improvement in generalization with over-parametrization. We further present a matching lower bound for the Rademacher complexity that improves over previous capacity lower bounds for neural networks.

Posted Content
TL;DR: It is desirable for detection and classification algorithms to generalize to unfamiliar environments, but suitable benchmarks for quantitatively studying this phenomenon are not yet available, so a dataset designed to measure recognition generalization to novel environments is presented.
Abstract: It is desirable for detection and classification algorithms to generalize to unfamiliar environments, but suitable benchmarks for quantitatively studying this phenomenon are not yet available. We present a dataset designed to measure recognition generalization to novel environments. The images in our dataset are harvested from twenty camera traps deployed to monitor animal populations. Camera traps are fixed at one location, hence the background changes little across images; capture is triggered automatically, hence there is no human bias. The challenge is learning recognition in a handful of locations, and generalizing animal detection and classification to new locations where no training data is available. In our experiments state-of-the-art algorithms show excellent performance when tested at the same location where they were trained. However, we find that generalization to new locations is poor, especially for classification systems.

Proceedings Article
01 Jun 2018
TL;DR: In this article, a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training is presented.
Abstract: We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

Journal ArticleDOI
TL;DR: An essential singularity is found in the entanglement fidelity of GKP codes in the limit of vanishing loss rate, and a generalization of spin-coherent states is introduced, extending the characterization to qudit binomial codes and yielding a multiqudit code.
Abstract: The early Gottesman, Kitaev, and Preskill (GKP) proposal for encoding a qubit in an oscillator has recently been followed by cat- and binomial-code proposals. Numerically optimized codes have also been proposed, and we introduce codes of this type here. These codes have yet to be compared using the same error model; we provide such a comparison by determining the entanglement fidelity of all codes with respect to the bosonic pure-loss channel (i.e., photon loss) after the optimal recovery operation. We then compare achievable communication rates of the combined encoding-error-recovery channel by calculating the channel's hashing bound for each code. Cat and binomial codes perform similarly, with binomial codes outperforming cat codes at small loss rates. Despite not being designed to protect against the pure-loss channel, GKP codes significantly outperform all other codes for most values of the loss rate. We show that the performance of GKP and some binomial codes increases monotonically with increasing average photon number of the codes. In order to corroborate our numerical evidence of the cat-binomial-GKP order of performance occurring at small loss rates, we analytically evaluate the quantum error-correction conditions of those codes. For GKP codes, we find an essential singularity in the entanglement fidelity in the limit of vanishing loss rate. In addition to comparing the codes, we draw parallels between binomial codes and discrete-variable systems. First, we characterize one- and two-mode binomial as well as multiqubit permutation-invariant codes in terms of spin-coherent states. Such a characterization allows us to introduce check operators and error-correction procedures for binomial codes. Second, we introduce a generalization of spin-coherent states, extending our characterization to qudit binomial codes and yielding a multiqudit code.

Proceedings Article
15 Feb 2018
TL;DR: In this article, the authors investigate the tension between complexity and generalization through an extensive empirical exploration of two natural metrics of complexity related to sensitivity to input perturbations, and demonstrate how the input-output Jacobian norm can be predictive of generalization at the level of individual test points.
Abstract: In practice it is often found that large over-parameterized neural networks generalize better than their smaller counterparts, an observation that appears to conflict with classical notions of function complexity, which typically favor smaller models. In this work, we investigate this tension between complexity and generalization through an extensive empirical exploration of two natural metrics of complexity related to sensitivity to input perturbations. Our experiments survey thousands of models with various fully-connected architectures, optimizers, and other hyper-parameters, as well as four different image classification datasets. We find that trained neural networks are more robust to input perturbations in the vicinity of the training data manifold, as measured by the norm of the input-output Jacobian of the network, and that it correlates well with generalization. We further establish that factors associated with poor generalization $-$ such as full-batch training or using random labels $-$ correspond to lower robustness, while factors associated with good generalization $-$ such as data augmentation and ReLU non-linearities $-$ give rise to more robust functions. Finally, we demonstrate how the input-output Jacobian norm can be predictive of generalization at the level of individual test points.

Proceedings Article
01 Jan 2018
TL;DR: NASHBOT is developed, a Gaussian process based BO framework for neural architecture search which outperforms other alternatives for architecture search in several cross validation based model selection tasks on multi-layer perceptrons and convolutional neural networks.
Abstract: Bayesian Optimisation (BO) refers to a class of methods for global optimisation of a function f which is only accessible via point evaluations. It is typically used in settings where f is expensive to evaluate. A common use case for BO in machine learning is model selection, where it is not possible to analytically model the generalisation performance of a statistical model, and we resort to noisy and expensive training and validation procedures to choose the best model. Conventional BO methods have focused on Euclidean and categorical domains, which, in the context of model selection, only permits tuning scalar hyper-parameters of machine learning algorithms. However, with the surge of interest in deep learning, there is an increasing demand to tune neural network architectures. In this work, we develop NASBOT, a Gaussian process based BO framework for neural architecture search. To accomplish this, we develop a distance metric in the space of neural network architectures which can be computed efficiently via an optimal transport program. This distance might be of independent interest to the deep learning community as it may find applications outside of BO. We demonstrate that NASBOT outperforms other alternatives for architecture search in several cross validation based model selection tasks on multi-layer perceptrons and convolutional neural networks.

Proceedings Article
03 Jul 2018
TL;DR: The authors showed that overfitting is not a unique feature of deep learning, and that the performance of overfitted classifiers is not unique to deep learning but rather a function of the kernel function rather than the optimization process.
Abstract: Generalization performance of classifiers in deep learning has recently become a subject of intense study. Deep models, typically over-parametrized, tend to fit the training data exactly. Despite this "overfitting", they perform well on test data, a phenomenon not yet fully understood. The first point of our paper is that strong performance of overfitted classifiers is not a unique feature of deep learning. Using six real-world and two synthetic datasets, we establish experimentally that kernel machines trained to have zero classification or near zero regression error perform very well on test data, even when the labels are corrupted with a high level of noise. We proceed to give a lower bound on the norm of zero loss solutions for smooth kernels, showing that they increase nearly exponentially with data size. We point out that this is difficult to reconcile with the existing generalization bounds. Moreover, none of the bounds produce non-trivial results for interpolating solutions. Second, we show experimentally that (non-smooth) Laplacian kernels easily fit random labels, a finding that parallels results for ReLU neural networks. In contrast, fitting noisy data requires many more epochs for smooth Gaussian kernels. Similar performance of overfitted Laplacian and Gaussian classifiers on test, suggests that generalization is tied to the properties of the kernel function rather than the optimization process. Certain key phenomena of deep learning are manifested similarly in kernel methods in the modern "overfitted" regime. The combination of the experimental and theoretical results presented in this paper indicates a need for new theoretical ideas for understanding properties of classical kernel methods. We argue that progress on understanding deep learning will be difficult until more tractable "shallow" kernel methods are better understood.

Journal ArticleDOI
TL;DR: It is found that using a simple empirical parametrization scheme is sufficient in order to obtain HDNNPs with high accuracy for the wACSFs employed here, and the intrinsic parameters of the descriptors can in principle be optimized with a genetic algorithm in a highly automated manner.
Abstract: We introduce weighted atom-centered symmetry functions (wACSFs) as descriptors of a chemical system’s geometry for use in the prediction of chemical properties such as enthalpies or potential energies via machine learning. The wACSFs are based on conventional atom-centered symmetry functions (ACSFs) but overcome the undesirable scaling of the latter with an increasing number of different elements in a chemical system. The performance of these two descriptors is compared using them as inputs in high-dimensional neural network potentials (HDNNPs), employing the molecular structures and associated enthalpies of the 133 855 molecules containing up to five different elements reported in the QM9 database as reference data. A substantially smaller number of wACSFs than ACSFs is needed to obtain a comparable spatial resolution of the molecular structures. At the same time, this smaller set of wACSFs leads to a significantly better generalization performance in the machine learning potential than the large set of conventional ACSFs. Furthermore, we show that the intrinsic parameters of the descriptors can in principle be optimized with a genetic algorithm in a highly automated manner. For the wACSFs employed here, we find however that using a simple empirical parametrization scheme is sufficient in order to obtain HDNNPs with high accuracy.

Posted Content
TL;DR: It is argued that progress on understanding deep learning will be difficult until more tractable "shallow" kernel methods are better understood, and a need for new theoretical ideas for understanding properties of classical kernel methods.
Abstract: Generalization performance of classifiers in deep learning has recently become a subject of intense study Deep models, typically over-parametrized, tend to fit the training data exactly Despite this "overfitting", they perform well on test data, a phenomenon not yet fully understood The first point of our paper is that strong performance of overfitted classifiers is not a unique feature of deep learning Using six real-world and two synthetic datasets, we establish experimentally that kernel machines trained to have zero classification or near zero regression error perform very well on test data, even when the labels are corrupted with a high level of noise We proceed to give a lower bound on the norm of zero loss solutions for smooth kernels, showing that they increase nearly exponentially with data size We point out that this is difficult to reconcile with the existing generalization bounds Moreover, none of the bounds produce non-trivial results for interpolating solutions Second, we show experimentally that (non-smooth) Laplacian kernels easily fit random labels, a finding that parallels results for ReLU neural networks In contrast, fitting noisy data requires many more epochs for smooth Gaussian kernels Similar performance of overfitted Laplacian and Gaussian classifiers on test, suggests that generalization is tied to the properties of the kernel function rather than the optimization process Certain key phenomena of deep learning are manifested similarly in kernel methods in the modern "overfitted" regime The combination of the experimental and theoretical results presented in this paper indicates a need for new theoretical ideas for understanding properties of classical kernel methods We argue that progress on understanding deep learning will be difficult until more tractable "shallow" kernel methods are better understood

Posted Content
TL;DR: The key finding is that `vanilla' deep RL algorithms generalize better than specialized schemes that were proposed specifically to tackle generalization.
Abstract: Deep reinforcement learning (RL) has achieved breakthrough results on many tasks, but agents often fail to generalize beyond the environment they were trained in. As a result, deep RL algorithms that promote generalization are receiving increasing attention. However, works in this area use a wide variety of tasks and experimental setups for evaluation. The literature lacks a controlled assessment of the merits of different generalization schemes. Our aim is to catalyze community-wide progress on generalization in deep RL. To this end, we present a benchmark and experimental protocol, and conduct a systematic empirical study. Our framework contains a diverse set of environments, our methodology covers both in-distribution and out-of-distribution generalization, and our evaluation includes deep RL algorithms that specifically tackle generalization. Our key finding is that `vanilla' deep RL algorithms generalize better than specialized schemes that were proposed specifically to tackle generalization.

Journal ArticleDOI
TL;DR: This work works step-by-step through two examples to illustrate concepts and address common concerns in Super Learner, an ensemble method that allows researchers to combine several different prediction algorithms into one.
Abstract: Stacked generalization is an ensemble method that allows researchers to combine several different prediction algorithms into one. Since its introduction in the early 1990s, the method has evolved several times into a host of methods among which is the "Super Learner". Super Learner uses V-fold cross-validation to build the optimal weighted combination of predictions from a library of candidate algorithms. Optimality is defined by a user-specified objective function, such as minimizing mean squared error or maximizing the area under the receiver operating characteristic curve. Although relatively simple in nature, use of Super Learner by epidemiologists has been hampered by limitations in understanding conceptual and technical details. We work step-by-step through two examples to illustrate concepts and address common concerns.

Journal ArticleDOI
TL;DR: Modelling how humans search for rewards under limited search horizons finds evidence that Gaussian process function learning—combined with an optimistic upper confidence bound sampling strategy—provides a robust account of how people use generalization to guide search.
Abstract: From foraging for food to learning complex games, many aspects of human behaviour can be framed as a search problem with a vast space of possible actions. Under finite search horizons, optimal solutions are generally unobtainable. Yet, how do humans navigate vast problem spaces, which require intelligent exploration of unobserved actions? Using various bandit tasks with up to 121 arms, we study how humans search for rewards under limited search horizons, in which the spatial correlation of rewards (in both generated and natural environments) provides traction for generalization. Across various different probabilistic and heuristic models, we find evidence that Gaussian process function learning—combined with an optimistic upper confidence bound sampling strategy—provides a robust account of how people use generalization to guide search. Our modelling results and parameter estimates are recoverable and can be used to simulate human-like performance, providing insights about human behaviour in complex environments.

Proceedings ArticleDOI
18 Jun 2018
TL;DR: In this article, a dynamic-structured semantic propagation network (DSSPN) is proposed to propagate the learned knowledge of all ancestor neurons into each fine-grained child neuron for feature evolving.
Abstract: Semantic concept hierarchy is still under-explored for semantic segmentation due to the inefficiency and complicated optimization of incorporating structural inference into dense prediction. This lack of modeling semantic correlations also makes prior works must tune highly-specified models for each task due to the label discrepancy across datasets. It severely limits the generalization capability of segmentation models for open set concept vocabulary and annotation utilization. In this paper, we propose a Dynamic-Structured Semantic Propagation Network (DSSPN) that builds a semantic neuron graph by explicitly incorporating the semantic concept hierarchy into network construction. Each neuron represents the instantiated module for recognizing a specific type of entity such as a super-class (e.g. food) or a specific concept (e.g. pizza). During training, DSSPN performs the dynamic-structured neuron computation graph by only activating a sub-graph of neurons for each image in a principled way. A dense semantic-enhanced neural block is proposed to propagate the learned knowledge of all ancestor neurons into each fine-grained child neuron for feature evolving. Another merit of such semantic explainable structure is the ability of learning a unified model concurrently on diverse datasets by selectively activating different neuron sub-graphs for each annotation at each step. Extensive experiments on four public semantic segmentation datasets (i.e. ADE20K, COCO-Stuff, Cityscape and Mapillary) demonstrate the superiority of our DSSPN over state-of-the-art segmentation models. Moreoever, we demonstrate a universal segmentation model that is jointly trained on diverse datasets can surpass the performance of the common fine-tuning scheme for exploiting multiple domain knowledge.

Journal ArticleDOI
TL;DR: This paper develops a framework that enables the robot to learn both movement and stiffness features from the human tutor and can be efficiently realized by the proposed framework.
Abstract: One promising approach for robots efficiently learning skills is to learn manipulation skills from human tutors by demonstration and then generalize these learned skills to complete new tasks. Traditional learning and generalization methods, however, have not well considered human impedance features, which makes the skills less humanlike and restricted in physical human–robot interaction scenarios. In particular, stiffness generalization has not been well considered. This paper develops a framework that enables the robot to learn both movement and stiffness features from the human tutor. To this end, the upper limb muscle activities of the human tutor are monitored to extract variable stiffness in real time, and the estimated human arm endpoint stiffness is properly mapped into the robot impedance controller. Then, a dynamic movement primitives model is extended and employed to simultaneously encode the movement trajectories and the stiffness profiles. In this way, both position trajectory and stiffness profile are considered for robot motion control in this paper to realize a more complete skill transfer process. More importantly, stiffness generalization and movement generalization can be efficiently realized by the proposed framework. Experimental tests have been performed on a dual-arm Baxter robot to verify the effectiveness of the proposed method.

Posted Content
TL;DR: This paper investigated the problem of overfitting in deep reinforcement learning by using procedurally generated environments to construct distinct training and test sets, and found that agents overfit to surprisingly large training sets.
Abstract: In this paper, we investigate the problem of overfitting in deep reinforcement learning. Among the most common benchmarks in RL, it is customary to use the same environments for both training and testing. This practice offers relatively little insight into an agent's ability to generalize. We address this issue by using procedurally generated environments to construct distinct training and test sets. Most notably, we introduce a new environment called CoinRun, designed as a benchmark for generalization in RL. Using CoinRun, we find that agents overfit to surprisingly large training sets. We then show that deeper convolutional architectures improve generalization, as do methods traditionally found in supervised learning, including L2 regularization, dropout, data augmentation and batch normalization.

Posted Content
TL;DR: This paper created a new NLI test set that shows the deficiency of state-of-the-art models in inferences that require lexical and world knowledge, demonstrating that these systems are limited in their generalization ability, failing to capture many simple inferences.
Abstract: We create a new NLI test set that shows the deficiency of state-of-the-art models in inferences that require lexical and world knowledge. The new examples are simpler than the SNLI test set, containing sentences that differ by at most one word from sentences in the training set. Yet, the performance on the new test set is substantially worse across systems trained on SNLI, demonstrating that these systems are limited in their generalization ability, failing to capture many simple inferences.