scispace - formally typeset
Search or ask a question

Showing papers on "Generalization published in 2020"


Posted Content
TL;DR: A new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) is proposed that improves the BERT and RoBERTa models using two novel techniques that significantly improve the efficiency of model pre-training and performance of downstream tasks.
Abstract: Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, out performing the human baseline by a decent margin (90.3 versus 89.8).

921 citations


Journal ArticleDOI
TL;DR: The experimental results show that the proposed model demonstrates better generalization ability than the existing image fusion models for fusing various types of images, such as multi-focus, infrared-visual, multi-modal medical and multi-exposure images.

524 citations


Posted Content
TL;DR: This paper implements DomainBed, a testbed for domain generalization including seven multi-domain datasets, nine baseline algorithms, and three model selection criteria, and finds that, when carefully implemented, empirical risk minimization shows state-of-the-art performance across all datasets.
Abstract: The goal of domain generalization algorithms is to predict well on distributions different from those seen during training While a myriad of domain generalization algorithms exist, inconsistencies in experimental conditions -- datasets, architectures, and model selection criteria -- render fair and realistic comparisons difficult In this paper, we are interested in understanding how useful domain generalization algorithms are in realistic settings As a first step, we realize that model selection is non-trivial for domain generalization tasks Contrary to prior work, we argue that domain generalization algorithms without a model selection strategy should be regarded as incomplete Next, we implement DomainBed, a testbed for domain generalization including seven multi-domain datasets, nine baseline algorithms, and three model selection criteria We conduct extensive experiments using DomainBed and find that, when carefully implemented, empirical risk minimization shows state-of-the-art performance across all datasets Looking forward, we hope that the release of DomainBed, along with contributions from fellow researchers, will streamline reproducible and rigorous research in domain generalization

492 citations


Posted Content
TL;DR: A comprehensive review of 62 state-of-the-art robust training methods, all of which are categorized into five groups according to their methodological difference, followed by a systematic comparison of six properties used to evaluate their superiority.
Abstract: Deep learning has achieved remarkable success in numerous domains with help from large amounts of big data. However, the quality of data labels is a concern because of the lack of high-quality labels in many real-world scenarios. As noisy labels severely degrade the generalization performance of deep neural networks, learning from noisy labels (robust training) is becoming an important task in modern deep learning applications. In this survey, we first describe the problem of learning with label noise from a supervised learning perspective. Next, we provide a comprehensive review of 46 state-of-the-art robust training methods, all of which are categorized into seven groups according to their methodological difference, followed by a systematic comparison of six properties used to evaluate their superiority. Subsequently, we summarize the typically used evaluation methodology, including public noisy datasets and evaluation metrics. Finally, we present several promising research directions that can serve as a guideline for future studies.

474 citations


Proceedings ArticleDOI
14 Jun 2020
TL;DR: Local Deep Implicit Functions (LDIF), a 3D shape representation that decomposes space into a structured set of learned implicit functions that provides higher surface reconstruction accuracy than the state-of-the-art (OccNet), while requiring fewer than 1\% of the network parameters.
Abstract: The goal of this project is to learn a 3D shape representation that enables accurate surface reconstruction, compact storage, efficient computation, consistency for similar shapes, generalization across diverse shape categories, and inference from depth camera observations. Towards this end, we introduce Local Deep Implicit Functions (LDIF), a 3D shape representation that decomposes space into a structured set of learned implicit functions. We provide networks that infer the space decomposition and local deep implicit functions from a 3D mesh or posed depth image. During experiments, we find that it provides 10.3 points higher surface reconstruction accuracy (F-Score) than the state-of-the-art (OccNet), while requiring fewer than 1\% of the network parameters. Experiments on posed depth image completion and generalization to unseen classes show 15.8 and 17.8 point improvements over the state-of-the-art, while producing a structured 3D representation for each input with consistency across diverse shape collections.

378 citations


Posted Content
TL;DR: In this article, deep ensembles provide an effective mechanism for approximate Bayesian marginalization, and propose a related approach that further improves the predictive distribution by marginalizing within basins of attraction, without significant overhead.
Abstract: The key distinguishing property of a Bayesian approach is marginalization, rather than using a single setting of weights. Bayesian marginalization can particularly improve the accuracy and calibration of modern deep neural networks, which are typically underspecified by the data, and can represent many compelling but different solutions. We show that deep ensembles provide an effective mechanism for approximate Bayesian marginalization, and propose a related approach that further improves the predictive distribution by marginalizing within basins of attraction, without significant overhead. We also investigate the prior over functions implied by a vague distribution over neural network weights, explaining the generalization properties of such models from a probabilistic perspective. From this perspective, we explain results that have been presented as mysterious and distinct to neural network generalization, such as the ability to fit images with random labels, and show that these results can be reproduced with Gaussian processes. We also show that Bayesian model averaging alleviates double descent, resulting in monotonic performance improvements with increased flexibility. Finally, we provide a Bayesian perspective on tempering for calibrating predictive distributions.

328 citations


Proceedings ArticleDOI
14 Jun 2020
TL;DR: This work first notices CNN's ability in capturing the high-frequency components of images, which are almost imperceptible to a human, and leads to multiple hypotheses that are related to the generalization behaviors of CNN, including a potential explanation for adversarial examples.
Abstract: We investigate the relationship between the frequency spectrum of image data and the generalization behavior of convolutional neural networks (CNN). We first notice CNN's ability in capturing the high-frequency components of images. These high-frequency components are almost imperceptible to a human. Thus the observation leads to multiple hypotheses that are related to the generalization behaviors of CNN, including a potential explanation for adversarial examples, a discussion of CNN's trade-off between robustness and accuracy, and some evidence in understanding training heuristics.

321 citations


Journal ArticleDOI
Rene Ranftl1, Katrin Lasinger2, David Hafner1, Konrad Schindler2, Vladlen Koltun1 
TL;DR: The authors proposed a robust training objective that is invariant to changes in depth range and scale, advocate the use of principled multi-objective learning to combine data from different sources, and highlight the importance of pretraining encoders on auxiliary tasks.
Abstract: The success of monocular depth estimation relies on large and diverse training sets. Due to the challenges associated with acquiring dense ground-truth depth across different environments at scale, a number of datasets with distinct characteristics and biases have emerged. We develop tools that enable mixing multiple datasets during training, even if their annotations are incompatible. In particular, we propose a robust training objective that is invariant to changes in depth range and scale, advocate the use of principled multi-objective learning to combine data from different sources, and highlight the importance of pretraining encoders on auxiliary tasks. Armed with these tools, we experiment with six diverse training datasets, including a new, massive data source: 3D films. To demonstrate the generalization power of our approach we use zero-shot cross-dataset transfer, i.e. we evaluate on datasets that were not seen during training. The experiments confirm that mixing data from complementary sources greatly improves monocular depth estimation. Our approach clearly outperforms competing methods across diverse datasets, setting a new state of the art for monocular depth estimation.

300 citations


Posted Content
TL;DR: In this article, representation self-challenging (RSC) is proposed to improve cross-domain generalization of CNNs by iteratively disabling the dominant features on the training data and forcing the network to activate remaining features that correlate with labels.
Abstract: Convolutional Neural Networks (CNN) conduct image classification by activating dominant features that correlated with labels. When the training and testing data are under similar distributions, their dominant features are similar, which usually facilitates decent performance on the testing data. The performance is nonetheless unmet when tested on samples from different distributions, leading to the challenges in cross-domain image classification. We introduce a simple training heuristic, Representation Self-Challenging (RSC), that significantly improves the generalization of CNN to the out-of-domain data. RSC iteratively challenges (discards) the dominant features activated on the training data, and forces the network to activate remaining features that correlates with labels. This process appears to activate feature representations applicable to out-of-domain data without prior knowledge of new domain and without learning extra network parameters. We present theoretical properties and conditions of RSC for improving cross-domain generalization. The experiments endorse the simple, effective and architecture-agnostic nature of our RSC method.

272 citations


Journal ArticleDOI
03 Apr 2020
TL;DR: This paper proposes a novel DG approach based on Deep Domain-Adversarial Image Generation based on augmenting the source training data with the generated unseen domain data to make the label classifier more robust to unknown domain changes.
Abstract: Machine learning models typically suffer from the domain shift problem when trained on a source dataset and evaluated on a target dataset of different distribution. To overcome this problem, domain generalisation (DG) methods aim to leverage data from multiple source domains so that a trained model can generalise to unseen domains. In this paper, we propose a novel DG approach based on Deep Domain-Adversarial Image Generation (DDAIG). Specifically, DDAIG consists of three components, namely a label classifier, a domain classifier and a domain transformation network (DoTNet). The goal for DoTNet is to map the source training data to unseen domains. This is achieved by having a learning objective formulated to ensure that the generated data can be correctly classified by the label classifier while fooling the domain classifier. By augmenting the source training data with the generated unseen domain data, we can make the label classifier more robust to unknown domain changes. Extensive experiments on four DG datasets demonstrate the effectiveness of our approach.

224 citations


Journal ArticleDOI
TL;DR: It has been confirmed by experimental results that DEL produces dynamic NN ensembles of appropriate architecture and diversity that demonstrate good generalization ability.
Abstract: This paper presents a novel dynamic ensemble learning (DEL) algorithm for designing ensemble of neural networks (NNs). DEL algorithm determines the size of ensemble, the number of individual NNs employing a constructive strategy, the number of hidden nodes of individual NNs employing a constructive–pruning strategy, and different training samples for individual NN’s learning. For diversity, negative correlation learning has been introduced and also variation of training samples has been made for individual NNs that provide better learning from the whole training samples. The major benefits of the proposed DEL compared to existing ensemble algorithms are (1) automatic design of ensemble; (2) maintaining accuracy and diversity of NNs at the same time; and (3) minimum number of parameters to be defined by user. DEL algorithm is applied to a set of real-world classification problems such as the cancer, diabetes, heart disease, thyroid, credit card, glass, gene, horse, letter recognition, mushroom, and soybean datasets. It has been confirmed by experimental results that DEL produces dynamic NN ensembles of appropriate architecture and diversity that demonstrate good generalization ability.

Journal ArticleDOI
03 Apr 2020
TL;DR: In this article, the authors propose a method that iteratively divides samples into latent domains via clustering, and then trains the domain-invariant feature extractor shared among the divided domains via adversarial learning.
Abstract: When domains, which represent underlying data distributions, vary during training and testing processes, deep neural networks suffer a drop in their performance. Domain generalization allows improvements in the generalization performance for unseen target domains by using multiple source domains. Conventional methods assume that the domain to which each sample belongs is known in training. However, many datasets, such as those collected via web crawling, contain a mixture of multiple latent domains, in which the domain of each sample is unknown. This paper introduces domain generalization using a mixture of multiple latent domains as a novel and more realistic scenario, where we try to train a domain-generalized model without using domain labels. To address this scenario, we propose a method that iteratively divides samples into latent domains via clustering, and which trains the domain-invariant feature extractor shared among the divided latent domains via adversarial learning. We assume that the latent domain of images is reflected in their style, and thus, utilize style features for clustering. By using these features, our proposed method successfully discovers latent domains and achieves domain generalization even if the domain labels are not given. Experiments show that our proposed method can train a domain-generalized model without using domain labels. Moreover, it outperforms conventional domain generalization methods, including those that utilize domain labels.

Proceedings Article
30 Apr 2020
TL;DR: This work presents a unifying view and proposes an open-set method to relax current generalization assumptions, and extends the applicability of transformation-based methods to non-image data using random affine transformations.
Abstract: Anomaly detection, finding patterns that substantially deviate from those seen previously, is one of the fundamental problems of artificial intelligence. Recently, classification-based methods were shown to achieve superior results on this task. In this work, we present a unifying view and propose an open-set method to relax current generalization assumptions. Furthermore, we extend the applicability of transformation-based methods to non-image data using random affine transformations. Our method is shown to obtain state-of-the-art accuracy and is applicable to broad data types. The strong performance of our method is extensively validated on multiple datasets from different domains.

Proceedings ArticleDOI
23 Jun 2020
TL;DR: Empirical evidence shows that the proposed causal speech enhancement model, based on an encoder-decoder architecture with skip-connections, is capable of removing various kinds of background noise including stationary and non-stationary noises, as well as room reverb.
Abstract: We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with skip-connections. It is optimized on both time and frequency domains, using multiple loss functions. Empirical evidence shows that it is capable of removing various kinds of background noise including stationary and non-stationary noises, as well as room reverb. Additionally, we suggest a set of data augmentation techniques applied directly on the raw waveform which further improve model performance and its generalization abilities. We perform evaluations on several standard benchmarks, both using objective metrics and human judgements. The proposed model matches state-of-the-art performance of both causal and non causal methods while working directly on the raw waveform.

Posted Content
TL;DR: This paper proposes GraphLIME, a local interpretable model explanation for graphs using the Hilbert-Schmidt Independence Criterion (HSIC) Lasso, which is a nonlinear feature selection method.
Abstract: Graph structured data has wide applicability in various domains such as physics, chemistry, biology, computer vision, and social networks, to name a few. Recently, graph neural networks (GNN) were shown to be successful in effectively representing graph structured data because of their good performance and generalization ability. GNN is a deep learning based method that learns a node representation by combining specific nodes and the structural/topological information of a graph. However, like other deep models, explaining the effectiveness of GNN models is a challenging task because of the complex nonlinear transformations made over the iterations. In this paper, we propose GraphLIME, a local interpretable model explanation for graphs using the Hilbert-Schmidt Independence Criterion (HSIC) Lasso, which is a nonlinear feature selection method. GraphLIME is a generic GNN-model explanation framework that learns a nonlinear interpretable model locally in the subgraph of the node being explained. More specifically, to explain a node, we generate a nonlinear interpretable model from its $N$-hop neighborhood and then compute the K most representative features as the explanations of its prediction using HSIC Lasso. Through experiments on two real-world datasets, the explanations of GraphLIME are found to be of extraordinary degree and more descriptive in comparison to the existing explanation methods.

Journal ArticleDOI
13 May 2020
TL;DR: In this paper, a sharp double inequality involving the ratio of generalized complete elliptic integrals of the first kind was established, which is the improvement and generalization of some previously known results.
Abstract: In the article, we establish a sharp double inequality involving the ratio of generalized complete elliptic integrals of the first kind, which is the improvement and generalization of some previously known results.

Proceedings Article
30 Apr 2020
TL;DR: A novel method to systematically construct compositional generalization benchmarks by maximizing compound divergence while guaranteeing a small atom divergence between train and test sets is introduced, and it is demonstrated how this method can be used to create new compositionality benchmarks on top of the existing SCAN dataset.
Abstract: State-of-the-art machine learning methods exhibit limited compositional generalization. At the same time, there is a lack of realistic benchmarks that comprehensively measure this ability, which makes it challenging to find and evaluate improvements. We introduce a novel method to systematically construct such benchmarks by maximizing compound divergence while guaranteeing a small atom divergence between train and test sets, and we quantitatively compare this method to other approaches for creating compositional generalization benchmarks. We present a large and realistic natural language question answering dataset that is constructed according to this method, and we use it to analyze the compositional generalization ability of three machine learning architectures. We find that they fail to generalize compositionally and that there is a surprisingly strong negative correlation between compound divergence and accuracy. We also demonstrate how our method can be used to create new compositionality benchmarks on top of the existing SCAN dataset, which confirms these findings.

Posted Content
TL;DR: This work theoretically shows that only knowing how many factors have changed, but not which ones, is sufficient to learn disentangled representations, and provides practical algorithms that learn disENTangled representations from pairs of images without requiring annotation of groups, individual factors, or the number of factors that have changed.
Abstract: Intelligent agents should be able to learn useful representations by observing changes in their environment. We model such observations as pairs of non-i.i.d. images sharing at least one of the underlying factors of variation. First, we theoretically show that only knowing how many factors have changed, but not which ones, is sufficient to learn disentangled representations. Second, we provide practical algorithms that learn disentangled representations from pairs of images without requiring annotation of groups, individual factors, or the number of factors that have changed. Third, we perform a large-scale empirical study and show that such pairs of observations are sufficient to reliably learn disentangled representations on several benchmark data sets. Finally, we evaluate our learned representations and find that they are simultaneously useful on a diverse suite of tasks, including generalization under covariate shifts, fairness, and abstract reasoning. Overall, our results demonstrate that weak supervision enables learning of useful disentangled representations in realistic scenarios.

Proceedings Article
Vitaly Feldman1, Chiyuan Zhang1
09 Aug 2020
TL;DR: The experiments demonstrate the significant benefits of memorization for generalization on several standard benchmarks and provide quantitative and visually compelling evidence for the theory put forth in Feldman (2019), which proposes a theoretical explanation for this phenomenon.
Abstract: Deep learning algorithms are well-known to have a propensity for fitting the training data very well and often fit even outliers and mislabeled data points. Such fitting requires memorization of training data labels, a phenomenon that has attracted significant research interest but has not been given a compelling explanation so far. A recent work of Feldman (2019) proposes a theoretical explanation for this phenomenon based on a combination of two insights. First, natural image and data distributions are (informally) known to be long-tailed, that is have a significant fraction of rare and atypical examples. Second, in a simple theoretical model such memorization is necessary for achieving close-to-optimal generalization error when the data distribution is long-tailed. However, no direct empirical evidence for this explanation or even an approach for obtaining such evidence were given. In this work we design experiments to test the key ideas in this theory. The experiments require estimation of the influence of each training example on the accuracy at each test example as well as memorization values of training examples. Estimating these quantities directly is computationally prohibitive but we show that closely-related subsampled influence and memorization values can be estimated much more efficiently. Our experiments demonstrate the significant benefits of memorization for generalization on several standard benchmarks. They also provide quantitative and visually compelling evidence for the theory put forth in (Feldman, 2019).

Posted Content
TL;DR: This work considers a larger list of inductive biases that humans and animals exploit, focusing on those which concern mostly higher-level and sequential conscious processing, and suggests they could potentially help build AI systems benefiting from humans' abilities in terms of flexible out-of-distribution and systematic generalization.
Abstract: A fascinating hypothesis is that human and animal intelligence could be explained by a few principles (rather than an encyclopedic list of heuristics). If that hypothesis was correct, we could more easily both understand our own intelligence and build intelligent machines. Just like in physics, the principles themselves would not be sufficient to predict the behavior of complex systems like brains, and substantial computation might be needed to simulate human-like intelligence. This hypothesis would suggest that studying the kind of inductive biases that humans and animals exploit could help both clarify these principles and provide inspiration for AI research and neuroscience theories. Deep learning already exploits several key inductive biases, and this work considers a larger list, focusing on those which concern mostly higher-level and sequential conscious processing. The objective of clarifying these particular principles is that they could potentially help us build AI systems benefiting from humans' abilities in terms of flexible out-of-distribution and systematic generalization, which is currently an area where a large gap exists between state-of-the-art machine learning and human intelligence.

Journal ArticleDOI
TL;DR: A general panorama of the state of the art of the Choquet integral generalizations is offered, showing the relations and intersections among such five classes of generalizations.

Proceedings ArticleDOI
14 Jun 2020
TL;DR: A novel system that explicitly disentangles scale from the network estimation, which achieves state-of-the-art results among self-supervised learning-based methods on KITTI Odometry and NYUv2 dataset and presents some interesting findings on the limitation of PoseNet-based relative pose estimation methods in terms of generalization ability.
Abstract: In this work, we tackle the essential problem of scale inconsistency for self supervised joint depth-pose learning. Most existing methods assume that a consistent scale of depth and pose can be learned across all input samples, which makes the learning problem harder, resulting in degraded performance and limited generalization in indoor environments and long-sequence visual odometry application. To address this issue, we propose a novel system that explicitly disentangles scale from the network estimation. Instead of relying on PoseNet architecture, our method recovers relative pose by directly solving fundamental matrix from dense optical flow correspondence and makes use of a two-view triangulation module to recover an up-to-scale 3D structure. Then, we align the scale of the depth prediction with the triangulated point cloud and use the transformed depth map for depth error computation and dense reprojection check. Our whole system can be jointly trained end-to-end. Extensive experiments show that our system not only reaches state-of-the-art performance on KITTI depth and flow estimation, but also significantly improves the generalization ability of existing self-supervised depth-pose learning methods under a variety of challenging scenarios, and achieves state-of-the-art results among self-supervised learning-based methods on KITTI Odometry and NYUv2 dataset. Furthermore, we present some interesting findings on the limitation of PoseNet-based relative pose estimation methods in terms of generalization ability. Code is available at https://github.com/B1ueber2y/TrianFlow.

Proceedings Article
Yiding Jiang1, Behnam Neyshabur1, Dilip Krishnan1, Hossein Mobahi1, Samy Bengio1 
30 Apr 2020
TL;DR: In this article, a large scale study of generalization bounds and measures in deep networks is presented, where the authors train over two thousand CIFAR-10 networks with systematic changes in important hyper-parameters.
Abstract: Generalization of deep networks has been intensely researched in recent years, resulting in a number of theoretical bounds and empirically motivated measures. However, most papers proposing such measures only study a small set of models, leaving open the question of whether these measures are truly useful in practice. We present the first large scale study of generalization bounds and measures in deep networks. We train over two thousand CIFAR-10 networks with systematic changes in important hyper-parameters. We attempt to uncover potential causal relationships between each measure and generalization, by using rank correlation coefficient and its modified forms. We analyze the results and show that some of the studied measures are very promising for further research.

Journal ArticleDOI
01 Feb 2020
TL;DR: The objective of this paper is to develop some correlation coefficients for T-spherical fuzzy sets due to the non-applicability of correlations of intuitionistic fuzzy sets and picture fuzzy sets in some certain circumstances.
Abstract: The framework of T-spherical fuzzy set is a generalization of fuzzy set, intuitionistic fuzzy set and picture fuzzy set having a great potential of dealing with uncertain events with no limitation. A T-spherical fuzzy framework can deal with phenomena of more than yes or no type; for example, consider the scenario of voting where one’s voting interest is not limited to “in favor’’ or “against’’ rather there could be some sort of abstinence or refusal degree also. The objective of this paper is to develop some correlation coefficients for T-spherical fuzzy sets due to the non-applicability of correlations of intuitionistic fuzzy sets and picture fuzzy sets in some certain circumstances. The fitness of new correlation coefficients has been discussed, and their generalization is studied with the help of some results. Clustering and multi-attribute decision-making algorithms have been proposed in the environment of T-spherical fuzzy sets. To demonstrate the viability of proposed algorithms and correlation coefficients, two real-life problems including a clustering problem and a multi-attribute decision-making problem have been solved. A comparative study of the newly presented and pre-existing literature is established showing the superiority of proposed work over the existing theory. Some advantages of new correlation coefficients and drawbacks of the pre-existing work are demonstrated with the help of numerical examples.

Journal ArticleDOI
TL;DR: In this article, the authors show that the initialization causes finite-size random fluctuations of the neural net output function f N around its expectation, which affect the generalization error for classification.
Abstract: Supervised deep learning involves the training of neural networks with a large number N of parameters. For large enough N, in the so-called over-parametrized regime, one can essentially fit the training data points. Sparsity-based arguments would suggest that the generalization error increases as N grows past a certain threshold N *. Instead, empirical studies have shown that in the over-parametrized regime, generalization error keeps decreasing with N. We resolve this paradox through a new framework. We rely on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations of the neural net output function f N around its expectation . These affect the generalization error for classification: under natural assumptions, it decays to a plateau value in a power-law fashion ~N −1/2. This description breaks down at a so-called jamming transition N = N *. At this threshold, we argue that diverges. This result leads to a plausible explanation for the cusp in test error known to occur at N *. Our results are confirmed by extensive empirical observations on the MNIST and CIFAR image datasets. Our analysis finally suggests that, given a computational envelope, the smallest generalization error is obtained using several networks of intermediate sizes, just beyond N *, and averaging their outputs.

Book ChapterDOI
23 Aug 2020
TL;DR: A new domain generalization framework that learns how to generalize across domains simultaneously from extrinsic relationship supervision and intrinsic self-supervision for images from multi-source domains is presented.
Abstract: The generalization capability of neural networks across domains is crucial for real-world applications. We argue that a generalized object recognition system should well understand the relationships among different images and also the images themselves at the same time. To this end, we present a new domain generalization framework (called EISNet) that learns how to generalize across domains simultaneously from extrinsic relationship supervision and intrinsic self-supervision for images from multi-source domains. To be specific, we formulate our framework with feature embedding using a multi-task learning paradigm. Besides conducting the common supervised recognition task, we seamlessly integrate a momentum metric learning task and a self-supervised auxiliary task to collectively integrate the extrinsic and intrinsic supervisions. Also, we develop an effective momentum metric learning scheme with the K-hard negative mining to boost the network generalization ability. We demonstrate the effectiveness of our approach on two standard object recognition benchmarks VLCS and PACS, and show that our EISNet achieves state-of-the-art performance.

Journal ArticleDOI
TL;DR: In this paper, a fairly comprehensive analysis is presented for the gradient descent dynamics for training two-layer neural network models in the situation when the parameters in both layers are updated, and sharp estimates of the generalization error is established for target functions in the appropriate reproducing kernel Hilbert space.
Abstract: A fairly comprehensive analysis is presented for the gradient descent dynamics for training two-layer neural network models in the situation when the parameters in both layers are updated. General initialization schemes as well as general regimes for the network width and training data size are considered. In the over-parametrized regime, it is shown that gradient descent dynamics can achieve zero training loss exponentially fast regardless of the quality of the labels. In addition, it is proved that throughout the training process the functions represented by the neural network model are uniformly close to that of a kernel method. For general values of the network width and training data size, sharp estimates of the generalization error is established for target functions in the appropriate reproducing kernel Hilbert space.

Journal ArticleDOI
TL;DR: In this paper, a Prior Guided Feature Enrichment Network (PFENet) is proposed to solve the problem of generalization ability reduction on unseen classes due to inappropriate use of high-level semantic information of training classes and spatial inconsistency between query and support targets.
Abstract: State-of-the-art semantic segmentation methods require sufficient labeled data to achieve good results and hardly work on unseen classes without fine-tuning. Few-shot segmentation is thus proposed to tackle this problem by learning a model that quickly adapts to new classes with a few labeled support samples. Theses frameworks still face the challenge of generalization ability reduction on unseen classes due to inappropriate use of high-level semantic information of training classes and spatial inconsistency between query and support targets. To alleviate these issues, we propose the Prior Guided Feature Enrichment Network (PFENet). It consists of novel designs of (1) a training-free prior mask generation method that not only retains generalization power but also improves model performance and (2) Feature Enrichment Module (FEM) that overcomes spatial inconsistency by adaptively enriching query features with support features and prior masks. Extensive experiments on PASCAL- 5i and COCO prove that the proposed prior generation method and FEM both improve the baseline method significantly. Our PFENet also outperforms state-of-the-art methods by a large margin without efficiency loss. It is surprising that our model even generalizes to cases without labeled support samples.

Book ChapterDOI
23 Aug 2020
TL;DR: Domain-specific masks for generalization (DMG) as discussed by the authors learns a balance of domain-invariant and domain-specific features to improve both in-domain and out-of-domain generalization performance.
Abstract: We introduce Domain-specific Masks for Generalization, a model for improving both in-domain and out-of-domain generalization performance. For domain generalization, the goal is to learn from a set of source domains to produce a single model that will best generalize to an unseen target domain. As such, many prior approaches focus on learning representations which persist across all source domains with the assumption that these domain agnostic representations will generalize well. However, often individual domains contain characteristics which are unique and when leveraged can significantly aid in-domain recognition performance. To produce a model which best generalizes to both seen and unseen domains, we propose learning domain specific masks. The masks are encouraged to learn a balance of domain-invariant and domain-specific features, thus enabling a model which can benefit from the predictive power of specialized features while retaining the universal applicability of domain-invariant features. We demonstrate competitive performance compared to naive baselines and state-of-the-art methods on both PACS and DomainNet (Our code is available at https://github.com/prithv1/DMG).

Journal ArticleDOI
TL;DR: An actual MADM application has been given to testify this new model and some comparisons between this novel MABAC model and two q-ROFNs aggregation operators are provided to further demonstrate the merits of the q-rung orthopair fuzzy MABac model.