scispace - formally typeset
Search or ask a question
Posted Content

Invariant Risk Minimization

TL;DR: This work introduces Invariant Risk Minimization, a learning paradigm to estimate invariant correlations across multiple training distributions and shows how the invariances learned by IRM relate to the causal structures governing the data and enable out-of-distribution generalization.
Abstract: We introduce Invariant Risk Minimization (IRM), a learning paradigm to estimate invariant correlations across multiple training distributions. To achieve this goal, IRM learns a data representation such that the optimal classifier, on top of that data representation, matches for all training distributions. Through theory and experiments, we show how the invariances learned by IRM relate to the causal structures governing the data and enable out-of-distribution generalization.
Citations
More filters
Posted Content
TL;DR: This paper implements DomainBed, a testbed for domain generalization including seven multi-domain datasets, nine baseline algorithms, and three model selection criteria, and finds that, when carefully implemented, empirical risk minimization shows state-of-the-art performance across all datasets.
Abstract: The goal of domain generalization algorithms is to predict well on distributions different from those seen during training While a myriad of domain generalization algorithms exist, inconsistencies in experimental conditions -- datasets, architectures, and model selection criteria -- render fair and realistic comparisons difficult In this paper, we are interested in understanding how useful domain generalization algorithms are in realistic settings As a first step, we realize that model selection is non-trivial for domain generalization tasks Contrary to prior work, we argue that domain generalization algorithms without a model selection strategy should be regarded as incomplete Next, we implement DomainBed, a testbed for domain generalization including seven multi-domain datasets, nine baseline algorithms, and three model selection criteria We conduct extensive experiments using DomainBed and find that, when carefully implemented, empirical risk minimization shows state-of-the-art performance across all datasets Looking forward, we hope that the release of DomainBed, along with contributions from fellow researchers, will streamline reproducible and rigorous research in domain generalization

492 citations


Cites background or methods from "Invariant Risk Minimization"

  • ...Akuzawa et al. (2019) extend DANN by considering cases where there exists an statistical dependence between the domain and the class label variables. Albuquerque et al. (2019) extend DANN by considering one-versus-all adversaries that try to predict to which training domain does each of the examples belong to....

    [...]

  • ...Akuzawa et al. (2019) extend DANN by considering cases where there exists an statistical dependence between the domain and the class label variables. Albuquerque et al. (2019) extend DANN by considering one-versus-all adversaries that try to predict to which training domain does each of the examples belong to. Li et al. (2018b) employ GANs and the maximum mean discrepancy criteria (Gretton et al., 2012) to align feature distributions across domains. Matsuura and Harada (2019) leverages clustering techniques to learn domaininvariant features even when the separation between training domains is not given. Li et al. (2018c;d) learns a feature transformation φ such that the conditional distributions P (φ(X) | Y d = y) match for all training domains d and label values y. Shankar et al. (2018) use a domain classifier to construct adversarial examples for a label classifier, and use a label classifier to construct adversarial examples for the domain classifier. This results in a label classifier with better domain generalization. Li et al. (2019a) train a robust feature extractor and classifier. The robustness comes from (i) asking the feature extractor to produce features such that a classifier trained on domain d can classify instances for domain d′ 6= d, and (ii) asking the classifier to predict labels on domain d using features produced by a feature extractor trained on domain d′ 6= d. Li et al. (2020) adopt a lifelong learning strategy to attack the problem of domain generalization. Motiian et al. (2017) learn a feature representation such that (i) examples from different domains but the same class are close, (ii) examples from different domains and classes are far, and (iii) training examples can be correctly classified. Ilse et al. (2019) train a variational autoencoder (Kingma and Welling, 2014) where the bottleneck representation factorizes knowledge about domain, class label, and residual variations in the input space. Fang et al. (2013) learn a structural SVM metric such that the neighborhood of each example contains examples from the same category and all training domains....

    [...]

  • ...Akuzawa et al. (2019) extend DANN by considering cases where there exists an statistical dependence between the domain and the class label variables....

    [...]

  • ...Akuzawa et al. (2019) extend DANN by considering cases where there exists an statistical dependence between the domain and the class label variables. Albuquerque et al. (2019) extend DANN by considering one-versus-all adversaries that try to predict to which training domain does each of the examples belong to. Li et al. (2018b) employ GANs and the maximum mean discrepancy criteria (Gretton et al., 2012) to align feature distributions across domains. Matsuura and Harada (2019) leverages clustering techniques to learn domaininvariant features even when the separation between training domains is not given. Li et al. (2018c;d) learns a feature transformation φ such that the conditional distributions P (φ(X) | Y d = y) match for all training domains d and label values y. Shankar et al. (2018) use a domain classifier to construct adversarial examples for a label classifier, and use a label classifier to construct adversarial examples for the domain classifier. This results in a label classifier with better domain generalization. Li et al. (2019a) train a robust feature extractor and classifier. The robustness comes from (i) asking the feature extractor to produce features such that a classifier trained on domain d can classify instances for domain d′ 6= d, and (ii) asking the classifier to predict labels on domain d using features produced by a feature extractor trained on domain d′ 6= d. Li et al. (2020) adopt a lifelong learning strategy to attack the problem of domain generalization. Motiian et al. (2017) learn a feature representation such that (i) examples from different domains but the same class are close, (ii) examples from different domains and classes are far, and (iii) training examples can be correctly classified....

    [...]

  • ...Akuzawa et al. (2019) extend DANN by considering cases where there exists an statistical dependence between the domain and the class label variables. Albuquerque et al. (2019) extend DANN by considering one-versus-all adversaries that try to predict to which training domain does each of the examples belong to. Li et al. (2018b) employ GANs and the maximum mean discrepancy criteria (Gretton et al., 2012) to align feature distributions across domains. Matsuura and Harada (2019) leverages clustering techniques to learn domaininvariant features even when the separation between training domains is not given. Li et al. (2018c;d) learns a feature transformation φ such that the conditional distributions P (φ(X) | Y d = y) match for all training domains d and label values y. Shankar et al. (2018) use a domain classifier to construct adversarial examples for a label classifier, and use a label classifier to construct adversarial examples for the domain classifier. This results in a label classifier with better domain generalization. Li et al. (2019a) train a robust feature extractor and classifier. The robustness comes from (i) asking the feature extractor to produce features such that a classifier trained on domain d can classify instances for domain d′ 6= d, and (ii) asking the classifier to predict labels on domain d using features produced by a feature extractor trained on domain d′ 6= d. Li et al. (2020) adopt a lifelong learning strategy to attack the problem of domain generalization. Motiian et al. (2017) learn a feature representation such that (i) examples from different domains but the same class are close, (ii) examples from different domains and classes are far, and (iii) training examples can be correctly classified. Ilse et al. (2019) train a variational autoencoder (Kingma and Welling, 2014) where the bottleneck representation factorizes knowledge about domain, class label, and residual variations in the input space....

    [...]

Posted Content
TL;DR: This work introduces the principle of Risk Extrapolation (REx), and shows conceptually how this principle enables extrapolation, and demonstrates the effectiveness and scalability of instantiations of REx on various OoD generalization tasks.
Abstract: Distributional shift is one of the major obstacles when transferring machine learning prediction systems from the lab to the real world. To tackle this problem, we assume that variation across training domains is representative of the variation we might encounter at test time, but also that shifts at test time may be more extreme in magnitude. In particular, we show that reducing differences in risk across training domains can reduce a model's sensitivity to a wide range of extreme distributional shifts, including the challenging setting where the input contains both causal and anti-causal elements. We motivate this approach, Risk Extrapolation (REx), as a form of robust optimization over a perturbation set of extrapolated domains (MM-REx), and propose a penalty on the variance of training risks (V-REx) as a simpler variant. We prove that variants of REx can recover the causal mechanisms of the targets, while also providing some robustness to changes in the input distribution ("covariate shift"). By appropriately trading-off robustness to causally induced distributional shifts and covariate shift, REx is able to outperform alternative methods such as Invariant Risk Minimization in situations where these types of shift co-occur.

400 citations


Cites background or methods from "Invariant Risk Minimization"

  • ...Arjovsky et al. (2019) propose an extension of that work, called Invariant Risk Minimization (IRM), with the goal of learning a data representation that does not rely on spurious correlations....

    [...]

  • ...Arjovsky et al. (2019) construct a binary classification problem (with 0-4 and 5-9 each collapsed into a single class) based on the MNIST dataset, using color as a spurious fea- ture....

    [...]

  • ...…(Engstrom et al., 2019; Jacobsen et al., 2018) and non-adversarial (Hendrycks & Dietterich, 2019; Yin et al., 2019) robustness, causality (Arjovsky et al., 2019), and other works aimed at distinguishing statistical features from semantic features (Gowal et al., 2019; Geirhos et al.,…...

    [...]

  • ...Arjovsky et al. (2019) propose an extension of this work, called Invariant Risk Minimization (IRM), in order to learn a data representation that does not rely on spurious correlations....

    [...]

  • ...In Section C, we provide results on the synthetic structural equation models from Arjovsky et al. (2019)....

    [...]

Posted Content
TL;DR: This work shows the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain, and shows that this problem appears in a wide variety of practical ML pipelines.
Abstract: ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain.

374 citations


Cites background from "Invariant Risk Minimization"

  • ...In particular, concerns regarding “spurious correlations” and “shortcut learning” in trained models are now widespread (e.g., Geirhos et al., 2020; Arjovsky et al., 2019)....

    [...]

  • ...In this context, Peters et al. (2016); Heinze-Deml et al. (2018); Arjovsky et al. (2019); Magliacane et al. (2018) propose approaches to overcome this structural bias, often by using data collected in multiple environments to identify causal invariances....

    [...]

  • ...We call these structural failure modes, because they are often diagnosed as a misalignment between the predictor learned by empirical risk minimization and the causal structure of the desired predictor (Schölkopf, 2019; Arjovsky et al., 2019)....

    [...]

  • ...In such cases, the iid-optimal predictors must necessarily incorporate spurious associations (Caruana et al., 2015; Arjovsky et al., 2019; Ilyas et al., 2019)....

    [...]

Journal ArticleDOI
TL;DR: In this paper, a set of recommendations for model interpretation and benchmarking, highlighting recent advances in machine learning to improve robustness and transferability from the lab to real-world applications, are presented.
Abstract: Deep learning has triggered the current rise of artificial intelligence and is the workhorse of today's machine intelligence. Numerous success stories have rapidly spread all over science, industry and society, but its limitations have only recently come into focus. In this perspective we seek to distil how many of deep learning's problem can be seen as different symptoms of the same underlying problem: shortcut learning. Shortcuts are decision rules that perform well on standard benchmarks but fail to transfer to more challenging testing conditions, such as real-world scenarios. Related issues are known in Comparative Psychology, Education and Linguistics, suggesting that shortcut learning may be a common characteristic of learning systems, biological and artificial alike. Based on these observations, we develop a set of recommendations for model interpretation and benchmarking, highlighting recent advances in machine learning to improve robustness and transferability from the lab to real-world applications.

311 citations

Journal ArticleDOI
TL;DR: In this paper, the authors highlight the importance of establishing the causal relationship between images and their annotations, and offer step-by-step recommendations for future studies, while providing a detailed categorisation of potential biases and mitigation techniques.
Abstract: Causal reasoning can shed new light on the major challenges in machine learning for medical imaging: scarcity of high-quality annotated data and mismatch between the development dataset and the target environment. A causal perspective on these issues allows decisions about data collection, annotation, preprocessing, and learning strategies to be made and scrutinized more transparently, while providing a detailed categorisation of potential biases and mitigation techniques. Along with worked clinical examples, we highlight the importance of establishing the causal relationship between images and their annotations, and offer step-by-step recommendations for future studies.

233 citations

References
More filters
Posted Content
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Abstract: We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

29,480 citations


"Invariant Risk Minimization" refers background in this paper

  • ...Also, there are problems where we predict parts of the input from other parts of the input, like in self-supervised learning [14]....

    [...]

01 Jan 1998
TL;DR: Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.
Abstract: A comprehensive look at learning and generalization theory. The statistical theory of learning and generalization concerns the problem of choosing desired functions on the basis of empirical data. Highly applicable to a variety of computer science and robotics fields, this book offers lucid coverage of the theory as a whole. Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.

26,531 citations


"Invariant Risk Minimization" refers background in this paper

  • ...Because most machine learning algorithms depend on the assumption that training and testing data are sampled independently from the same distribution [51], it is common practice to shuffle at random the training and testing examples....

    [...]

MonographDOI
TL;DR: The art and science of cause and effect have been studied in the social sciences for a long time as mentioned in this paper, see, e.g., the theory of inferred causation, causal diagrams and the identification of causal effects.
Abstract: 1. Introduction to probabilities, graphs, and causal models 2. A theory of inferred causation 3. Causal diagrams and the identification of causal effects 4. Actions, plans, and direct effects 5. Causality and structural models in the social sciences 6. Simpson's paradox, confounding, and collapsibility 7. Structural and counterfactual models 8. Imperfect experiments: bounds and counterfactuals 9. Probability of causation: interpretation and identification Epilogue: the art and science of cause and effect.

12,606 citations


"Invariant Risk Minimization" refers background or methods in this paper

  • ...A Structural Equation Model (SEM) C := (S, N) governing the random vector X = (X1, . . . , Xd) is a set of structural equations: Si : Xi ← fi(Pa(Xi), Ni), where Pa(Xi) ⊆ {X1, . . . , Xd} \ {Xi} are called the parents of Xi, and the Ni are independent noise random variables....

    [...]

  • ...Third, in some cases the features X will not be directly observed, but only a scrambled version X ·S. Figure 3 summarizes the SEM generating the data (Xe, Y e) for all environments e in these experiments....

    [...]

  • ...An intervention e on C consists of replacing one or several of its structural equations to obtain an intervened SEM Ce = (Se, Ne), with structural equations: Sei : X e i ← fei (Pae(Xei ), Nei ), The variable Xe is intervened if Si 6= Sei or Ni 6= Nei ....

    [...]

  • ...We begin by assuming that the data from all the environments share the same underlying Structural Equation Model, or SEM [55, 39]:...

    [...]

  • ...Consider a SEM C = (S, N)....

    [...]

Journal ArticleDOI
TL;DR: A discussion of matching, randomization, random sampling, and other methods of controlling extraneous variation is presented in this paper, where the objective is to specify the benefits of randomization in estimating causal effects of treatments.
Abstract: A discussion of matching, randomization, random sampling, and other methods of controlling extraneous variation is presented. The objective is to specify the benefits of randomization in estimating causal effects of treatments. The basic conclusion is that randomization should be employed whenever possible but that the use of carefully controlled nonrandomized data to estimate causal effects is a reasonable and necessary procedure in many cases. Recent psychological and educational literature has included extensive criticism of the use of nonrandomized studies to estimate causal effects of treatments (e.g., Campbell & Erlebacher, 1970). The implication in much of this literature is that only properly randomized experiments can lead to useful estimates of causal effects. If taken as applying to all fields of study, this position is untenable. Since the extensive use of randomized experiments is limited to the last half century,8 and in fact is not used in much scientific investigation today,4 one is led to the conclusion that most scientific "truths" have been established without using randomized experiments. In addition, most of us successfully determine the causal effects of many of our everyday actions, even interpersonal behaviors, without the benefit of randomization. Even if the position that causal effects of treatments can only be well established from randomized experiments is taken as applying only to the social sciences in which

8,377 citations


Additional excerpts

  • ...Rubin’s ignorability [44] plays the same role....

    [...]

Book
23 Sep 2002
TL;DR: In this paper, a review of topology, linear algebra, algebraic geometry, and differential equations is presented, along with an overview of the de Rham Theorem and its application in calculus.
Abstract: Preface.- 1 Smooth Manifolds.- 2 Smooth Maps.- 3 Tangent Vectors.- 4 Submersions, Immersions, and Embeddings.- 5 Submanifolds.- 6 Sard's Theorem.- 7 Lie Groups.- 8 Vector Fields.- 9 Integral Curves and Flows.- 10 Vector Bundles.- 11 The Cotangent Bundle.- 12 Tensors.- 13 Riemannian Metrics.- 14 Differential Forms.- 15 Orientations.- 16 Integration on Manifolds.- 17 De Rham Cohomology.- 18 The de Rham Theorem.- 19 Distributions and Foliations.- 20 The Exponential Map.- 21 Quotient Manifolds.- 22 Symplectic Manifolds.- Appendix A: Review of Topology.- Appendix B: Review of Linear Algebra.- Appendix C: Review of Calculus.- Appendix D: Review of Differential Equations.- References.- Notation Index.- Subject Index

3,051 citations