scispace - formally typeset

Posted Content

Invariant Risk Minimization


TL;DR: This work introduces Invariant Risk Minimization, a learning paradigm to estimate invariant correlations across multiple training distributions and shows how the invariances learned by IRM relate to the causal structures governing the data and enable out-of-distribution generalization.
Abstract: We introduce Invariant Risk Minimization (IRM), a learning paradigm to estimate invariant correlations across multiple training distributions. To achieve this goal, IRM learns a data representation such that the optimal classifier, on top of that data representation, matches for all training distributions. Through theory and experiments, we show how the invariances learned by IRM relate to the causal structures governing the data and enable out-of-distribution generalization.
Citations
More filters

Posted Content
04 May 2020-arXiv: Learning
TL;DR: This tutorial article aims to provide the reader with the conceptual tools needed to get started on research on offline reinforcement learning algorithms: reinforcementlearning algorithms that utilize previously collected data, without additional online data collection.
Abstract: In this tutorial article, we aim to provide the reader with the conceptual tools needed to get started on research on offline reinforcement learning algorithms: reinforcement learning algorithms that utilize previously collected data, without additional online data collection. Offline reinforcement learning algorithms hold tremendous promise for making it possible to turn large datasets into powerful decision making engines. Effective offline reinforcement learning methods would be able to extract policies with the maximum possible utility out of the available data, thereby allowing automation of a wide range of decision-making domains, from healthcare and education to robotics. However, the limitations of current algorithms make this difficult. We will aim to provide the reader with an understanding of these challenges, particularly in the context of modern deep reinforcement learning methods, and describe some potential solutions that have been explored in recent work to mitigate these challenges, along with recent applications, and a discussion of perspectives on open problems in the field.

324 citations


Cites methods from "Invariant Risk Minimization"

  • ...…techniques from causal inference (Schölkopf, 2019), uncertainty estimation (Gal and Ghahramani, 2016; Kendall and Gal, 2017), density estimation and generative modeling (Kingma et al., 2014), distributional robustness (Sinha et al., 2017; Sagawa et al., 2019) and invariance (Arjovsky et al., 2019)....

    [...]


Journal ArticleDOI
TL;DR: A set of recommendations for model interpretation and benchmarking is developed, highlighting recent advances in machine learning to improve robustness and transferability from the lab to real-world applications.
Abstract: Deep learning has triggered the current rise of artificial intelligence and is the workhorse of today’s machine intelligence. Numerous success stories have rapidly spread all over science, industry and society, but its limitations have only recently come into focus. In this Perspective we seek to distil how many of deep learning’s failures can be seen as different symptoms of the same underlying problem: shortcut learning. Shortcuts are decision rules that perform well on standard benchmarks but fail to transfer to more challenging testing conditions, such as real-world scenarios. Related issues are known in comparative psychology, education and linguistics, suggesting that shortcut learning may be a common characteristic of learning systems, biological and artificial alike. Based on these observations, we develop a set of recommendations for model interpretation and benchmarking, highlighting recent advances in machine learning to improve robustness and transferability from the lab to real-world applications. Deep learning has resulted in impressive achievements, but under what circumstances does it fail, and why? The authors propose that its failures are a consequence of shortcut learning, a common characteristic across biological and artificial systems in which strategies that appear to have solved a problem fail unexpectedly under different circumstances.

291 citations


Posted Content
20 Nov 2019-arXiv: Learning
TL;DR: The results suggest that regularization is important for worst-group generalization in the overparameterized regime, even if it is not needed for average generalization, and introduce a stochastic optimization algorithm, with convergence guarantees, to efficiently train group DRO models.
Abstract: Overparameterized neural networks can be highly accurate on average on an i.i.d. test set yet consistently fail on atypical groups of the data (e.g., by learning spurious correlations that hold on average but not in such groups). Distributionally robust optimization (DRO) allows us to learn models that instead minimize the worst-case training loss over a set of pre-defined groups. However, we find that naively applying group DRO to overparameterized neural networks fails: these models can perfectly fit the training data, and any model with vanishing average training loss also already has vanishing worst-case training loss. Instead, the poor worst-case performance arises from poor generalization on some groups. By coupling group DRO models with increased regularization---a stronger-than-typical L2 penalty or early stopping---we achieve substantially higher worst-group accuracies, with 10-40 percentage point improvements on a natural language inference task and two image tasks, while maintaining high average accuracies. Our results suggest that regularization is important for worst-group generalization in the overparameterized regime, even if it is not needed for average generalization. Finally, we introduce a stochastic optimization algorithm, with convergence guarantees, to efficiently train group DRO models.

162 citations


Posted Content
Alexander D'Amour1, Katherine Heller1, Dan Moldovan1, Ben Adlam1  +36 moreInstitutions (5)
06 Nov 2020-arXiv: Learning
TL;DR: This work shows the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain, and shows that this problem appears in a wide variety of practical ML pipelines.
Abstract: ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain.

142 citations


Cites background from "Invariant Risk Minimization"

  • ...In particular, concerns regarding “spurious correlations” and “shortcut learning” in trained models are now widespread (e.g., Geirhos et al., 2020; Arjovsky et al., 2019)....

    [...]

  • ...In this context, Peters et al. (2016); Heinze-Deml et al. (2018); Arjovsky et al. (2019); Magliacane et al. (2018) propose approaches to overcome this structural bias, often by using data collected in multiple environments to identify causal invariances....

    [...]

  • ...We call these structural failure modes, because they are often diagnosed as a misalignment between the predictor learned by empirical risk minimization and the causal structure of the desired predictor (Schölkopf, 2019; Arjovsky et al., 2019)....

    [...]

  • ...In such cases, the iid-optimal predictors must necessarily incorporate spurious associations (Caruana et al., 2015; Arjovsky et al., 2019; Ilyas et al., 2019)....

    [...]


Posted Content
02 Mar 2020-arXiv: Learning
TL;DR: This work introduces the principle of Risk Extrapolation (REx), and shows conceptually how this principle enables extrapolation, and demonstrates the effectiveness and scalability of instantiations of REx on various OoD generalization tasks.
Abstract: Distributional shift is one of the major obstacles when transferring machine learning prediction systems from the lab to the real world. To tackle this problem, we assume that variation across training domains is representative of the variation we might encounter at test time, but also that shifts at test time may be more extreme in magnitude. In particular, we show that reducing differences in risk across training domains can reduce a model's sensitivity to a wide range of extreme distributional shifts, including the challenging setting where the input contains both causal and anti-causal elements. We motivate this approach, Risk Extrapolation (REx), as a form of robust optimization over a perturbation set of extrapolated domains (MM-REx), and propose a penalty on the variance of training risks (V-REx) as a simpler variant. We prove that variants of REx can recover the causal mechanisms of the targets, while also providing some robustness to changes in the input distribution ("covariate shift"). By appropriately trading-off robustness to causally induced distributional shifts and covariate shift, REx is able to outperform alternative methods such as Invariant Risk Minimization in situations where these types of shift co-occur.

112 citations


Cites background or methods from "Invariant Risk Minimization"

  • ...Arjovsky et al. (2019) propose an extension of that work, called Invariant Risk Minimization (IRM), with the goal of learning a data representation that does not rely on spurious correlations....

    [...]

  • ...Arjovsky et al. (2019) construct a binary classification problem (with 0-4 and 5-9 each collapsed into a single class) based on the MNIST dataset, using color as a spurious fea- ture....

    [...]

  • ...…(Engstrom et al., 2019; Jacobsen et al., 2018) and non-adversarial (Hendrycks & Dietterich, 2019; Yin et al., 2019) robustness, causality (Arjovsky et al., 2019), and other works aimed at distinguishing statistical features from semantic features (Gowal et al., 2019; Geirhos et al.,…...

    [...]

  • ...Arjovsky et al. (2019) propose an extension of this work, called Invariant Risk Minimization (IRM), in order to learn a data representation that does not rely on spurious correlations....

    [...]

  • ...In Section C, we provide results on the synthetic structural equation models from Arjovsky et al. (2019)....

    [...]


References
More filters

01 Jan 1998-
TL;DR: Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.
Abstract: A comprehensive look at learning and generalization theory. The statistical theory of learning and generalization concerns the problem of choosing desired functions on the basis of empirical data. Highly applicable to a variety of computer science and robotics fields, this book offers lucid coverage of the theory as a whole. Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.

26,121 citations


"Invariant Risk Minimization" refers background in this paper

  • ...Because most machine learning algorithms depend on the assumption that training and testing data are sampled independently from the same distribution [51], it is common practice to shuffle at random the training and testing examples....

    [...]


MonographDOI
Judea Pearl1Institutions (1)
Abstract: 1. Introduction to probabilities, graphs, and causal models 2. A theory of inferred causation 3. Causal diagrams and the identification of causal effects 4. Actions, plans, and direct effects 5. Causality and structural models in the social sciences 6. Simpson's paradox, confounding, and collapsibility 7. Structural and counterfactual models 8. Imperfect experiments: bounds and counterfactuals 9. Probability of causation: interpretation and identification Epilogue: the art and science of cause and effect.

11,220 citations


"Invariant Risk Minimization" refers background or methods in this paper

  • ...A Structural Equation Model (SEM) C := (S, N) governing the random vector X = (X1, . . . , Xd) is a set of structural equations: Si : Xi ← fi(Pa(Xi), Ni), where Pa(Xi) ⊆ {X1, . . . , Xd} \ {Xi} are called the parents of Xi, and the Ni are independent noise random variables....

    [...]

  • ...Third, in some cases the features X will not be directly observed, but only a scrambled version X ·S. Figure 3 summarizes the SEM generating the data (Xe, Y e) for all environments e in these experiments....

    [...]

  • ...An intervention e on C consists of replacing one or several of its structural equations to obtain an intervened SEM Ce = (Se, Ne), with structural equations: Sei : X e i ← fei (Pae(Xei ), Nei ), The variable Xe is intervened if Si 6= Sei or Ni 6= Nei ....

    [...]

  • ...We begin by assuming that the data from all the environments share the same underlying Structural Equation Model, or SEM [55, 39]:...

    [...]

  • ...Consider a SEM C = (S, N)....

    [...]


Journal ArticleDOI
Donald B. Rubin1Institutions (1)
Abstract: A discussion of matching, randomization, random sampling, and other methods of controlling extraneous variation is presented. The objective is to specify the benefits of randomization in estimating causal effects of treatments. The basic conclusion is that randomization should be employed whenever possible but that the use of carefully controlled nonrandomized data to estimate causal effects is a reasonable and necessary procedure in many cases. Recent psychological and educational literature has included extensive criticism of the use of nonrandomized studies to estimate causal effects of treatments (e.g., Campbell & Erlebacher, 1970). The implication in much of this literature is that only properly randomized experiments can lead to useful estimates of causal effects. If taken as applying to all fields of study, this position is untenable. Since the extensive use of randomized experiments is limited to the last half century,8 and in fact is not used in much scientific investigation today,4 one is led to the conclusion that most scientific "truths" have been established without using randomized experiments. In addition, most of us successfully determine the causal effects of many of our everyday actions, even interpersonal behaviors, without the benefit of randomization. Even if the position that causal effects of treatments can only be well established from randomized experiments is taken as applying only to the social sciences in which

7,232 citations


Additional excerpts

  • ...Rubin’s ignorability [44] plays the same role....

    [...]


Posted Content
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Abstract: We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

4,099 citations


"Invariant Risk Minimization" refers background in this paper

  • ...Also, there are problems where we predict parts of the input from other parts of the input, like in self-supervised learning [14]....

    [...]


Book
John M. Lee1Institutions (1)
23 Sep 2002-
Abstract: Preface.- 1 Smooth Manifolds.- 2 Smooth Maps.- 3 Tangent Vectors.- 4 Submersions, Immersions, and Embeddings.- 5 Submanifolds.- 6 Sard's Theorem.- 7 Lie Groups.- 8 Vector Fields.- 9 Integral Curves and Flows.- 10 Vector Bundles.- 11 The Cotangent Bundle.- 12 Tensors.- 13 Riemannian Metrics.- 14 Differential Forms.- 15 Orientations.- 16 Integration on Manifolds.- 17 De Rham Cohomology.- 18 The de Rham Theorem.- 19 Distributions and Foliations.- 20 The Exponential Map.- 21 Quotient Manifolds.- 22 Symplectic Manifolds.- Appendix A: Review of Topology.- Appendix B: Review of Linear Algebra.- Appendix C: Review of Calculus.- Appendix D: Review of Differential Equations.- References.- Notation Index.- Subject Index

2,654 citations


Network Information
Related Papers (5)
27 Jun 2016

Kaiming He, Xiangyu Zhang +2 more

01 Jan 2016, Journal of Machine Learning Research

Yaroslav Ganin, Evgeniya Ustinova +6 more

16 Jun 2013

Krikamol Muandet, David Balduzzi +1 more

25 Dec 2017

Da Li, Yongxin Yang +2 more

01 Jun 2018

Haoliang Li, Sinno Jialin Pan +2 more

Performance
Metrics
No. of citations received by the Paper in previous years
YearCitations
2021211
2020160
201922
20181