Invariant Risk Minimization

Home
/
Papers
/
Invariant Risk Minimization

Posted Content•

Invariant Risk Minimization

Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, David Lopez-Paz

05 Jul 2019-arXiv: Machine Learning-

TL;DR: This work introduces Invariant Risk Minimization, a learning paradigm to estimate invariant correlations across multiple training distributions and shows how the invariances learned by IRM relate to the causal structures governing the data and enable out-of-distribution generalization.

read less

Abstract: We introduce Invariant Risk Minimization (IRM), a learning paradigm to estimate invariant correlations across multiple training distributions. To achieve this goal, IRM learns a data representation such that the optimal classifier, on top of that data representation, matches for all training distributions. Through theory and experiments, we show how the invariances learned by IRM relate to the causal structures governing the data and enable out-of-distribution generalization.

...read moreread less

Citations

PDF

Open Access

More filters

Posted Content•

In Search of Lost Domain Generalization

[...]

Ishaan Gulrajani¹, David Lopez-Paz²•Institutions (2)

Stanford University¹, Facebook²

02 Jul 2020-arXiv: Learning

TL;DR: This paper implements DomainBed, a testbed for domain generalization including seven multi-domain datasets, nine baseline algorithms, and three model selection criteria, and finds that, when carefully implemented, empirical risk minimization shows state-of-the-art performance across all datasets.

...read moreread less

Abstract: The goal of domain generalization algorithms is to predict well on distributions different from those seen during training While a myriad of domain generalization algorithms exist, inconsistencies in experimental conditions -- datasets, architectures, and model selection criteria -- render fair and realistic comparisons difficult In this paper, we are interested in understanding how useful domain generalization algorithms are in realistic settings As a first step, we realize that model selection is non-trivial for domain generalization tasks Contrary to prior work, we argue that domain generalization algorithms without a model selection strategy should be regarded as incomplete Next, we implement DomainBed, a testbed for domain generalization including seven multi-domain datasets, nine baseline algorithms, and three model selection criteria We conduct extensive experiments using DomainBed and find that, when carefully implemented, empirical risk minimization shows state-of-the-art performance across all datasets Looking forward, we hope that the release of DomainBed, along with contributions from fellow researchers, will streamline reproducible and rigorous research in domain generalization

...read moreread less

492 citations

Cites background or methods from "Invariant Risk Minimization"

...Akuzawa et al. (2019) extend DANN by considering cases where there exists an statistical dependence between the domain and the class label variables. Albuquerque et al. (2019) extend DANN by considering one-versus-all adversaries that try to predict to which training domain does each of the examples belong to....
[...]
...Akuzawa et al. (2019) extend DANN by considering cases where there exists an statistical dependence between the domain and the class label variables. Albuquerque et al. (2019) extend DANN by considering one-versus-all adversaries that try to predict to which training domain does each of the examples belong to. Li et al. (2018b) employ GANs and the maximum mean discrepancy criteria (Gretton et al., 2012) to align feature distributions across domains. Matsuura and Harada (2019) leverages clustering techniques to learn domaininvariant features even when the separation between training domains is not given. Li et al. (2018c;d) learns a feature transformation φ such that the conditional distributions P (φ(X) | Y d = y) match for all training domains d and label values y. Shankar et al. (2018) use a domain classifier to construct adversarial examples for a label classifier, and use a label classifier to construct adversarial examples for the domain classifier. This results in a label classifier with better domain generalization. Li et al. (2019a) train a robust feature extractor and classifier. The robustness comes from (i) asking the feature extractor to produce features such that a classifier trained on domain d can classify instances for domain d′ 6= d, and (ii) asking the classifier to predict labels on domain d using features produced by a feature extractor trained on domain d′ 6= d. Li et al. (2020) adopt a lifelong learning strategy to attack the problem of domain generalization. Motiian et al. (2017) learn a feature representation such that (i) examples from different domains but the same class are close, (ii) examples from different domains and classes are far, and (iii) training examples can be correctly classified. Ilse et al. (2019) train a variational autoencoder (Kingma and Welling, 2014) where the bottleneck representation factorizes knowledge about domain, class label, and residual variations in the input space. Fang et al. (2013) learn a structural SVM metric such that the neighborhood of each example contains examples from the same category and all training domains....
[...]
...Akuzawa et al. (2019) extend DANN by considering cases where there exists an statistical dependence between the domain and the class label variables....
[...]
...Akuzawa et al. (2019) extend DANN by considering cases where there exists an statistical dependence between the domain and the class label variables. Albuquerque et al. (2019) extend DANN by considering one-versus-all adversaries that try to predict to which training domain does each of the examples belong to. Li et al. (2018b) employ GANs and the maximum mean discrepancy criteria (Gretton et al., 2012) to align feature distributions across domains. Matsuura and Harada (2019) leverages clustering techniques to learn domaininvariant features even when the separation between training domains is not given. Li et al. (2018c;d) learns a feature transformation φ such that the conditional distributions P (φ(X) | Y d = y) match for all training domains d and label values y. Shankar et al. (2018) use a domain classifier to construct adversarial examples for a label classifier, and use a label classifier to construct adversarial examples for the domain classifier. This results in a label classifier with better domain generalization. Li et al. (2019a) train a robust feature extractor and classifier. The robustness comes from (i) asking the feature extractor to produce features such that a classifier trained on domain d can classify instances for domain d′ 6= d, and (ii) asking the classifier to predict labels on domain d using features produced by a feature extractor trained on domain d′ 6= d. Li et al. (2020) adopt a lifelong learning strategy to attack the problem of domain generalization. Motiian et al. (2017) learn a feature representation such that (i) examples from different domains but the same class are close, (ii) examples from different domains and classes are far, and (iii) training examples can be correctly classified....
[...]
...Akuzawa et al. (2019) extend DANN by considering cases where there exists an statistical dependence between the domain and the class label variables. Albuquerque et al. (2019) extend DANN by considering one-versus-all adversaries that try to predict to which training domain does each of the examples belong to. Li et al. (2018b) employ GANs and the maximum mean discrepancy criteria (Gretton et al., 2012) to align feature distributions across domains. Matsuura and Harada (2019) leverages clustering techniques to learn domaininvariant features even when the separation between training domains is not given. Li et al. (2018c;d) learns a feature transformation φ such that the conditional distributions P (φ(X) | Y d = y) match for all training domains d and label values y. Shankar et al. (2018) use a domain classifier to construct adversarial examples for a label classifier, and use a label classifier to construct adversarial examples for the domain classifier. This results in a label classifier with better domain generalization. Li et al. (2019a) train a robust feature extractor and classifier. The robustness comes from (i) asking the feature extractor to produce features such that a classifier trained on domain d can classify instances for domain d′ 6= d, and (ii) asking the classifier to predict labels on domain d using features produced by a feature extractor trained on domain d′ 6= d. Li et al. (2020) adopt a lifelong learning strategy to attack the problem of domain generalization. Motiian et al. (2017) learn a feature representation such that (i) examples from different domains but the same class are close, (ii) examples from different domains and classes are far, and (iii) training examples can be correctly classified. Ilse et al. (2019) train a variational autoencoder (Kingma and Welling, 2014) where the bottleneck representation factorizes knowledge about domain, class label, and residual variations in the input space....
[...]

Posted Content•

Out-of-Distribution Generalization via Risk Extrapolation (REx)

[...]

David Krueger¹, Ethan Caballero, Joern-Henrik Jacobsen², Amy Zhang³, Jonathan Binas, Dinghuai Zhang, Rémi Le Priol¹, Aaron Courville¹ - Show less +4 more•Institutions (3)

Université de Montréal¹, Apple Inc.², University of California, Berkeley³

02 Mar 2020-arXiv: Learning

TL;DR: This work introduces the principle of Risk Extrapolation (REx), and shows conceptually how this principle enables extrapolation, and demonstrates the effectiveness and scalability of instantiations of REx on various OoD generalization tasks.

...read moreread less

Abstract: Distributional shift is one of the major obstacles when transferring machine learning prediction systems from the lab to the real world. To tackle this problem, we assume that variation across training domains is representative of the variation we might encounter at test time, but also that shifts at test time may be more extreme in magnitude. In particular, we show that reducing differences in risk across training domains can reduce a model's sensitivity to a wide range of extreme distributional shifts, including the challenging setting where the input contains both causal and anti-causal elements. We motivate this approach, Risk Extrapolation (REx), as a form of robust optimization over a perturbation set of extrapolated domains (MM-REx), and propose a penalty on the variance of training risks (V-REx) as a simpler variant. We prove that variants of REx can recover the causal mechanisms of the targets, while also providing some robustness to changes in the input distribution ("covariate shift"). By appropriately trading-off robustness to causally induced distributional shifts and covariate shift, REx is able to outperform alternative methods such as Invariant Risk Minimization in situations where these types of shift co-occur.

...read moreread less

400 citations

Cites background or methods from "Invariant Risk Minimization"

...Arjovsky et al. (2019) propose an extension of that work, called Invariant Risk Minimization (IRM), with the goal of learning a data representation that does not rely on spurious correlations....
[...]
...Arjovsky et al. (2019) construct a binary classification problem (with 0-4 and 5-9 each collapsed into a single class) based on the MNIST dataset, using color as a spurious fea- ture....
[...]
...…(Engstrom et al., 2019; Jacobsen et al., 2018) and non-adversarial (Hendrycks & Dietterich, 2019; Yin et al., 2019) robustness, causality (Arjovsky et al., 2019), and other works aimed at distinguishing statistical features from semantic features (Gowal et al., 2019; Geirhos et al.,…...
[...]
...Arjovsky et al. (2019) propose an extension of this work, called Invariant Risk Minimization (IRM), in order to learn a data representation that does not rely on spurious correlations....
[...]
...In Section C, we provide results on the synthetic structural equation models from Arjovsky et al. (2019)....
[...]

Posted Content•

Underspecification Presents Challenges for Credibility in Modern Machine Learning

[...]

Alexander D'Amour¹, Katherine Heller¹, Dan Moldovan¹, Ben Adlam¹, Babak Alipanahi¹, Alex Beutel¹, Christina Chen¹, Jonathan Deaton¹, Jacob Eisenstein¹, Matthew D. Hoffman¹, Farhad Hormozdiari¹, Neil Houlsby¹, Shaobo Hou¹, Ghassen Jerfel¹, Alan Karthikesalingam¹, Mario Lucic¹, Yi-An Ma², Cory Y. McLean¹, Diana Mincu¹, Akinori Mitani¹, Andrea Montanari³, Zachary Nado¹, Vivek T. Natarajan¹, Christopher Nielson⁴, Thomas F. Osborne⁴, Rajiv Raman, Kim Ramasamy, Rory Sayres¹, Jessica Schrouff¹, Martin G. Seneviratne¹, Shannon Sequeira¹, Harini Suresh⁵, Victor Veitch¹, Max Vladymyrov¹, Xuezhi Wang¹, Kellie Webster¹, Steve Yadlowsky¹, Taedong Yun¹, Xiaohua Zhai¹, D. Sculley¹ - Show less +36 more•Institutions (5)

Google¹, University of California, San Diego², Stanford University³, United States Department of Veterans Affairs⁴, Massachusetts Institute of Technology⁵

06 Nov 2020-arXiv: Learning

TL;DR: This work shows the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain, and shows that this problem appears in a wide variety of practical ML pipelines.

...read moreread less

Abstract: ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain.

...read moreread less

374 citations

Cites background from "Invariant Risk Minimization"

...In particular, concerns regarding “spurious correlations” and “shortcut learning” in trained models are now widespread (e.g., Geirhos et al., 2020; Arjovsky et al., 2019)....
[...]
...In this context, Peters et al. (2016); Heinze-Deml et al. (2018); Arjovsky et al. (2019); Magliacane et al. (2018) propose approaches to overcome this structural bias, often by using data collected in multiple environments to identify causal invariances....
[...]
...We call these structural failure modes, because they are often diagnosed as a misalignment between the predictor learned by empirical risk minimization and the causal structure of the desired predictor (Schölkopf, 2019; Arjovsky et al., 2019)....
[...]
...In such cases, the iid-optimal predictors must necessarily incorporate spurious associations (Caruana et al., 2015; Arjovsky et al., 2019; Ilyas et al., 2019)....
[...]

Journal Article•DOI•

Shortcut Learning in Deep Neural Networks

[...]

Robert Geirhos¹, Jörn-Henrik Jacobsen², Claudio Michaelis¹, Richard S. Zemel², Wieland Brendel¹, Matthias Bethge¹, Felix A. Wichmann¹ - Show less +3 more•Institutions (2)

University of Tübingen¹, University of Toronto²

16 Apr 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a set of recommendations for model interpretation and benchmarking, highlighting recent advances in machine learning to improve robustness and transferability from the lab to real-world applications, are presented.

...read moreread less

Abstract: Deep learning has triggered the current rise of artificial intelligence and is the workhorse of today's machine intelligence. Numerous success stories have rapidly spread all over science, industry and society, but its limitations have only recently come into focus. In this perspective we seek to distil how many of deep learning's problem can be seen as different symptoms of the same underlying problem: shortcut learning. Shortcuts are decision rules that perform well on standard benchmarks but fail to transfer to more challenging testing conditions, such as real-world scenarios. Related issues are known in Comparative Psychology, Education and Linguistics, suggesting that shortcut learning may be a common characteristic of learning systems, biological and artificial alike. Based on these observations, we develop a set of recommendations for model interpretation and benchmarking, highlighting recent advances in machine learning to improve robustness and transferability from the lab to real-world applications.

...read moreread less

311 citations

Journal Article•DOI•

Causality matters in medical imaging.

[...]

Daniel Coelho de Castro¹, Ian Walker¹, Ben Glocker¹•Institutions (1)

Imperial College London¹

22 Jul 2020-Nature Communications

TL;DR: In this paper, the authors highlight the importance of establishing the causal relationship between images and their annotations, and offer step-by-step recommendations for future studies, while providing a detailed categorisation of potential biases and mitigation techniques.

...read moreread less

Abstract: Causal reasoning can shed new light on the major challenges in machine learning for medical imaging: scarcity of high-quality annotated data and mismatch between the development dataset and the target environment. A causal perspective on these issues allows decisions about data collection, annotation, preprocessing, and learning strategies to be made and scrutinized more transparently, while providing a detailed categorisation of potential biases and mitigation techniques. Along with worked clinical examples, we highlight the importance of establishing the causal relationship between images and their annotations, and offer step-by-step recommendations for future studies.

...read moreread less

233 citations

1
2
3
4
5
…
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Posted Content•

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

[...]

Jacob Devlin¹, Ming-Wei Chang¹, Kenton Lee¹, Kristina Toutanova¹•Institutions (1)

Google¹

11 Oct 2018-arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Abstract: We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

...read moreread less

29,480 citations

"Invariant Risk Minimization" refers background in this paper

...Also, there are problems where we predict parts of the input from other parts of the input, like in self-supervised learning [14]....
[...]

Statistical learning theory

[...]

Vladimir Vapnik

01 Jan 1998

TL;DR: Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.

...read moreread less

Abstract: A comprehensive look at learning and generalization theory. The statistical theory of learning and generalization concerns the problem of choosing desired functions on the basis of empirical data. Highly applicable to a variety of computer science and robotics fields, this book offers lucid coverage of the theory as a whole. Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.

...read moreread less

26,531 citations

"Invariant Risk Minimization" refers background in this paper

...Because most machine learning algorithms depend on the assumption that training and testing data are sampled independently from the same distribution [51], it is common practice to shuffle at random the training and testing examples....
[...]

Monograph•DOI•

Causality: models, reasoning, and inference

[...]

Judea Pearl¹•Institutions (1)

University of California, Los Angeles¹

14 Sep 2009-Tijdschrift Voor Filosofie

TL;DR: The art and science of cause and effect have been studied in the social sciences for a long time as mentioned in this paper, see, e.g., the theory of inferred causation, causal diagrams and the identification of causal effects.

...read moreread less

Abstract: 1. Introduction to probabilities, graphs, and causal models 2. A theory of inferred causation 3. Causal diagrams and the identification of causal effects 4. Actions, plans, and direct effects 5. Causality and structural models in the social sciences 6. Simpson's paradox, confounding, and collapsibility 7. Structural and counterfactual models 8. Imperfect experiments: bounds and counterfactuals 9. Probability of causation: interpretation and identification Epilogue: the art and science of cause and effect.

...read moreread less

12,606 citations

"Invariant Risk Minimization" refers background or methods in this paper

...A Structural Equation Model (SEM) C := (S, N) governing the random vector X = (X1, . . . , Xd) is a set of structural equations: Si : Xi ← fi(Pa(Xi), Ni), where Pa(Xi) ⊆ {X1, . . . , Xd} \ {Xi} are called the parents of Xi, and the Ni are independent noise random variables....
[...]
...Third, in some cases the features X will not be directly observed, but only a scrambled version X ·S. Figure 3 summarizes the SEM generating the data (Xe, Y e) for all environments e in these experiments....
[...]
...An intervention e on C consists of replacing one or several of its structural equations to obtain an intervened SEM Ce = (Se, Ne), with structural equations: Sei : X e i ← fei (Pae(Xei ), Nei ), The variable Xe is intervened if Si 6= Sei or Ni 6= Nei ....
[...]
...We begin by assuming that the data from all the environments share the same underlying Structural Equation Model, or SEM [55, 39]:...
[...]
...Consider a SEM C = (S, N)....
[...]

Journal Article•DOI•

Estimating causal effects of treatments in randomized and nonrandomized studies.

[...]

Donald B. Rubin¹•Institutions (1)

Princeton University¹

01 Oct 1974-Journal of Educational Psychology

TL;DR: A discussion of matching, randomization, random sampling, and other methods of controlling extraneous variation is presented in this paper, where the objective is to specify the benefits of randomization in estimating causal effects of treatments.

...read moreread less

Abstract: A discussion of matching, randomization, random sampling, and other methods of controlling extraneous variation is presented. The objective is to specify the benefits of randomization in estimating causal effects of treatments. The basic conclusion is that randomization should be employed whenever possible but that the use of carefully controlled nonrandomized data to estimate causal effects is a reasonable and necessary procedure in many cases. Recent psychological and educational literature has included extensive criticism of the use of nonrandomized studies to estimate causal effects of treatments (e.g., Campbell & Erlebacher, 1970). The implication in much of this literature is that only properly randomized experiments can lead to useful estimates of causal effects. If taken as applying to all fields of study, this position is untenable. Since the extensive use of randomized experiments is limited to the last half century,8 and in fact is not used in much scientific investigation today,4 one is led to the conclusion that most scientific "truths" have been established without using randomized experiments. In addition, most of us successfully determine the causal effects of many of our everyday actions, even interpersonal behaviors, without the benefit of randomization. Even if the position that causal effects of treatments can only be well established from randomized experiments is taken as applying only to the social sciences in which

...read moreread less

8,377 citations

Additional excerpts

...Rubin’s ignorability [44] plays the same role....
[...]

Book•

Introduction to Smooth Manifolds

[...]

John M. Lee¹•Institutions (1)

University of Washington¹

23 Sep 2002

TL;DR: In this paper, a review of topology, linear algebra, algebraic geometry, and differential equations is presented, along with an overview of the de Rham Theorem and its application in calculus.

...read moreread less

Abstract: Preface.- 1 Smooth Manifolds.- 2 Smooth Maps.- 3 Tangent Vectors.- 4 Submersions, Immersions, and Embeddings.- 5 Submanifolds.- 6 Sard's Theorem.- 7 Lie Groups.- 8 Vector Fields.- 9 Integral Curves and Flows.- 10 Vector Bundles.- 11 The Cotangent Bundle.- 12 Tensors.- 13 Riemannian Metrics.- 14 Differential Forms.- 15 Orientations.- 16 Integration on Manifolds.- 17 De Rham Cohomology.- 18 The de Rham Theorem.- 19 Distributions and Foliations.- 20 The Exponential Map.- 21 Quotient Manifolds.- 22 Symplectic Manifolds.- Appendix A: Review of Topology.- Appendix B: Review of Linear Algebra.- Appendix C: Review of Calculus.- Appendix D: Review of Differential Equations.- References.- Notation Index.- Subject Index

...read moreread less

3,051 citations