Invariant Risk Minimization

Home
/
Papers
/
Invariant Risk Minimization

Posted Content•

Invariant Risk Minimization

Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, David Lopez-Paz

05 Jul 2019-arXiv: Machine Learning-

TL;DR: This work introduces Invariant Risk Minimization, a learning paradigm to estimate invariant correlations across multiple training distributions and shows how the invariances learned by IRM relate to the causal structures governing the data and enable out-of-distribution generalization.

read less

Abstract: We introduce Invariant Risk Minimization (IRM), a learning paradigm to estimate invariant correlations across multiple training distributions. To achieve this goal, IRM learns a data representation such that the optimal classifier, on top of that data representation, matches for all training distributions. Through theory and experiments, we show how the invariances learned by IRM relate to the causal structures governing the data and enable out-of-distribution generalization.

...read moreread less

Citations

PDF

Open Access

More filters

Posted Content•

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

[...]

Sergey Levine, Aviral Kumar, George Tucker, Justin Fu

04 May 2020-arXiv: Learning

TL;DR: This tutorial article aims to provide the reader with the conceptual tools needed to get started on research on offline reinforcement learning algorithms: reinforcementlearning algorithms that utilize previously collected data, without additional online data collection.

...read moreread less

Abstract: In this tutorial article, we aim to provide the reader with the conceptual tools needed to get started on research on offline reinforcement learning algorithms: reinforcement learning algorithms that utilize previously collected data, without additional online data collection. Offline reinforcement learning algorithms hold tremendous promise for making it possible to turn large datasets into powerful decision making engines. Effective offline reinforcement learning methods would be able to extract policies with the maximum possible utility out of the available data, thereby allowing automation of a wide range of decision-making domains, from healthcare and education to robotics. However, the limitations of current algorithms make this difficult. We will aim to provide the reader with an understanding of these challenges, particularly in the context of modern deep reinforcement learning methods, and describe some potential solutions that have been explored in recent work to mitigate these challenges, along with recent applications, and a discussion of perspectives on open problems in the field.

...read moreread less

950 citations

Cites methods from "Invariant Risk Minimization"

...…techniques from causal inference (Schölkopf, 2019), uncertainty estimation (Gal and Ghahramani, 2016; Kendall and Gal, 2017), density estimation and generative modeling (Kingma et al., 2014), distributional robustness (Sinha et al., 2017; Sagawa et al., 2019) and invariance (Arjovsky et al., 2019)....
[...]

Journal Article•DOI•

Shortcut learning in deep neural networks

[...]

Robert Geirhos¹, Jörn-Henrik Jacobsen², Claudio Michaelis¹, Richard S. Zemel², Wieland Brendel¹, Matthias Bethge¹, Felix A. Wichmann¹ - Show less +3 more•Institutions (2)

University of Tübingen¹, University of Toronto²

16 Apr 2020-Nature Machine Intelligence

TL;DR: A set of recommendations for model interpretation and benchmarking is developed, highlighting recent advances in machine learning to improve robustness and transferability from the lab to real-world applications.

...read moreread less

Abstract: Deep learning has triggered the current rise of artificial intelligence and is the workhorse of today’s machine intelligence. Numerous success stories have rapidly spread all over science, industry and society, but its limitations have only recently come into focus. In this Perspective we seek to distil how many of deep learning’s failures can be seen as different symptoms of the same underlying problem: shortcut learning. Shortcuts are decision rules that perform well on standard benchmarks but fail to transfer to more challenging testing conditions, such as real-world scenarios. Related issues are known in comparative psychology, education and linguistics, suggesting that shortcut learning may be a common characteristic of learning systems, biological and artificial alike. Based on these observations, we develop a set of recommendations for model interpretation and benchmarking, highlighting recent advances in machine learning to improve robustness and transferability from the lab to real-world applications. Deep learning has resulted in impressive achievements, but under what circumstances does it fail, and why? The authors propose that its failures are a consequence of shortcut learning, a common characteristic across biological and artificial systems in which strategies that appear to have solved a problem fail unexpectedly under different circumstances.

...read moreread less

924 citations

Journal Article•DOI•

Toward Causal Representation Learning

[...]

Bernhard Schölkopf¹, Francesco Locatello¹, Stefan Bauer¹, Nan Rosemary Ke, Nal Kalchbrenner², Anirudh Goyal, Yoshua Bengio - Show less +3 more•Institutions (2)

Max Planck Society¹, Google²

26 Feb 2021

TL;DR: The authors reviewed fundamental concepts of causal inference and related them to crucial open problems of machine learning, including transfer and generalization, thereby assaying how causality can contribute to modern machine learning research.

...read moreread less

Abstract: The two fields of machine learning and graphical causality arose and are developed separately. However, there is, now, cross-pollination and increasing interest in both fields to benefit from the advances of the other. In this article, we review fundamental concepts of causal inference and relate them to crucial open problems of machine learning, including transfer and generalization, thereby assaying how causality can contribute to modern machine learning research. This also applies in the opposite direction: we note that most work in causality starts from the premise that the causal variables are given. A central problem for AI and causality is, thus, causal representation learning, that is, the discovery of high-level causal variables from low-level observations. Finally, we delineate some implications of causality for machine learning and propose key research areas at the intersection of both communities.

...read moreread less

601 citations

Posted Content•

WILDS: A Benchmark of in-the-Wild Distribution Shifts

[...]

Pang Wei Koh¹, Shiori Sagawa¹, Henrik Marklund¹, Sang Michael Xie², Marvin Zhang¹, Akshay Balsubramani¹, Weihua Hu¹, Michihiro Yasunaga³, Richard Lanas Phillips¹, Irena Gao¹, Tony Lee¹, Etienne David⁴, Ian Stavness⁵, Wei Guo⁵, Berton A. Earnshaw, Imran S. Haque⁶, Sara Beery¹, Jure Leskovec¹, Anshul Kundaje⁷, Emma Pierson², Sergey Levine¹, Chelsea Finn¹, Percy Liang¹ - Show less +19 more•Institutions (7)

Stanford University¹, University of California, Berkeley², Cornell University³, University of Saskatchewan⁴, University of Tokyo⁵, California Institute of Technology⁶, Microsoft⁷

14 Dec 2020-arXiv: Learning

TL;DR: WILDS is presented, a benchmark of in-the-wild distribution shifts spanning diverse data modalities and applications, and is hoped to encourage the development of general-purpose methods that are anchored to real-world distribution shifts and that work well across different applications and problem settings.

...read moreread less

Abstract: Distribution shifts -- where the training distribution differs from the test distribution -- can substantially degrade the accuracy of machine learning (ML) systems deployed in the wild. Despite their ubiquity, these real-world distribution shifts are under-represented in the datasets widely used in the ML community today. To address this gap, we present WILDS, a curated collection of 8 benchmark datasets that reflect a diverse range of distribution shifts which naturally arise in real-world applications, such as shifts across hospitals for tumor identification; across camera traps for wildlife monitoring; and across time and location in satellite imaging and poverty mapping. On each dataset, we show that standard training results in substantially lower out-of-distribution than in-distribution performance, and that this gap remains even with models trained by existing methods for handling distribution shifts. This underscores the need for new training methods that produce models which are more robust to the types of distribution shifts that arise in practice. To facilitate method development, we provide an open-source package that automates dataset loading, contains default model architectures and hyperparameters, and standardizes evaluations. Code and leaderboards are available at this https URL.

...read moreread less

579 citations

Cites methods from "Invariant Risk Minimization"

...We adapted the implementations of CORAL from Gulrajani & Lopez-Paz (2020); IRM from Arjovsky et al. (2019); and Group DRO from Sagawa et al. (2020a)....
[...]

Posted Content•

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization.

[...]

Shiori Sagawa¹, Pang Wei Koh¹, Tatsunori Hashimoto², Percy Liang¹•Institutions (2)

Stanford University¹, Microsoft²

20 Nov 2019-arXiv: Learning

TL;DR: The results suggest that regularization is important for worst-group generalization in the overparameterized regime, even if it is not needed for average generalization, and introduce a stochastic optimization algorithm, with convergence guarantees, to efficiently train group DRO models.

...read moreread less

Abstract: Overparameterized neural networks can be highly accurate on average on an i.i.d. test set yet consistently fail on atypical groups of the data (e.g., by learning spurious correlations that hold on average but not in such groups). Distributionally robust optimization (DRO) allows us to learn models that instead minimize the worst-case training loss over a set of pre-defined groups. However, we find that naively applying group DRO to overparameterized neural networks fails: these models can perfectly fit the training data, and any model with vanishing average training loss also already has vanishing worst-case training loss. Instead, the poor worst-case performance arises from poor generalization on some groups. By coupling group DRO models with increased regularization---a stronger-than-typical L2 penalty or early stopping---we achieve substantially higher worst-group accuracies, with 10-40 percentage point improvements on a natural language inference task and two image tasks, while maintaining high average accuracies. Our results suggest that regularization is important for worst-group generalization in the overparameterized regime, even if it is not needed for average generalization. Finally, we introduce a stochastic optimization algorithm, with convergence guarantees, to efficiently train group DRO models.

...read moreread less

579 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Posted Content•

On Causal and Anticausal Learning

[...]

Bernhard Schoelkopf¹, Dominik Janzing¹, Jonas Peters¹, Eleni Sgouritsa¹, Kun Zhang¹, Joris M. Mooij² - Show less +2 more•Institutions (2)

Max Planck Society¹, Radboud University Nijmegen²

27 Jun 2012-arXiv: Learning

TL;DR: The problem of function estimation in the case where an underlying causal model can be inferred is considered, and a hypothesis for when semi-supervised learning can help is formulated, and corroborate it with empirical results.

...read moreread less

Abstract: We consider the problem of function estimation in the case where an underlying causal model can be inferred. This has implications for popular scenarios such as covariate shift, concept drift, transfer learning and semi-supervised learning. We argue that causal knowledge may facilitate some approaches for a given problem, and rule out others. In particular, we formulate a hypothesis for when semi-supervised learning can help, and corroborate it with empirical results.

...read moreread less

427 citations

"Invariant Risk Minimization" refers background in this paper

...Irma: It sounds reasonable! What about the case where P (Y e|Xe) changes? Does this happen in normal supervised learning? I remember attending a lecture by Professor Schölkopf [45, 25] where he mentioned that P (Y e|Xe) is often invariant across environments when X is a cause of Y , and that it often varies when X is an effect of Y ....
[...]
...I remember attending a lecture by Professor Schölkopf [45, 25] where he mentioned that P (Y e|Xe) is often invariant across environments when Xe is a cause of Y e, and that it often varies when Xe is an effect of Y e....
[...]
...Contrary to Professor Schölkopf, I believe that most supervised learning problems, such as image classification, are causal....
[...]
...Some works in machine learning [45, 18, 21, 26, 36, 43, 34, 7] pursue similar questions....
[...]

Posted Content•

Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet

[...]

Wieland Brendel¹, Matthias Bethge¹•Institutions (1)

University of Tübingen¹

20 Mar 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: A high-performance DNN architecture on ImageNet whose decisions are considerably easier to explain is introduced, and behaves similar to state-of-the art deep neural networks such as VGG-16, ResNet-152 or DenseNet-169 in terms of feature sensitivity, error distribution and interactions between image parts.

...read moreread less

Abstract: Deep Neural Networks (DNNs) excel on many complex perceptual tasks but it has proven notoriously difficult to understand how they reach their decisions. We here introduce a high-performance DNN architecture on ImageNet whose decisions are considerably easier to explain. Our model, a simple variant of the ResNet-50 architecture called BagNet, classifies an image based on the occurrences of small local image features without taking into account their spatial ordering. This strategy is closely related to the bag-of-feature (BoF) models popular before the onset of deep learning and reaches a surprisingly high accuracy on ImageNet (87.6% top-5 for 33 x 33 px features and Alexnet performance for 17 x 17 px features). The constraint on local features makes it straight-forward to analyse how exactly each part of the image influences the classification. Furthermore, the BagNets behave similar to state-of-the art deep neural networks such as VGG-16, ResNet-152 or DenseNet-169 in terms of feature sensitivity, error distribution and interactions between image parts. This suggests that the improvements of DNNs over previous bag-of-feature classifiers in the last few years is mostly achieved by better fine-tuning rather than by qualitatively different decision strategies.

...read moreread less

373 citations

"Invariant Risk Minimization" refers background in this paper

...Unfortunately, spurious correlations and biases are often simpler to detect than the true phenomenon of interest [17, 9, 10, 11]....
[...]

Journal Article•DOI•

Causal inference by using invariant prediction: identification and confidence intervals

[...]

Jonas Peters¹, Peter Bühlmann, Nicolai Meinshausen•Institutions (1)

Max Planck Society¹

01 Nov 2016-Journal of The Royal Statistical Society Series B-statistical Methodology

TL;DR: In this article, the authors exploit the invariance of a prediction under a causal model for causal inference: given different experimental settings (e.g. various interventions) they collect all models that do show invariance in their predictive accuracy across settings and interventions.

...read moreread less

Abstract: Summary What is the difference between a prediction that is made with a causal model and that with a non-causal model? Suppose that we intervene on the predictor variables or change the whole environment. The predictions from a causal model will in general work as well under interventions as for observational data. In contrast, predictions from a non-causal model can potentially be very wrong if we actively intervene on variables. Here, we propose to exploit this invariance of a prediction under a causal model for causal inference: given different experimental settings (e.g. various interventions) we collect all models that do show invariance in their predictive accuracy across settings and interventions. The causal model will be a member of this set of models with high probability. This approach yields valid confidence intervals for the causal relationships in quite general scenarios. We examine the example of structural equation models in more detail and provide sufficient assumptions under which the set of causal predictors becomes identifiable. We further investigate robustness properties of our approach under model misspecification and discuss possible extensions. The empirical properties are studied for various data sets, including large-scale gene perturbation experiments.

...read moreread less

338 citations

Book Chapter•DOI•

Revisiting Visual Question Answering Baselines

[...]

Allan Jabri¹, Armand Joulin¹, Laurens van der Maaten¹•Institutions (1)

Facebook¹

08 Oct 2016

TL;DR: The authors proposed a simple alternative model based on binary classification, which receives the answer as input and predicts whether or not an image-question-answer triplet is correct, which achieves state-of-the-art performance on the Visual7W Telling task and compares surprisingly well with the most complex systems proposed for the VQA Real Multiple Choice task.

...read moreread less

Abstract: Visual question answering (VQA) is an interesting learning setting for evaluating the abilities and shortcomings of current systems for image understanding. Many of the recently proposed VQA systems include attention or memory mechanisms designed to perform “reasoning”. Furthermore, for the task of multiple-choice VQA, nearly all of these systems train a multi-class classifier on image and question features to predict an answer. This paper questions the value of these common practices and develops a simple alternative model based on binary classification. Instead of treating answers as competing choices, our model receives the answer as input and predicts whether or not an image-question-answer triplet is correct. We evaluate our model on the Visual7W Telling and the VQA Real Multiple Choice tasks, and find that even simple versions of our model perform competitively. Our best model achieves state-of-the-art performance of \(65.8\,\%\) accuracy on the Visual7W Telling task and compares surprisingly well with the most complex systems proposed for the VQA Real Multiple Choice task. Additionally, we explore variants of the model and study the transferability of the model between both datasets. We also present an error analysis of our best model, the results of which suggest that a key problem of current VQA systems lies in the lack of visual grounding and localization of concepts that occur in the questions and answers.

...read moreread less

287 citations

Book Chapter•DOI•

Recognition in Terra Incognita

[...]

Sara Beery¹, Grant Van Horn¹, Pietro Perona¹•Institutions (1)

California Institute of Technology¹

08 Sep 2018

TL;DR: The CaltechCameraTraps dataset as mentioned in this paper is designed to measure recognition generalization to novel environments, where cameras are fixed at one location, hence the background changes little across images; capture is triggered automatically, hence there is no human bias.

...read moreread less

Abstract: It is desirable for detection and classification algorithms to generalize to unfamiliar environments, but suitable benchmarks for quantitatively studying this phenomenon are not yet available We present a dataset designed to measure recognition generalization to novel environments The images in our dataset are harvested from twenty camera traps deployed to monitor animal populations Camera traps are fixed at one location, hence the background changes little across images; capture is triggered automatically, hence there is no human bias The challenge is learning recognition in a handful of locations, and generalizing animal detection and classification to new locations where no training data is available In our experiments state-of-the-art algorithms show excellent performance when tested at the same location where they were trained However, we find that generalization to new locations is poor, especially for classification systems(The dataset is available at https://beerysgithubio/CaltechCameraTraps/)

...read moreread less

259 citations