Showing papers by "Yoshua Bengio published in 2021"

PDF

Open Access

Journal Article•DOI•

[...]

Bernhard Schölkopf¹, Francesco Locatello¹, Stefan Bauer¹, Nan Rosemary Ke, Nal Kalchbrenner², Anirudh Goyal, Yoshua Bengio - Show less +3 more•Institutions (2)

Max Planck Society¹, Google²

26 Feb 2021

TL;DR: The authors reviewed fundamental concepts of causal inference and related them to crucial open problems of machine learning, including transfer and generalization, thereby assaying how causality can contribute to modern machine learning research.

...read moreread less

Abstract: The two fields of machine learning and graphical causality arose and are developed separately. However, there is, now, cross-pollination and increasing interest in both fields to benefit from the advances of the other. In this article, we review fundamental concepts of causal inference and relate them to crucial open problems of machine learning, including transfer and generalization, thereby assaying how causality can contribute to modern machine learning research. This also applies in the opposite direction: we note that most work in causality starts from the premise that the causal variables are given. A central problem for AI and causality is, thus, causal representation learning, that is, the discovery of high-level causal variables from low-level observations. Finally, we delineate some implications of causality for machine learning and propose key research areas at the intersection of both communities.

...read moreread less

601 citations

Journal Article•DOI•

Machine learning for combinatorial optimization: A methodological tour d’horizon

[...]

Yoshua Bengio¹, Andrea Lodi², Antoine Prouvost²•Institutions (2)

Université de Montréal¹, École Polytechnique de Montréal²

16 Apr 2021-European Journal of Operational Research

TL;DR: A survey of machine learning and combinatorial optimization problems can be found in this paper, where the main point is to see generic optimization problems as data points and inquire what is the relevant distribution of problems to use for learning on a given task.

...read moreread less

464 citations

Journal Article•DOI•

Deep learning for AI

[...]

Yoshua Bengio¹, Yann LeCun², Geoffrey E. Hinton³•Institutions (3)

Université de Montréal¹, New York University², University of Toronto³

21 Jun 2021-Communications of The ACM

TL;DR: In this paper, neural networks are used to learn the rich internal representations required for difficult tasks such as recognizing objects or understanding language, which can be used to classify objects or understand language.

...read moreread less

Abstract: How can neural networks learn the rich internal representations required for difficult tasks such as recognizing objects or understanding language?

...read moreread less

294 citations

Journal Article•DOI•

Inherent privacy limitations of decentralized contact tracing apps.

[...]

Yoshua Bengio¹, Daphne Ippolito², Richard Janda³, Max Jarvie, Benjamin Prud'homme, Jean-François Rousseau, Abhinav Sharma⁴, Yun William Yu⁵ - Show less +4 more•Institutions (5)

Université de Montréal¹, University of Pennsylvania², McGill University³, McGill University Health Centre⁴, University of Toronto⁵

15 Jan 2021-Journal of the American Medical Informatics Association

TL;DR: In this brief communication, a few of these inherent privacy limitations of any decentralized automatic contact tracing system are discussed.

...read moreread less

39 citations

Journal Article•DOI•

How does hemispheric specialization contribute to human-defining cognition?

[...]

Gesa Hartwigsen¹, Yoshua Bengio², Danilo Bzdok³, Danilo Bzdok⁴•Institutions (4)

Max Planck Society¹, Université de Montréal², McGill University³, Montreal Neurological Institute and Hospital⁴

07 Jul 2021-Neuron

TL;DR: In this paper, dual-processing theories of cognition have been used to explain human cognitive abilities, such as semantic understanding of world structure, logical reasoning, and communication via language, and they have been integrated with the global workspace theory to explain dynamic relay of information products between two systems.

...read moreread less

33 citations

Posted Content•

Towards Causal Representation Learning

[...]

Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, Yoshua Bengio - Show less +3 more

22 Feb 2021-arXiv: Learning

...read moreread less

Abstract: The two fields of machine learning and graphical causality arose and developed separately. However, there is now cross-pollination and increasing interest in both fields to benefit from the advances of the other. In the present paper, we review fundamental concepts of causal inference and relate them to crucial open problems of machine learning, including transfer and generalization, thereby assaying how causality can contribute to modern machine learning research. This also applies in the opposite direction: we note that most work in causality starts from the premise that the causal variables are given. A central problem for AI and causality is, thus, causal representation learning, the discovery of high-level causal variables from low-level observations. Finally, we delineate some implications of causality for machine learning and propose key research areas at the intersection of both communities.

...read moreread less

28 citations

Posted Content•

SpeechBrain: A General-Purpose Speech Toolkit

[...]

08 Jun 2021-arXiv: Audio and Speech Processing

TL;DR: SpeechBrain this article is an open-source and all-in-one speech toolkit designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented.

...read moreread less

Abstract: SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing pipelines. SpeechBrain achieves competitive or state-of-the-art performance in a wide range of speech benchmarks. It also provides training recipes, pretrained models, and inference scripts for popular speech datasets, as well as tutorials which allow anyone with basic Python proficiency to familiarize themselves with speech technologies.

...read moreread less

27 citations

Journal Article•DOI•

Scaling Equilibrium Propagation to Deep ConvNets by Drastically Reducing Its Gradient Estimator Bias.

[...]

Axel Laborieux¹, Maxence Ernoult¹, Maxence Ernoult², Benjamin Scellier², Yoshua Bengio², Yoshua Bengio³, Julie Grollier¹, Damien Querlioz¹ - Show less +4 more•Institutions (3)

Université Paris-Saclay¹, Université de Montréal², Canadian Institute for Advanced Research³

01 Jan 2021-Frontiers in Neuroscience

TL;DR: In this article, the authors show that a bias in the gradient estimate of equilibrium propagation, inherent in the use of finite nudging, is responsible for this phenomenon and that cancelling it allows training deep convolutional neural networks.

...read moreread less

Abstract: Equilibrium Propagation is a biologically-inspired algorithm that trains convergent recurrent neural networks with a local learning rule. This approach constitutes a major lead to allow learning-capable neuromophic systems and comes with strong theoretical guarantees. Equilibrium propagation operates in two phases, during which the network is let to evolve freely and then "nudged" towards a target; the weights of the network are then updated based solely on the states of the neurons that they connect. The weight updates of Equilibrium Propagation have been shown mathematically to approach those provided by Backpropagation Through Time (BPTT), the mainstream approach to train recurrent neural networks, when nudging is performed with infinitely small strength. In practice, however, the standard implementation of Equilibrium Propagation does not scale to visual tasks harder than MNIST. In this work, we show that a bias in the gradient estimate of equilibrium propagation, inherent in the use of finite nudging, is responsible for this phenomenon and that cancelling it allows training deep convolutional neural networks. We show that this bias can be greatly reduced by using symmetric nudging (a positive nudging and a negative one). We also generalize Equilibrium Propagation to the case of cross-entropy loss (by opposition to squared error). As a result of these advances, we are able to achieve a test error of 11.7% on CIFAR-10, which approaches the one achieved by BPTT and provides a major improvement with respect to the standard Equilibrium Propagation that gives 86% test error. We also apply these techniques to train an architecture with unidirectional forward and backward connections, yielding a 13.2% test error. These results highlight equilibrium propagation as a compelling biologically-plausible approach to compute error gradients in deep neuromorphic systems.

...read moreread less

22 citations

Posted Content•DOI•

Learning from unexpected events in the neocortical microcircuit

[...]

Colleen J Gillon¹, Jason E. Pina², Jérôme Lecoq³, Ruweida Ahmed³, Yazan N. Billeh³, Shiella Caldejon³, Peter A. Groblewski³, Timothy M. Henley², India Kato³, Eric Lee³, Jennifer Luviano³, Kyla Mace³, Chelsea Nayan³, Thuyanh V. Nguyen³, Kat North³, Jed Perkins³, Sam Seid³, Matthew T. Valley³, Ali Williford³, Yoshua Bengio⁴, Yoshua Bengio⁵, Timothy P. Lillicrap⁶, Blake A. Richards, Joel Zylberberg - Show less +20 more•Institutions (6)

University of Toronto¹, York University², Allen Institute for Brain Science³, Université de Montréal⁴, Canadian Institute for Advanced Research⁵, University College London⁶

16 Jan 2021-bioRxiv

TL;DR: In this article, the authors show that unexpected event signals predict subsequent changes in responses to expected and unexpected stimuli in individual neurons and distal apical dendrites that are tracked over a period of days.

...read moreread less

Abstract: Scientists have long conjectured that the neocortex learns the structure of the environment in a predictive, hierarchical manner. To do so, expected, predictable features are differentiated from unexpected ones by comparing bottom-up and top-down streams of data. It is theorized that the neocortex then changes the representation of incoming stimuli, guided by differences in the responses to expected and unexpected events. Such differences in cortical responses have been observed; however, it remains unknown whether these unexpected event signals govern subsequent changes in the brain’s stimulus representations, and, thus, govern learning. Here, we show that unexpected event signals predict subsequent changes in responses to expected and unexpected stimuli in individual neurons and distal apical dendrites that are tracked over a period of days. These findings were obtained by observing layer 2/3 and layer 5 pyramidal neurons in primary visual cortex of awake, behaving mice using two-photon calcium imaging. We found that many neurons in both layers 2/3 and 5 showed large differences between their responses to expected and unexpected events. These unexpected event signals also determined how the responses evolved over subsequent days, in a manner that was different between the somata and distal apical dendrites. This difference between the somata and distal apical dendrites may be important for hierarchical computation, given that these two compartments tend to receive bottom-up and top-down information, respectively. Together, our results provide novel evidence that the neocortex indeed instantiates a predictive hierarchical model in which unexpected events drive learning.

...read moreread less

21 citations

Proceedings Article•

Recurrent Independent Mechanisms

[...]

Anirudh Goyal¹, Alex Lamb¹, Jordan Hoffmann, Shagun Sodhani², Sergey Levine³, Yoshua Bengio¹, Bernhard Schölkopf⁴ - Show less +3 more•Institutions (4)

Université de Montréal¹, Facebook², University of Washington³, Max Planck Society⁴

03 May 2021

TL;DR: The authors propose Recurrent Independent Mechanisms (RIMs), a new recurrent architecture in which multiple groups of recurrent cells operate with nearly independent transition dynamics, communicate only sparingly through the bottleneck of attention, and compete with each other so they are updated only at time steps where they are most relevant.

...read moreread less

Abstract: We explore the hypothesis that learning modular structures which reflect the dynamics of the environment can lead to better generalization and robustness to changes that only affect a few of the underlying causes. We propose Recurrent Independent Mechanisms (RIMs), a new recurrent architecture in which multiple groups of recurrent cells operate with nearly independent transition dynamics, communicate only sparingly through the bottleneck of attention, and compete with each other so they are updated only at time steps where they are most relevant. We show that this leads to specialization amongst the RIMs, which in turn allows for remarkably improved generalization on tasks where some factors of variation differ systematically between training and evaluation.

...read moreread less

20 citations

Journal Article•DOI•

Problems in the deployment of machine-learned models in health care.

[...]

Joseph Paul Cohen, Tianshi Cao, Joseph D. Viviano, Chin-Wei Huang, Michael Fralick, Marzyeh Ghassemi, Muhammad Mamdani, Russell Greiner, Yoshua Bengio - Show less +5 more

07 Sep 2021-Canadian Medical Association Journal

TL;DR: In a companion article, Verma et al. as mentioned in this paper discuss how machine-learned solutions can be developed and implemented to support medical decision-making, and discuss the benefits of machine learning for medical decision making.

...read moreread less

Abstract: [See related articles at www.cmaj.ca/lookup/doi/10.1503/cmaj.202434][1] and [www.cmaj.ca/lookup/doi/10.1503/cmaj.210036][2] KEY POINTS In a companion article, Verma and colleagues discuss how machine-learned solutions can be developed and implemented to support medical decision-making.[1][3] Both

...read moreread less

Journal Article•DOI•

Using Artificial Intelligence to Visualize the Impacts of Climate Change

[...]

Alexandra Luccioni¹, Victor Schmidt¹, Vahe Vardanyan¹, Yoshua Bengio¹, Theresa-Marie Rhyne - Show less +1 more•Institutions (1)

Université de Montréal¹

01 Jan 2021-IEEE Computer Graphics and Applications

TL;DR: The AI climate impact visualizer as mentioned in this paper uses cutting-edge artificial intelligence (AI) approaches to develop an interactive personalized visualization tool, which allows a user to enter an address and provide them with an AI-imagined possible visualization of the future of this location in 2050 following the detrimental effects of climate change.

...read moreread less

Abstract: Public awareness and concern about climate change often do not match the magnitude of its threat to humans and our environment. One reason for this disagreement is that it is difficult to mentally simulate the effects of a process as complex as climate change and to have a concrete representation of the impact that our individual actions will have on our own future, especially if the consequences are long term and abstract. To overcome these challenges, we propose to use cutting-edge artificial intelligence (AI) approaches to develop an interactive personalized visualization tool, the AI climate impact visualizer. It will allow a user to enter an address—be it their house, their school, or their workplace—-and it will provide them with an AI-imagined possible visualization of the future of this location in 2050 following the detrimental effects of climate change such as floods, storms, and wildfires. This image will be accompanied by accessible information regarding the science behind climate change, i.e., why extreme weather events are becoming more frequent and what kinds of changes are happening on a local and global scale.

...read moreread less

Posted Content•

Coordination Among Neural Modules Through a Shared Global Workspace.

[...]

Anirudh Goyal, Aniket Didolkar, Alex Lamb, Kartikeya Badola, Nan Rosemary Ke, Nasim Rahaman, Jonathan Binas, Charles Blundell, Michael C. Mozer, Yoshua Bengio - Show less +6 more

01 Mar 2021-arXiv: Learning

TL;DR: In this paper, the authors explore the use of such a communication channel in the context of deep learning for modeling the structure of complex environments and show that capacity limitations have a rational basis in that they encourage specialization and compositionality and facilitate the synchronization of otherwise independent specialists.

...read moreread less

Abstract: Deep learning has seen a movement away from representing examples with a monolithic hidden state towards a richly structured state. For example, Transformers segment by position, and object-centric architectures decompose images into entities. In all these architectures, interactions between different elements are modeled via pairwise interactions: Transformers make use of self-attention to incorporate information from other positions; object-centric architectures make use of graph neural networks to model interactions among entities. However, pairwise interactions may not achieve global coordination or a coherent, integrated representation that can be used for downstream tasks. In cognitive science, a global workspace architecture has been proposed in which functionally specialized components share information through a common, bandwidth-limited communication channel. We explore the use of such a communication channel in the context of deep learning for modeling the structure of complex environments. The proposed method includes a shared workspace through which communication among different specialist modules takes place but due to limits on the communication bandwidth, specialist modules must compete for access. We show that capacity limitations have a rational basis in that (1) they encourage specialization and compositionality and (2) they facilitate the synchronization of otherwise independent specialists.

...read moreread less

Proceedings Article•

Learning Neural Generative Dynamics for Molecular Conformation Generation

[...]

Minkai Xu¹, Shitong Luo², Yoshua Bengio¹, Jian Peng³, Jian Tang⁴ - Show less +1 more•Institutions (4)

Université de Montréal¹, Peking University², University of Illinois at Urbana–Champaign³, HEC Montréal⁴

03 May 2021

TL;DR: In this paper, a probabilistic framework was proposed to generate valid and diverse conformations given a molecular graph, combining the advantages of both flow-based and energy-based models, enjoying a high model capacity to estimate the multimodal conformation distribution.

...read moreread less

Abstract: We study how to generate molecule conformations (i.e., 3D structures) from a molecular graph. Traditional methods, such as molecular dynamics, sample conformations via computationally expensive simulations. Recently, machine learning methods have shown great potential by training on a large collection of conformation data. Challenges arise from the limited model capacity for capturing complex distributions of conformations and the difficulty in modeling long-range dependencies between atoms. Inspired by the recent progress in deep generative models, in this paper, we propose a novel probabilistic framework to generate valid and diverse conformations given a molecular graph. We propose a method combining the advantages of both flow-based and energy-based models, enjoying: (1) a high model capacity to estimate the multimodal conformation distribution; (2) explicitly capturing the complex long-range dependencies between atoms in the observation space. Extensive experiments demonstrate the superior performance of the proposed method on several benchmarks, including conformation generation and distance modeling tasks, with a significant improvement over existing generative models for molecular conformation sampling.

...read moreread less

Proceedings Article•

Meta-Learning Framework with Applications to Zero-Shot Time-Series Forecasting

[...]

Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, Yoshua Bengio

18 May 2021

TL;DR: In this article, a broad meta-learning framework is proposed to discover generic ways of processing time series (TS) from a diverse dataset so as to greatly improve generalization on new TS coming from different datasets.

...read moreread less

Abstract: Can meta-learning discover generic ways of processing time series (TS) from a diverse dataset so as to greatly improve generalization on new TS coming from different datasets? This work provides positive evidence to this using a broad meta-learning framework which we show subsumes many existing meta-learning algorithms. Our theoretical analysis suggests that residual connections act as a meta-learning adaptation mechanism, generating a subset of task-specific parameters based on a given TS input, thus gradually expanding the expressive power of the architecture on-the-fly. The same mechanism is shown via linearization analysis to have the interpretation of a sequential update of the final linear layer. Our empirical results on a wide range of data emphasize the importance of the identified meta-learning mechanisms for successful zero-shot univariate forecasting, suggesting that it is viable to train a neural network on a source TS dataset and deploy it on a different target TS dataset without retraining, resulting in performance that is at least as good as that of state-of-practice univariate forecasting models.

...read moreread less

Proceedings Article•

Saliency is a Possible Red Herring When Diagnosing Poor Generalization

[...]

Joseph D. Viviano¹, Becks Simpson², Francis Dutil³, Yoshua Bengio¹, Joseph Paul Cohen⁴ - Show less +1 more•Institutions (4)

Université de Montréal¹, University of Queensland², Université de Sherbrooke³, Stanford University⁴

03 May 2021

TL;DR: This article showed that mask information is only used during training and has an impact on generalization accuracy depending on the severity of the shift between the training and test distributions, which raises questions about the utility of masks as "attribution priors" as well as saliency maps for explainable predictions.

...read moreread less

Abstract: Poor generalization is one symptom of models that learn to predict target variables using spuriously-correlated image features present only in the training distribution instead of the true image features that denote a class. It is often thought that this can be diagnosed visually using attribution (aka saliency) maps. We study if this assumption is correct. In some prediction tasks, such as for medical images, one may have some images with masks drawn by a human expert, indicating a region of the image containing relevant information to make the prediction. We study multiple methods that take advantage of such auxiliary labels, by training networks to ignore distracting features which may be found outside of the region of interest. This mask information is only used during training and has an impact on generalization accuracy depending on the severity of the shift between the training and test distributions. Surprisingly, while these methods improve generalization performance in the presence of a covariate shift, there is no strong correspondence between the correction of attribution towards the features a human expert have labelled as important and generalization performance. These results suggest that the root cause of poor generalization may not always be spatially defined, and raise questions about the utility of masks as 'attribution priors' as well as saliency maps for explainable predictions.

...read moreread less

Proceedings Article•

Systematic generalisation with group invariant predictions

[...]

Faruk Ahmed¹, Yoshua Bengio¹, Harm van Seijen², Aaron Courville¹•Institutions (2)

Université de Montréal¹, Microsoft²

03 May 2021

TL;DR: The authors consider situations where the presence of dominant simpler correlations with the target variable in a training set can cause an SGD-trained neural network to be less reliant on more persistently-correlating complex features.

...read moreread less

Abstract: We consider situations where the presence of dominant simpler correlations with the target variable in a training set can cause an SGD-trained neural network to be less reliant on more persistently-correlating complex features. When the non-persistent, simpler correlations correspond to non-semantic background factors, a neural network trained on this data can exhibit dramatic failure upon encountering systematic distributional shift, where the correlating background features are recombined with different objects. We perform an empirical study showing that group invariance methods across inferred partitionings of the training set can lead to significant improvements at such test-time situations. We suggest a new invariance penalty, showing with experiments on three synthetic datasets that it can perform better than alternatives. We find that even without assuming access to any systematic-shift validation sets, one can still find improvements over an ERM-trained reference model.

...read moreread less

Proceedings Article•

Predicting Infectiousness for Proactive Contact Tracing

[...]

Yoshua Bengio¹, Prateek Gupta², Tegan Maharaj³, Nasim Rahaman⁴, Martin Weiss¹, Tristan Deleu¹, Eilif Muller, Meng Qu¹, Victor Schmidt¹, Pierre-Luc St-Charles³, Hannah Alsdurf⁵, Olexa Bilaniuk¹, David L. Buckeridge⁶, Gaétan Marceau Caron, Pierre Luc Carrier¹, Joumana Ghosn, satya ortiz gagne, Chris Pal³, Irina Rish¹, Bernhard Schölkopf⁴, Abhinav Sharma⁶, Jian Tang⁷, Andrew Williams¹ - Show less +19 more•Institutions (7)

Université de Montréal¹, University of Oxford², École Polytechnique de Montréal³, Max Planck Society⁴, University of Ottawa⁵, McGill University⁶, HEC Montréal⁷

03 May 2021

TL;DR: Methods that can be deployed to a smartphone to locally and proactively predict an individual's infectiousness based on their contact history and other information are developed, suggesting PCT could help in safe re-opening and second-wave prevention.

...read moreread less

Abstract: The COVID-19 pandemic has spread rapidly worldwide, overwhelming manual contact tracing in many countries and resulting in widespread lockdowns for emergency containment. Large-scale digital contact tracing (DCT) has emerged as a potential solution to resume economic and social activity while minimizing spread of the virus. Various DCT methods have been proposed, each making trade-offs be-tween privacy, mobility restrictions, and public health. The most common approach, binary contact tracing (BCT), models infection as a binary event, informed only by an individual’s test results, with corresponding binary recommendations that either all or none of the individual’s contacts quarantine. BCT ignores the inherent uncertainty in contacts and the infection process, which could be used to tailor messaging to high-risk individuals, and prompt proactive testing or earlier warnings. It also does not make use of observations such as symptoms or pre-existing medical conditions, which could be used to make more accurate infectiousness predictions. In this paper, we use a recently-proposed COVID-19 epidemiological simulator to develop and test methods that can be deployed to a smartphone to locally and proactively predict an individual’s infectiousness (risk of infecting others) based on their contact history and other information, while respecting strong privacy constraints. Predictions are used to provide personalized recommendations to the individual via an app, as well as to send anonymized messages to the individual’s contacts, who use this information to better predict their own infectiousness, an approach we call proactive contact tracing (PCT). Similarly to other works, we find that compared to no tracing, all DCT methods tested are able to reduce spread of the disease and thus save lives, even at low adoption rates, strongly supporting a role for DCT methods in managing the pandemic. Further, we find a deep-learning based PCT method which improves over BCT for equivalent average mobility, suggesting PCT could help in safe re-opening and second-wave prevention.

...read moreread less

Posted Content•

Neural Production Systems

[...]

Anirudh Goyal¹, Aniket Didolkar², Nan Rosemary Ke³, Charles Blundell⁴, Philippe Beaudoin⁵, Nicolas Heess⁴, Michael C. Mozer⁴, Yoshua Bengio¹ - Show less +4 more•Institutions (5)

Université de Montréal¹, Manipal Institute of Technology², École Polytechnique de Montréal³, Google⁴, University of British Columbia⁵

02 Mar 2021-arXiv: Artificial Intelligence

TL;DR: In this paper, a set of rule templates are applied by binding placeholder variables in the rules to specific entities, and the best fitting rules are applied to update entity properties, which achieves a flexible, dynamic flow of control and serves to factorize entity specific and rule-based information.

...read moreread less

Abstract: Visual environments are structured, consisting of distinct objects or entities. These entities have properties -- both visible and latent -- that determine the manner in which they interact with one another. To partition images into entities, deep-learning researchers have proposed structural inductive biases such as slot-based architectures. To model interactions among entities, equivariant graph neural nets (GNNs) are used, but these are not particularly well suited to the task for two reasons. First, GNNs do not predispose interactions to be sparse, as relationships among independent entities are likely to be. Second, GNNs do not factorize knowledge about interactions in an entity-conditional manner. As an alternative, we take inspiration from cognitive science and resurrect a classic approach, production systems, which consist of a set of rule templates that are applied by binding placeholder variables in the rules to specific entities. Rules are scored on their match to entities, and the best fitting rules are applied to update entity properties. In a series of experiments, we demonstrate that this architecture achieves a flexible, dynamic flow of control and serves to factorize entity-specific and rule-based information. This disentangling of knowledge achieves robust future-state prediction in rich visual environments, outperforming state-of-the-art methods using GNNs, and allows for the extrapolation from simple (few object) environments to more complex environments.

...read moreread less

Proceedings Article•

RNNLogic: Learning Logic Rules for Reasoning on Knowledge Graphs

[...]

Meng Qu¹, Junkun Chen², Louis-Pascal Xhonneux¹, Yoshua Bengio¹, Jian Tang³ - Show less +1 more•Institutions (3)

Université de Montréal¹, Tsinghua University², HEC Montréal³

03 May 2021

TL;DR: In this article, a probabilistic model called RNNLogic is proposed to learn logic rules for reasoning on knowledge graphs, which treats logic rules as a latent variable, and simultaneously trains a rule generator as well as a reasoning predictor with logic rules.

...read moreread less

Abstract: This paper studies learning logic rules for reasoning on knowledge graphs. Logic rules provide interpretable explanations when used for prediction as well as being able to generalize to other tasks, and hence are critical to learn. Existing methods either suffer from the problem of searching in a large search space (e.g., neural logic programming) or ineffective optimization due to sparse rewards (e.g., techniques based on reinforcement learning). To address these limitations, this paper proposes a probabilistic model called RNNLogic. RNNLogic treats logic rules as a latent variable, and simultaneously trains a rule generator as well as a reasoning predictor with logic rules. We develop an EM-based algorithm for optimization. In each iteration, the reasoning predictor is updated to explore some generated logic rules for reasoning. Then in the E-step, we select a set of high-quality rules from all generated rules with both the rule generator and reasoning predictor via posterior inference; and in the M-step, the rule generator is updated with the rules selected in the E-step. Experiments on four datasets prove the effectiveness of RNNLogic.

...read moreread less

Posted Content•

DEUP: Direct Epistemic Uncertainty Prediction

[...]

Moksh Jain, Salem Lahlou, Hadi Nekoei, Victor Butoi, Paul Bertin, Jarrid Rector-Brooks, Maksym Korablyov, Yoshua Bengio - Show less +4 more

16 Feb 2021-arXiv: Learning

TL;DR: Direct Epistemic Uncertainty Prediction (DEUP) as discussed by the authors is a principled approach for directly estimating epistemic uncertainty by learning to predict generalization error and subtracting an estimate of aleatoric uncertainty, i.e., intrinsic unpredictability.

...read moreread less

Abstract: Epistemic uncertainty is the part of out-of-sample prediction error due to the lack of knowledge of the learner. Whereas previous work was focusing on model variance, we propose a principled approach for directly estimating epistemic uncertainty by learning to predict generalization error and subtracting an estimate of aleatoric uncertainty, i.e., intrinsic unpredictability. This estimator of epistemic uncertainty includes the effect of model bias and can be applied in non-stationary learning environments arising in active learning or reinforcement learning. In addition to demonstrating these properties of Direct Epistemic Uncertainty Prediction (DEUP), we illustrate its advantage against existing methods for uncertainty estimation on downstream tasks including sequential model optimization and reinforcement learning. We also evaluate the quality of uncertainty estimates from DEUP for probabilistic classification of images and for estimating uncertainty about synergistic drug combinations.

...read moreread less

Proceedings Article•

Deep Verifier Networks: Verification of Deep Discriminative Models with Deep Generative Models.

[...]

Tong Che, Xiaofeng Liu¹, Site Li², Yubin Ge³, Ruixiang Zhang, Caiming Xiong⁴, Yoshua Bengio - Show less +3 more•Institutions (4)

Harvard University¹, Carnegie Mellon University², University of Illinois at Urbana–Champaign³, Salesforce.com⁴

18 May 2021

TL;DR: Deep verifier networks (DVN) as mentioned in this paper uses conditional variational auto-encoders with disentanglement constraints to separate the label information from the latent representation.

...read moreread less

Abstract: AI Safety is a major concern in many deep learning applications such as autonomous driving. Given a trained deep learning model, an important natural problem is how to reliably verify the model's prediction. In this paper, we propose a novel framework --- deep verifier networks (DVN) to detect unreliable inputs or predictions of deep discriminative models, using separately trained deep generative models. Our proposed model is based on conditional variational auto-encoders with disentanglement constraints to separate the label information from the latent representation. We give both intuitive and theoretical justifications for the model. Our verifier network is trained independently with the prediction model, which eliminates the need of retraining the verifier network for a new model. We test the verifier network on both out-of-distribution detection and adversarial example detection problems, as well as anomaly detection problems in structured prediction tasks such as image caption generation. We achieve state-of-the-art results in all of these problems.

...read moreread less

Journal Article•

Transformers with Competitive Ensembles of Independent Mechanisms

[...]

Alex Lamb¹, Di He², Anirudh Goyal¹, Guolin Ke², Chien-Feng Liao³, Mirco Ravanelli¹, Yoshua Bengio¹ - Show less +3 more•Institutions (3)

Université de Montréal¹, Microsoft², Academia Sinica³

04 May 2021-arXiv: Learning

TL;DR: Transformer with Independent Mechanisms (TIM) as discussed by the authors proposes Transformers with independent mechanisms, a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention.

...read moreread less

Abstract: An important development in deep learning from the earliest MLPs has been a move towards architectures with structural inductive biases which enable the model to keep distinct sources of information and routes of processing well-separated. This structure is linked to the notion of independent mechanisms from the causality literature, in which a mechanism is able to retain the same processing as irrelevant aspects of the world are changed. For example, convnets enable separation over positions, while attention-based architectures (especially Transformers) learn which combination of positions to process dynamically. In this work we explore a way in which the Transformer architecture is deficient: it represents each position with a large monolithic hidden representation and a single set of parameters which are applied over the entire hidden representation. This potentially throws unrelated sources of information together, and limits the Transformer's ability to capture independent mechanisms. To address this, we propose Transformers with Independent Mechanisms (TIM), a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention. Additionally, we propose a competition mechanism which encourages these mechanisms to specialize over time steps, and thus be more independent. We study TIM on a large scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance.

...read moreread less

Journal Article•DOI•

Predicting Tactical Solutions to Operational Planning Problems Under Imperfect Information

[...]

Eric Larsen¹, Sébastien Lachapelle¹, Yoshua Bengio¹, Emma Frejinger¹, Simon Lacoste-Julien¹, Andrea Lodi² - Show less +2 more•Institutions (2)

Université de Montréal¹, École Polytechnique de Montréal²

21 Sep 2021-Informs Journal on Computing

TL;DR: In this paper, the authors propose a methodology to quickly predict expected tactical descriptions using machine learning and operations research, which can be used to predict expected descriptions of tactical objectives and objectives.

...read moreread less

Abstract: This paper offers a methodological contribution at the intersection of machine learning and operations research. Namely, we propose a methodology to quickly predict expected tactical descriptions o...

...read moreread less

Journal Article•DOI•

A Two-Stream Continual Learning System With Variational Domain-Agnostic Feature Replay.

[...]

Qicheng Lao¹, Xiang Jiang, Mohammad Havaei, Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

03 Mar 2021-IEEE Transactions on Neural Networks

TL;DR: In this paper, a variational domain-agnostic feature replay-based approach is proposed to deal with both task drift and domain drift in two-stream continual learning (CL) systems.

...read moreread less

Abstract: Learning in nonstationary environments is one of the biggest challenges in machine learning. Nonstationarity can be caused by either task drift, i.e., the drift in the conditional distribution of labels given the input data, or the domain drift, i.e., the drift in the marginal distribution of the input data. This article aims to tackle this challenge with a modularized two-stream continual learning (CL) system, where the model is required to learn new tasks from a support stream and adapted to new domains in the query stream while maintaining previously learned knowledge. To deal with both drifts within and across the two streams, we propose a variational domain-agnostic feature replay-based approach that decouples the system into three modules: an inference module that filters the input data from the two streams into domain-agnostic representations, a generative module that facilitates the high-level knowledge transfer, and a solver module that applies the filtered and transferable knowledge to solve the queries. We demonstrate the effectiveness of our proposed approach in addressing the two fundamental scenarios and complex scenarios in two-stream CL.

...read moreread less

Posted Content•

Invariance Principle Meets Information Bottleneck for Out-of-Distribution Generalization

[...]

Kartik Ahuja, Ethan Caballero, Dinghuai Zhang, Yoshua Bengio, Ioannis Mitliagkas, Irina Rish - Show less +2 more

11 Jun 2021-arXiv: Learning

TL;DR: The authors showed that the invariance principle alone alone is insufficient to generalize OOD and proposed a form of information bottleneck constraint along with invariance to solve the OOD generalization problem.

...read moreread less

Abstract: The invariance principle from causality is at the heart of notable approaches such as invariant risk minimization (IRM) that seek to address out-of-distribution (OOD) generalization failures. Despite the promising theory, invariance principle-based approaches fail in common classification tasks, where invariant (causal) features capture all the information about the label. Are these failures due to the methods failing to capture the invariance? Or is the invariance principle itself insufficient? To answer these questions, we revisit the fundamental assumptions in linear regression tasks, where invariance-based approaches were shown to provably generalize OOD. In contrast to the linear regression tasks, we show that for linear classification tasks we need much stronger restrictions on the distribution shifts, or otherwise OOD generalization is impossible. Furthermore, even with appropriate restrictions on distribution shifts in place, we show that the invariance principle alone is insufficient. We prove that a form of the information bottleneck constraint along with invariance helps address key failures when invariant features capture all the information about the label and also retains the existing success when they do not. We propose an approach that incorporates both of these principles and demonstrate its effectiveness in several experiments.

...read moreread less

Proceedings Article•

CausalWorld: A Robotic Manipulation Benchmark for Causal Structure and Transfer Learning

[...]

Ossama Ahmed¹, Frederik Träuble², Anirudh Goyal³, Alexander Neitz², Manuel Wüthrich², Yoshua Bengio³, Bernhard Schölkopf², Stefan Bauer¹ - Show less +4 more•Institutions (3)

École Polytechnique Fédérale de Lausanne¹, Max Planck Society², Université de Montréal³

03 May 2021

TL;DR: CausalWorld as discussed by the authors is a benchmark for causal structure and transfer learning in a robotic manipulation environment, where the user can intervene on all causal variables, which allows for fine-grained control over how similar different tasks (or task distributions) are.

...read moreread less

Abstract: Despite recent successes of reinforcement learning (RL), it remains a challenge for agents to transfer learned skills to related environments To facilitate research addressing this, we propose CausalWorld, a benchmark for causal structure and transfer learning in a robotic manipulation environment The environment is a simulation of an open-source robotic platform, hence offering the possibility of sim-to-real transfer Tasks consist of constructing 3D shapes from a given set of blocks - inspired by how children learn to build complex structures The key strength of CausalWorld is that it provides a combinatorial family of such tasks with common causal structure and underlying factors (including, eg, robot and object masses, colors, sizes) The user (or the agent) may intervene on all causal variables, which allows for fine-grained control over how similar different tasks (or task distributions) are One can thus easily define training and evaluation distributions of a desired difficulty level, targeting a specific form of generalization (eg, only changes in appearance or object mass) Further, this common parametrization facilitates defining curricula by interpolating between an initial and a target task While users may define their own task distributions, we present eight meaningful distributions as concrete benchmarks, ranging from simple to very challenging, all of which require long-horizon planning and precise low-level motor control at the same time Finally, we provide baseline results for a subset of these tasks on distinct training curricula and corresponding evaluation protocols, verifying the feasibility of the tasks in this benchmark

...read moreread less

Proceedings Article•DOI•

CMIM: Cross-Modal Information Maximization For Medical Imaging

[...]

Tristan Sylvain¹, Francis Dutil, Tess Berthier, Lisa Di Jorio, Margaux Luck¹, Devon Hjelm², Yoshua Bengio¹ - Show less +3 more•Institutions (2)

Université de Montréal¹, Microsoft²

06 Jun 2021

TL;DR: In this paper, the authors propose an innovative framework that makes the most of available data by learning good representations of a multi-modal input that are resilient to modality dropping at test time, using recent advances in mutual information maximization.

...read moreread less

Abstract: In hospitals, data are siloed to specific information systems that make the same information available under different modalities such as the different medical imaging exams the patient undergoes (CT scans, MRI, PET, Ultrasound, etc.) and their associated radiology reports. This offers unique opportunities to obtain and use at train-time those multiple views of the same information that might not always be available at test-time.In this paper, we propose an innovative framework that makes the most of available data by learning good representations of a multi-modal input that are resilient to modality dropping at test-time, using recent advances in mutual information maximization. By maximizing cross-modal information at train time, we are able to outperform several state-of-the-art baselines in two different settings, medical image classification, and segmentation. In particular, our method is shown to have a strong impact on the inference-time performance of weaker modalities.

...read moreread less

Posted Content•DOI•

Training neural networks to recognize speech increased their correspondence to the human auditory pathway but did not yield a shared hierarchy of acoustic features

[...]

Thompson Jaf., Yoshua Bengio¹, Elia Formisano, M Schönwiesner²•Institutions (2)

Université de Montréal¹, Leipzig University²

28 Jan 2021-bioRxiv

TL;DR: In this article, the authors compared the representations of CNNs trained to recognize speech (triphone recognition) to 7-Tesla fMRI activity collected throughout the human auditory pathway, including subcortical and cortical regions, while participants listened to speech.

...read moreread less

Abstract: The correspondence between the activity of artificial neurons in convolutional neural networks (CNNs) trained to recognize objects in images and neural activity collected throughout the primate visual system has been well documented. Shallower layers of CNNs are typically more similar to early visual areas and deeper layers tend to be more similar to later visual areas, providing evidence for a shared representational hierarchy. This phenomenon has not been thoroughly studied in the auditory domain. Here, we compared the representations of CNNs trained to recognize speech (triphone recognition) to 7-Tesla fMRI activity collected throughout the human auditory pathway, including subcortical and cortical regions, while participants listened to speech. We found no evidence for a shared representational hierarchy of acoustic speech features. Instead, all auditory regions of interest were most similar to a single layer of the CNNs: the first fully-connected layer. This layer sits at the boundary between the relatively task-general intermediate layers and the highly task-specific final layers. This suggests that alternative architectural designs and/or training objectives may be needed to achieve fine-grained layer-wise correspondence with the human auditory pathway. Highlights Trained CNNs more similar to auditory fMRI activity than untrained No evidence of a shared representational hierarchy for acoustic features All ROIs were most similar to the first fully-connected layer CNN performance on speech recognition task positively associated with fmri similarity

...read moreread less

Posted Content•

Systematic Evaluation of Causal Discovery in Visual Model Based Reinforcement Learning

[...]

Nan Rosemary Ke, Aniket Didolkar, Sarthak Mittal, Anirudh Goyal, Guillaume Lajoie, Stefan Bauer, Danilo Jimenez Rezende, Yoshua Bengio, Michael C. Mozer, Chris Pal - Show less +6 more

02 Jul 2021-arXiv: Machine Learning

TL;DR: In this paper, a suite of benchmarking RL environments is designed to evaluate various representation learning algorithms from the literature and find that explicitly incorporating structure and modularity in models can help causal induction in model-based reinforcement learning.

...read moreread less

Abstract: Inducing causal relationships from observations is a classic problem in machine learning. Most work in causality starts from the premise that the causal variables themselves are observed. However, for AI agents such as robots trying to make sense of their environment, the only observables are low-level variables like pixels in images. To generalize well, an agent must induce high-level variables, particularly those which are causal or are affected by causal variables. A central goal for AI and causality is thus the joint discovery of abstract representations and causal structure. However, we note that existing environments for studying causal induction are poorly suited for this objective because they have complicated task-specific causal graphs which are impossible to manipulate parametrically (e.g., number of nodes, sparsity, causal chain length, etc.). In this work, our goal is to facilitate research in learning representations of high-level variables as well as causal structures among them. In order to systematically probe the ability of methods to identify these variables and structures, we design a suite of benchmarking RL environments. We evaluate various representation learning algorithms from the literature and find that explicitly incorporating structure and modularity in models can help causal induction in model-based reinforcement learning.

...read moreread less