scispace - formally typeset
Search or ask a question

Showing papers by "Yoshua Bengio published in 2021"


Journal ArticleDOI
26 Feb 2021
TL;DR: The authors reviewed fundamental concepts of causal inference and related them to crucial open problems of machine learning, including transfer and generalization, thereby assaying how causality can contribute to modern machine learning research.
Abstract: The two fields of machine learning and graphical causality arose and are developed separately. However, there is, now, cross-pollination and increasing interest in both fields to benefit from the advances of the other. In this article, we review fundamental concepts of causal inference and relate them to crucial open problems of machine learning, including transfer and generalization, thereby assaying how causality can contribute to modern machine learning research. This also applies in the opposite direction: we note that most work in causality starts from the premise that the causal variables are given. A central problem for AI and causality is, thus, causal representation learning, that is, the discovery of high-level causal variables from low-level observations. Finally, we delineate some implications of causality for machine learning and propose key research areas at the intersection of both communities.

601 citations


Journal ArticleDOI
TL;DR: A survey of machine learning and combinatorial optimization problems can be found in this paper, where the main point is to see generic optimization problems as data points and inquire what is the relevant distribution of problems to use for learning on a given task.

464 citations


Journal ArticleDOI
TL;DR: In this paper, neural networks are used to learn the rich internal representations required for difficult tasks such as recognizing objects or understanding language, which can be used to classify objects or understand language.
Abstract: How can neural networks learn the rich internal representations required for difficult tasks such as recognizing objects or understanding language?

294 citations


Journal ArticleDOI
TL;DR: In this brief communication, a few of these inherent privacy limitations of any decentralized automatic contact tracing system are discussed.

39 citations


Journal ArticleDOI
07 Jul 2021-Neuron
TL;DR: In this paper, dual-processing theories of cognition have been used to explain human cognitive abilities, such as semantic understanding of world structure, logical reasoning, and communication via language, and they have been integrated with the global workspace theory to explain dynamic relay of information products between two systems.

33 citations


Posted Content
TL;DR: The authors reviewed fundamental concepts of causal inference and related them to crucial open problems of machine learning, including transfer and generalization, thereby assaying how causality can contribute to modern machine learning research.
Abstract: The two fields of machine learning and graphical causality arose and developed separately. However, there is now cross-pollination and increasing interest in both fields to benefit from the advances of the other. In the present paper, we review fundamental concepts of causal inference and relate them to crucial open problems of machine learning, including transfer and generalization, thereby assaying how causality can contribute to modern machine learning research. This also applies in the opposite direction: we note that most work in causality starts from the premise that the causal variables are given. A central problem for AI and causality is, thus, causal representation learning, the discovery of high-level causal variables from low-level observations. Finally, we delineate some implications of causality for machine learning and propose key research areas at the intersection of both communities.

28 citations


Posted Content
TL;DR: SpeechBrain this article is an open-source and all-in-one speech toolkit designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented.
Abstract: SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing pipelines. SpeechBrain achieves competitive or state-of-the-art performance in a wide range of speech benchmarks. It also provides training recipes, pretrained models, and inference scripts for popular speech datasets, as well as tutorials which allow anyone with basic Python proficiency to familiarize themselves with speech technologies.

27 citations


Journal ArticleDOI
TL;DR: In this article, the authors show that a bias in the gradient estimate of equilibrium propagation, inherent in the use of finite nudging, is responsible for this phenomenon and that cancelling it allows training deep convolutional neural networks.
Abstract: Equilibrium Propagation is a biologically-inspired algorithm that trains convergent recurrent neural networks with a local learning rule. This approach constitutes a major lead to allow learning-capable neuromophic systems and comes with strong theoretical guarantees. Equilibrium propagation operates in two phases, during which the network is let to evolve freely and then "nudged" towards a target; the weights of the network are then updated based solely on the states of the neurons that they connect. The weight updates of Equilibrium Propagation have been shown mathematically to approach those provided by Backpropagation Through Time (BPTT), the mainstream approach to train recurrent neural networks, when nudging is performed with infinitely small strength. In practice, however, the standard implementation of Equilibrium Propagation does not scale to visual tasks harder than MNIST. In this work, we show that a bias in the gradient estimate of equilibrium propagation, inherent in the use of finite nudging, is responsible for this phenomenon and that cancelling it allows training deep convolutional neural networks. We show that this bias can be greatly reduced by using symmetric nudging (a positive nudging and a negative one). We also generalize Equilibrium Propagation to the case of cross-entropy loss (by opposition to squared error). As a result of these advances, we are able to achieve a test error of 11.7% on CIFAR-10, which approaches the one achieved by BPTT and provides a major improvement with respect to the standard Equilibrium Propagation that gives 86% test error. We also apply these techniques to train an architecture with unidirectional forward and backward connections, yielding a 13.2% test error. These results highlight equilibrium propagation as a compelling biologically-plausible approach to compute error gradients in deep neuromorphic systems.

22 citations


Posted ContentDOI
16 Jan 2021-bioRxiv
TL;DR: In this article, the authors show that unexpected event signals predict subsequent changes in responses to expected and unexpected stimuli in individual neurons and distal apical dendrites that are tracked over a period of days.
Abstract: Scientists have long conjectured that the neocortex learns the structure of the environment in a predictive, hierarchical manner. To do so, expected, predictable features are differentiated from unexpected ones by comparing bottom-up and top-down streams of data. It is theorized that the neocortex then changes the representation of incoming stimuli, guided by differences in the responses to expected and unexpected events. Such differences in cortical responses have been observed; however, it remains unknown whether these unexpected event signals govern subsequent changes in the brain’s stimulus representations, and, thus, govern learning. Here, we show that unexpected event signals predict subsequent changes in responses to expected and unexpected stimuli in individual neurons and distal apical dendrites that are tracked over a period of days. These findings were obtained by observing layer 2/3 and layer 5 pyramidal neurons in primary visual cortex of awake, behaving mice using two-photon calcium imaging. We found that many neurons in both layers 2/3 and 5 showed large differences between their responses to expected and unexpected events. These unexpected event signals also determined how the responses evolved over subsequent days, in a manner that was different between the somata and distal apical dendrites. This difference between the somata and distal apical dendrites may be important for hierarchical computation, given that these two compartments tend to receive bottom-up and top-down information, respectively. Together, our results provide novel evidence that the neocortex indeed instantiates a predictive hierarchical model in which unexpected events drive learning.

21 citations


Proceedings Article
03 May 2021
TL;DR: The authors propose Recurrent Independent Mechanisms (RIMs), a new recurrent architecture in which multiple groups of recurrent cells operate with nearly independent transition dynamics, communicate only sparingly through the bottleneck of attention, and compete with each other so they are updated only at time steps where they are most relevant.
Abstract: We explore the hypothesis that learning modular structures which reflect the dynamics of the environment can lead to better generalization and robustness to changes that only affect a few of the underlying causes. We propose Recurrent Independent Mechanisms (RIMs), a new recurrent architecture in which multiple groups of recurrent cells operate with nearly independent transition dynamics, communicate only sparingly through the bottleneck of attention, and compete with each other so they are updated only at time steps where they are most relevant. We show that this leads to specialization amongst the RIMs, which in turn allows for remarkably improved generalization on tasks where some factors of variation differ systematically between training and evaluation.

20 citations


Journal ArticleDOI
TL;DR: In a companion article, Verma et al. as mentioned in this paper discuss how machine-learned solutions can be developed and implemented to support medical decision-making, and discuss the benefits of machine learning for medical decision making.
Abstract: [See related articles at www.cmaj.ca/lookup/doi/10.1503/cmaj.202434][1] and [www.cmaj.ca/lookup/doi/10.1503/cmaj.210036][2] KEY POINTS In a companion article, Verma and colleagues discuss how machine-learned solutions can be developed and implemented to support medical decision-making.[1][3] Both

Journal ArticleDOI
TL;DR: The AI climate impact visualizer as mentioned in this paper uses cutting-edge artificial intelligence (AI) approaches to develop an interactive personalized visualization tool, which allows a user to enter an address and provide them with an AI-imagined possible visualization of the future of this location in 2050 following the detrimental effects of climate change.
Abstract: Public awareness and concern about climate change often do not match the magnitude of its threat to humans and our environment. One reason for this disagreement is that it is difficult to mentally simulate the effects of a process as complex as climate change and to have a concrete representation of the impact that our individual actions will have on our own future, especially if the consequences are long term and abstract. To overcome these challenges, we propose to use cutting-edge artificial intelligence (AI) approaches to develop an interactive personalized visualization tool, the AI climate impact visualizer. It will allow a user to enter an address—be it their house, their school, or their workplace—-and it will provide them with an AI-imagined possible visualization of the future of this location in 2050 following the detrimental effects of climate change such as floods, storms, and wildfires. This image will be accompanied by accessible information regarding the science behind climate change, i.e., why extreme weather events are becoming more frequent and what kinds of changes are happening on a local and global scale.

Posted Content
TL;DR: In this paper, the authors explore the use of such a communication channel in the context of deep learning for modeling the structure of complex environments and show that capacity limitations have a rational basis in that they encourage specialization and compositionality and facilitate the synchronization of otherwise independent specialists.
Abstract: Deep learning has seen a movement away from representing examples with a monolithic hidden state towards a richly structured state. For example, Transformers segment by position, and object-centric architectures decompose images into entities. In all these architectures, interactions between different elements are modeled via pairwise interactions: Transformers make use of self-attention to incorporate information from other positions; object-centric architectures make use of graph neural networks to model interactions among entities. However, pairwise interactions may not achieve global coordination or a coherent, integrated representation that can be used for downstream tasks. In cognitive science, a global workspace architecture has been proposed in which functionally specialized components share information through a common, bandwidth-limited communication channel. We explore the use of such a communication channel in the context of deep learning for modeling the structure of complex environments. The proposed method includes a shared workspace through which communication among different specialist modules takes place but due to limits on the communication bandwidth, specialist modules must compete for access. We show that capacity limitations have a rational basis in that (1) they encourage specialization and compositionality and (2) they facilitate the synchronization of otherwise independent specialists.

Proceedings Article
03 May 2021
TL;DR: In this paper, a probabilistic framework was proposed to generate valid and diverse conformations given a molecular graph, combining the advantages of both flow-based and energy-based models, enjoying a high model capacity to estimate the multimodal conformation distribution.
Abstract: We study how to generate molecule conformations (i.e., 3D structures) from a molecular graph. Traditional methods, such as molecular dynamics, sample conformations via computationally expensive simulations. Recently, machine learning methods have shown great potential by training on a large collection of conformation data. Challenges arise from the limited model capacity for capturing complex distributions of conformations and the difficulty in modeling long-range dependencies between atoms. Inspired by the recent progress in deep generative models, in this paper, we propose a novel probabilistic framework to generate valid and diverse conformations given a molecular graph. We propose a method combining the advantages of both flow-based and energy-based models, enjoying: (1) a high model capacity to estimate the multimodal conformation distribution; (2) explicitly capturing the complex long-range dependencies between atoms in the observation space. Extensive experiments demonstrate the superior performance of the proposed method on several benchmarks, including conformation generation and distance modeling tasks, with a significant improvement over existing generative models for molecular conformation sampling.

Proceedings Article
18 May 2021
TL;DR: In this article, a broad meta-learning framework is proposed to discover generic ways of processing time series (TS) from a diverse dataset so as to greatly improve generalization on new TS coming from different datasets.
Abstract: Can meta-learning discover generic ways of processing time series (TS) from a diverse dataset so as to greatly improve generalization on new TS coming from different datasets? This work provides positive evidence to this using a broad meta-learning framework which we show subsumes many existing meta-learning algorithms. Our theoretical analysis suggests that residual connections act as a meta-learning adaptation mechanism, generating a subset of task-specific parameters based on a given TS input, thus gradually expanding the expressive power of the architecture on-the-fly. The same mechanism is shown via linearization analysis to have the interpretation of a sequential update of the final linear layer. Our empirical results on a wide range of data emphasize the importance of the identified meta-learning mechanisms for successful zero-shot univariate forecasting, suggesting that it is viable to train a neural network on a source TS dataset and deploy it on a different target TS dataset without retraining, resulting in performance that is at least as good as that of state-of-practice univariate forecasting models.

Proceedings Article
03 May 2021
TL;DR: This article showed that mask information is only used during training and has an impact on generalization accuracy depending on the severity of the shift between the training and test distributions, which raises questions about the utility of masks as "attribution priors" as well as saliency maps for explainable predictions.
Abstract: Poor generalization is one symptom of models that learn to predict target variables using spuriously-correlated image features present only in the training distribution instead of the true image features that denote a class. It is often thought that this can be diagnosed visually using attribution (aka saliency) maps. We study if this assumption is correct. In some prediction tasks, such as for medical images, one may have some images with masks drawn by a human expert, indicating a region of the image containing relevant information to make the prediction. We study multiple methods that take advantage of such auxiliary labels, by training networks to ignore distracting features which may be found outside of the region of interest. This mask information is only used during training and has an impact on generalization accuracy depending on the severity of the shift between the training and test distributions. Surprisingly, while these methods improve generalization performance in the presence of a covariate shift, there is no strong correspondence between the correction of attribution towards the features a human expert have labelled as important and generalization performance. These results suggest that the root cause of poor generalization may not always be spatially defined, and raise questions about the utility of masks as 'attribution priors' as well as saliency maps for explainable predictions.

Proceedings Article
03 May 2021
TL;DR: The authors consider situations where the presence of dominant simpler correlations with the target variable in a training set can cause an SGD-trained neural network to be less reliant on more persistently-correlating complex features.
Abstract: We consider situations where the presence of dominant simpler correlations with the target variable in a training set can cause an SGD-trained neural network to be less reliant on more persistently-correlating complex features. When the non-persistent, simpler correlations correspond to non-semantic background factors, a neural network trained on this data can exhibit dramatic failure upon encountering systematic distributional shift, where the correlating background features are recombined with different objects. We perform an empirical study showing that group invariance methods across inferred partitionings of the training set can lead to significant improvements at such test-time situations. We suggest a new invariance penalty, showing with experiments on three synthetic datasets that it can perform better than alternatives. We find that even without assuming access to any systematic-shift validation sets, one can still find improvements over an ERM-trained reference model.

Proceedings Article
03 May 2021
TL;DR: Methods that can be deployed to a smartphone to locally and proactively predict an individual's infectiousness based on their contact history and other information are developed, suggesting PCT could help in safe re-opening and second-wave prevention.
Abstract: The COVID-19 pandemic has spread rapidly worldwide, overwhelming manual contact tracing in many countries and resulting in widespread lockdowns for emergency containment. Large-scale digital contact tracing (DCT) has emerged as a potential solution to resume economic and social activity while minimizing spread of the virus. Various DCT methods have been proposed, each making trade-offs be-tween privacy, mobility restrictions, and public health. The most common approach, binary contact tracing (BCT), models infection as a binary event, informed only by an individual’s test results, with corresponding binary recommendations that either all or none of the individual’s contacts quarantine. BCT ignores the inherent uncertainty in contacts and the infection process, which could be used to tailor messaging to high-risk individuals, and prompt proactive testing or earlier warnings. It also does not make use of observations such as symptoms or pre-existing medical conditions, which could be used to make more accurate infectiousness predictions. In this paper, we use a recently-proposed COVID-19 epidemiological simulator to develop and test methods that can be deployed to a smartphone to locally and proactively predict an individual’s infectiousness (risk of infecting others) based on their contact history and other information, while respecting strong privacy constraints. Predictions are used to provide personalized recommendations to the individual via an app, as well as to send anonymized messages to the individual’s contacts, who use this information to better predict their own infectiousness, an approach we call proactive contact tracing (PCT). Similarly to other works, we find that compared to no tracing, all DCT methods tested are able to reduce spread of the disease and thus save lives, even at low adoption rates, strongly supporting a role for DCT methods in managing the pandemic. Further, we find a deep-learning based PCT method which improves over BCT for equivalent average mobility, suggesting PCT could help in safe re-opening and second-wave prevention.

Posted Content
TL;DR: In this paper, a set of rule templates are applied by binding placeholder variables in the rules to specific entities, and the best fitting rules are applied to update entity properties, which achieves a flexible, dynamic flow of control and serves to factorize entity specific and rule-based information.
Abstract: Visual environments are structured, consisting of distinct objects or entities. These entities have properties -- both visible and latent -- that determine the manner in which they interact with one another. To partition images into entities, deep-learning researchers have proposed structural inductive biases such as slot-based architectures. To model interactions among entities, equivariant graph neural nets (GNNs) are used, but these are not particularly well suited to the task for two reasons. First, GNNs do not predispose interactions to be sparse, as relationships among independent entities are likely to be. Second, GNNs do not factorize knowledge about interactions in an entity-conditional manner. As an alternative, we take inspiration from cognitive science and resurrect a classic approach, production systems, which consist of a set of rule templates that are applied by binding placeholder variables in the rules to specific entities. Rules are scored on their match to entities, and the best fitting rules are applied to update entity properties. In a series of experiments, we demonstrate that this architecture achieves a flexible, dynamic flow of control and serves to factorize entity-specific and rule-based information. This disentangling of knowledge achieves robust future-state prediction in rich visual environments, outperforming state-of-the-art methods using GNNs, and allows for the extrapolation from simple (few object) environments to more complex environments.

Proceedings Article
03 May 2021
TL;DR: In this article, a probabilistic model called RNNLogic is proposed to learn logic rules for reasoning on knowledge graphs, which treats logic rules as a latent variable, and simultaneously trains a rule generator as well as a reasoning predictor with logic rules.
Abstract: This paper studies learning logic rules for reasoning on knowledge graphs. Logic rules provide interpretable explanations when used for prediction as well as being able to generalize to other tasks, and hence are critical to learn. Existing methods either suffer from the problem of searching in a large search space (e.g., neural logic programming) or ineffective optimization due to sparse rewards (e.g., techniques based on reinforcement learning). To address these limitations, this paper proposes a probabilistic model called RNNLogic. RNNLogic treats logic rules as a latent variable, and simultaneously trains a rule generator as well as a reasoning predictor with logic rules. We develop an EM-based algorithm for optimization. In each iteration, the reasoning predictor is updated to explore some generated logic rules for reasoning. Then in the E-step, we select a set of high-quality rules from all generated rules with both the rule generator and reasoning predictor via posterior inference; and in the M-step, the rule generator is updated with the rules selected in the E-step. Experiments on four datasets prove the effectiveness of RNNLogic.

Posted Content
TL;DR: Direct Epistemic Uncertainty Prediction (DEUP) as discussed by the authors is a principled approach for directly estimating epistemic uncertainty by learning to predict generalization error and subtracting an estimate of aleatoric uncertainty, i.e., intrinsic unpredictability.
Abstract: Epistemic uncertainty is the part of out-of-sample prediction error due to the lack of knowledge of the learner. Whereas previous work was focusing on model variance, we propose a principled approach for directly estimating epistemic uncertainty by learning to predict generalization error and subtracting an estimate of aleatoric uncertainty, i.e., intrinsic unpredictability. This estimator of epistemic uncertainty includes the effect of model bias and can be applied in non-stationary learning environments arising in active learning or reinforcement learning. In addition to demonstrating these properties of Direct Epistemic Uncertainty Prediction (DEUP), we illustrate its advantage against existing methods for uncertainty estimation on downstream tasks including sequential model optimization and reinforcement learning. We also evaluate the quality of uncertainty estimates from DEUP for probabilistic classification of images and for estimating uncertainty about synergistic drug combinations.

Proceedings Article
18 May 2021
TL;DR: Deep verifier networks (DVN) as mentioned in this paper uses conditional variational auto-encoders with disentanglement constraints to separate the label information from the latent representation.
Abstract: AI Safety is a major concern in many deep learning applications such as autonomous driving. Given a trained deep learning model, an important natural problem is how to reliably verify the model's prediction. In this paper, we propose a novel framework --- deep verifier networks (DVN) to detect unreliable inputs or predictions of deep discriminative models, using separately trained deep generative models. Our proposed model is based on conditional variational auto-encoders with disentanglement constraints to separate the label information from the latent representation. We give both intuitive and theoretical justifications for the model. Our verifier network is trained independently with the prediction model, which eliminates the need of retraining the verifier network for a new model. We test the verifier network on both out-of-distribution detection and adversarial example detection problems, as well as anomaly detection problems in structured prediction tasks such as image caption generation. We achieve state-of-the-art results in all of these problems.

Journal Article
TL;DR: Transformer with Independent Mechanisms (TIM) as discussed by the authors proposes Transformers with independent mechanisms, a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention.
Abstract: An important development in deep learning from the earliest MLPs has been a move towards architectures with structural inductive biases which enable the model to keep distinct sources of information and routes of processing well-separated. This structure is linked to the notion of independent mechanisms from the causality literature, in which a mechanism is able to retain the same processing as irrelevant aspects of the world are changed. For example, convnets enable separation over positions, while attention-based architectures (especially Transformers) learn which combination of positions to process dynamically. In this work we explore a way in which the Transformer architecture is deficient: it represents each position with a large monolithic hidden representation and a single set of parameters which are applied over the entire hidden representation. This potentially throws unrelated sources of information together, and limits the Transformer's ability to capture independent mechanisms. To address this, we propose Transformers with Independent Mechanisms (TIM), a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention. Additionally, we propose a competition mechanism which encourages these mechanisms to specialize over time steps, and thus be more independent. We study TIM on a large scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance.

Journal ArticleDOI
TL;DR: In this paper, the authors propose a methodology to quickly predict expected tactical descriptions using machine learning and operations research, which can be used to predict expected descriptions of tactical objectives and objectives.
Abstract: This paper offers a methodological contribution at the intersection of machine learning and operations research. Namely, we propose a methodology to quickly predict expected tactical descriptions o...

Journal ArticleDOI
TL;DR: In this paper, a variational domain-agnostic feature replay-based approach is proposed to deal with both task drift and domain drift in two-stream continual learning (CL) systems.
Abstract: Learning in nonstationary environments is one of the biggest challenges in machine learning. Nonstationarity can be caused by either task drift, i.e., the drift in the conditional distribution of labels given the input data, or the domain drift, i.e., the drift in the marginal distribution of the input data. This article aims to tackle this challenge with a modularized two-stream continual learning (CL) system, where the model is required to learn new tasks from a support stream and adapted to new domains in the query stream while maintaining previously learned knowledge. To deal with both drifts within and across the two streams, we propose a variational domain-agnostic feature replay-based approach that decouples the system into three modules: an inference module that filters the input data from the two streams into domain-agnostic representations, a generative module that facilitates the high-level knowledge transfer, and a solver module that applies the filtered and transferable knowledge to solve the queries. We demonstrate the effectiveness of our proposed approach in addressing the two fundamental scenarios and complex scenarios in two-stream CL.

Posted Content
TL;DR: The authors showed that the invariance principle alone alone is insufficient to generalize OOD and proposed a form of information bottleneck constraint along with invariance to solve the OOD generalization problem.
Abstract: The invariance principle from causality is at the heart of notable approaches such as invariant risk minimization (IRM) that seek to address out-of-distribution (OOD) generalization failures. Despite the promising theory, invariance principle-based approaches fail in common classification tasks, where invariant (causal) features capture all the information about the label. Are these failures due to the methods failing to capture the invariance? Or is the invariance principle itself insufficient? To answer these questions, we revisit the fundamental assumptions in linear regression tasks, where invariance-based approaches were shown to provably generalize OOD. In contrast to the linear regression tasks, we show that for linear classification tasks we need much stronger restrictions on the distribution shifts, or otherwise OOD generalization is impossible. Furthermore, even with appropriate restrictions on distribution shifts in place, we show that the invariance principle alone is insufficient. We prove that a form of the information bottleneck constraint along with invariance helps address key failures when invariant features capture all the information about the label and also retains the existing success when they do not. We propose an approach that incorporates both of these principles and demonstrate its effectiveness in several experiments.

Proceedings Article
03 May 2021
TL;DR: CausalWorld as discussed by the authors is a benchmark for causal structure and transfer learning in a robotic manipulation environment, where the user can intervene on all causal variables, which allows for fine-grained control over how similar different tasks (or task distributions) are.
Abstract: Despite recent successes of reinforcement learning (RL), it remains a challenge for agents to transfer learned skills to related environments To facilitate research addressing this, we propose CausalWorld, a benchmark for causal structure and transfer learning in a robotic manipulation environment The environment is a simulation of an open-source robotic platform, hence offering the possibility of sim-to-real transfer Tasks consist of constructing 3D shapes from a given set of blocks - inspired by how children learn to build complex structures The key strength of CausalWorld is that it provides a combinatorial family of such tasks with common causal structure and underlying factors (including, eg, robot and object masses, colors, sizes) The user (or the agent) may intervene on all causal variables, which allows for fine-grained control over how similar different tasks (or task distributions) are One can thus easily define training and evaluation distributions of a desired difficulty level, targeting a specific form of generalization (eg, only changes in appearance or object mass) Further, this common parametrization facilitates defining curricula by interpolating between an initial and a target task While users may define their own task distributions, we present eight meaningful distributions as concrete benchmarks, ranging from simple to very challenging, all of which require long-horizon planning and precise low-level motor control at the same time Finally, we provide baseline results for a subset of these tasks on distinct training curricula and corresponding evaluation protocols, verifying the feasibility of the tasks in this benchmark

Proceedings ArticleDOI
06 Jun 2021
TL;DR: In this paper, the authors propose an innovative framework that makes the most of available data by learning good representations of a multi-modal input that are resilient to modality dropping at test time, using recent advances in mutual information maximization.
Abstract: In hospitals, data are siloed to specific information systems that make the same information available under different modalities such as the different medical imaging exams the patient undergoes (CT scans, MRI, PET, Ultrasound, etc.) and their associated radiology reports. This offers unique opportunities to obtain and use at train-time those multiple views of the same information that might not always be available at test-time.In this paper, we propose an innovative framework that makes the most of available data by learning good representations of a multi-modal input that are resilient to modality dropping at test-time, using recent advances in mutual information maximization. By maximizing cross-modal information at train time, we are able to outperform several state-of-the-art baselines in two different settings, medical image classification, and segmentation. In particular, our method is shown to have a strong impact on the inference-time performance of weaker modalities.

Posted ContentDOI
28 Jan 2021-bioRxiv
TL;DR: In this article, the authors compared the representations of CNNs trained to recognize speech (triphone recognition) to 7-Tesla fMRI activity collected throughout the human auditory pathway, including subcortical and cortical regions, while participants listened to speech.
Abstract: The correspondence between the activity of artificial neurons in convolutional neural networks (CNNs) trained to recognize objects in images and neural activity collected throughout the primate visual system has been well documented. Shallower layers of CNNs are typically more similar to early visual areas and deeper layers tend to be more similar to later visual areas, providing evidence for a shared representational hierarchy. This phenomenon has not been thoroughly studied in the auditory domain. Here, we compared the representations of CNNs trained to recognize speech (triphone recognition) to 7-Tesla fMRI activity collected throughout the human auditory pathway, including subcortical and cortical regions, while participants listened to speech. We found no evidence for a shared representational hierarchy of acoustic speech features. Instead, all auditory regions of interest were most similar to a single layer of the CNNs: the first fully-connected layer. This layer sits at the boundary between the relatively task-general intermediate layers and the highly task-specific final layers. This suggests that alternative architectural designs and/or training objectives may be needed to achieve fine-grained layer-wise correspondence with the human auditory pathway. Highlights Trained CNNs more similar to auditory fMRI activity than untrained No evidence of a shared representational hierarchy for acoustic features All ROIs were most similar to the first fully-connected layer CNN performance on speech recognition task positively associated with fmri similarity

Posted Content
TL;DR: In this paper, a suite of benchmarking RL environments is designed to evaluate various representation learning algorithms from the literature and find that explicitly incorporating structure and modularity in models can help causal induction in model-based reinforcement learning.
Abstract: Inducing causal relationships from observations is a classic problem in machine learning. Most work in causality starts from the premise that the causal variables themselves are observed. However, for AI agents such as robots trying to make sense of their environment, the only observables are low-level variables like pixels in images. To generalize well, an agent must induce high-level variables, particularly those which are causal or are affected by causal variables. A central goal for AI and causality is thus the joint discovery of abstract representations and causal structure. However, we note that existing environments for studying causal induction are poorly suited for this objective because they have complicated task-specific causal graphs which are impossible to manipulate parametrically (e.g., number of nodes, sparsity, causal chain length, etc.). In this work, our goal is to facilitate research in learning representations of high-level variables as well as causal structures among them. In order to systematically probe the ability of methods to identify these variables and structures, we design a suite of benchmarking RL environments. We evaluate various representation learning algorithms from the literature and find that explicitly incorporating structure and modularity in models can help causal induction in model-based reinforcement learning.