Top 111 papers published by Yoshua Bengio from Université de Montréal in 2019

Journal Article•DOI•

A deep learning framework for neuroscience

[...]

Blake A. Richards, Timothy P. Lillicrap¹, Philippe Beaudoin, Yoshua Bengio², Yoshua Bengio³, Rafal Bogacz⁴, Amelia J. Christensen⁵, Claudia Clopath⁶, Rui Ponte Costa⁷, Rui Ponte Costa⁸, Archy O. de Berker, Surya Ganguli⁹, Surya Ganguli⁵, Colleen J Gillon¹⁰, Danijar Hafner¹⁰, Danijar Hafner⁹, Adam Kepecs¹¹, Nikolaus Kriegeskorte¹², Peter E. Latham¹, Grace W. Lindsay¹², Kenneth D. Miller¹², Richard Naud¹³, Christopher C. Pack¹⁴, Panayiota Poirazi¹⁵, Pieter R. Roelfsema¹⁶, João Sacramento¹⁷, Andrew M. Saxe⁴, Benjamin Scellier², Anna C. Schapiro¹⁸, Walter Senn⁸, Greg Wayne, Daniel L. K. Yamins⁵, Friedemann Zenke¹⁹, Friedemann Zenke⁴, Joel Zylberberg²⁰, Joel Zylberberg³, Denis Therien, Konrad P. Kording³, Konrad P. Kording¹⁸ - Show less +35 more•Institutions (20)

28 Oct 2019-Nature Neuroscience

TL;DR: It is argued that a deep network is best understood in terms of components used to design it—objective functions, architecture and learning rules—rather than unit-by-unit computation.

...read moreread less

Abstract: Systems neuroscience seeks explanations for how the brain implements a wide variety of perceptual, cognitive and motor tasks. Conversely, artificial intelligence attempts to design computational systems based on the tasks they will have to solve. In artificial neural networks, the three components specified by design are the objective functions, the learning rules and the architectures. With the growing success of deep learning, which utilizes brain-inspired architectures, these three designed components have increasingly become central to how we model, engineer and optimize complex artificial learning systems. Here we argue that a greater focus on these components would also benefit systems neuroscience. We give examples of how this optimization-based framework can drive theoretical and experimental progress in neuroscience. We contend that this principled perspective on systems neuroscience will help to generate more rapid progress.

...read moreread less

633 citations

Proceedings Article•

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

[...]

Kundan Kumar¹, Rithesh Kumar², Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo², Alexandre de Brebisson³, Yoshua Bengio², Aaron Courville² - Show less +5 more•Institutions (3)

Indian Institute of Technology Kanpur¹, Université de Montréal², Imperial College London³

06 Sep 2019

TL;DR: The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks.

...read moreread less

Abstract: Previous works (Donahue et al., 2018a; Engel et al., 2019a) have found that generating coherent raw audio waveforms with GANs is challenging. In this paper, we show that it is possible to train GANs reliably to generate high quality coherent waveforms by introducing a set of architectural changes and simple training techniques. Subjective evaluation metric (Mean Opinion Score, or MOS) shows the effectiveness of the proposed approach for high quality mel-spectrogram inversion. To establish the generality of the proposed techniques, we show qualitative results of our model in speech synthesis, music domain translation and unconditional music synthesis. We evaluate the various components of the model through ablation studies and suggest a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks. Our model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion. Our pytorch implementation runs at more than 100x faster than realtime on GTX 1080Ti GPU and more than 2x faster than real-time on CPU, without any hardware specific optimization tricks.

...read moreread less

559 citations

Posted Content•

Tackling Climate Change with Machine Learning

[...]

10 Jun 2019-arXiv: Computers and Society

TL;DR: From smart grids to disaster management, high impact problems where existing gaps can be filled by ML are identified, in collaboration with other fields, to join the global effort against climate change.

...read moreread less

Abstract: Climate change is one of the greatest challenges facing humanity, and we, as machine learning experts, may wonder how we can help. Here we describe how machine learning can be a powerful tool in reducing greenhouse gas emissions and helping society adapt to a changing climate. From smart grids to disaster management, we identify high impact problems where existing gaps can be filled by machine learning, in collaboration with other fields. Our recommendations encompass exciting research questions as well as promising business opportunities. We call on the machine learning community to join the global effort against climate change.

...read moreread less

441 citations

Proceedings Article•

Manifold Mixup: Better Representations by Interpolating Hidden States

[...]

Vikas Verma¹, Alex Lamb², Christopher Beckham³, Amir Najafi⁴, Ioannis Mitliagkas², David Lopez-Paz⁵, Yoshua Bengio² - Show less +3 more•Institutions (5)

Helsinki University of Technology¹, Université de Montréal², École Polytechnique de Montréal³, Sharif University of Technology⁴, Facebook⁵

24 May 2019

TL;DR: Manifold Mixup as discussed by the authors leverages semantic interpolations as additional training signal, obtaining neural networks with smoother decision boundaries at multiple levels of representation, as a result, neural networks trained with Manifold mixup learn class-representations with fewer directions of variance.

...read moreread less

Abstract: Deep neural networks excel at learning the training data, but often provide incorrect and confident predictions when evaluated on slightly different test examples. This includes distribution shifts, outliers, and adversarial examples. To address these issues, we propose Manifold Mixup, a simple regularizer that encourages neural networks to predict less confidently on interpolations of hidden representations. Manifold Mixup leverages semantic interpolations as additional training signal, obtaining neural networks with smoother decision boundaries at multiple levels of representation. As a result, neural networks trained with Manifold Mixup learn class-representations with fewer directions of variance. We prove theory on why this flattening happens under ideal conditions, validate it on practical situations, and connect it to previous works on information theory and generalization. In spite of incurring no significant computation and being implemented in a few lines of code, Manifold Mixup improves strong baselines in supervised learning, robustness to single-step adversarial attacks, and test log-likelihood.

...read moreread less

388 citations

Posted Content•

Gradient based sample selection for online continual learning

[...]

Rahaf Aljundi¹, Min Lin², Baptiste Goujaud, Yoshua Bengio³•Institutions (3)

Katholieke Universiteit Leuven¹, National University of Singapore², Canadian Institute for Advanced Research³

20 Mar 2019-arXiv: Learning

TL;DR: This work formulation of sample selection as a constraint reduction problem based on the constrained optimization view of continual learning shows that it is equivalent to maximizing the diversity of samples in the replay buffer with parameters gradient as the feature.

...read moreread less

Abstract: A continual learning agent learns online with a non-stationary and never-ending stream of data. The key to such learning process is to overcome the catastrophic forgetting of previously seen data, which is a well known problem of neural networks. To prevent forgetting, a replay buffer is usually employed to store the previous data for the purpose of rehearsal. Previous works often depend on task boundary and i.i.d. assumptions to properly select samples for the replay buffer. In this work, we formulate sample selection as a constraint reduction problem based on the constrained optimization view of continual learning. The goal is to select a fixed subset of constraints that best approximate the feasible region defined by the original constraints. We show that it is equivalent to maximizing the diversity of samples in the replay buffer with parameters gradient as the feature. We further develop a greedy alternative that is cheap and efficient. The advantage of the proposed method is demonstrated by comparing to other alternatives under the continual learning setting. Further comparisons are made against state of the art methods that rely on task boundaries which show comparable or even better results for our method.

...read moreread less

329 citations

Proceedings Article•DOI•

Interpolation Consistency Training for Semi-supervised Learning.

[...]

Vikas Verma¹, Alex Lamb, Juho Kannala¹, Yoshua Bengio, David Lopez-Paz² - Show less +1 more•Institutions (2)

Aalto University¹, Facebook²

01 Aug 2019

258 citations

Posted Content•

A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms

[...]

Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Rosemary Nan Ke, Sébastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal, Chris Pal - Show less +4 more

30 Jan 2019-arXiv: Learning

TL;DR: This work proposes to meta-learn causal structures based on how fast a learner adapts to new distributions arising from sparse distributional changes, e.g. due to interventions, actions of agents and other sources of non-stationarities and shows that causal structures can be parameterized via continuous variables and learned end-to-end.

...read moreread less

Abstract: We propose to meta-learn causal structures based on how fast a learner adapts to new distributions arising from sparse distributional changes, e.g. due to interventions, actions of agents and other sources of non-stationarities. We show that under this assumption, the correct causal structural choices lead to faster adaptation to modified distributions because the changes are concentrated in one or just a few mechanisms when the learned knowledge is modularized appropriately. This leads to sparse expected gradients and a lower effective number of degrees of freedom needing to be relearned while adapting to the change. It motivates using the speed of adaptation to a modified distribution as a meta-learning objective. We demonstrate how this can be used to determine the cause-effect relationship between two observed variables. The distributional changes do not need to correspond to standard interventions (clamping a variable), and the learner has no direct knowledge of these interventions. We show that causal structures can be parameterized via continuous variables and learned end-to-end. We then explore how these ideas could be used to also learn an encoder that would map low-level observed variables to unobserved causal variables leading to faster adaptation out-of-distribution, learning a representation space where one can satisfy the assumptions of independent mechanisms and of small and sparse changes in these mechanisms due to actions and non-stationarities.

...read moreread less

228 citations

Posted Content•

Recurrent Independent Mechanisms

[...]

Anirudh Goyal¹, Alex Lamb¹, Jordan Hoffmann, Shagun Sodhani², Sergey Levine³, Yoshua Bengio¹, Bernhard Schölkopf⁴ - Show less +3 more•Institutions (4)

Université de Montréal¹, Facebook², University of Washington³, Max Planck Society⁴

24 Sep 2019-arXiv: Learning

TL;DR: Recurrent Independent Mechanisms is proposed, a new recurrent architecture in which multiple groups of recurrent cells operate with nearly independent transition dynamics, communicate only sparingly through the bottleneck of attention, and are only updated at time steps where they are most relevant.

...read moreread less

Abstract: Learning modular structures which reflect the dynamics of the environment can lead to better generalization and robustness to changes which only affect a few of the underlying causes. We propose Recurrent Independent Mechanisms (RIMs), a new recurrent architecture in which multiple groups of recurrent cells operate with nearly independent transition dynamics, communicate only sparingly through the bottleneck of attention, and are only updated at time steps where they are most relevant. We show that this leads to specialization amongst the RIMs, which in turn allows for dramatically improved generalization on tasks where some factors of variation differ systematically between training and evaluation.

...read moreread less

214 citations

Proceedings Article•DOI•

Learning Problem-Agnostic Speech Representations from Multiple Self-Supervised Tasks.

[...]

Santiago Pascual¹, Mirco Ravanelli², Joan Serrà³, Antonio Bonafonte⁴, Yoshua Bengio² - Show less +1 more•Institutions (4)

Polytechnic University of Catalonia¹, Université de Montréal², Telefónica³, Amazon.com⁴

06 Apr 2019

TL;DR: This article proposed an improved self-supervised method, where a single neural encoder is followed by multiple workers that jointly solve different selfsupervised tasks, and the needed consensus across different tasks naturally imposes meaningful constraints to the encoder, contributing to discover general representations and to minimize the risk of learning superficial ones.

...read moreread less

Abstract: Learning good representations without supervision is still an open issue in machine learning, and is particularly challenging for speech signals, which are often characterized by long sequences with a complex hierarchical structure. Some recent works, however, have shown that it is possible to derive useful speech representations by employing a self-supervised encoder-discriminator approach. This paper proposes an improved self-supervised method, where a single neural encoder is followed by multiple workers that jointly solve different self-supervised tasks. The needed consensus across different tasks naturally imposes meaningful constraints to the encoder, contributing to discover general representations and to minimize the risk of learning superficial ones. Experiments show that the proposed approach can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues. In addition, a number of design choices make the encoder easily exportable, facilitating its direct usage or adaptation to different problems.

...read moreread less

214 citations

Proceedings Article•DOI•

The Pytorch-kaldi Speech Recognition Toolkit

[...]

Mirco Ravanelli¹, Titouan Parcollet², Yoshua Bengio¹•Institutions (2)

Université de Montréal¹, University of Avignon²

12 May 2019

TL;DR: The PyTorch-Kaldi project as discussed by the authors aims to bridge the gap between these popular toolkits, trying to inherit the efficiency of Kaldi and the flexibility of Pytorch.

...read moreread less

Abstract: The availability of open-source software is playing a remarkable role in the popularization of speech recognition and deep learning. Kaldi, for instance, is nowadays an established framework used to develop state-of-the-art speech recognizers. PyTorch is used to build neural networks with the Python language and has recently spawn tremendous interest within the machine learning community thanks to its simplicity and flexibility.The PyTorch-Kaldi project aims to bridge the gap between these popular toolkits, trying to inherit the efficiency of Kaldi and the flexibility of PyTorch. PyTorch-Kaldi is not only a simple interface between these software, but it embeds several useful features for developing modern speech recognizers. For instance, the code is specifically designed to naturally plug-in user-defined acoustic models. As an alternative, users can exploit several pre-implemented neural networks that can be customized using intuitive configuration files. PyTorch-Kaldi supports multiple feature and label streams as well as combinations of neural networks, enabling the use of complex neural architectures. The toolkit is publicly-released along with a rich documentation and is designed to properly work locally or on HPC clusters.Experiments, that are conducted on several datasets and tasks, show that PyTorch-Kaldi can effectively be used to develop modern state-of-the-art speech recognizers.

...read moreread less

198 citations

Proceedings Article•

Unsupervised State Representation Learning in Atari

[...]

Ankesh Anand¹, Evan Racah², Sherjil Ozair¹, Yoshua Bengio¹, Marc-Alexandre Côté³, R Devon Hjelm⁴ - Show less +2 more•Institutions (4)

Université de Montréal¹, Lawrence Berkeley National Laboratory², Université de Sherbrooke³, Microsoft⁴

01 Dec 2019

TL;DR: This work introduces a method that learns state representations by maximizing mutual information across spatially and temporally distinct features of a neural encoder of the observations and introduces a new benchmark based on Atari 2600 games to evaluate representations based on how well they capture the ground truth state variables.

...read moreread less

Abstract: State representation learning, or the ability to capture latent generative factors of an environment is crucial for building intelligent agents that can perform a wide variety of tasks. Learning such representations in an unsupervised manner without supervision from rewards is an open problem. We introduce a method that tries to learn better state representations by maximizing mutual information across spatially and temporally distinct features of a neural encoder of the observations. We also introduce a new benchmark based on Atari 2600 games where we evaluate representations based on how well they capture the ground truth state. We believe this new framework for evaluating representation learning models will be crucial for future representation learning research. Finally, we compare our technique with other state-of-the-art generative and contrastive representation learning methods.

...read moreread less

Proceedings Article•

On the Spectral Bias of Neural Networks

[...]

Nasim Rahaman¹, Aristide Baratin², Devansh Arpit³, Felix Draxler¹, Min Lin⁴, Fred A. Hamprecht¹, Yoshua Bengio², Aaron Courville² - Show less +4 more•Institutions (4)

Heidelberg University¹, Université de Montréal², Salesforce.com³, National University of Singapore⁴

24 May 2019

TL;DR: This article showed that deep ReLU networks are biased towards low frequency functions, meaning that they cannot have local fluctuations without affecting their global behavior, which is in line with the observation that over-parameterized networks find simple patterns that generalize across data samples.

...read moreread less

Abstract: Neural networks are known to be a class of highly expressive functions able to fit even random input-output mappings with $100\%$ accuracy. In this work, we present properties of neural networks that complement this aspect of expressivity. By using tools from Fourier analysis, we show that deep ReLU networks are biased towards low frequency functions, meaning that they cannot have local fluctuations without affecting their global behavior. Intuitively, this property is in line with the observation that over-parameterized networks find simple patterns that generalize across data samples. We also investigate how the shape of the data manifold affects expressivity by showing evidence that learning high frequencies gets \emph{easier} with increasing manifold complexity, and present a theoretical understanding of this behavior. Finally, we study the robustness of the frequency components with respect to parameter perturbation, to develop the intuition that the parameters must be finely tuned to express high frequency functions.

...read moreread less

Proceedings Article•DOI•

Speech Model Pre-training for End-to-End Spoken Language Understanding

[...]

Loren Lugosch¹, Mirco Ravanelli², Patrick Ignoto, Vikrant Singh Tomar¹, Yoshua Bengio² - Show less +1 more•Institutions (2)

McGill University¹, Université de Montréal²

15 Sep 2019

TL;DR: The authors proposed a method to reduce the data requirements of end-to-end spoken language understanding (SLU) in which the model is first pre-trained to predict words and phonemes, thus learning good features for SLU.

...read moreread less

Abstract: Whereas conventional spoken language understanding (SLU) systems map speech to text, and then text to intent, end-to-end SLU systems map speech directly to intent through a single trainable model. Achieving high accuracy with these end-to-end models without a large amount of training data is difficult. We propose a method to reduce the data requirements of end-to-end SLU in which the model is first pre-trained to predict words and phonemes, thus learning good features for SLU. We introduce a new SLU dataset, Fluent Speech Commands, and show that our method improves performance both when the full dataset is used for training and when only a small subset is used. We also describe preliminary experiments to gauge the model's ability to generalize to new phrases not heard during training.

...read moreread less

Posted Content•

GMNN: Graph Markov Neural Networks

[...]

Meng Qu¹, Yoshua Bengio², Jian Tang³•Institutions (3)

University of Illinois at Urbana–Champaign¹, Université de Montréal², University of Chicago³

15 May 2019-arXiv: Learning

TL;DR: In this paper, a graph Markov neural network (GMNN) was proposed to combine the advantages of both statistical relational learning and graph neural networks for semi-supervised object classification.

...read moreread less

Abstract: This paper studies semi-supervised object classification in relational data, which is a fundamental problem in relational data modeling. The problem has been extensively studied in the literature of both statistical relational learning (e.g. relational Markov networks) and graph neural networks (e.g. graph convolutional networks). Statistical relational learning methods can effectively model the dependency of object labels through conditional random fields for collective classification, whereas graph neural networks learn effective object representations for classification through end-to-end training. In this paper, we propose the Graph Markov Neural Network (GMNN) that combines the advantages of both worlds. A GMNN models the joint distribution of object labels with a conditional random field, which can be effectively trained with the variational EM algorithm. In the E-step, one graph neural network learns effective object representations for approximating the posterior distributions of object labels. In the M-step, another graph neural network is used to model the local label dependency. Experiments on object classification, link classification, and unsupervised node representation learning show that GMNN achieves state-of-the-art results.

...read moreread less

Posted Content•

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

[...]

Kundan Kumar¹, Rithesh Kumar², Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo², Alexandre de Brebisson³, Yoshua Bengio², Aaron Courville² - Show less +5 more•Institutions (3)

Indian Institute of Technology Kanpur¹, Université de Montréal², Imperial College London³

08 Oct 2019-arXiv: Audio and Speech Processing

TL;DR: This article proposed a non-autoregressive, fully convolutional GAN for mel-spectrogram inversion and achieved state-of-the-art performance in speech synthesis, music domain translation and unconditional music synthesis.

...read moreread less

Abstract: Previous works (Donahue et al., 2018a; Engel et al., 2019a) have found that generating coherent raw audio waveforms with GANs is challenging. In this paper, we show that it is possible to train GANs reliably to generate high quality coherent waveforms by introducing a set of architectural changes and simple training techniques. Subjective evaluation metric (Mean Opinion Score, or MOS) shows the effectiveness of the proposed approach for high quality mel-spectrogram inversion. To establish the generality of the proposed techniques, we show qualitative results of our model in speech synthesis, music domain translation and unconditional music synthesis. We evaluate the various components of the model through ablation studies and suggest a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks. Our model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion. Our pytorch implementation runs at more than 100x faster than realtime on GTX 1080Ti GPU and more than 2x faster than real-time on CPU, without any hardware specific optimization tricks.

...read moreread less

Proceedings Article•

Gradient based sample selection for online continual learning

[...]

Rahaf Aljundi¹, Min Lin², Baptiste Goujaud, Yoshua Bengio³•Institutions (3)

Katholieke Universiteit Leuven¹, National University of Singapore², Canadian Institute for Advanced Research³

01 Jan 2019

TL;DR: In this article, the authors formulate sample selection as a constraint reduction problem based on the constrained optimization view of continual learning and show that it is equivalent to maximizing the diversity of samples in the replay buffer with parameter gradient as the feature.

...read moreread less

Abstract: A continual learning agent learns online with a non-stationary and never-ending stream of data. The key to such learning process is to overcome the catastrophic forgetting of previously seen data, which is a well known problem of neural networks. To prevent forgetting, a replay buffer is usually employed to store the previous data for the purpose of rehearsal. Previous work often depend on task boundary and i.i.d. assumptions to properly select samples for the replay buffer. In this work, we formulate sample selection as a constraint reduction problem based on the constrained optimization view of continual learning. The goal is to select a fixed subset of constraints that best approximate the feasible region defined by the original constraints. We show that it is equivalent to maximizing the diversity of samples in the replay buffer with parameter gradient as the feature. We further develop a greedy alternative that is cheap and efficient. The advantage of the proposed method is demonstrated by comparing to other alternatives under the continual learning setting. Further comparisons are made against state of the art methods that rely on task boundaries which show comparable or even better results for our method.

...read moreread less

Posted Content•

Speech Model Pre-training for End-to-End Spoken Language Understanding

[...]

Loren Lugosch¹, Mirco Ravanelli², Patrick Ignoto, Vikrant Singh Tomar¹, Yoshua Bengio² - Show less +1 more•Institutions (2)

McGill University¹, Université de Montréal²

07 Apr 2019-arXiv: Audio and Speech Processing

TL;DR: A method to reduce the data requirements of end-to-end SLU in which the model is first pre-trained to predict words and phonemes, thus learning good features for SLU is proposed and improves performance both when the full dataset is used for training and when only a small subset is used.

...read moreread less

Abstract: Whereas conventional spoken language understanding (SLU) systems map speech to text, and then text to intent, end-to-end SLU systems map speech directly to intent through a single trainable model. Achieving high accuracy with these end-to-end models without a large amount of training data is difficult. We propose a method to reduce the data requirements of end-to-end SLU in which the model is first pre-trained to predict words and phonemes, thus learning good features for SLU. We introduce a new SLU dataset, Fluent Speech Commands, and show that our method improves performance both when the full dataset is used for training and when only a small subset is used. We also describe preliminary experiments to gauge the model's ability to generalize to new phrases not heard during training.

...read moreread less

Posted Content•

Interpolation Consistency Training for Semi-Supervised Learning

[...]

Vikas Verma, Kenji Kawaguchi, Alex Lamb, Juho Kannala, Yoshua Bengio, David Lopez-Paz - Show less +2 more

09 Mar 2019-arXiv: Machine Learning

TL;DR: Interpolation Consistency Training (ICT), a simple and computation efficient algorithm for training Deep Neural Networks in the semi-supervised learning paradigm, achieves state-of-the-art performance when applied to standard neural network architectures on the CIFAR-10 and SVHN benchmark datasets.

...read moreread less

Abstract: We introduce Interpolation Consistency Training (ICT), a simple and computation efficient algorithm for training Deep Neural Networks in the semi-supervised learning paradigm. ICT encourages the prediction at an interpolation of unlabeled points to be consistent with the interpolation of the predictions at those points. In classification problems, ICT moves the decision boundary to low-density regions of the data distribution. Our experiments show that ICT achieves state-of-the-art performance when applied to standard neural network architectures on the CIFAR-10 and SVHN benchmark datasets. Our theoretical analysis shows that ICT corresponds to a certain type of data-adaptive regularization with unlabeled points which reduces overfitting to labeled points under high confidence values.

...read moreread less

Proceedings Article•DOI•

Do Neural Dialog Systems Use the Conversation History Effectively? An Empirical Study

[...]

Chinnadhurai Sankar¹, Sandeep Subramanian¹, Chris Pal², Sarath Chandar¹, Yoshua Bengio¹ - Show less +1 more•Institutions (2)

Université de Montréal¹, École Polytechnique de Montréal²

01 Jul 2019

TL;DR: This paper takes an empirical approach to understanding how neural generative models use the available dialog history by studying the sensitivity of the models to artificially introduced unnatural changes or perturbations to their context at test time.

...read moreread less

Abstract: Neural generative models have been become increasingly popular when building conversational agents. They offer flexibility, can be easily adapted to new domains, and require minimal domain engineering. A common criticism of these systems is that they seldom understand or use the available dialog history effectively. In this paper, we take an empirical approach to understanding how these models use the available dialog history by studying the sensitivity of the models to artificially introduced unnatural changes or perturbations to their context at test time. We experiment with 10 different types of perturbations on 4 multi-turn dialog datasets and find that commonly used neural dialog architectures like recurrent and transformer-based seq2seq models are rarely sensitive to most perturbations such as missing or reordering utterances, shuffling words, etc. Also, by open-sourcing our code, we believe that it will serve as a useful diagnostic tool for evaluating dialog systems in the future.

...read moreread less

Posted Content•

Learning Neural Causal Models from Unknown Interventions.

[...]

Nan Rosemary Ke, Olexa Bilaniuk, Anirudh Goyal, Stephan Bauer, Hugol Larochelle, Chris Pal, Yoshua Bengio - Show less +3 more

25 Sep 2019-arXiv: Machine Learning

TL;DR: This paper provides a general framework based on continuous optimization and neural networks to create models for the combination of observational and interventional data and establishes strong benchmark results on several structure learning tasks.

...read moreread less

Abstract: Promising results have driven a recent surge of interest in continuous optimization methods for Bayesian network structure learning from observational data. However, there are theoretical limitations on the identifiability of underlying structures obtained from observational data alone. Interventional data provides much richer information about the underlying data-generating process. However, the extension and application of methods designed for observational data to include interventions is not straightforward and remains an open problem. In this paper we provide a general framework based on continuous optimization and neural networks to create models for the combination of observational and interventional data. The proposed method is even applicable in the challenging and realistic case that the identity of the intervened upon variable is unknown. We examine the proposed method in the setting of graph recovery both de novo and from a partially-known edge set. We establish strong benchmark results on several structure learning tasks, including structure recovery of both synthetic graphs as well as standard graphs from the Bayesian Network Repository.

...read moreread less

Proceedings Article•

GMNN: Graph Markov Neural Networks

[...]

Meng Qu¹, Yoshua Bengio², Jian Tang³•Institutions (3)

University of Illinois at Urbana–Champaign¹, Université de Montréal², University of Chicago³

24 May 2019

TL;DR: This paper proposes the Graph Markov Neural Network (GMNN) that combines the advantages of both worlds, and demonstrates that GMNN achieves state-of-the-art results on object classification, link classification, and unsupervised node representation learning.

...read moreread less

Abstract: This paper studies semi-supervised object classification in relational data, which is a fundamental problem in relational data modeling. The problem has been extensively studied in the literature of both statistical relational learning (e.g. relational Markov networks) and graph neural networks (e.g. graph convolutional networks). Statistical relational learning methods can effectively model the dependency of object labels through conditional random fields for collective classification, whereas graph neural networks learn effective object representations for classification through end-to-end training. In this paper, we propose the Graph Markov Neural Network (GMNN) that combines the advantages of both worlds. A GMNN models the joint distribution of object labels with a conditional random field, which can be effectively trained with the variational EM algorithm. In the E-step, one graph neural network learns effective object representations for approximating the posterior distributions of object labels. In the M-step, another graph neural network is used to model the local label dependency. Experiments on object classification, link classification, and unsupervised node representation learning show that GMNN achieves state-of-the-art results.

...read moreread less

Journal Article•DOI•

Gated Orthogonal Recurrent Units: On Learning to Forget

[...]

Li Jing¹, Caglar Gulcehre², John Peurifoy¹, Yichen Shen¹, Max Tegmark¹, Marin Soljacic¹, Yoshua Bengio² - Show less +3 more•Institutions (2)

Massachusetts Institute of Technology¹, Université de Montréal²

15 Mar 2019-Neural Computation

TL;DR: In this paper, a recurrent neural network (RNN)-based model was proposed that combines the remembering ability of unitary evolution RNNs with the ability of gated RNN to effectively forget redundant or irrelevant information in its memory.

...read moreread less

Abstract: We present a novel recurrent neural network (RNN)-based model that combines the remembering ability of unitary evolution RNNs with the ability of gated RNNs to effectively forget redundant or irrelevant information in its memory. We achieve this by extending restricted orthogonal evolution RNNs with a gating mechanism similar to gated recurrent unit RNNs with a reset gate and an update gate. Our model is able to outperform long short-term memory, gated recurrent units, and vanilla unitary or orthogonal RNNs on several long-term-dependency benchmark tasks. We empirically show that both orthogonal and unitary RNNs lack the ability to forget. This ability plays an important role in RNNs. We provide competitive results along with an analysis of our model on many natural sequential tasks, including question answering, speech spectrum prediction, character-level language modeling, and synthetic tasks that involve long-term dependencies such as algorithmic, denoising, and copying tasks.

...read moreread less

Posted Content•

Maximum Entropy Generators for Energy-Based Models.

[...]

Rithesh Kumar¹, Anirudh Goyal, Aaron Courville, Yoshua Bengio•Institutions (1)

Université de Montréal¹

24 Jan 2019-arXiv: Learning

TL;DR: This work proposes learning both the energy function and an amortized approximate sampling mechanism using a neural generator network, which provides an efficient approximation of the log-likelihood gradient.

...read moreread less

Abstract: Maximum likelihood estimation of energy-based models is a challenging problem due to the intractability of the log-likelihood gradient. In this work, we propose learning both the energy function and an amortized approximate sampling mechanism using a neural generator network, which provides an efficient approximation of the log-likelihood gradient. The resulting objective requires maximizing entropy of the generated samples, which we perform using recently proposed nonparametric mutual information estimators. Finally, to stabilize the resulting adversarial game, we use a zero-centered gradient penalty derived as a necessary condition from the score matching literature. The proposed technique can generate sharp images with Inception and FID scores competitive with recent GAN techniques, does not suffer from mode collapse, and is competitive with state-of-the-art anomaly detection techniques.

...read moreread less

Proceedings Article•DOI•

Learning Fixed Points in Generative Adversarial Networks: From Image-to-Image Translation to Disease Detection and Localization

[...]

Mahfuzur Rahman Siddiquee¹, Zongwei Zhou¹, Nima Tajbakhsh¹, Ruibin Feng¹, Michael B. Gotway², Yoshua Bengio, Jianming Liang¹ - Show less +3 more•Institutions (2)

Arizona State University¹, Mayo Clinic²

01 Oct 2019

TL;DR: Li et al. as discussed by the authors proposed a fixed-point GAN to identify a minimal subset of target pixels for domain translation, an ability that no GAN is equipped with yet, and trained by supervising same domain translation through a conditional identity loss, and regularizing cross-domain translation through revised adversarial, domain classification, and cycle consistency loss.

...read moreread less

Abstract: Generative adversarial networks (GANs) have ushered in a revolution in image-to-image translation. The development and proliferation of GANs raises an interesting question: can we train a GAN to remove an object, if present, from an image while otherwise preserving the image? Specifically, can a GAN ``virtually heal'' anyone by turning his medical image, with an unknown health status (diseased or healthy), into a healthy one, so that diseased regions could be revealed by subtracting those two images? Such a task requires a GAN to identify a minimal subset of target pixels for domain translation, an ability that we call fixed-point translation, which no GAN is equipped with yet. Therefore, we propose a new GAN, called Fixed-Point GAN, trained by (1) supervising same-domain translation through a conditional identity loss, and (2) regularizing cross-domain translation through revised adversarial, domain classification, and cycle consistency loss. Based on fixed-point translation, we further derive a novel framework for disease detection and localization using only image-level annotation. Qualitative and quantitative evaluations demonstrate that the proposed method outperforms the state of the art in multi-domain image-to-image translation and that it surpasses predominant weakly-supervised localization methods in both disease detection and localization. Implementation is available at https://github.com/jlianglab/Fixed-Point-GAN.

...read moreread less

Posted Content•

Compositional generalization in a deep seq2seq model by separating syntax and semantics.

[...]

Jake Russin¹, Jason Jo, Randall C. O'Reilly, Yoshua Bengio•Institutions (1)

University of Colorado Boulder¹

22 Apr 2019-arXiv: Learning

TL;DR: This work suggests that separating syntactic from semantic learning may be a useful heuristic for capturing compositional structure, and implements a modification to standard approaches in neural machine translation, imposing an analogous separation.

...read moreread less

Abstract: Standard methods in deep learning for natural language processing fail to capture the compositional structure of human language that allows for systematic generalization outside of the training distribution. However, human learners readily generalize in this way, e.g. by applying known grammatical rules to novel words. Inspired by work in neuroscience suggesting separate brain systems for syntactic and semantic processing, we implement a modification to standard approaches in neural machine translation, imposing an analogous separation. The novel model, which we call Syntactic Attention, substantially outperforms standard methods in deep learning on the SCAN dataset, a compositional generalization task, without any hand-engineered features or additional supervision. Our work suggests that separating syntactic from semantic learning may be a useful heuristic for capturing compositional structure.

...read moreread less

Posted Content•

GraphMix: Improved Training of GNNs for Semi-Supervised Learning

[...]

Vikas Verma¹, Meng Qu, Kenji Kawaguchi², Alex Lamb, Yoshua Bengio, Juho Kannala, Jian Tang - Show less +3 more•Institutions (2)

Helsinki University of Technology¹, Massachusetts Institute of Technology²

25 Sep 2019-arXiv: Learning

TL;DR: GraphMix is presented, a regularization method for Graph Neural Network based semi-supervised object classification, whereby it is proposed to train a fully-connected network jointly with the graph neural network via parameter sharing and interpolation-based regularization.

...read moreread less

Abstract: We present GraphMix, a regularization method for Graph Neural Network based semi-supervised object classification, whereby we propose to train a fully-connected network jointly with the graph neural network via parameter sharing and interpolation-based regularization. Further, we provide a theoretical analysis of how GraphMix improves the generalization bounds of the underlying graph neural network, without making any assumptions about the "aggregation" layer or the depth of the graph neural networks. We experimentally validate this analysis by applying GraphMix to various architectures such as Graph Convolutional Networks, Graph Attention Networks and Graph-U-Net. Despite its simplicity, we demonstrate that GraphMix can consistently improve or closely match state-of-the-art performance using even simpler architectures such as Graph Convolutional Networks, across three established graph benchmarks: Cora, Citeseer and Pubmed citation network datasets, as well as three newly proposed datasets: Cora-Full, Co-author-CS and Co-author-Physics.

...read moreread less

Posted Content•

Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks

[...]

Santiago Pascual¹, Mirco Ravanelli², Joan Serrà³, Antonio Bonafonte⁴, Yoshua Bengio² - Show less +1 more•Institutions (4)

Polytechnic University of Catalonia¹, Université de Montréal², Telefónica³, Amazon.com⁴

06 Apr 2019-arXiv: Learning

TL;DR: Experiments show that the proposed improved self-supervised method can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues.

...read moreread less

Abstract: Learning good representations without supervision is still an open issue in machine learning, and is particularly challenging for speech signals, which are often characterized by long sequences with a complex hierarchical structure. Some recent works, however, have shown that it is possible to derive useful speech representations by employing a self-supervised encoder-discriminator approach. This paper proposes an improved self-supervised method, where a single neural encoder is followed by multiple workers that jointly solve different self-supervised tasks. The needed consensus across different tasks naturally imposes meaningful constraints to the encoder, contributing to discover general representations and to minimize the risk of learning superficial ones. Experiments show that the proposed approach can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues. In addition, a number of design choices make the encoder easily exportable, facilitating its direct usage or adaptation to different problems.

...read moreread less

Journal Article•DOI•

Combined Reinforcement Learning via Abstract Representations

[...]

Vincent François-Lavet¹, Yoshua Bengio², Doina Precup¹, Joelle Pineau¹•Institutions (2)

McGill University¹, Université de Montréal²

17 Jul 2019

TL;DR: It is shown that the modularity brought by this approach leads to good generalization while being computationally efficient, with planning happening in a smaller latent state space, which opens up new strategies for interpretable AI, exploration and transfer learning.

...read moreread less

Abstract: In the quest for efficient and robust reinforcement learning methods, both model-free and model-based approaches offer advantages. In this paper we propose a new way of explicitly bridging both approaches via a shared low-dimensional learned encoding of the environment, meant to capture summarizing abstractions. We show that the modularity brought by this approach leads to good generalization while being computationally efficient, with planning happening in a smaller latent state space. In addition, this approach recovers a sufficient low-dimensional representation of the environment, which opens up new strategies for interpretable AI, exploration and transfer learning.

...read moreread less

Proceedings Article•DOI•

Learning Speaker Representations with Mutual Information

[...]

Mirco Ravanelli¹, Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

15 Sep 2019

TL;DR: In this article, an encoder-discriminator architecture was proposed to learn representations that capture speaker identities by maximizing the mutual information between the encoded representations of chunks of speech randomly sampled from the same sentence.

...read moreread less

Abstract: Learning good representations is of crucial importance in deep learning. Mutual Information (MI) or similar measures of statistical dependence are promising tools for learning these representations in an unsupervised way. Even though the mutual information between two random variables is hard to measure directly in high dimensional spaces, some recent studies have shown that an implicit optimization of MI can be achieved with an encoder-discriminator architecture similar to that of Generative Adversarial Networks (GANs). In this work, we learn representations that capture speaker identities by maximizing the mutual information between the encoded representations of chunks of speech randomly sampled from the same sentence. The proposed encoder relies on the SincNet architecture and transforms raw speech waveform into a compact feature vector. The discriminator is fed by either positive samples (of the joint distribution of encoded chunks) or negative samples (from the product of the marginals) and is trained to separate them. We report experiments showing that this approach effectively learns useful speaker representations, leading to promising results on speaker identification and verification tasks. Our experiments consider both unsupervised and semi-supervised settings and compare the performance achieved with different objective functions.

...read moreread less

Posted Content•

Hyperbolic Discounting and Learning over Multiple Horizons

[...]

William Fedus¹, Carles Gelada¹, Yoshua Bengio, Marc G. Bellemare¹, Hugo Larochelle¹ - Show less +1 more•Institutions (1)

Google¹

19 Feb 2019-arXiv: Machine Learning

TL;DR: It is demonstrated that a simple approach approximates hyperbolic discount functions while still using familiar temporal-difference learning techniques in RL and a surprising discovery is made that simultaneously learning value functions over multiple time-horizons is an effective auxiliary task which often improves over a strong value-based RL agent, Rainbow.

...read moreread less

Abstract: Reinforcement learning (RL) typically defines a discount factor as part of the Markov Decision Process. The discount factor values future rewards by an exponential scheme that leads to theoretical convergence guarantees of the Bellman equation. However, evidence from psychology, economics and neuroscience suggests that humans and animals instead have hyperbolic time-preferences. In this work we revisit the fundamentals of discounting in RL and bridge this disconnect by implementing an RL agent that acts via hyperbolic discounting. We demonstrate that a simple approach approximates hyperbolic discount functions while still using familiar temporal-difference learning techniques in RL. Additionally, and independent of hyperbolic discounting, we make a surprising discovery that simultaneously learning value functions over multiple time-horizons is an effective auxiliary task which often improves over a strong value-based RL agent, Rainbow.

...read moreread less

Showing papers by "Yoshua Bengio published in 2019"