Showing papers by "Yoshua Bengio published in 2018"

PDF

Open Access

Proceedings Article•DOI•

[...]

Petar Veličković¹, Guillem Cucurull², Arantxa Casanova³, Adriana Romero⁴, Pietro Liò¹, Yoshua Bengio⁵ - Show less +2 more•Institutions (5)

University of Cambridge¹, Autonomous University of Barcelona², Polytechnic University of Catalonia³, HEC Montréal⁴, Université de Montréal⁵

15 Feb 2018

TL;DR: Graph Attention Networks (GATs) as mentioned in this paper leverage masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations.

...read moreread less

Abstract: We present graph attention networks (GATs), novel neural network architectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in which nodes are able to attend over their neighborhoods' features, we enable (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront. In this way, we address several key challenges of spectral-based graph neural networks simultaneously, and make our model readily applicable to inductive as well as transductive problems. Our GAT models have achieved or matched state-of-the-art results across four established transductive and inductive graph benchmarks: the Cora, Citeseer and Pubmed citation network datasets, as well as a protein-protein interaction dataset (wherein test graphs remain unseen during training).

...read moreread less

7,904 citations

Proceedings Article•

Learning deep representations by mutual information estimation and maximization

[...]

R Devon Hjelm¹, Alex Fedorov², Samuel Lavoie-Marchildon³, Karan Grewal, Philip Bachman¹, Adam Trischler¹, Yoshua Bengio³ - Show less +3 more•Institutions (3)

Microsoft¹, University of New Mexico², Université de Montréal³

20 Aug 2018

TL;DR: Deep InfoMax (DIM) as discussed by the authors maximizes mutual information between an input and the output of a deep neural network encoder by matching to a prior distribution adversarially.

...read moreread less

Abstract: This work investigates unsupervised learning of representations by maximizing mutual information between an input and the output of a deep neural network encoder. Importantly, we show that structure matters: incorporating knowledge about locality in the input into the objective can significantly improve a representation’s suitability for downstream tasks. We further control characteristics of the representation by matching to a prior distribution adversarially. Our method, which we call Deep InfoMax (DIM), outperforms a number of popular unsupervised learning methods and compares favorably with fully-supervised learning on several classification tasks in with some standard architectures. DIM opens new avenues for unsupervised learning of representations and is an important step towards flexible formulations of representation learning objectives for specific end-goals.

...read moreread less

1,218 citations

Posted Content•

Learning deep representations by mutual information estimation and maximization

[...]

R Devon Hjelm¹, Alex Fedorov², Samuel Lavoie-Marchildon³, Karan Grewal, Philip Bachman¹, Adam Trischler¹, Yoshua Bengio³ - Show less +3 more•Institutions (3)

Microsoft¹, University of New Mexico², Université de Montréal³

20 Aug 2018-arXiv: Machine Learning

TL;DR: It is shown that structure matters: incorporating knowledge about locality in the input into the objective can significantly improve a representation’s suitability for downstream tasks and is an important step towards flexible formulations of representation learning objectives for specific end-goals.

...read moreread less

Abstract: In this work, we perform unsupervised learning of representations by maximizing mutual information between an input and the output of a deep neural network encoder. Importantly, we show that structure matters: incorporating knowledge about locality of the input to the objective can greatly influence a representation's suitability for downstream tasks. We further control characteristics of the representation by matching to a prior distribution adversarially. Our method, which we call Deep InfoMax (DIM), outperforms a number of popular unsupervised learning methods and competes with fully-supervised learning on several classification tasks. DIM opens new avenues for unsupervised learning of representations and is an important step towards flexible formulations of representation-learning objectives for specific end-goals.

...read moreread less

871 citations

Proceedings Article•DOI•

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

[...]

Zhilin Yang¹, Peng Qi², Saizheng Zhang³, Yoshua Bengio³, William W. Cohen⁴, Ruslan Salakhutdinov¹, Christopher D. Manning² - Show less +3 more•Institutions (4)

Carnegie Mellon University¹, Stanford University², Université de Montréal³, Google⁴

25 Sep 2018

TL;DR: HotpotQA as discussed by the authors is a dataset with 113k Wikipedia-based question-answer pairs with four key features: finding and reasoning over multiple supporting documents to answer; the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; providing sentence-level supporting facts required for reasoning; and offering a new type of factoid comparison questions to test QA systems' ability to extract relevant facts and perform necessary comparison.

...read moreread less

Abstract: Existing question answering (QA) datasets fail to train QA systems to perform complex reasoning and provide explanations for answers We introduce HotpotQA, a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowing QA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems’ ability to extract relevant facts and perform necessary comparison We show that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions

...read moreread less

850 citations

Posted Content•

Deep Graph Infomax.

[...]

Petar Veličković¹, William Fedus², William L. Hamilton³, Pietro Liò¹, Yoshua Bengio⁴, R Devon Hjelm⁵ - Show less +2 more•Institutions (5)

University of Cambridge¹, Google², Stanford University³, Université de Montréal⁴, Microsoft⁵

27 Sep 2018-arXiv: Machine Learning

TL;DR: Deep Graph Infomax (DGI) is presented, a general approach for learning node representations within graph-structured data in an unsupervised manner that is readily applicable to both transductive and inductive learning setups.

...read moreread less

Abstract: We present Deep Graph Infomax (DGI), a general approach for learning node representations within graph-structured data in an unsupervised manner. DGI relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs---both derived using established graph convolutional network architectures. The learnt patch representations summarize subgraphs centered around nodes of interest, and can thus be reused for downstream node-wise learning tasks. In contrast to most prior approaches to unsupervised learning with GCNs, DGI does not rely on random walk objectives, and is readily applicable to both transductive and inductive learning setups. We demonstrate competitive performance on a variety of node classification benchmarks, which at times even exceeds the performance of supervised learning.

...read moreread less

834 citations

Proceedings Article•

Mutual Information Neural Estimation.

[...]

Mohamed Ishmael Belghazi¹, Aristide Baratin², Sai Rajeshwar, Sherjil Ozair³, Yoshua Bengio², Aaron Courville², Devon Hjelm⁴ - Show less +3 more•Institutions (4)

Facebook¹, Université de Montréal², Baidu³, Microsoft⁴

03 Jul 2018

TL;DR: A Mutual Information Neural Estimator (MINE) is presented that is linearly scalable in dimensionality as well as in sample size, trainable through back-prop, and strongly consistent, and applied to improve adversarially trained generative models.

...read moreread less

Abstract: We argue that the estimation of mutual information between high dimensional continuous random variables can be achieved by gradient descent over neural networks. We present a Mutual Information Neural Estimator (MINE) that is linearly scalable in dimensionality as well as in sample size, trainable through back-prop, and strongly consistent. We present a handful of applications on which MINE can be used to minimize or maximize mutual information. We apply MINE to improve adversarially trained generative models. We also use MINE to implement the Information Bottleneck, applying it to supervised classification; our results demonstrate substantial improvement in flexibility and performance in these settings.

...read moreread less

820 citations

Proceedings Article•DOI•

Speaker Recognition from Raw Waveform with SincNet

[...]

Mirco Ravanelli¹, Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

29 Jul 2018

TL;DR: This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters, based on parametrized sinc functions, which implement band-pass filters.

...read moreread less

Abstract: Deep learning is progressively gaining popularity as a viable alternative to i-vectors for speaker recognition. Promising results have been recently obtained with Convolutional Neural Networks (CNNs) when fed by raw speech samples directly. Rather than employing standard hand-crafted features, the latter CNNs learn low-level speech representations from waveforms, potentially allowing the network to better capture important narrow-band speaker characteristics such as pitch and formants. Proper design of the neural network is crucial to achieve this goal.This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters. SincNet is based on parametrized sinc functions, which implement band-pass filters. In contrast to standard CNNs, that learn all elements of each filter, only low and high cutoff frequencies are directly learned from data with the proposed method. This offers a very compact and efficient way to derive a customized filter bank specifically tuned for the desired application.Our experiments, conducted on both speaker identification and speaker verification tasks, show that the proposed architecture converges faster and performs better than a standard CNN on raw waveforms.

...read moreread less

605 citations

Posted Content•

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

[...]

Zhilin Yang¹, Peng Qi², Saizheng Zhang³, Yoshua Bengio³, William W. Cohen⁴, Ruslan Salakhutdinov¹, Christopher D. Manning² - Show less +3 more•Institutions (4)

Carnegie Mellon University¹, Stanford University², Université de Montréal³, Google⁴

25 Sep 2018-arXiv: Computation and Language

TL;DR: It is shown that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.

...read moreread less

Abstract: Existing question answering (QA) datasets fail to train QA systems to perform complex reasoning and provide explanations for answers. We introduce HotpotQA, a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowing QA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems' ability to extract relevant facts and perform necessary comparison. We show that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.

...read moreread less

574 citations

Posted Content•

Machine Learning for Combinatorial Optimization: a Methodological Tour d'Horizon

[...]

Yoshua Bengio¹, Andrea Lodi², Antoine Prouvost²•Institutions (2)

Université de Montréal¹, École Polytechnique de Montréal²

15 Nov 2018-arXiv: Learning

TL;DR: A main point of the paper is seeing generic optimization problems as data points and inquiring what is the relevant distribution of problems to use for learning on a given task.

...read moreread less

Abstract: This paper surveys the recent attempts, both from the machine learning and operations research communities, at leveraging machine learning to solve combinatorial optimization problems. Given the hard nature of these problems, state-of-the-art algorithms rely on handcrafted heuristics for making decisions that are otherwise too expensive to compute or mathematically not well defined. Thus, machine learning looks like a natural candidate to make such decisions in a more principled and optimized way. We advocate for pushing further the integration of machine learning and combinatorial optimization and detail a methodology to do so. A main point of the paper is seeing generic optimization problems as data points and inquiring what is the relevant distribution of problems to use for learning on a given task.

...read moreread less

557 citations

Proceedings Article•DOI•

Deep Graph Infomax

[...]

Petar Veličković¹, William Fedus², William L. Hamilton³, Pietro Liò¹, Yoshua Bengio⁴, R Devon Hjelm⁵ - Show less +2 more•Institutions (5)

University of Cambridge¹, Google², Stanford University³, Université de Montréal⁴, Microsoft⁵

27 Sep 2018

TL;DR: Deep Graph Infomax (DGI) as discussed by the authors is a general approach for learning node representations within graph-structured data in an unsupervised manner, which relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs.

...read moreread less

Abstract: We present Deep Graph Infomax (DGI), a general approach for learning node representations within graph-structured data in an unsupervised manner. DGI relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs—both derived using established graph convolutional network architectures. The learnt patch representations summarize subgraphs centered around nodes of interest, and can thus be reused for downstream node-wise learning tasks. In contrast to most prior approaches to unsupervised learning with GCNs, DGI does not rely on random walk objectives, and is readily applicable to both transductive and inductive learning setups. We demonstrate competitive performance on a variety of node classification benchmarks, which at times even exceeds the performance of supervised learning.

...read moreread less

503 citations

Posted Content•

On the Spectral Bias of Neural Networks

[...]

Nasim Rahaman¹, Aristide Baratin², Devansh Arpit³, Felix Draxler¹, Min Lin⁴, Fred A. Hamprecht¹, Yoshua Bengio², Aaron Courville² - Show less +4 more•Institutions (4)

Heidelberg University¹, Université de Montréal², Salesforce.com³, National University of Singapore⁴

22 Jun 2018-arXiv: Machine Learning

TL;DR: This work shows that deep ReLU networks are biased towards low frequency functions, and studies the robustness of the frequency components with respect to parameter perturbation, to develop the intuition that the parameters must be finely tuned to express high frequency functions.

...read moreread less

Abstract: Neural networks are known to be a class of highly expressive functions able to fit even random input-output mappings with $100\%$ accuracy. In this work, we present properties of neural networks that complement this aspect of expressivity. By using tools from Fourier analysis, we show that deep ReLU networks are biased towards low frequency functions, meaning that they cannot have local fluctuations without affecting their global behavior. Intuitively, this property is in line with the observation that over-parameterized networks find simple patterns that generalize across data samples. We also investigate how the shape of the data manifold affects expressivity by showing evidence that learning high frequencies gets \emph{easier} with increasing manifold complexity, and present a theoretical understanding of this behavior. Finally, we study the robustness of the frequency components with respect to parameter perturbation, to develop the intuition that the parameters must be finely tuned to express high frequency functions.

...read moreread less

Posted Content•

Manifold Mixup: Better Representations by Interpolating Hidden States.

[...]

Vikas Verma¹, Alex Lamb², Christopher Beckham, Amir Najafi³, Ioannis Mitliagkas, Aaron Courville², David Lopez-Paz⁴, Yoshua Bengio - Show less +4 more•Institutions (4)

Helsinki University of Technology¹, Université de Montréal², Sharif University of Technology³, Facebook⁴

13 Jun 2018-arXiv: Machine Learning

TL;DR: Manifold Mixup, a simple regularizer that encourages neural networks to predict less confidently on interpolations of hidden representations, improves strong baselines in supervised learning, robustness to single-step adversarial attacks, and test log-likelihood.

...read moreread less

Abstract: Deep neural networks excel at learning the training data, but often provide incorrect and confident predictions when evaluated on slightly different test examples. This includes distribution shifts, outliers, and adversarial examples. To address these issues, we propose Manifold Mixup, a simple regularizer that encourages neural networks to predict less confidently on interpolations of hidden representations. Manifold Mixup leverages semantic interpolations as additional training signal, obtaining neural networks with smoother decision boundaries at multiple levels of representation. As a result, neural networks trained with Manifold Mixup learn class-representations with fewer directions of variance. We prove theory on why this flattening happens under ideal conditions, validate it on practical situations, and connect it to previous works on information theory and generalization. In spite of incurring no significant computation and being implemented in a few lines of code, Manifold Mixup improves strong baselines in supervised learning, robustness to single-step adversarial attacks, and test log-likelihood.

...read moreread less

Proceedings Article•

MetaGAN: an adversarial approach to few-shot learning

[...]

Ruixiang Zhang¹, Tong Che¹, Zoubin Ghahramani², Yoshua Bengio¹, Yangqiu Song³ - Show less +1 more•Institutions (3)

Université de Montréal¹, University of Cambridge², Hong Kong University of Science and Technology³

03 Dec 2018

TL;DR: This paper proposes a conceptually simple and general framework called MetaGAN for few-shot learning problems, and shows that with this MetaGAN framework, it can extend supervised few- shot learning models to naturally cope with unlabeled data.

...read moreread less

Abstract: In this paper, we propose a conceptually simple and general framework called MetaGAN for few-shot learning problems. Most state-of-the-art few-shot classification models can be integrated with MetaGAN in a principled and straightforward way. By introducing an adversarial generator conditioned on tasks, we augment vanilla few-shot classification models with the ability to discriminate between real and fake data. We argue that this GAN-based approach can help few-shot classifiers to learn sharper decision boundary, which could generalize better. We show that with our MetaGAN framework, we can extend supervised few-shot learning models to naturally cope with unlabeled data. Different from previous work in semi-supervised few-shot learning, our algorithms can deal with semi-supervision at both sample-level and task-level. We give theoretical justifications of the strength of MetaGAN, and validate the effectiveness of MetaGAN on challenging few-shot image classification benchmarks.

...read moreread less

Proceedings Article•

Bayesian Model-Agnostic Meta-Learning

[...]

Jaesik Yoon¹, Taesup Kim², Ousmane Amadou Dia, Sungwoong Kim, Yoshua Bengio², Yoshua Bengio³, Sungjin Ahn - Show less +3 more•Institutions (3)

KAIST¹, Université de Montréal², Canadian Institute for Advanced Research³

01 Jan 2018

TL;DR: The proposed method combines scalable gradient-based meta-learning with nonparametric variational inference in a principled probabilistic framework and is capable of learning complex uncertainty structure beyond a point estimate or a simple Gaussian approximation during fast adaptation.

...read moreread less

Abstract: Due to the inherent model uncertainty, learning to infer Bayesian posterior from a few-shot dataset is an important step towards robust meta-learning. In this paper, we propose a novel Bayesian model-agnostic meta-learning method. The proposed method combines efficient gradient-based meta-learning with nonparametric variational inference in a principled probabilistic framework. Unlike previous methods, during fast adaptation, the method is capable of learning complex uncertainty structure beyond a simple Gaussian approximation, and during meta-update, a novel Bayesian mechanism prevents meta-level overfitting. Remaining a gradient-based method, it is also the first Bayesian model-agnostic meta-learning method applicable to various tasks including reinforcement learning. Experiment results show the accuracy and robustness of the proposed method in sinusoidal regression, image classification, active learning, and reinforcement learning.

...read moreread less

Proceedings Article•

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

[...]

Sandeep Subramanian¹, Adam Trischler², Yoshua Bengio³, Chris Pal⁴•Institutions (4)

Carnegie Mellon University¹, Microsoft², Université de Montréal³, École Polytechnique de Montréal⁴

15 Feb 2018

TL;DR: The authors proposed a multi-task learning framework for sentence representations that combines the inductive biases of diverse training objectives in a single model, and trained this model on several data sources with multiple training objectives on over 100 million sentences.

...read moreread less

Abstract: A lot of the recent success in natural language processing (NLP) has been driven by distributed vector representations of words trained on large amounts of text in an unsupervised manner. These representations are typically used as general purpose features for words across a range of NLP problems. However, extending this success to learning representations of sequences of words, such as sentences, remains an open problem. Recent work has explored unsupervised as well as supervised learning techniques with different training objectives to learn general purpose fixed-length sentence representations. In this work, we present a simple, effective multi-task learning framework for sentence representations that combines the inductive biases of diverse training objectives in a single model. We train this model on several data sources with multiple training objectives on over 100 million sentences. Extensive experiments demonstrate that sharing a single recurrent sentence encoder across weakly related tasks leads to consistent improvements over previous methods. We present substantial improvements in the context of transfer learning and low-resource settings using our learned general-purpose representations.

...read moreread less

Journal Article•DOI•

Drawing and Recognizing Chinese Characters with Recurrent Neural Network

[...]

Xu-Yao Zhang¹, Fei Yin¹, Yan-Ming Zhang¹, Cheng-Lin Liu¹, Yoshua Bengio² - Show less +1 more•Institutions (2)

Chinese Academy of Sciences¹, Université de Montréal²

01 Apr 2018-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: Wang et al. as mentioned in this paper proposed a framework by using the recurrent neural network (RNN) as both a discriminative model for recognizing Chinese characters and a generator model for drawing (generating) Chinese characters.

...read moreread less

Abstract: Recent deep learning based approaches have achieved great success on handwriting recognition. Chinese characters are among the most widely adopted writing systems in the world. Previous research has mainly focused on recognizing handwritten Chinese characters. However, recognition is only one aspect for understanding a language, another challenging and interesting task is to teach a machine to automatically write (pictographic) Chinese characters. In this paper, we propose a framework by using the recurrent neural network (RNN) as both a discriminative model for recognizing Chinese characters and a generative model for drawing (generating) Chinese characters. To recognize Chinese characters, previous methods usually adopt the convolutional neural network (CNN) models which require transforming the online handwriting trajectory into image-like representations. Instead, our RNN based approach is an end-to-end system which directly deals with the sequential structure and does not require any domain-specific knowledge. With the RNN system (combining an LSTM and GRU), state-of-the-art performance can be achieved on the ICDAR-2013 competition database. Furthermore, under the RNN framework, a conditional generative model with character embedding is proposed for automatically drawing recognizable Chinese characters. The generated characters (in vector format) are human-readable and also can be recognized by the discriminative RNN model with high accuracy. Experimental results verify the effectiveness of using RNNs as both generative and discriminative models for the tasks of drawing and recognizing Chinese characters.

...read moreread less

Posted Content•

An Empirical Study of Example Forgetting during Deep Neural Network Learning

[...]

Mariya Toneva¹, Alessandro Sordoni², Remi Tachet des Combes², Adam Trischler², Yoshua Bengio³, Geoffrey J. Gordon¹ - Show less +2 more•Institutions (3)

Carnegie Mellon University¹, Microsoft², Université de Montréal³

12 Dec 2018-arXiv: Learning

TL;DR: In this article, the authors investigate the learning dynamics of neural networks as they train on single classification tasks, and find that certain examples are forgotten with high frequency, and some not at all.

...read moreread less

Abstract: Inspired by the phenomenon of catastrophic forgetting, we investigate the learning dynamics of neural networks as they train on single classification tasks. Our goal is to understand whether a related phenomenon occurs when data does not undergo a clear distributional shift. We define a `forgetting event' to have occurred when an individual training example transitions from being classified correctly to incorrectly over the course of learning. Across several benchmark data sets, we find that: (i) certain examples are forgotten with high frequency, and some not at all; (ii) a data set's (un)forgettable examples generalize across neural architectures; and (iii) based on forgetting dynamics, a significant fraction of examples can be omitted from the training data set while still maintaining state-of-the-art generalization performance.

...read moreread less

Journal Article•DOI•

Light Gated Recurrent Units for Speech Recognition

[...]

Mirco Ravanelli, Philemon Brakel¹, Maurizio Omologo, Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

23 Mar 2018

TL;DR: This paper revise one of the most popular RNN models, namely, gated recurrent units (GRUs), and proposes a simplified architecture that turned out to be very effective for ASR, and proposes to replace hyperbolic tangent with rectified linear unit activations.

...read moreread less

Abstract: A field that has directly benefited from the recent advances in deep learning is automatic speech recognition (ASR). Despite the great achievements of the past decades, however, a natural and robust human–machine speech interaction still appears to be out of reach, especially in challenging environments characterized by significant noise and reverberation. To improve robustness, modern speech recognizers often employ acoustic models based on recurrent neural networks (RNNs) that are naturally able to exploit large time contexts and long-term speech modulations. It is thus of great interest to continue the study of proper techniques for improving the effectiveness of RNNs in processing speech signals. In this paper, we revise one of the most popular RNN models, namely, gated recurrent units (GRUs), and propose a simplified architecture that turned out to be very effective for ASR. The contribution of this work is twofold: First, we analyze the role played by the reset gate, showing that a significant redundancy with the update gate occurs. As a result, we propose to remove the former from the GRU design, leading to a more efficient and compact single-gate model. Second, we propose to replace hyperbolic tangent with rectified linear unit activations. This variation couples well with batch normalization and could help the model learn long-term dependencies without numerical issues. Results show that the proposed architecture, called light GRU, not only reduces the per-epoch training time by more than 30% over a standard GRU, but also consistently improves the recognition accuracy across different tasks, input features, noisy conditions, as well as across different ASR paradigms, ranging from standard DNN-HMM speech recognizers to end-to-end connectionist temporal classification models.

...read moreread less

Proceedings Article•

Dendritic cortical microcircuits approximate the backpropagation algorithm

[...]

João Sacramento¹, Rui Ponte Costa², Yoshua Bengio³, Walter Senn¹•Institutions (3)

University of Bern¹, University of Bristol², Université de Montréal³

01 Jan 2018

TL;DR: A novel view of learning on dendritic cortical circuits and on how the brain may solve the long-standing synaptic credit assignment problem is introduced, in which error-driven synaptic plasticity adapts the network towards a global desired output.

...read moreread less

Abstract: Deep learning has seen remarkable developments over the last years, many of them inspired by neuroscience. However, the main learning mechanism behind these advances – error backpropagation – appears to be at odds with neurobiology. Here, we introduce a multilayer neuronal network model with simplified dendritic compartments in which error-driven synaptic plasticity adapts the network towards a global desired output. In contrast to previous work our model does not require separate phases and synaptic learning is driven by local dendritic prediction errors continuously in time. Such errors originate at apical dendrites and occur due to a mismatch between predictive input from lateral interneurons and activity from actual top-down feedback. Through the use of simple dendritic compartments and different cell-types our model can represent both error and normal activity within a pyramidal neuron. We demonstrate the learning capabilities of the model in regression and classification tasks, and show analytically that it approximates the error backpropagation algorithm. Moreover, our framework is consistent with recent observations of learning between brain areas and the architecture of cortical microcircuits. Overall, we introduce a novel view of learning on dendritic cortical circuits and on how the brain may solve the long-standing synaptic credit assignment problem.

...read moreread less

Proceedings Article•

Image-to-image translation for cross-domain disentanglement

[...]

Abel Gonzalez-Garcia¹, Joost van de Weijer², Yoshua Bengio³•Institutions (3)

University of Edinburgh¹, Autonomous University of Barcelona², Facebook³

24 May 2018

TL;DR: This paper achieves better results for translation on challenging datasets as well as for cross-domain retrieval on realistic datasets and compares the model to the state-of-the-art in multi-modal image translation.

...read moreread less

Abstract: Deep image translation methods have recently shown excellent results, outputting high-quality images covering multiple modes of the data distribution. There has also been increased interest in disentangling the internal representations learned by deep methods to further improve their performance and achieve a finer control. In this paper, we bridge these two objectives and introduce the concept of cross-domain disentanglement. We aim to separate the internal representation into three parts. The shared part contains information for both domains. The exclusive parts, on the other hand, contain only factors of variation that are particular to each domain. We achieve this through bidirectional image translation based on Generative Adversarial Networks and cross-domain autoencoders, a novel network component. Our model offers multiple advantages. We can output diverse samples covering multiple modes of the distributions of both domains, perform domain- specific image transfer and interpolation, and cross-domain retrieval without the need of labeled data, only paired images. We compare our model to the state-of-the-art in multi-modal image translation and achieve better results for translation on challenging datasets as well as for cross-domain retrieval on realistic datasets.

...read moreread less

Proceedings Article•DOI•

Towards End-to-end Spoken Language Understanding

[...]

Dmitriy Serdyuk¹, Yongqiang Wang¹, Christian Fuegen¹, Anuj Kumar¹, Baiyang Liu¹, Yoshua Bengio¹ - Show less +2 more•Institutions (1)

Facebook¹

15 Apr 2018

TL;DR: This study showed that the trained model can achieve reasonable good result and demonstrated that the model can capture the semantic attention directly from the audio features.

...read moreread less

Abstract: Spoken language understanding system is traditionally designed as a pipeline of a number of components. First, the audio signal is processed by an automatic speech recognizer for transcription or n-best hypotheses. With the recognition results, a natural language understanding system classifies the text to structured data as domain, intent and slots for down-streaming consumers, such as dialog system, hands-free applications. These components are usually developed and optimized independently. In this paper, we present our study on an end-to-end learning system for spoken language understanding. With this unified approach, we can infer the semantic meaning directly from audio features without the intermediate text representation. This study showed that the trained model can achieve reasonable good result and demonstrated that the model can capture the semantic attention directly from the audio features.

...read moreread less

Journal Article•DOI•

Fine-grained attention mechanism for neural machine translation

[...]

Heeyoul Choi¹, Kyunghyun Cho², Yoshua Bengio³•Institutions (3)

Handong Global University¹, New York University², Université de Montréal³

05 Apr 2018-Neurocomputing

TL;DR: The authors proposed a fine-grained (or 2D) attention mechanism where each dimension of a context vector will receive a separate attention score, which improves the translation quality in terms of BLEU score.

...read moreread less

Proceedings Article•

BabyAI: A Platform to Study the Sample Efficiency of Grounded Language Learning

[...]

Maxime Chevalier-Boisvert¹, Dzmitry Bahdanau¹, Salem Lahlou, Lucas Willems, Chitwan Saharia², Thien Huu Nguyen³, Yoshua Bengio¹ - Show less +3 more•Institutions (3)

Université de Montréal¹, Indian Institute of Technology Bombay², University of Oregon³

27 Sep 2018

Abstract: Allowing humans to interactively train artificial agents to understand language instructions is desirable for both practical and scientific reasons, but given the poor data efficiency of the current learning methods, this goal may require substantial research efforts. Here, we introduce the BabyAI research platform to support investigations towards including humans in the loop for grounded language learning. The BabyAI platform comprises an extensible suite of 19 levels of increasing difficulty. The levels gradually lead the agent towards acquiring a combinatorially rich synthetic language which is a proper subset of English. The platform also provides a heuristic expert agent for the purpose of simulating a human teacher. We report baseline results and estimate the amount of human involvement that would be required to train a neural network-based agent on some of the BabyAI levels. We put forward strong evidence that current deep learning methods are not yet sufficiently sample efficient when it comes to learning a language with compositional properties.

...read moreread less

Journal Article•DOI•

Feature-wise transformations

[...]

Vincent Dumoulin¹, Ethan Perez², Nathan Schucher, Florian Strub³, Harm de Vries, Aaron Courville, Yoshua Bengio - Show less +3 more•Institutions (3)

Google¹, Rice University², university of lille³

09 Jul 2018

TL;DR: In this paper, the authors present a set of real-world problems that require integrating multiple sources of information, such as vision, language, audio, etc., in order to understand a scene in a movie or answer a question about an image.

...read moreread less

Abstract: Many real-world problems require integrating multiple sources of information. Sometimes these problems involve multiple, distinct modalities of information — vision, language, audio, etc. — as is required to understand a scene in a movie or answer a question about an image. Other times, these problems involve multiple sources of the same kind of input, i.e. when summarizing several documents or drawing one image in the style of another.

...read moreread less

Proceedings Article•

An Empirical Study of Example Forgetting during Deep Neural Network Learning

[...]

Mariya Toneva¹, Alessandro Sordoni², Remi Tachet des Combes², Adam Trischler², Yoshua Bengio³, Geoffrey J. Gordon¹ - Show less +2 more•Institutions (3)

Carnegie Mellon University¹, Microsoft², Université de Montréal³

27 Sep 2018

TL;DR: It is found that certain examples are forgotten with high frequency, and some not at all; a data set’s (un)forgettable examples generalize across neural architectures; and a significant fraction of examples can be omitted from the training data set while still maintaining state-of-the-art generalization performance.

...read moreread less

Abstract: Inspired by the phenomenon of catastrophic forgetting, we investigate the learning dynamics of neural networks as they train on single classification tasks. Our goal is to understand whether a related phenomenon occurs when data does not undergo a clear distributional shift. We define a “forgetting event” to have occurred when an individual training example transitions from being classified correctly to incorrectly over the course of learning. Across several benchmark data sets, we find that: (i) certain examples are forgotten with high frequency, and some not at all; (ii) a data set’s (un)forgettable examples generalize across neural architectures; and (iii) based on forgetting dynamics, a significant fraction of examples can be omitted from the training data set while still maintaining state-of-the-art generalization performance.

...read moreread less

Posted Content•

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

[...]

Sandeep Subramanian¹, Adam Trischler², Yoshua Bengio³, Chris Pal⁴•Institutions (4)

Carnegie Mellon University¹, Microsoft², Université de Montréal³, École Polytechnique de Montréal⁴

30 Mar 2018-arXiv: Computation and Language

TL;DR: This work presents a simple, effective multi-task learning framework for sentence representations that combines the inductive biases of diverse training objectives in a single model and demonstrates that sharing a single recurrent sentence encoder across weakly related tasks leads to consistent improvements over previous methods.

...read moreread less

Abstract: A lot of the recent success in natural language processing (NLP) has been driven by distributed vector representations of words trained on large amounts of text in an unsupervised manner These representations are typically used as general purpose features for words across a range of NLP problems However, extending this success to learning representations of sequences of words, such as sentences, remains an open problem Recent work has explored unsupervised as well as supervised learning techniques with different training objectives to learn general purpose fixed-length sentence representations In this work, we present a simple, effective multi-task learning framework for sentence representations that combines the inductive biases of diverse training objectives in a single model We train this model on several data sources with multiple training objectives on over 100 million sentences Extensive experiments demonstrate that sharing a single recurrent sentence encoder across weakly related tasks leads to consistent improvements over previous methods We present substantial improvements in the context of transfer learning and low-resource settings using our learned general-purpose representations

...read moreread less

Proceedings Article•

FigureQA: An Annotated Figure Dataset for Visual Reasoning

[...]

Samira Ebrahimi Kahou¹, Adam Atkinson¹, Vincent Michalski², Ákos Kádár³, Adam Trischler¹, Yoshua Bengio² - Show less +2 more•Institutions (3)

Microsoft¹, Université de Montréal², Tilburg University³

12 Feb 2018

TL;DR: FigureQA is envisioned as a first step towards developing models that can intuitively recognize patterns from visual representations of data, and preliminary results indicate that the task poses a significant machine learning challenge.

...read moreread less

Abstract: We introduce FigureQA, a visual reasoning corpus of over one million question-answer pairs grounded in over 100,000 images. The images are synthetic, scientific-style figures from five classes: line plots, dot-line plots, vertical and horizontal bar graphs, and pie charts. We formulate our reasoning task by generating questions from 15 templates; questions concern various relationships between plot elements and examine characteristics like the maximum, the minimum, area-under-the-curve, smoothness, and intersection. To resolve, such questions often require reference to multiple plot elements and synthesis of information distributed spatially throughout a figure. To facilitate the training of machine learning systems, the corpus also includes side data that can be used to formulate auxiliary objectives. In particular, we provide the numerical data used to generate each figure as well as bounding-box annotations for all plot elements. We study the proposed visual reasoning task by training several models, including the recently proposed Relation Network as strong baseline. Preliminary results indicate that the task poses a significant machine learning challenge. We envision FigureQA as a first step towards developing models that can intuitively recognize patterns from visual representations of data.

...read moreread less

Proceedings Article•

Mine: mutual information neural estimation

[...]

Mohamed Ishmael Belghazi¹, Aristide Baratin², Sai Rajeswar², Sherjil Ozair³, Yoshua Bengio², Aaron Courville², Devon Hjelm⁴ - Show less +3 more•Institutions (4)

Facebook¹, Université de Montréal², Baidu³, Microsoft⁴

07 Jun 2018

Posted Content•

BabyAI: First Steps Towards Grounded Language Learning With a Human In the Loop.

[...]

Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, Yoshua Bengio - Show less +3 more

18 Oct 2018

TL;DR: The BabyAI research platform is introduced to support investigations towards including humans in the loop for grounded language learning and puts forward strong evidence that current deep learning methods are not yet sufficiently sample efficient when it comes to learning a language with compositional properties.

...read moreread less

Journal Article•DOI•

Deep convolutional networks for quality assessment of protein folds.

[...]

Georgy Derevyanko¹, Sergei Grudinin², Yoshua Bengio³, Guillaume Lamoureux¹•Institutions (3)

Concordia University¹, University of Grenoble², Université de Montréal³

01 Dec 2018-Bioinformatics

TL;DR: It is shown that deep convolutional networks can be used to predict the ranking of model structures solely on the basis of their raw three-dimensional atomic densities, without any feature tuning.

...read moreread less

Abstract: Motivation The computational prediction of a protein structure from its sequence generally relies on a method to assess the quality of protein models. Most assessment methods rank candidate models using heavily engineered structural features, defined as complex functions of the atomic coordinates. However, very few methods have attempted to learn these features directly from the data. Results We show that deep convolutional networks can be used to predict the ranking of model structures solely on the basis of their raw three-dimensional atomic densities, without any feature tuning. We develop a deep neural network that performs on par with state-of-the-art algorithms from the literature. The network is trained on decoys from the CASP7 to CASP10 datasets and its performance is tested on the CASP11 dataset. Additional testing on decoys from the CASP12, CAMEO and 3DRobot datasets confirms that the network performs consistently well across a variety of protein structures. While the network learns to assess structural decoys globally and does not rely on any predefined features, it can be analyzed to show that it implicitly identifies regions that deviate from the native structure. Availability and implementation The code and the datasets are available at https://github.com/lamoureux-lab/3DCNN_MQA. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less