Showing papers by "Yoshua Bengio published in 2017"

PDF

Open Access

Journal Article•DOI•

Brain tumor segmentation with Deep Neural Networks

[...]

Mohammad Havaei¹, Axel Davy², David Warde-Farley³, Antoine Biard³, Aaron Courville³, Yoshua Bengio³, Chris Pal⁴, Pierre-Marc Jodoin¹, Hugo Larochelle¹ - Show less +5 more•Institutions (4)

Université de Sherbrooke¹, École Normale Supérieure², Université de Montréal³, École Polytechnique de Montréal⁴

01 Jan 2017-Medical Image Analysis

TL;DR: A fast and accurate fully automatic method for brain tumor segmentation which is competitive both in terms of accuracy and speed compared to the state of the art, and introduces a novel cascaded architecture that allows the system to more accurately model local label dependencies.

...read moreread less

2,538 citations

Proceedings Article•DOI•

The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation

[...]

Simon Jégou, Michal Drozdzal¹, David Vazquez, Adriana Romero, Yoshua Bengio - Show less +1 more•Institutions (1)

École Polytechnique de Montréal¹

21 Jul 2017

TL;DR: In this article, the authors extend DenseNets to semantic segmentation and achieve state-of-the-art results on urban scene benchmark datasets such as CamVid and Gatech, without any further post-processing module nor pretraining.

...read moreread less

Abstract: State-of-the-art approaches for semantic image segmentation are built on Convolutional Neural Networks (CNNs). The typical segmentation architecture is composed of (a) a downsampling path responsible for extracting coarse semantic features, followed by (b) an upsampling path trained to recover the input image resolution at the output of the model and, optionally, (c) a post-processing module (e.g. Conditional Random Fields) to refine the model predictions.,,,,,, Recently, a new CNN architecture, Densely Connected Convolutional Networks (DenseNets), has shown excellent results on image classification tasks. The idea of DenseNets is based on the observation that if each layer is directly connected to every other layer in a feed-forward fashion then the network will be more accurate and easier to train.,,,,,, In this paper, we extend DenseNets to deal with the problem of semantic segmentation. We achieve state-of-the-art results on urban scene benchmark datasets such as CamVid and Gatech, without any further post-processing module nor pretraining. Moreover, due to smart construction of the model, our approach has much less parameters than currently published best entries for these datasets.

...read moreread less

1,163 citations

Proceedings Article•

A closer look at memorization in deep networks

[...]

Devansh Arpit¹, Stanisław Jastrzębski², Nicolas Ballas¹, David Krueger¹, Emmanuel Bengio³, Maxinder S. Kanwal⁴, Tegan Maharaj⁵, Asja Fischer⁶, Aaron Courville¹, Yoshua Bengio¹, Simon Lacoste-Julien¹ - Show less +7 more•Institutions (6)

Université de Montréal¹, Jagiellonian University², McGill University³, University of California, Berkeley⁴, École Polytechnique de Montréal⁵, University of Bonn⁶

06 Aug 2017

TL;DR: The analysis suggests that the notions of effective capacity which are dataset independent are unlikely to explain the generalization performance of deep networks when trained with gradient based methods because training data itself plays an important role in determining the degree of memorization.

...read moreread less

Abstract: We examine the role of memorization in deep learning, drawing connections to capacity, generalization, and adversarial robustness. While deep networks are capable of memorizing noise data, our results suggest that they tend to prioritize learning simple patterns first. In our experiments, we expose qualitative differences in gradient-based optimization of deep neural networks (DNNs) on noise vs. real data. We also demonstrate that for appropriately tuned explicit regularization (e.g., dropout) we can degrade DNN training performance on noise datasets without compromising generalization on real data. Our analysis suggests that the notions of effective capacity which are dataset independent are unlikely to explain the generalization performance of deep networks when trained with gradient based methods because training data itself plays an important role in determining the degree of memorization.

...read moreread less

1,080 citations

Posted Content•

Graph Attention Networks

[...]

Petar Veličković¹, Guillem Cucurull², Arantxa Casanova³, Adriana Romero⁴, Pietro Liò¹, Yoshua Bengio⁵ - Show less +2 more•Institutions (5)

University of Cambridge¹, Autonomous University of Barcelona², Polytechnic University of Catalonia³, HEC Montréal⁴, Université de Montréal⁵

30 Oct 2017-arXiv: Machine Learning

TL;DR: Graph Attention Networks (GATs) as discussed by the authors leverage masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations.

...read moreread less

Abstract: We present graph attention networks (GATs), novel neural network architectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in which nodes are able to attend over their neighborhoods' features, we enable (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront. In this way, we address several key challenges of spectral-based graph neural networks simultaneously, and make our model readily applicable to inductive as well as transductive problems. Our GAT models have achieved or matched state-of-the-art results across four established transductive and inductive graph benchmarks: the Cora, Citeseer and Pubmed citation network datasets, as well as a protein-protein interaction dataset (wherein test graphs remain unseen during training).

...read moreread less

1,016 citations

Journal Article•

Quantized neural networks: training neural networks with low precision weights and activations

[...]

Itay Hubara¹, Matthieu Courbariaux², Daniel Soudry³, Ran El-Yaniv¹, Yoshua Bengio² - Show less +1 more•Institutions (3)

Technion – Israel Institute of Technology¹, Université de Montréal², Columbia University³

01 Jan 2017-Journal of Machine Learning Research

TL;DR: In this paper, a method to train quantized neural networks (QNNs) with extremely low precision (e.g., 1-bit) weights and activations, at run-time is introduced.

...read moreread less

Abstract: We introduce a method to train Quantized Neural Networks (QNNs) -- neural networks with extremely low precision (e.g., 1-bit) weights and activations, at run-time. At traintime the quantized weights and activations are used for computing the parameter gradients. During the forward pass, QNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations. As a result, power consumption is expected to be drastically reduced. We trained QNNs over the MNIST, CIFAR-10, SVHN and ImageNet datasets. The resulting QNNs achieve prediction accuracy comparable to their 32-bit counterparts. For example, our quantized version of AlexNet with 1-bit weights and 2-bit activations achieves 51% top-1 accuracy. Moreover, we quantize the parameter gradients to 6-bits as well which enables gradients computation using only bit-wise operation. Quantized recurrent neural networks were tested over the Penn Treebank dataset, and achieved comparable accuracy as their 32-bit counterparts using only 4-bits. Last but not least, we programmed a binary matrix multiplication GPU kernel with which it is possible to run our MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The QNN code is available online.

...read moreread less

919 citations

Posted Content•

A Structured Self-attentive Sentence Embedding

[...]

Zhouhan Lin¹, Minwei Feng², Cicero Nogueira dos Santos², Mo Yu², Bing Xiang², Bowen Zhou², Yoshua Bengio¹ - Show less +3 more•Institutions (2)

Université de Montréal¹, IBM²

09 Mar 2017-arXiv: Computation and Language

TL;DR: This paper proposed a self-attention mechanism and a special regularization term for the model, which achieved a significant performance gain compared to other sentence embedding methods in all of the three tasks.

...read moreread less

Abstract: This paper proposes a new model for extracting an interpretable sentence embedding by introducing self-attention. Instead of using a vector, we use a 2-D matrix to represent the embedding, with each row of the matrix attending on a different part of the sentence. We also propose a self-attention mechanism and a special regularization term for the model. As a side effect, the embedding comes with an easy way of visualizing what specific parts of the sentence are encoded into the embedding. We evaluate our model on 3 different tasks: author profiling, sentiment classification, and textual entailment. Results show that our model yields a significant performance gain compared to other sentence embedding methods in all of the 3 tasks.

...read moreread less

767 citations

Proceedings Article•

A Structured Self-Attentive Sentence Embedding.

[...]

Zhouhan Lin¹, Minwei Feng², Cicero Nogueira dos Santos², Mo Yu², Bing Xiang², Bowen Zhou², Yoshua Bengio¹ - Show less +3 more•Institutions (2)

Université de Montréal¹, IBM²

09 Mar 2017

TL;DR: A new model for extracting an interpretable sentence embedding by introducing self-attention is proposed, which uses a 2-D matrix to represent the embedding, with each row of the matrix attending on a different part of the sentence.

...read moreread less

724 citations

Proceedings Article•DOI•

Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space

[...]

Anh Nguyen¹, Jeff Clune¹, Yoshua Bengio, Alexey Dosovitskiy², Jason Yosinski³ - Show less +1 more•Institutions (3)

University of Wyoming¹, University of Freiburg², Uber ³

21 Jul 2017

TL;DR: This paper introduces an additional prior on the latent code, improving both sample quality and sample diversity, leading to a state-of-the-art generative model that produces high quality images at higher resolutions than previous generative models, and does so for all 1000 ImageNet categories.

...read moreread less

Abstract: Generating high-resolution, photo-realistic images has been a long-standing goal in machine learning. Recently, Nguyen et al. [37] showed one interesting way to synthesize novel images by performing gradient ascent in the latent space of a generator network to maximize the activations of one or multiple neurons in a separate classifier network. In this paper we extend this method by introducing an additional prior on the latent code, improving both sample quality and sample diversity, leading to a state-of-the-art generative model that produces high quality images at higher resolutions (227 × 227) than previous generative models, and does so for all 1000 ImageNet categories. In addition, we provide a unified probabilistic interpretation of related activation maximization methods and call the general class of models Plug and Play Generative Networks. PPGNs are composed of 1) a generator network G that is capable of drawing a wide range of image types and 2) a replaceable condition network C that tells the generator what to draw. We demonstrate the generation of images conditioned on a class (when C is an ImageNet or MIT Places classification network) and also conditioned on a caption (when C is an image captioning network). Our method also improves the state of the art of Multifaceted Feature Visualization [40], which generates the set of synthetic inputs that activate a neuron in order to better understand how deep neural networks operate. Finally, we show that our model performs reasonably well at the task of image inpainting. While image models are used in this paper, the approach is modality-agnostic and can be applied to many types of data.

...read moreread less

689 citations

Proceedings Article•

Char2Wav: End-to-End Speech Synthesis

[...]

Jose Sotelo¹, Soroush Mehri, Kundan Kumar², João Felipe Santos³, Kyle Kastner¹, Aaron Courville¹, Yoshua Bengio¹ - Show less +3 more•Institutions (3)

Université de Montréal¹, Indian Institute of Technology Kanpur², Institut national de la recherche scientifique³

17 Feb 2017

TL;DR: Char2Wav is an end-to-end model for speech synthesis that learns to produce audio directly from text and is a bidirectional recurrent neural network with attention that produces vocoder acoustic features.

...read moreread less

Abstract: We present Char2Wav, an end-to-end model for speech synthesis. Char2Wav has two components: a reader and a neural vocoder. The reader is an encoderdecoder model with attention. The encoder is a bidirectional recurrent neural network that accepts text or phonemes as inputs, while the decoder is a recurrent neural network (RNN) with attention that produces vocoder acoustic features. Neural vocoder refers to a conditional extension of SampleRNN which generates raw waveform samples from intermediate representations. Unlike traditional models for speech synthesis, Char2Wav learns to produce audio directly from text.

...read moreread less

412 citations

Posted Content•

Three Factors Influencing Minima in SGD

[...]

Stanisław Jastrzębski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, Amos Storkey - Show less +3 more

13 Nov 2017-arXiv: Learning

TL;DR: Through this analysis, it is found that three factors – learning rate, batch size and the variance of the loss gradients – control the trade-off between the depth and width of the minima found by SGD, with wider minima favoured by a higher ratio of learning rate to batch size.

...read moreread less

Abstract: We investigate the dynamical and convergent properties of stochastic gradient descent (SGD) applied to Deep Neural Networks (DNNs). Characterizing the relation between learning rate, batch size and the properties of the final minima, such as width or generalization, remains an open question. In order to tackle this problem we investigate the previously proposed approximation of SGD by a stochastic differential equation (SDE). We theoretically argue that three factors - learning rate, batch size and gradient covariance - influence the minima found by SGD. In particular we find that the ratio of learning rate to batch size is a key determinant of SGD dynamics and of the width of the final minima, and that higher values of the ratio lead to wider minima and often better generalization. We confirm these findings experimentally. Further, we include experiments which show that learning rate schedules can be replaced with batch size schedules and that the ratio of learning rate to batch size is an important factor influencing the memorization process.

...read moreread less

386 citations

Posted Content•

Deep Complex Networks

[...]

Chiheb Trabelsi¹, Olexa Bilaniuk², Ying Zhang³, Dmitriy Serdyuk⁴, Sandeep Subramanian⁵, João Felipe Santos⁶, Soroush Mehri, Negar Rostamzadeh⁷, Yoshua Bengio³, Chris Pal¹ - Show less +6 more•Institutions (7)

École Polytechnique de Montréal¹, University of Ottawa², Université de Montréal³, Facebook⁴, Carnegie Mellon University⁵, Institut national de la recherche scientifique⁶, University of Trento⁷

27 May 2017-arXiv: Neural and Evolutionary Computing

TL;DR: This work relies on complex convolutions and present algorithms for complex batch-normalization, complex weight initialization strategies for complex-valued neural nets and uses them in experiments with end-to-end training schemes and demonstrates that such complex- valued models are competitive with their real-valued counterparts.

...read moreread less

Abstract: At present, the vast majority of building blocks, techniques, and architectures for deep learning are based on real-valued operations and representations. However, recent work on recurrent neural networks and older fundamental theoretical analysis suggests that complex numbers could have a richer representational capacity and could also facilitate noise-robust memory retrieval mechanisms. Despite their attractive properties and potential for opening up entirely new neural architectures, complex-valued deep neural networks have been marginalized due to the absence of the building blocks required to design such models. In this work, we provide the key atomic components for complex-valued deep neural networks and apply them to convolutional feed-forward networks and convolutional LSTMs. More precisely, we rely on complex convolutions and present algorithms for complex batch-normalization, complex weight initialization strategies for complex-valued neural nets and we use them in experiments with end-to-end training schemes. We demonstrate that such complex-valued models are competitive with their real-valued counterparts. We test deep complex models on several computer vision tasks, on music transcription using the MusicNet dataset and on Speech Spectrum Prediction using the TIMIT dataset. We achieve state-of-the-art performance on these audio-related tasks.

...read moreread less

Posted Content•

Sharp Minima Can Generalize For Deep Nets

[...]

Laurent Dinh¹, Razvan Pascanu, Samy Bengio², Yoshua Bengio¹•Institutions (2)

Université de Montréal¹, Google²

15 Mar 2017-arXiv: Learning

TL;DR: It is argued that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization, and when focusing on deep networks with rectifier units, the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit is exploited.

...read moreread less

Abstract: Despite their overwhelming capacity to overfit, deep learning architectures tend to generalize relatively well to unseen data, allowing them to be deployed in practice. However, explaining why this is the case is still an open area of research. One standing hypothesis that is gaining popularity, e.g. Hochreiter & Schmidhuber (1997); Keskar et al. (2017), is that the flatness of minima of the loss function found by stochastic gradient based methods results in good generalization. This paper argues that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization. Specifically, when focusing on deep networks with rectifier units, we can exploit the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit to build equivalent models corresponding to arbitrarily sharper minima. Furthermore, if we allow to reparametrize a function, the geometry of its parameters can change drastically without affecting its generalization properties.

...read moreread less

Proceedings Article•

Sharp minima can generalize for deep nets

[...]

Laurent Dinh¹, Razvan Pascanu, Samy Bengio², Yoshua Bengio¹•Institutions (2)

Université de Montréal¹, Google²

06 Aug 2017

TL;DR: The authors argue that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization, and exploit the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit to build equivalent models corresponding to arbitrarily sharper minima.

...read moreread less

Posted Content•

A Closer Look at Memorization in Deep Networks

[...]

Université de Montréal¹, Jagiellonian University², McGill University³, University of California, Berkeley⁴, École Polytechnique de Montréal⁵, University of Bonn⁶

16 Jun 2017-arXiv: Machine Learning

TL;DR: The authors examine the role of memorization in deep learning, drawing connections to capacity, generalization, and adversarial robustness, showing that deep networks tend to prioritize learning simple patterns first.

...read moreread less

Proceedings Article•

Deep Complex Networks

[...]

27 May 2017

TL;DR: In this paper, the authors provide the key atomic components for complex-valued deep neural networks and apply them to convolutional feed-forward networks, and demonstrate that such complexvalued models are competitive with their real-valued counterparts.

...read moreread less

Abstract: At present, the vast majority of building blocks, techniques, and architectures for deep learning are based on real-valued operations and representations. However, recent work on recurrent neural networks and older fundamental theoretical analysis suggests that complex numbers could have a richer representational capacity and could also facilitate noise-robust memory retrieval mechanisms. Despite their attractive properties and potential for opening up entirely new neural architectures, complex-valued deep neural networks have been marginalized due to the absence of the building blocks required to design such models. In this work, we provide the key atomic components for complex-valued deep neural networks and apply them to convolutional feed-forward networks. More precisely, we rely on complex convolutions and present algorithms for complex batch-normalization, complex weight initialization strategies for complex-valued neural nets and we use them in experiments with end-to-end training schemes. We demonstrate that such complex-valued models are competitive with their real-valued counterparts. We test deep complex models on several computer vision tasks, on music transcription using the MusicNet dataset and on Speech spectrum prediction using TIMIT. We achieve state-of-the-art performance on these audio-related tasks.

...read moreread less

Proceedings Article•DOI•

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

[...]

Ryan Lowe¹, Michael Noseworthy², Iulian Vlad Serban³, Nicolas Angelard-Gontier¹, Yoshua Bengio³, Joelle Pineau¹ - Show less +2 more•Institutions (3)

McGill University¹, Massachusetts Institute of Technology², Université de Montréal³

17 Feb 2017

TL;DR: This paper presented an evaluation model (ADEM) that learns to predict human-like scores to input responses, using a new dataset of human response scores, and showed that the ADEM model's predictions correlate significantly, and at a level much higher than word-overlap metrics such as BLEU, with human judgements at both utterance and system-level.

...read moreread less

Abstract: Automatically evaluating the quality of dialogue responses for unstructured domains is a challenging problem. Unfortunately, existing automatic evaluation metrics are biased and correlate very poorly with human judgements of response quality (Liu et al., 2016). Yet having an accurate automatic evaluation procedure is crucial for dialogue research, as it allows rapid prototyping and testing of new models with fewer expensive human evaluations. In response to this challenge, we formulate automatic dialogue evaluation as a learning problem.We present an evaluation model (ADEM)that learns to predict human-like scores to input responses, using a new dataset of human response scores. We show that the ADEM model’s predictions correlate significantly, and at a level much higher than word-overlap metrics such as BLEU, with human judgements at both the utterance and system-level. We also show that ADEM can generalize to evaluating dialogue mod-els unseen during training, an important step for automatic dialogue evaluation.

...read moreread less

Posted Content•

Generalization in Deep Learning

[...]

Kenji Kawaguchi, Leslie Pack Kaelbling, Yoshua Bengio

16 Oct 2017-arXiv: Machine Learning

TL;DR: Non-vacuous and numerically-tight generalization guarantees for deep learning are provided, as well as theoretical insights into why and how deep learning can generalize well, despite its large capacity, complexity, possible algorithmic instability, nonrobustness, and sharp minima.

...read moreread less

Abstract: This paper provides theoretical insights into why and how deep learning can generalize well, despite its large capacity, complexity, possible algorithmic instability, nonrobustness, and sharp minima, responding to an open question in the literature. We also discuss approaches to provide non-vacuous generalization guarantees for deep learning. Based on theoretical observations, we propose new open problems and discuss the limitations of our results.

...read moreread less

Posted Content•

Measuring the tendency of CNNs to Learn Surface Statistical Regularities

[...]

Jason Jo, Yoshua Bengio

30 Nov 2017-arXiv: Learning

TL;DR: This paper showed that deep CNNs tend to latch onto the Fourier image statistics of the training dataset, sometimes exhibiting up to a 28% generalization gap across the various test sets.

...read moreread less

Abstract: Deep CNNs are known to exhibit the following peculiarity: on the one hand they generalize extremely well to a test set, while on the other hand they are extremely sensitive to so-called adversarial perturbations. The extreme sensitivity of high performance CNNs to adversarial examples casts serious doubt that these networks are learning high level abstractions in the dataset. We are concerned with the following question: How can a deep CNN that does not learn any high level semantics of the dataset manage to generalize so well? The goal of this article is to measure the tendency of CNNs to learn surface statistical regularities of the dataset. To this end, we use Fourier filtering to construct datasets which share the exact same high level abstractions but exhibit qualitatively different surface statistical regularities. For the SVHN and CIFAR-10 datasets, we present two Fourier filtered variants: a low frequency variant and a randomly filtered variant. Each of the Fourier filtering schemes is tuned to preserve the recognizability of the objects. Our main finding is that CNNs exhibit a tendency to latch onto the Fourier image statistics of the training dataset, sometimes exhibiting up to a 28% generalization gap across the various test sets. Moreover, we observe that significantly increasing the depth of a network has a very marginal impact on closing the aforementioned generalization gap. Thus we provide quantitative evidence supporting the hypothesis that deep CNNs tend to learn surface statistical regularities in the dataset rather than higher-level abstract concepts.

...read moreread less

Journal Article•DOI•

Equilibrium Propagation: Bridging the Gap between Energy-Based Models and Backpropagation.

[...]

Benjamin Scellier¹, Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

04 May 2017-Frontiers in Computational Neuroscience

TL;DR: It is shown that multi-layer recurrently connected networks with 1, 2, and 3 hidden layers can be trained by Equilibrium Propagation on the permutation-invariant MNIST task, and it makes it more plausible that a mechanism similar to Backpropagation could be implemented by brains.

...read moreread less

Abstract: We introduce Equilibrium Propagation, a learning framework for energy-based models. It involves only one kind of neural computation, performed in both the first phase (when the prediction is made) and the second phase of training (after the target or prediction error is revealed). Although this algorithm computes the gradient of an objective function just like Backpropagation, it does not need a special computation or circuit for the second phase, where errors are implicitly propagated. Equilibrium Propagation shares similarities with Contrastive Hebbian Learning and Contrastive Divergence while solving the theoretical issues of both algorithms: our algorithm computes the gradient of a well defined objective function. Because the objective function is defined in terms of local perturbations, the second phase of Equilibrium Propagation corresponds to only nudging the prediction (fixed point, or stationary distribution) towards a configuration that reduces prediction error. In the case of a recurrent multi-layer supervised network, the output units are slightly nudged towards their target in the second phase, and the perturbation introduced at the output layer propagates backward in the hidden layers. We show that the signal 'back-propagated' during this second phase corresponds to the propagation of error derivatives and encodes the gradient of the objective function, when the synaptic update corresponds to a standard form of spike-timing dependent plasticity. This work makes it more plausible that a mechanism similar to Backpropagation could be implemented by brains, since leaky integrator neural computation performs both inference and error back-propagation in our model. The only local difference between the two phases is whether synaptic changes are allowed or not. We also show experimentally that multi-layer recurrently connected networks with 1, 2 and 3 hidden layers can be trained by Equilibrium Propagation on the permutation-invariant MNIST task.

...read moreread less

Journal Article•DOI•

Online and offline handwritten Chinese character recognition: A comprehensive study and new benchmark

[...]

Xu-Yao Zhang¹, Yoshua Bengio², Cheng-Lin Liu³, Cheng-Lin Liu¹•Institutions (3)

Chinese Academy of Sciences¹, Université de Montréal², Center for Excellence in Education³

01 Jan 2017-Pattern Recognition

TL;DR: In this article, a new adaptation layer is proposed to reduce the mismatch between training and test data on a particular source layer, and the adaptation process can be efficiently and effectively implemented in an unsupervised manner.

...read moreread less

Posted Content•

Maximum-Likelihood Augmented Discrete Generative Adversarial Networks

[...]

Tong Che, Yanran Li, Ruixiang Zhang, R Devon Hjelm, Wenjie Li, Yangqiu Song, Yoshua Bengio - Show less +3 more

26 Feb 2017-arXiv: Artificial Intelligence

TL;DR: This work derives a novel and low-variance GAN objective using the discriminator's output that follows corresponds to the log-likelihood, which is proved to be consistent in theory and beneficial in practice.

...read moreread less

Abstract: Despite the successes in capturing continuous distributions, the application of generative adversarial networks (GANs) to discrete settings, like natural language tasks, is rather restricted. The fundamental reason is the difficulty of back-propagation through discrete random variables combined with the inherent instability of the GAN training objective. To address these problems, we propose Maximum-Likelihood Augmented Discrete Generative Adversarial Networks. Instead of directly optimizing the GAN objective, we derive a novel and low-variance objective using the discriminator's output that follows corresponds to the log-likelihood. Compared with the original, the new objective is proved to be consistent in theory and beneficial in practice. The experimental results on various discrete datasets demonstrate the effectiveness of the proposed approach.

...read moreread less

Posted Content•

A Deep Reinforcement Learning Chatbot

[...]

Iulian Vlad Serban, Chinnadhurai Sankar, Mathieu Germain, Saizheng Zhang, Zhouhan Lin, Sandeep Subramanian, Taesup Kim, Michael Pieper, Sarath Chandar, Nan Rosemary Ke, Sai Rajeshwar, Alexandre de Brébisson, Jose Sotelo, Dendi Suhubdy, Vincent Michalski, Alexandre Nguyen, Joelle Pineau, Yoshua Bengio - Show less +14 more

07 Sep 2017-arXiv: Computation and Language

TL;DR: MILA's MILABOT is capable of conversing with humans on popular small talk topics through both speech and text and consists of an ensemble of natural language generation and retrieval models, including template-based models, bag-of-words models, sequence-to-sequence neural network and latent variable neural network models.

...read moreread less

Abstract: We present MILABOT: a deep reinforcement learning chatbot developed by the Montreal Institute for Learning Algorithms (MILA) for the Amazon Alexa Prize competition. MILABOT is capable of conversing with humans on popular small talk topics through both speech and text. The system consists of an ensemble of natural language generation and retrieval models, including template-based models, bag-of-words models, sequence-to-sequence neural network and latent variable neural network models. By applying reinforcement learning to crowdsourced data and real-world user interactions, the system has been trained to select an appropriate response from the models in its ensemble. The system has been evaluated through A/B testing with real-world users, where it performed significantly better than many competing systems. Due to its machine learning architecture, the system is likely to improve with additional data.

...read moreread less

Proceedings Article•

Improving Generative Adversarial Networks with Denoising Feature Matching

[...]

David Warde-Farley¹, Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

24 Apr 2017

TL;DR: An augmented training procedure for generative adversarial networks designed to address shortcomings of the original by directing the generator towards probable configurations of abstract discriminator features is proposed.

...read moreread less

Abstract: We propose an augmented training procedure for generative adversarial networks designed to address shortcomings of the original by directing the generator towards probable configurations of abstract discriminator features. We estimate and track the distribution of these features, as computed from data, with a denoising auto-encoder, and use it to propose high-level targets for the generator. We combine this new loss with the original and evaluate the hybrid criterion on the task of unsupervised image synthesis from datasets comprising a diverse set of visual categories, noting a qualitative and quantitative improvement in the ``objectness'' of the resulting samples.

...read moreread less

Posted Content•

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

[...]

Ryan Lowe¹, Michael Noseworthy², Iulian Vlad Serban³, Nicolas Angelard-Gontier¹, Yoshua Bengio³, Joelle Pineau¹ - Show less +2 more•Institutions (3)

McGill University¹, Massachusetts Institute of Technology², Université de Montréal³

23 Aug 2017-arXiv: Computation and Language

TL;DR: The authors presented an evaluation model that learns to predict human-like scores to input responses, using a new dataset of human response scores, and showed that the model's predictions correlate significantly, and at a level much higher than word-overlap metrics such as BLEU, with human judgements at both utterance and system-level.

...read moreread less

Abstract: Automatically evaluating the quality of dialogue responses for unstructured domains is a challenging problem. Unfortunately, existing automatic evaluation metrics are biased and correlate very poorly with human judgements of response quality. Yet having an accurate automatic evaluation procedure is crucial for dialogue research, as it allows rapid prototyping and testing of new models with fewer expensive human evaluations. In response to this challenge, we formulate automatic dialogue evaluation as a learning problem. We present an evaluation model (ADEM) that learns to predict human-like scores to input responses, using a new dataset of human response scores. We show that the ADEM model's predictions correlate significantly, and at a level much higher than word-overlap metrics such as BLEU, with human judgements at both the utterance and system-level. We also show that ADEM can generalize to evaluating dialogue models unseen during training, an important step for automatic dialogue evaluation.

...read moreread less

Posted Content•

The Consciousness Prior.

[...]

Yoshua Bengio

25 Sep 2017-arXiv: Learning

TL;DR: A new prior is proposed for learning representations of high-level concepts of the kind the authors manipulate with language, inspired by cognitive neuroscience theories of consciousness, that makes it natural to map conscious states to natural language utterances or to express classical AI knowledge in a form similar to facts and rules.

...read moreread less

Abstract: A new prior is proposed for learning representations of high-level concepts of the kind we manipulate with language. This prior can be combined with other priors in order to help disentangling abstract factors from each other. It is inspired by cognitive neuroscience theories of consciousness, seen as a bottleneck through which just a few elements, after having been selected by attention from a broader pool, are then broadcast and condition further processing, both in perception and decision-making. The set of recently selected elements one becomes aware of is seen as forming a low-dimensional conscious state. This conscious state is combining the few concepts constituting a conscious thought, i.e., what one is immediately conscious of at a particular moment. We claim that this architectural and information-processing constraint corresponds to assumptions about the joint distribution between high-level concepts. To the extent that these assumptions are generally true (and the form of natural language seems consistent with them), they can form a useful prior for representation learning. A low-dimensional thought or conscious state is analogous to a sentence: it involves only a few variables and yet can make a statement with very high probability of being true. This is consistent with a joint distribution (over high-level concepts) which has the form of a sparse factor graph, i.e., where the dependencies captured by each factor of the factor graph involve only very few variables while creating a strong dip in the overall energy function. The consciousness prior also makes it natural to map conscious states to natural language utterances or to express classical AI knowledge in a form similar to facts and rules, albeit capturing uncertainty as well as efficient search mechanisms implemented by attention mechanisms.

...read moreread less

Posted Content•

Learning Normalized Inputs for Iterative Estimation in Medical Image Segmentation

[...]

Michal Drozdzal, Gabriel Chartrand, Eugene Vorontsov, Lisa Di Jorio, An Tang, Adriana Romero, Yoshua Bengio, Chris Pal, Samuel Kadoury - Show less +5 more

16 Feb 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a simple yet powerful pipeline for medical image segmentation that combines Fully Convolutional Networks (FCNs) with Fully convolutional residual networks (FC-ResNets) is presented.

...read moreread less

Abstract: In this paper, we introduce a simple, yet powerful pipeline for medical image segmentation that combines Fully Convolutional Networks (FCNs) with Fully Convolutional Residual Networks (FC-ResNets). We propose and examine a design that takes particular advantage of recent advances in the understanding of both Convolutional Neural Networks as well as ResNets. Our approach focuses upon the importance of a trainable pre-processing when using FC-ResNets and we show that a low-capacity FCN model can serve as a pre-processor to normalize medical input data. In our image segmentation pipeline, we use FCNs to obtain normalized images, which are then iteratively refined by means of a FC-ResNet to generate a segmentation prediction. As in other fully convolutional approaches, our pipeline can be used off-the-shelf on different image modalities. We show that using this pipeline, we exhibit state-of-the-art performance on the challenging Electron Microscopy benchmark, when compared to other 2D methods. We improve segmentation results on CT images of liver lesions, when contrasting with standard FCN methods. Moreover, when applying our 2D pipeline on a challenging 3D MRI prostate segmentation challenge we reach results that are competitive even when compared to 3D methods. The obtained results illustrate the strong potential and versatility of the pipeline by achieving highly accurate results on multi-modality images from different anatomical regions and organs.

...read moreread less

Posted Content•

Boundary-Seeking Generative Adversarial Networks

[...]

R Devon Hjelm, Athul Paul Jacob, Tong Che, Adam Trischler, Kyunghyun Cho, Yoshua Bengio - Show less +2 more

27 Feb 2017-arXiv: Machine Learning

TL;DR: This work introduces a method for training GANs with discrete data that uses the estimated difference measure from the discriminator to compute importance weights for generated samples, thus providing a policy gradient for training the generator.

...read moreread less

Abstract: Generative adversarial networks (GANs) are a learning framework that rely on training a discriminator to estimate a measure of difference between a target and generated distributions. GANs, as normally formulated, rely on the generated samples being completely differentiable w.r.t. the generative parameters, and thus do not work for discrete data. We introduce a method for training GANs with discrete data that uses the estimated difference measure from the discriminator to compute importance weights for generated samples, thus providing a policy gradient for training the generator. The importance weights have a strong connection to the decision boundary of the discriminator, and we call our method boundary-seeking GANs (BGANs). We demonstrate the effectiveness of the proposed algorithm with discrete image and character-based natural language generation. In addition, the boundary-seeking objective extends to continuous data, which can be used to improve stability of training, and we demonstrate this on Celeba, Large-scale Scene Understanding (LSUN) bedrooms, and Imagenet without conditioning.

...read moreread less

Journal Article•DOI•

On integrating a language model into neural machine translation

[...]

Caglar Gulcehre¹, Orhan Firat², Kelvin Xu¹, Kyunghyun Cho¹, Yoshua Bengio³ - Show less +1 more•Institutions (3)

Université de Montréal¹, Middle East Technical University², Canadian Institute for Advanced Research³

01 Sep 2017-Computer Speech & Language

TL;DR: This work combines scores from neural language model trained only on target monolingual data with neural machine translation model and fusing hidden-states of these two models, and obtains up to 2 BLEU improvement over hierarchical and phrase-based baseline on low-resource language pair, Turkish English.

...read moreread less

Proceedings Article•

Z-Forcing: Training Stochastic Recurrent Networks

[...]

Anirudh Goyal¹, Alessandro Sordoni², Marc-Alexandre Côté³, Nan Rosemary Ke⁴, Yoshua Bengio¹ - Show less +1 more•Institutions (4)

Université de Montréal¹, Microsoft², Université de Sherbrooke³, École Polytechnique de Montréal⁴

15 Nov 2017

TL;DR: This work unify successful ideas from recently proposed architectures into a stochastic recurrent model that achieves state-of-the-art results on standard speech benchmarks such as TIMIT and Blizzard and competitive performance on sequential MNIST.

...read moreread less

Abstract: Many efforts have been devoted to training generative latent variable models with autoregressive decoders, such as recurrent neural networks (RNN). Stochastic recurrent models have been successful in capturing the variability observed in natural sequential data such as speech. We unify successful ideas from recently proposed architectures into a stochastic recurrent model: each step in the sequence is associated with a latent variable that is used to condition the recurrent dynamics for future steps. Training is performed with amortised variational inference where the approximate posterior is augmented with a RNN that runs backward through the sequence. In addition to maximizing the variational lower bound, we ease training of the latent variables by adding an auxiliary cost which forces them to reconstruct the state of the backward recurrent network. This provides the latent variables with a task-independent objective that enhances the performance of the overall model. We found this strategy to perform better than alternative approaches such as KL annealing. Although being conceptually simple, our model achieves state-of-the-art results on standard speech benchmarks such as TIMIT and Blizzard and competitive performance on sequential MNIST. Finally, we apply our model to language modeling on the IMDB dataset where the auxiliary cost helps in learning interpretable latent variables.

...read moreread less

Journal Article•DOI•

Use machine learning to find energy materials.

[...]

Phil De Luna, Jennifer N. Wei, Yoshua Bengio, Alán Aspuru-Guzik, Edward H. Sargent - Show less +1 more

07 Dec 2017-Nature

TL;DR: Artificial intelligence can speed up research into new photovoltaic, battery and carbon-capture materials, argue Edward Sargent, Alán Aspuru-Guzikand colleagues.

...read moreread less

Abstract: Artificial intelligence can speed up research into new photovoltaic, battery and carbon-capture materials, argue Edward Sargent, Alan Aspuru-Guzikand colleagues. Artificial intelligence can speed up research into new photovoltaic, battery and carbon-capture materials, argue Edward Sargent, Alan Aspuru-Guzikand colleagues.

...read moreread less