Showing papers by "Yoshua Bengio published in 2016"

PDF

Open Access

Book•

[...]

Ian Goodfellow¹, Yoshua Bengio², Aaron Courville²•Institutions (2)

18 Nov 2016

TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.

...read moreread less

Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

...read moreread less

38,208 citations

Posted Content•

Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

[...]

Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, Yoshua Bengio - Show less +1 more

09 Feb 2016-arXiv: Learning

TL;DR: A binary matrix multiplication GPU kernel is written with which it is possible to run the MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy.

...read moreread less

Abstract: We introduce a method to train Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time. At training-time the binary weights and activations are used for computing the parameters gradients. During the forward pass, BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which is expected to substantially improve power-efficiency. To validate the effectiveness of BNNs we conduct two sets of experiments on the Torch7 and Theano frameworks. On both, BNNs achieved nearly state-of-the-art results over the MNIST, CIFAR-10 and SVHN datasets. Last but not least, we wrote a binary matrix multiplication GPU kernel with which it is possible to run our MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The code for training and running our BNNs is available on-line.

...read moreread less

2,320 citations

Posted Content•

Theano: A Python framework for fast computation of mathematical expressions

[...]

Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, Alexander Belopolsky, Yoshua Bengio, Arnaud Bergeron, James Bergstra, Valentin Bisson, Josh Bleecher Snyder, Nicolas Bouchard, Nicolas Boulanger-Lewandowski, Xavier Bouthillier, Alexandre de Brébisson, Olivier Breuleux, Pierre Luc Carrier, Kyunghyun Cho, Jan Chorowski, Paul F. Christiano, Tim Cooijmans, Marc-Alexandre Côté, Myriam Côté, Aaron Courville, Yann N. Dauphin, Olivier Delalleau, Julien Demouth, Guillaume Desjardins, Sander Dieleman, Laurent Dinh, Mélanie Ducoffe, Vincent Dumoulin, Samira Ebrahimi Kahou, Dumitru Erhan, Ziye Fan, Orhan Firat, Mathieu Germain, Xavier Glorot, Ian Goodfellow, Matthew M. Graham, Caglar Gulcehre, Philippe Hamel, Iban Harlouchet, Jean-Philippe Heng, Balázs Hidasi, Sina Honari, Arjun Jain, Sébastien Jean, Kai Jia, Mikhail Korobov, Vivek Kulkarni, Alex Lamb, Pascal Lamblin, Eric Larsen, César Laurent, Sean Lee, Simon Lefrancois, Simon Lemieux, Nicholas Léonard, Zhouhan Lin, Jesse A. Livezey, Cory Lorenz, Jeremiah Lowin, Qianli Ma, Pierre-Antoine Manzagol, Olivier Mastropietro, Robert T. McGibbon, Roland Memisevic, Bart van Merriënboer, Vincent Michalski, Mehdi Mirza, Alberto Orlandi, Chris Pal, Razvan Pascanu, Mohammad Pezeshki, Colin Raffel, Daniel Renshaw, Matthew Rocklin, Adriana Romero, Markus Roth, Peter Sadowski, John Salvatier, François Savard, Jan Schlüter, John Schulman, Gabriel Schwartz, Iulian Vlad Serban, Dmitriy Serdyuk, Samira Shabanian, Étienne Simon, Sigurd Spieckermann, S. Ramana Subramanyam, Jakub Sygnowski, Jérémie Tanguay, Gijs van Tulder, Joseph Turian, Sebastian Urban, Pascal Vincent, Francesco Visin, Harm de Vries, David Warde-Farley, Dustin J. Webb, Matthew Willson, Kelvin Xu, Lijun Xue, Li Yao, Saizheng Zhang, Ying Zhang - Show less +108 more

09 May 2016-arXiv: Symbolic Computation

TL;DR: The performance of Theano is compared against Torch7 and TensorFlow on several machine learning models and recently-introduced functionalities and improvements are discussed.

...read moreread less

Abstract: Theano is a Python library that allows to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Since its introduction, it has been one of the most used CPU and GPU mathematical compilers - especially in the machine learning community - and has shown steady performance improvements. Theano is being actively and continuously developed since 2008, multiple frameworks have been built on top of it and it has been used to produce many state-of-the-art machine learning models. The present article is structured as follows. Section I provides an overview of the Theano software and its community. Section II presents the principal features of Theano and how to use them, and compares them with other similar projects. Section III focuses on recently-introduced functionalities and improvements. Section IV compares the performance of Theano against Torch7 and TensorFlow on several machine learning models. Section V discusses current limitations of Theano and potential ways of improving it.

...read moreread less

2,194 citations

Proceedings Article•

Building end-to-end dialogue systems using generative hierarchical neural network models

[...]

Iulian Vlad Serban¹, Alessandro Sordoni¹, Yoshua Bengio¹, Aaron Courville¹, Joelle Pineau² - Show less +1 more•Institutions (2)

Université de Montréal¹, McGill University²

12 Feb 2016

TL;DR: The authors extend the hierarchical recurrent encoder-decoder neural network to the dialogue domain, and demonstrate that this model is competitive with state-of-the-art neural language models and backoff n-gram models.

...read moreread less

Abstract: We investigate the task of building open domain, conversational dialogue systems based on large dialogue corpora using generative models. Generative models produce system responses that are autonomously generated word-by-word, opening up the possibility for realistic, flexible interactions. In support of this goal, we extend the recently proposed hierarchical recurrent encoder-decoder neural network to the dialogue domain, and demonstrate that this model is competitive with state-of-the-art neural language models and backoff n-gram models. We investigate the limitations of this and similar approaches, and show how its performance can be improved by bootstrapping the learning from a larger question-answer pair corpus and from pretrained word embeddings.

...read moreread less

1,533 citations

Proceedings Article•

Binarized Neural Networks

[...]

Itay Hubara¹, Matthieu Courbariaux², Daniel Soudry³, Ran El-Yaniv¹, Yoshua Bengio² - Show less +1 more•Institutions (3)

Technion – Israel Institute of Technology¹, Université de Montréal², Columbia University³

08 Feb 2016

...read moreread less

Abstract: We introduce a method to train Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time. At train-time the binary weights and activations are used for computing the parameter gradients. During the forward pass, BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which is expected to substantially improve power-efficiency. To validate the effectiveness of BNNs, we conducted two sets of experiments on the Torch7 and Theano frameworks. On both, BNNs achieved nearly state-of-the-art results over the MNIST, CIFAR-10 and SVHN datasets. We also report our preliminary results on the challenging ImageNet dataset. Last but not least, we wrote a binary matrix multiplication GPU kernel with which it is possible to run our MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The code for training and running our BNNs is available on-line.

...read moreread less

1,425 citations

Posted Content•

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

[...]

Itay Hubara¹, Matthieu Courbariaux², Daniel Soudry³, Ran El-Yaniv¹, Yoshua Bengio² - Show less +1 more•Institutions (3)

Technion – Israel Institute of Technology¹, Université de Montréal², Columbia University³

22 Sep 2016-arXiv: Neural and Evolutionary Computing

TL;DR: A binary matrix multiplication GPU kernel is programmed with which it is possible to run the MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy.

...read moreread less

Abstract: We introduce a method to train Quantized Neural Networks (QNNs) --- neural networks with extremely low precision (e.g., 1-bit) weights and activations, at run-time. At train-time the quantized weights and activations are used for computing the parameter gradients. During the forward pass, QNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations. As a result, power consumption is expected to be drastically reduced. We trained QNNs over the MNIST, CIFAR-10, SVHN and ImageNet datasets. The resulting QNNs achieve prediction accuracy comparable to their 32-bit counterparts. For example, our quantized version of AlexNet with 1-bit weights and 2-bit activations achieves $51\%$ top-1 accuracy. Moreover, we quantize the parameter gradients to 6-bits as well which enables gradients computation using only bit-wise operation. Quantized recurrent neural networks were tested over the Penn Treebank dataset, and achieved comparable accuracy as their 32-bit counterparts using only 4-bits. Last but not least, we programmed a binary matrix multiplication GPU kernel with which it is possible to run our MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The QNN code is available online.

...read moreread less

1,232 citations

Proceedings Article•DOI•

End-to-end attention-based large vocabulary speech recognition

[...]

Dzmitry Bahdanau¹, Jan Chorowski², Dmitriy Serdyuk¹, Philemon Brakel¹, Yoshua Bengio¹ - Show less +1 more•Institutions (2)

Université de Montréal¹, University of Wrocław²

20 Mar 2016

TL;DR: This work investigates an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels.

...read moreread less

Abstract: Many state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) Systems are hybrids of neural networks and Hidden Markov Models (HMMs). Recently, more direct end-to-end methods have been investigated, in which neural architectures were trained to model sequences of characters [1,2]. To our knowledge, all these approaches relied on Connectionist Temporal Classification [3] modules. We investigate an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels. We show how this setup can be applied to LVCSR by integrating the decoding RNN with an n-gram language model and by speeding up its operation by constraining selections made by the attention mechanism and by reducing the source sequence lengths by pooling information over time. Recognition accuracies similar to other HMM-free RNN-based approaches are reported for the Wall Street Journal corpus.

...read moreread less

1,167 citations

Posted Content•

The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation

[...]

Simon Jégou, Michal Drozdzal¹, David Vazquez, Adriana Romero, Yoshua Bengio - Show less +1 more•Institutions (1)

École Polytechnique de Montréal¹

28 Nov 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: The proposed DenseNets approach achieves state-of-the-art results on urban scene benchmark datasets such as CamVid and Gatech, without any further post-processing module nor pretraining, and has much less parameters than currently published best entries for these datasets.

...read moreread less

Abstract: State-of-the-art approaches for semantic image segmentation are built on Convolutional Neural Networks (CNNs). The typical segmentation architecture is composed of (a) a downsampling path responsible for extracting coarse semantic features, followed by (b) an upsampling path trained to recover the input image resolution at the output of the model and, optionally, (c) a post-processing module (e.g. Conditional Random Fields) to refine the model predictions. Recently, a new CNN architecture, Densely Connected Convolutional Networks (DenseNets), has shown excellent results on image classification tasks. The idea of DenseNets is based on the observation that if each layer is directly connected to every other layer in a feed-forward fashion then the network will be more accurate and easier to train. In this paper, we extend DenseNets to deal with the problem of semantic segmentation. We achieve state-of-the-art results on urban scene benchmark datasets such as CamVid and Gatech, without any further post-processing module nor pretraining. Moreover, due to smart construction of the model, our approach has much less parameters than currently published best entries for these datasets. Code to reproduce the experiments is available here : this https URL

...read moreread less

1,086 citations

Posted Content•

A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues

[...]

Iulian Vlad Serban¹, Alessandro Sordoni, Ryan Lowe², Laurent Charlin³, Joelle Pineau², Aaron Courville¹, Yoshua Bengio¹ - Show less +3 more•Institutions (3)

Université de Montréal¹, McGill University², HEC Montréal³

19 May 2016-arXiv: Computation and Language

TL;DR: A neural network-based generative architecture, with latent stochastic variables that span a variable number of time steps, that improves upon recently proposed models and that the latent variables facilitate the generation of long outputs and maintain the context.

...read moreread less

Abstract: Sequential data often possesses a hierarchical structure with complex dependencies between subsequences, such as found between the utterances in a dialogue. In an effort to model this kind of generative process, we propose a neural network-based generative architecture, with latent stochastic variables that span a variable number of time steps. We apply the proposed model to the task of dialogue response generation and compare it with recent neural network architectures. We evaluate the model performance through automatic evaluation metrics and by carrying out a human evaluation. The experiments demonstrate that our model improves upon recently proposed models and that the latent variables facilitate the generation of long outputs and maintain the context.

...read moreread less

853 citations

Proceedings Article•

Unitary evolution recurrent neural networks

[...]

Martin Arjovsky¹, Amar Shah¹, Yoshua Bengio²•Institutions (2)

University of Buenos Aires¹, Université de Montréal²

19 Jun 2016

TL;DR: This work constructs an expressive unitary weight matrix by composing several structured matrices that act as building blocks with parameters to be learned, and demonstrates the potential of this architecture by achieving state of the art results in several hard tasks involving very long-term dependencies.

...read moreread less

Abstract: Recurrent neural networks (RNNs) are notoriously difficult to train. When the eigenvalues of the hidden to hidden weight matrix deviate from absolute value 1, optimization becomes difficult due to the well studied issue of vanishing and exploding gradients, especially when trying to learn long-term dependencies. To circumvent this problem, we propose a new architecture that learns a unitary weight matrix, with eigenvalues of absolute value exactly 1. The challenge we address is that of parametrizing unitary matrices in a way that does not require expensive computations (such as eigendecomposition) after each weight update. We construct an expressive unitary weight matrix by composing several structured matrices that act as building blocks with parameters to be learned. Optimization with this parameterization becomes feasible only when considering hidden states in the complex domain. We demonstrate the potential of this architecture by achieving state of the art results in several hard tasks involving very longterm dependencies.

...read moreread less

630 citations

Proceedings Article•

A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues

[...]

Iulian Vlad Serban¹, Alessandro Sordoni, Ryan Lowe², Laurent Charlin³, Joelle Pineau², Aaron Courville¹, Yoshua Bengio¹ - Show less +3 more•Institutions (3)

Université de Montréal¹, McGill University², HEC Montréal³

19 May 2016

TL;DR: The authors proposed a neural network-based generative architecture with stochastic latent variables that span a variable number of time steps to generate meaningful, long and diverse responses and maintain dialogue state.

...read moreread less

Abstract: Sequential data often possesses hierarchical structures with complex dependencies between sub-sequences, such as found between the utterances in a dialogue. To model these dependencies in a generative framework, we propose a neural network-based generative architecture, with stochastic latent variables that span a variable number of time steps. We apply the proposed model to the task of dialogue response generation and compare it with other recent neural-network architectures. We evaluate the model performance through a human evaluation study. The experiments demonstrate that our model improves upon recently proposed models and that the latent variables facilitate both the generation of meaningful, long and diverse responses and maintaining dialogue state.

...read moreread less

Proceedings Article•DOI•

Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism

[...]

Orhan Firat¹, Kyunghyun Cho², Yoshua Bengio²•Institutions (2)

Middle East Technical University¹, Université de Montréal²

06 Jan 2016

TL;DR: This article proposed a multi-way, multilingual NMT model with a single attention mechanism that is shared across all language pairs and observed that the proposed model significantly improves the translation quality of low-resource language pairs.

...read moreread less

Abstract: We propose multi-way, multilingual neural machine translation. The proposed approach enables a single neural translation model to translate between multiple languages, with a number of parameters that grows only linearly with the number of languages. This is made possible by having a single attention mechanism that is shared across all language pairs. We train the proposed multi-way, multilingual model on ten language pairs from WMT'15 simultaneously and observe clear performance improvements over models trained on only one language pair. In particular, we observe that the proposed model significantly improves the translation quality of low-resource language pairs.

...read moreread less

Posted Content•

Mode Regularized Generative Adversarial Networks

[...]

Tong Che¹, Yanran Li², Athul Paul Jacob¹, Athul Paul Jacob³, Yoshua Bengio¹, Wenjie Li² - Show less +2 more•Institutions (3)

Université de Montréal¹, Hong Kong Polytechnic University², University of Waterloo³

07 Dec 2016-arXiv: Learning

TL;DR: This work introduces several ways of regularizing the objective, which can dramatically stabilize the training of GAN models and shows that these regularizers can help the fair distribution of probability mass across the modes of the data generating distribution, during the early phases of training and thus providing a unified solution to the missing modes problem.

...read moreread less

Abstract: Although Generative Adversarial Networks achieve state-of-the-art results on a variety of generative tasks, they are regarded as highly unstable and prone to miss modes. We argue that these bad behaviors of GANs are due to the very particular functional shape of the trained discriminators in high dimensional spaces, which can easily make training stuck or push probability mass in the wrong direction, towards that of higher concentration than that of the data generating distribution. We introduce several ways of regularizing the objective, which can dramatically stabilize the training of GAN models. We also show that our regularizers can help the fair distribution of probability mass across the modes of the data generating distribution, during the early phases of training and thus providing a unified solution to the missing modes problem.

...read moreread less

Proceedings Article•DOI•

Pointing the unknown words

[...]

Caglar Gulcehre¹, Sungjin Ahn¹, Ramesh Nallapati², Bowen Zhou², Yoshua Bengio¹ - Show less +1 more•Institutions (2)

Université de Montréal¹, IBM²

26 Mar 2016

TL;DR: This article used two softmax layers in order to predict the next word in conditional language models: one predicts the location of a word in the source sentence, and the other predicts a word from the shortlist vocabulary.

...read moreread less

Abstract: The problem of rare and unknown words is an important issue that can potentially effect the performance of many NLP systems, including traditional count-based and deep learning models. We propose a novel way to deal with the rare and unseen words for the neural network models using attention. Our model uses two softmax layers in order to predict the next word in conditional language models: one predicts the location of a word in the source sentence, and the other predicts a word in the shortlist vocabulary. At each timestep, the decision of which softmax layer to use is adaptively made by an MLP which is conditioned on the context. We motivate this work from a psychological evidence that humans naturally have a tendency to point towards objects in the context or the environment when the name of an object is not known. Using our proposed model, we observe improvements on two tasks, neural machine translation on the Europarl English to French parallel corpora and text summarization on the Gigaword dataset.

...read moreread less

Posted Content•

Hierarchical Multiscale Recurrent Neural Networks

[...]

Junyoung Chung¹, Sungjin Ahn, Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

06 Sep 2016-arXiv: Learning

TL;DR: In this paper, a hierarchical multiscale recurrent neural network (HM-RNN) is proposed to capture the latent hierarchical structure in the sequence by encoding the temporal dependencies with different timescales using a novel update mechanism.

...read moreread less

Abstract: Learning both hierarchical and temporal representation has been among the long-standing challenges of recurrent neural networks. Multiscale recurrent neural networks have been considered as a promising approach to resolve this issue, yet there has been a lack of empirical evidence showing that this type of models can actually capture the temporal dependencies by discovering the latent hierarchical structure of the sequence. In this paper, we propose a novel multiscale approach, called the hierarchical multiscale recurrent neural networks, which can capture the latent hierarchical structure in the sequence by encoding the temporal dependencies with different timescales using a novel update mechanism. We show some evidence that our proposed multiscale architecture can discover underlying hierarchical structure in the sequences without using explicit boundary information. We evaluate our proposed model on character-level language modelling and handwriting sequence modelling.

...read moreread less

Journal Article•DOI•

EmoNets: Multimodal deep learning approaches for emotion recognition in video

[...]

Samira Ebrahimi Kahou¹, Xavier Bouthillier¹, Pascal Lamblin¹, Caglar Gulcehre¹, Vincent Michalski², Kishore Konda², Sébastien Jean¹, Pierre Froumenty¹, Yann N. Dauphin¹, Nicolas Boulanger-Lewandowski¹, Raul Chandias Ferrari¹, Mehdi Mirza¹, David Warde-Farley¹, Aaron Courville¹, Pascal Vincent¹, Roland Memisevic¹, Chris Pal¹, Yoshua Bengio¹ - Show less +14 more•Institutions (2)

Université de Montréal¹, Goethe University Frankfurt²

01 Jun 2016-Journal on Multimodal User Interfaces

TL;DR: In this article, the authors presented an approach to learn several specialist models using deep learning techniques, each focusing on one modality, including CNN, deep belief net, K-means based bag-of-mouths, and relational autoencoder.

...read moreread less

Abstract: The task of the Emotion Recognition in the Wild (EmotiW) Challenge is to assign one of seven emotions to short video clips extracted from Hollywood style movies. The videos depict acted-out emotions under realistic conditions with a large degree of variation in attributes such as pose and illumination, making it worthwhile to explore approaches which consider combinations of features from multiple modalities for label assignment. In this paper we present our approach to learning several specialist models using deep learning techniques, each focusing on one modality. Among these are a convolutional neural network, focusing on capturing visual information in detected faces, a deep belief net focusing on the representation of the audio stream, a K-Means based “bag-of-mouths” model, which extracts visual features around the mouth region and a relational autoencoder, which addresses spatio-temporal aspects of videos. We explore multiple methods for the combination of cues from these modalities into one common classifier. This achieves a considerably greater accuracy than predictions from our strongest single-modality classifier. Our method was the winning submission in the 2013 EmotiW challenge and achieved a test set accuracy of 47.67 % on the 2014 dataset.

...read moreread less

Posted Content•

Professor Forcing: A New Algorithm for Training Recurrent Networks

[...]

Alex Lamb¹, Anirudh Goyal¹, Ying Zhang¹, Saizheng Zhang¹, Aaron Courville¹, Yoshua Bengio¹ - Show less +2 more•Institutions (1)

Université de Montréal¹

27 Oct 2016-arXiv: Machine Learning

TL;DR: The Professor Forcing algorithm, which uses adversarial domain adaptation to encourage the dynamics of the recurrent network to be the same when training the network and when sampling from the network over multiple time steps, is introduced.

...read moreread less

Abstract: The Teacher Forcing algorithm trains recurrent networks by supplying observed sequence values as inputs during training and using the network's own one-step-ahead predictions to do multi-step sampling. We introduce the Professor Forcing algorithm, which uses adversarial domain adaptation to encourage the dynamics of the recurrent network to be the same when training the network and when sampling from the network over multiple time steps. We apply Professor Forcing to language modeling, vocal synthesis on raw waveforms, handwriting generation, and image generation. Empirically we find that Professor Forcing acts as a regularizer, improving test likelihood on character level Penn Treebank and sequential MNIST. We also find that the model qualitatively improves samples, especially when sampling for a large number of time steps. This is supported by human evaluation of sample quality. Trade-offs between Professor Forcing and Scheduled Sampling are discussed. We produce T-SNEs showing that Professor Forcing successfully makes the dynamics of the network during training and sampling more similar.

...read moreread less

Proceedings Article•

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

[...]

Soroush Mehri, Kundan Kumar¹, Ishaan Gulrajani², Rithesh Kumar³, Shubham Jain⁴, Jose Sotelo³, Aaron Courville³, Yoshua Bengio³ - Show less +4 more•Institutions (4)

Indian Institute of Technology Kanpur¹, Salesforce.com², Université de Montréal³, Indian Institute of Technology Delhi⁴

04 Nov 2016

TL;DR: It is shown that the model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks in a hierarchical structure is able to capture underlying sources of variations in the temporal sequences over very long time spans, on three datasets of different nature.

...read moreread less

Abstract: In this paper we propose a novel model for unconditional audio generation based on generating one audio sample at a time. We show that our model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks in a hierarchical structure is able to capture underlying sources of variations in the temporal sequences over very long time spans, on three datasets of different nature. Human evaluation on the generated samples indicate that our model is preferred over competing models. We also show how each component of the model contributes to the exhibited performance.

...read moreread less

Posted Content•

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

[...]

Soroush Mehri, Kundan Kumar¹, Ishaan Gulrajani², Rithesh Kumar³, Shubham Jain⁴, Jose Sotelo³, Aaron Courville³, Yoshua Bengio³ - Show less +4 more•Institutions (4)

Indian Institute of Technology Kanpur¹, Salesforce.com², Université de Montréal³, Indian Institute of Technology Delhi⁴

22 Dec 2016-arXiv: Sound

TL;DR: In this article, the authors proposed a novel model for unconditional audio generation based on generating one audio sample at a time, which profits from combining memoryless modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks in a hierarchical structure.

...read moreread less

Posted Content•

An Actor-Critic Algorithm for Sequence Prediction

[...]

Dzmitry Bahdanau¹, Philemon Brakel¹, Kelvin Xu¹, Anirudh Goyal¹, Ryan Lowe², Joelle Pineau², Aaron Courville¹, Yoshua Bengio¹ - Show less +4 more•Institutions (2)

Université de Montréal¹, McGill University²

24 Jul 2016-arXiv: Learning

TL;DR: The authors proposed an actor-critic network that is trained to predict the value of an output token, given the policy of an actor network, which leads to improved performance on both a synthetic task and for German-English machine translation.

...read moreread less

Abstract: We present an approach to training neural networks to generate sequences using actor-critic methods from reinforcement learning (RL). Current log-likelihood training methods are limited by the discrepancy between their training and testing modes, as models must generate tokens conditioned on their previous guesses rather than the ground-truth tokens. We address this problem by introducing a \textit{critic} network that is trained to predict the value of an output token, given the policy of an \textit{actor} network. This results in a training procedure that is much closer to the test phase, and allows us to directly optimize for a task-specific score such as BLEU. Crucially, since we leverage these techniques in the supervised learning setting rather than the traditional RL setting, we condition the critic network on the ground-truth output. We show that our method leads to improved performance on both a synthetic task, and for German-English machine translation. Our analysis paves the way for such methods to be applied in natural language generation tasks, such as machine translation, caption generation, and dialogue modelling.

...read moreread less

Proceedings Article•

Professor Forcing: A New Algorithm for Training Recurrent Networks

[...]

Alex Lamb¹, Anirudh Goyal¹, Ying Zhang¹, Saizheng Zhang¹, Aaron Courville¹, Yoshua Bengio¹ - Show less +2 more•Institutions (1)

Université de Montréal¹

01 Jan 2016

TL;DR: In this article, the authors introduce the Professor Forcing algorithm, which uses adversarial domain adaptation to encourage the dynamics of the recurrent network to be the same when training the network and when sampling from the network over multiple time steps.

...read moreread less

Abstract: The Teacher Forcing algorithm trains recurrent networks by supplying observed sequence values as inputs during training and using the network’s own one-step-ahead predictions to do multi-step sampling We introduce the Professor Forcing algorithm, which uses adversarial domain adaptation to encourage the dynamics of the recurrent network to be the same when training the network and when sampling from the network over multiple time steps We apply Professor Forcing to language modeling, vocal synthesis on raw waveforms, handwriting generation, and image generation Empirically we find that Professor Forcing acts as a regularizer, improving test likelihood on character level Penn Treebank and sequential MNIST We also find that the model qualitatively improves samples, especially when sampling for a large number of time steps This is supported by human evaluation of sample quality Trade-offs between Professor Forcing and Scheduled Sampling are discussed We produce T-SNEs showing that Professor Forcing successfully makes the dynamics of the network during training and sampling more similar

...read moreread less

Posted Content•

Understanding intermediate layers using linear classifier probes

[...]

Guillaume Alain¹, Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

05 Oct 2016-arXiv: Machine Learning

TL;DR: In this paper, the authors use linear classifiers to monitor the features at every layer of a model and measure how suitable they are for classification, which can be used to develop a better intuition about models and to diagnose potential problems.

...read moreread less

Abstract: Neural network models have a reputation for being black boxes. We propose to monitor the features at every layer of a model and measure how suitable they are for classification. We use linear classifiers, which we refer to as "probes", trained entirely independently of the model itself. This helps us better understand the roles and dynamics of the intermediate layers. We demonstrate how this can be used to develop a better intuition about models and to diagnose potential problems. We apply this technique to the popular models Inception v3 and Resnet-50. Among other things, we observe experimentally that the linear separability of features increase monotonically along the depth of the model.

...read moreread less

Proceedings Article•

Understanding intermediate layers using linear classifier probes

[...]

Guillaume Alain¹, Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

05 Nov 2016

TL;DR: This work proposes to monitor the features at every layer of a model and measure how suitable they are for classification, using linear classifiers, which are referred to as "probes", trained entirely independently of the model itself.

...read moreread less

Posted Content•

Equilibrium Propagation: Bridging the Gap Between Energy-Based Models and Backpropagation

[...]

Benjamin Scellier¹, Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

16 Feb 2016-arXiv: Learning

TL;DR: In this article, the authors introduce equilibrium propagation, a learning framework for energy-based models that does not need a special computation or circuit for the second phase, where errors are implicitly propagated.

...read moreread less

Abstract: We introduce Equilibrium Propagation, a learning framework for energy-based models. It involves only one kind of neural computation, performed in both the first phase (when the prediction is made) and the second phase of training (after the target or prediction error is revealed). Although this algorithm computes the gradient of an objective function just like Backpropagation, it does not need a special computation or circuit for the second phase, where errors are implicitly propagated. Equilibrium Propagation shares similarities with Contrastive Hebbian Learning and Contrastive Divergence while solving the theoretical issues of both algorithms: our algorithm computes the gradient of a well defined objective function. Because the objective function is defined in terms of local perturbations, the second phase of Equilibrium Propagation corresponds to only nudging the prediction (fixed point, or stationary distribution) towards a configuration that reduces prediction error. In the case of a recurrent multi-layer supervised network, the output units are slightly nudged towards their target in the second phase, and the perturbation introduced at the output layer propagates backward in the hidden layers. We show that the signal 'back-propagated' during this second phase corresponds to the propagation of error derivatives and encodes the gradient of the objective function, when the synaptic update corresponds to a standard form of spike-timing dependent plasticity. This work makes it more plausible that a mechanism similar to Backpropagation could be implemented by brains, since leaky integrator neural computation performs both inference and error back-propagation in our model. The only local difference between the two phases is whether synaptic changes are allowed or not.

...read moreread less

Posted Content•

Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

[...]

David Krueger¹, Tegan Maharaj², János Kramár¹, Mohammad Pezeshki¹, Nicolas Ballas¹, Nan Rosemary Ke¹, Anirudh Goyal¹, Yoshua Bengio¹, Aaron Courville¹, Chris Pal¹ - Show less +6 more•Institutions (2)

Université de Montréal¹, École Polytechnique de Montréal²

03 Jun 2016-arXiv: Neural and Evolutionary Computing

TL;DR: This work proposes zoneout, a novel method for regularizing RNNs that uses random noise to train a pseudo-ensemble, improving generalization and performs an empirical investigation of various RNN regularizers, and finds that zoneout gives significant performance improvements across tasks.

...read moreread less

Abstract: We propose zoneout, a novel method for regularizing RNNs At each timestep, zoneout stochastically forces some hidden units to maintain their previous values Like dropout, zoneout uses random noise to train a pseudo-ensemble, improving generalization But by preserving instead of dropping hidden units, gradient information and state information are more readily propagated through time, as in feedforward stochastic depth networks We perform an empirical investigation of various RNN regularizers, and find that zoneout gives significant performance improvements across tasks We achieve competitive results with relatively simple models in character- and word-level language modelling on the Penn Treebank and Text8 datasets, and combining with recurrent batch normalization yields state-of-the-art results on permuted sequential MNIST

...read moreread less

Proceedings Article•DOI•

Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks

[...]

Ying Zhang¹, Mohammad Pezeshki², Philemon Brakel², Saizheng Zhang², César Laurent², Yoshua Bengio², Aaron Courville² - Show less +3 more•Institutions (2)

IBM¹, Université de Montréal²

08 Sep 2016

Proceedings Article•DOI•

Batch normalized recurrent neural networks

[...]

César Laurent¹, Gabriel Pereyra², Philemon Brakel¹, Ying Zhang¹, Yoshua Bengio¹ - Show less +1 more•Institutions (2)

Université de Montréal¹, University of Southern California²

20 Mar 2016

TL;DR: This paper investigates how batch normalization can be applied to RNNs and shows that the way it is applied leads to a faster convergence of the training criterion but doesn't seem to improve the generalization performance.

...read moreread less

Abstract: Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feed-forward neural networks [1]. In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we investigate how batch normalization can be applied to RNNs. We show for both a speech recognition task and language modeling that the way we apply batch normalization leads to a faster convergence of the training criterion but doesn't seem to improve the generalization performance.

...read moreread less

Proceedings Article•DOI•

Generating Factoid Questions With Recurrent Neural Networks: The 30M Factoid Question-Answer Corpus

[...]

Iulian Vlad Serban¹, Alberto Garcia-Duran², Caglar Gulcehre¹, Sungjin Ahn, Sarath Chandar¹, Aaron Courville¹, Yoshua Bengio¹ - Show less +3 more•Institutions (2)

Université de Montréal¹, University of Technology of Compiègne²

01 Jan 2016

TL;DR: The 30M Factoid Question Answer Corpus (30MQA) as mentioned in this paper ) is a large-scale question answer corpus, which was created by applying a neural network architecture on the knowledge base Freebase to transduce facts into natural language questions.

...read moreread less

Abstract: Over the past decade, large-scale supervised learning corpora have enabled machine learning researchers to make substantial advances. However, to this date, there are no large-scale questionanswer corpora available. In this paper we present the 30M Factoid QuestionAnswer Corpus, an enormous questionanswer pair corpus produced by applying a novel neural network architecture on the knowledge base Freebase to transduce facts into natural language questions. The produced question-answer pairs are evaluated both by human evaluators and using automatic evaluation metrics, including well-established machine translation and sentence similarity metrics. Across all evaluation criteria the questiongeneration model outperforms the competing template-based baseline. Furthermore, when presented to human evaluators, the generated questions appear to be comparable in quality to real human-generated questions. * First authors. ◦ Email: {iulian.vlad.serban,caglar.gulcehre, sungjin.ahn,sarath.chandar.anbil.parthipan, aaron.courville,yoshua.bengio}@umontreal.ca Email: alberto.garcia-duran@utc.fr † CIFAR Senior Fellow

...read moreread less

Posted Content•

Generating Factoid Questions With Recurrent Neural Networks: The 30M Factoid Question-Answer Corpus

[...]

Iulian Vlad Serban¹, Alberto Garcia-Duran², Caglar Gulcehre¹, Sungjin Ahn, Sarath Chandar¹, Aaron Courville¹, Yoshua Bengio¹ - Show less +3 more•Institutions (2)

Université de Montréal¹, University of Technology of Compiègne²

22 Mar 2016-arXiv: Computation and Language

TL;DR: The 30M Factoid Question-Answer Corpus is presented, an enormous question answer pair corpus produced by applying a novel neural network architecture on the knowledge base Freebase to transduce facts into natural language questions.

...read moreread less

Abstract: Over the past decade, large-scale supervised learning corpora have enabled machine learning researchers to make substantial advances. However, to this date, there are no large-scale question-answer corpora available. In this paper we present the 30M Factoid Question-Answer Corpus, an enormous question answer pair corpus produced by applying a novel neural network architecture on the knowledge base Freebase to transduce facts into natural language questions. The produced question answer pairs are evaluated both by human evaluators and using automatic evaluation metrics, including well-established machine translation and sentence similarity metrics. Across all evaluation criteria the question-generation model outperforms the competing template-based baseline. Furthermore, when presented to human evaluators, the generated questions appear comparable in quality to real human-generated questions.

...read moreread less

Posted Content•

Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism

[...]

Orhan Firat¹, Kyunghyun Cho², Yoshua Bengio²•Institutions (2)

Middle East Technical University¹, Université de Montréal²

06 Jan 2016-arXiv: Computation and Language

TL;DR: The proposed multi-way, multilingual neural machine translation approach enables a single neural translation model to translate between multiple languages, with a number of parameters that grows only linearly with the number of languages.

...read moreread less