Showing papers by "Yoshua Bengio published in 2013"

PDF

Open Access

Journal Article•DOI•

Representation Learning: A Review and New Perspectives

[...]

Yoshua Bengio, Aaron Courville, Pascal Vincent

01 Aug 2013-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: Recent work in the area of unsupervised feature learning and deep learning is reviewed, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks.

...read moreread less

Abstract: The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation, and manifold learning.

...read moreread less

11,201 citations

Proceedings Article•

On the difficulty of training recurrent neural networks

[...]

Razvan Pascanu¹, Tomas Mikolov², Yoshua Bengio¹•Institutions (2)

Université de Montréal¹, Brno University of Technology²

16 Jun 2013

TL;DR: In this article, a gradient norm clipping strategy is proposed to deal with the vanishing and exploding gradient problems in recurrent neural networks. But the proposed solution is limited to the case of RNNs.

...read moreread less

Abstract: There are two widely known issues with properly training recurrent neural networks, the vanishing and the exploding gradient problems detailed in Bengio et al. (1994). In this paper we attempt to improve the understanding of the underlying issues by exploring these problems from an analytical, a geometric and a dynamical systems perspective. Our analysis is used to justify a simple yet effective solution. We propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem. We validate empirically our hypothesis and proposed solutions in the experimental section.

...read moreread less

2,586 citations

Posted Content•

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

[...]

Yoshua Bengio, Nicholas Léonard, Aaron Courville

15 Aug 2013-arXiv: Learning

TL;DR: This work considers a small-scale version of {\em conditional computation}, where sparse stochastic units form a distributed representation of gaters that can turn off in combinatorially many ways large chunks of the computation performed in the rest of the neural network.

...read moreread less

Abstract: Stochastic neurons and hard non-linearities can be useful for a number of reasons in deep learning models, but in many cases they pose a challenging problem: how to estimate the gradient of a loss function with respect to the input of such stochastic or non-smooth neurons? I.e., can we "back-propagate" through these stochastic neurons? We examine this question, existing approaches, and compare four families of solutions, applicable in different settings. One of them is the minimum variance unbiased gradient estimator for stochatic binary neurons (a special case of the REINFORCE algorithm). A second approach, introduced here, decomposes the operation of a binary stochastic neuron into a stochastic binary part and a smooth differentiable part, which approximates the expected effect of the pure stochatic binary neuron to first order. A third approach involves the injection of additive or multiplicative noise in a computational graph that is otherwise differentiable. A fourth approach heuristically copies the gradient with respect to the stochastic output directly as an estimator of the gradient with respect to the sigmoid argument (we call this the straight-through estimator). To explore a context where these estimators are useful, we consider a small-scale version of {\em conditional computation}, where sparse stochastic units form a distributed representation of gaters that can turn off in combinatorially many ways large chunks of the computation performed in the rest of the neural network. In this case, it is important that the gating units produce an actual 0 most of the time. The resulting sparsity can be potentially be exploited to greatly reduce the computational cost of large deep networks for which conditional computation would be useful.

...read moreread less

2,178 citations

Proceedings Article•

Maxout Networks

[...]

Ian Goodfellow¹, David Warde-Farley¹, Mehdi Mirza¹, Aaron Courville¹, Yoshua Bengio¹ - Show less +1 more•Institutions (1)

Université de Montréal¹

16 Jun 2013

TL;DR: A simple new model called maxout is defined designed to both facilitate optimization by dropout and improve the accuracy of dropout's fast approximate model averaging technique.

...read moreread less

Abstract: We consider the problem of designing models to leverage a recently introduced approximate model averaging technique called dropout. We define a simple new model called maxout (so named because its output is the max of a set of inputs, and because it is a natural companion to dropout) designed to both facilitate optimization by dropout and improve the accuracy of dropout's fast approximate model averaging technique. We empirically verify that the model successfully accomplishes both of these tasks. We use maxout and dropout to demonstrate state of the art classification performance on four benchmark datasets: MNIST, CIFAR-10, CIFAR-100, and SVHN.

...read moreread less

1,692 citations

Posted Content•

An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

[...]

Ian Goodfellow¹, Mehdi Mirza¹, Da Xiao², Aaron Courville¹, Yoshua Bengio¹ - Show less +1 more•Institutions (2)

Université de Montréal¹, Beijing University of Posts and Telecommunications²

21 Dec 2013-arXiv: Machine Learning

TL;DR: In this article, the authors investigate the extent to which the catastrophic forgetting problem occurs for modern neural networks, comparing both established and recent gradient-based training algorithms and activation functions and find that the dropout algorithm is consistently best at adapting to the new task, remembering the old task and has the best tradeoff curve between these two extremes.

...read moreread less

Abstract: Catastrophic forgetting is a problem faced by many machine learning models and algorithms. When trained on one task, then trained on a second task, many machine learning models "forget" how to perform the first task. This is widely believed to be a serious problem for neural networks. Here, we investigate the extent to which the catastrophic forgetting problem occurs for modern neural networks, comparing both established and recent gradient-based training algorithms and activation functions. We also examine the effect of the relationship between the first task and the second task on catastrophic forgetting. We find that it is always best to train using the dropout algorithm--the dropout algorithm is consistently best at adapting to the new task, remembering the old task, and has the best tradeoff curve between these two extremes. We find that different tasks and relationships between tasks result in very different rankings of activation function performance. This suggests the choice of activation function should always be cross-validated.

...read moreread less

755 citations

Book Chapter•DOI•

Challenges in Representation Learning: A Report on Three Machine Learning Contests

[...]

Ian Goodfellow¹, Dumitru Erhan², Pierre Luc Carrier¹, Aaron Courville¹, Mehdi Mirza¹, Ben Hamner¹, William Cukierski¹, Yichuan Tang¹, David Thaler¹, Dong-Hyun Lee¹, Yingbo Zhou¹, Chetan Ramaiah¹, Fangxiang Feng¹, Ruifan Li¹, Xiaojie Wang¹, Dimitris Athanasakis¹, John Shawe-Taylor¹, Maxim Milakov¹, John Park¹, Radu Ionescu¹, Marius Popescu¹, Cristian Grozea¹, James Bergstra¹, Jingjing Xie¹, Lukasz Romaszko¹, Bing Xu¹, Zhang Chuang¹, Yoshua Bengio¹ - Show less +24 more•Institutions (2)

Université de Montréal¹, Google²

03 Nov 2013

TL;DR: The ICML 2013 Workshop on Challenges in Representation Learning focused on three challenges: the black box learning challenge, the facial expression recognition challenge, and the multimodal learning challenge.

...read moreread less

Abstract: The ICML 2013 Workshop on Challenges in Representation Learning focused on three challenges: the black box learning challenge, the facial expression recognition challenge, and the multimodal learning challenge. We describe the datasets created for these challenges and summarize the results of the competitions. We provide suggestions for organizers of future challenges and some comments on what kind of knowledge can be gained from machine learning competitions.

...read moreread less

737 citations

Posted Content•

How to Construct Deep Recurrent Neural Networks

[...]

Razvan Pascanu¹, Caglar Gulcehre¹, Kyunghyun Cho², Yoshua Bengio¹•Institutions (2)

Université de Montréal¹, Aalto University²

20 Dec 2013-arXiv: Neural and Evolutionary Computing

TL;DR: In this article, the authors explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN by carefully analyzing and understanding the architecture of an RNN.

...read moreread less

Abstract: In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.

...read moreread less

690 citations

Posted Content•

Maxout Networks

[...]

Ian Goodfellow¹, David Warde-Farley¹, Mehdi Mirza¹, Aaron Courville¹, Yoshua Bengio¹ - Show less +1 more•Institutions (1)

Université de Montréal¹

18 Feb 2013-arXiv: Machine Learning

TL;DR: In this article, a simple new model called maxout is proposed to both facilitate optimization by dropout and improve the accuracy of dropout's fast approximate model averaging technique, which is a natural companion to dropout.

...read moreread less

Abstract: We consider the problem of designing models to leverage a recently introduced approximate model averaging technique called dropout We define a simple new model called maxout (so named because its output is the max of a set of inputs, and because it is a natural companion to dropout) designed to both facilitate optimization by dropout and improve the accuracy of dropout's fast approximate model averaging technique We empirically verify that the model successfully accomplishes both of these tasks We use maxout and dropout to demonstrate state of the art classification performance on four benchmark datasets: MNIST, CIFAR-10, CIFAR-100, and SVHN

...read moreread less

672 citations

Posted Content•

Challenges in Representation Learning: A report on three machine learning contests

[...]

Université de Montréal¹, Google²

01 Jul 2013-arXiv: Machine Learning

...read moreread less

510 citations

Posted Content•

Deep Learning of Representations: Looking Forward

[...]

Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

02 May 2013-arXiv: Learning

TL;DR: In this paper, the authors examine some of the challenges of scaling deep learning algorithms to much larger models and datasets, reducing optimization difficulties due to ill-conditioning or local minima, designing more efficient and powerful inference and sampling procedures, and learning to disentangle the factors of variation underlying the observed data.

...read moreread less

Abstract: Deep learning research aims at discovering learning algorithms that discover multiple levels of distributed representations, with higher levels representing more abstract concepts. Although the study of deep learning has already led to impressive theoretical results, learning algorithms and breakthrough experiments, several challenges lie ahead. This paper proposes to examine some of these challenges, centering on the questions of scaling deep learning algorithms to much larger models and datasets, reducing optimization difficulties due to ill-conditioning or local minima, designing more efficient and powerful inference and sampling procedures, and learning to disentangle the factors of variation underlying the observed data. It also proposes a few forward-looking research directions aimed at overcoming these challenges.

...read moreread less

459 citations

Posted Content•

Generalized Denoising Auto-Encoders as Generative Models

[...]

Yoshua Bengio¹, Li Yao¹, Guillaume Alain¹, Pascal Vincent¹•Institutions (1)

Université de Montréal¹

29 May 2013-arXiv: Learning

TL;DR: A different attack on the problem is proposed, which deals with arbitrary (but noisy enough) corruption, arbitrary reconstruction loss, handling both discrete and continuous-valued variables, and removing the bias due to non-infinitesimal corruption noise.

...read moreread less

Abstract: Recent work has shown how denoising and contractive autoencoders implicitly capture the structure of the data-generating density, in the case where the corruption noise is Gaussian, the reconstruction error is the squared error, and the data is continuous-valued. This has led to various proposals for sampling from this implicitly learned density function, using Langevin and Metropolis-Hastings MCMC. However, it remained unclear how to connect the training procedure of regularized auto-encoders to the implicit estimation of the underlying data-generating distribution when the data are discrete, or using other forms of corruption process and reconstruction errors. Another issue is the mathematical justification which is only valid in the limit of small corruption noise. We propose here a different attack on the problem, which deals with all these issues: arbitrary (but noisy enough) corruption, arbitrary reconstruction loss (seen as a log-likelihood), handling both discrete and continuous-valued variables, and removing the bias due to non-infinitesimal corruption noise (or non-infinitesimal contractive penalty).

...read moreread less

Posted Content•

A Semantic Matching Energy Function for Learning with Multi-relational Data

[...]

Xavier Glorot¹, Antoine Bordes², Jason Weston³, Yoshua Bengio¹•Institutions (3)

Université de Montréal¹, University of Technology of Compiègne², Google³

15 Jan 2013-arXiv: Learning

TL;DR: A new neural network architecture designed to embed multi-relational graphs into a flexible continuous vector space in which the original data is kept and enhanced, demonstrating that it can scale up to tens of thousands of nodes and thousands of types of relation.

...read moreread less

Abstract: Large-scale relational learning becomes crucial for handling the huge amounts of structured data generated daily in many application domains ranging from computational biology or information retrieval, to natural language processing. In this paper, we present a new neural network architecture designed to embed multi-relational graphs into a flexible continuous vector space in which the original data is kept and enhanced. The network is trained to encode the semantics of these graphs in order to assign high probabilities to plausible components. We empirically show that it reaches competitive performance in link prediction on standard datasets from the literature.

...read moreread less

Book Chapter•DOI•

Deep learning of representations: looking forward

[...]

Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

29 Jul 2013

TL;DR: This paper proposes to examine some of the challenges of scaling deep learning algorithms to much larger models and datasets, reducing optimization difficulties due to ill-conditioning or local minima, designing more efficient and powerful inference and sampling procedures, and learning to disentangle the factors of variation underlying the observed data.

...read moreread less

Proceedings Article•DOI•

Combining modality specific deep neural networks for emotion recognition in video

[...]

Samira Ebrahimi Kahou¹, Chris Pal¹, Xavier Bouthillier², Pierre Froumenty¹, Caglar Gulcehre², Roland Memisevic², Pascal Vincent², Aaron Courville², Yoshua Bengio², Raul Chandias Ferrari², Mehdi Mirza², Sébastien Jean², Pierre Luc Carrier², Yann N. Dauphin², Nicolas Boulanger-Lewandowski², Abhishek Aggarwal², Jeremie Zumer, Pascal Lamblin², Jean-Philippe Raymond², Guillaume Desjardins², Razvan Pascanu², David Warde-Farley², Atousa Torabi², Arjun Sharma², Emmanuel Bengio², Myriam Côté³, Kishore Konda³, Zhenzhou Wu⁴ - Show less +24 more•Institutions (4)

École Polytechnique de Montréal¹, Université de Montréal², Goethe University Frankfurt³, McGill University⁴

09 Dec 2013

TL;DR: This paper presents the techniques used for the University of Montréal's team submissions to the 2013 Emotion Recognition in the Wild Challenge, a challenge to classify the emotions expressed by the primary human subject in short video clips extracted from feature length movies.

...read moreread less

Abstract: In this paper we present the techniques used for the University of Montreal's team submissions to the 2013 Emotion Recognition in the Wild Challenge. The challenge is to classify the emotions expressed by the primary human subject in short video clips extracted from feature length movies. This involves the analysis of video clips of acted scenes lasting approximately one-two seconds, including the audio track which may contain human voices as well as background music. Our approach combines multiple deep neural networks for different data modalities, including: (1) a deep convolutional neural network for the analysis of facial expressions within video frames; (2) a deep belief net to capture audio information; (3) a deep autoencoder to model the spatio-temporal information produced by the human actions depicted within the entire scene; and (4) a shallow network architecture focused on extracted features of the mouth of the primary human subject in the scene. We discuss each of these techniques, their performance characteristics and different strategies to aggregate their predictions. Our best single model was a convolutional neural network trained to predict emotions from static frames using two large data sets, the Toronto Face Database and our own set of faces images harvested from Google image search, followed by a per frame aggregation strategy that used the challenge training data. This yielded a test set accuracy of 35.58%. Using our best strategy for aggregating our top performing models into a single predictor we were able to produce an accuracy of 41.03% on the challenge test set. These compare favorably to the challenge baseline test set accuracy of 27.56%.

...read moreread less

Proceedings Article•DOI•

Advances in optimizing recurrent networks

[...]

Yoshua Bengio, Nicolas Boulanger-Lewandowski, Razvan Pascanu

26 May 2013

TL;DR: In this paper, the authors evaluate the use of clipping gradients, spanning longer time ranges with leaky integration, advanced momentum techniques, using more powerful output probability models, and encouraging sparser gradients to help symmetry breaking and credit assignment.

...read moreread less

Abstract: After a more than decade-long period of relatively little research activity in the area of recurrent neural networks, several new developments will be reviewed here that have allowed substantial progress both in understanding and in technical solutions towards more efficient training of recurrent networks. These advances have been motivated by and related to the optimization issues surrounding deep learning. Although recurrent networks are extremely powerful in what they can in principle represent in terms of modeling sequences, their training is plagued by two aspects of the same issue regarding the learning of long-term dependencies. Experiments reported here evaluate the use of clipping gradients, spanning longer time ranges with leaky integration, advanced momentum techniques, using more powerful output probability models, and encouraging sparser gradients to help symmetry breaking and credit assignment. The experiments are performed on text and music data and show off the combined effects of these techniques in generally improving both training and test error.

...read moreread less

Posted Content•

Deep Generative Stochastic Networks Trainable by Backprop

[...]

Yoshua Bengio, Éric Thibodeau-Laufer, Guillaume Alain, Jason Yosinski

05 Jun 2013-arXiv: Learning

TL;DR: Generative stochastic networks (GSN) as discussed by the authors learn the transition operator of a Markov chain whose stationary distribution estimates the data distribution, which is an alternative to maximum likelihood.

...read moreread less

Abstract: We introduce a novel training principle for probabilistic models that is an alternative to maximum likelihood. The proposed Generative Stochastic Networks (GSN) framework is based on learning the transition operator of a Markov chain whose stationary distribution estimates the data distribution. The transition distribution of the Markov chain is conditional on the previous state, generally involving a small move, so this conditional distribution has fewer dominant modes, being unimodal in the limit of small moves. Thus, it is easier to learn because it is easier to approximate its partition function, more like learning to perform supervised function approximation, with gradients that can be obtained by backprop. We provide theorems that generalize recent work on the probabilistic interpretation of denoising autoencoders and obtain along the way an interesting justification for dependency networks and generalized pseudolikelihood, along with a definition of an appropriate joint distribution and sampling mechanism even when the conditionals are not consistent. GSNs can be used with missing inputs and can be used to sample subsets of variables given the rest. We validate these theoretical results with experiments on two image datasets using an architecture that mimics the Deep Boltzmann Machine Gibbs sampler but allows training to proceed with simple backprop, without the need for layerwise pretraining.

...read moreread less

Proceedings Article•

What Regularized Auto-Encoders Learn from the Data Generating Distribution

[...]

Guillaume Alain¹, Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

17 Jan 2013

TL;DR: It is shown that the auto-encoder captures the score (derivative of the log-density with respect to the input) and contradicts previous interpretations of reconstruction error as an energy function.

...read moreread less

Abstract: What do auto-encoders learn about the underlying data generating distribution? Recent work suggests that some auto-encoder variants do a good job of capturing the local manifold structure of data. This paper clarifies some of these previous observations by showing that minimizing a particular form of regularized reconstruction error yields a reconstruction function that locally characterizes the shape of the data generating density. We show that the auto-encoder captures the score (derivative of the log-density with respect to the input). It contradicts previous interpretations of reconstruction error as an energy function. Unlike previous results, the theorems provided here are completely generic and do not depend on the parametrization of the auto-encoder: they show what the auto-encoder would tend to if given enough capacity and examples. These results are for a contractive training criterion we show to be similar to the denoising auto-encoder training criterion with small corruption noise, but with contraction applied on the whole reconstruction function rather than just encoder. Similarly to score matching, one can consider the proposed training criterion as a convenient alternative to maximum likelihood because it does not involve a partition function. Finally, we show how an approximate Metropolis-Hastings MCMC can be setup to recover samples from the estimated distribution, and this is confirmed in sampling experiments.

...read moreread less

DOI•

On the Optimization of a Synaptic Learning Rule

[...]

Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, Jan Gescei

17 Jun 2013

TL;DR: A new approach to neural modeling based on the idea of using an automated method to optimize the parameters of a synaptic learning rule and a theoretical analysis permitting to study the generalization property of such parametric learning rules is presented.

...read moreread less

Abstract: Because the domain of possible learning algorithms is large, we propose to constrain it hy using in Eq. (1) only already krl()wn, hilliogically plausihle synaptic mechanisms. Hence, we consider only local variahles, such as presynaptic activity, pllstsynaptic potential, synaptic strength, the activity of a facilitatory neuron, and the concentration of a diffusely acting neuromodulatoL Figure 14.1 shows the interaction between those elements. Constraining the learning rule to he hiologically plausihle should not he seen as an artificial constraint hut rather as a way to restrain the search space such that it is consistent with solutions that we helieve to he used in the hrain. This constraint might ease the search for new learning rules (Fig. ]4.2).

...read moreread less

Journal Article•DOI•

Learning deep physiological models of affect

[...]

Hector P. Martinez¹, Yoshua Bengio², Georgios N. Yannakakis³•Institutions (3)

IT University of Copenhagen¹, Université de Montréal², University of Malta³

01 May 2013-IEEE Computational Intelligence Magazine

TL;DR: All three key components of an affective model are touched upon and the use of deep learning (DL) methodologies for affective modeling from multiple physiological signals are introduced.

...read moreread less

Abstract: More than 15 years after the early studies in Affective Computing (AC), [1] the problem of detecting and modeling emotions in the context of human-computer interaction (HCI) remains complex and largely unexplored. The detection and modeling of emotion is, primarily, the study and use of artificial intelligence (AI) techniques for the construction of computational models of emotion. The key challenges one faces when attempting to model emotion [2] are inherent in the vague definitions and fuzzy boundaries of emotion, and in the modeling methodology followed. In this context, open research questions are still present in all key components of the modeling process. These include, first, the appropriateness of the modeling tool employed to map emotional manifestations and responses to annotated affective states; second, the processing of signals that express these manifestations (i.e., model input); and third, the way affective annotation (i.e., model output) is handled. This paper touches upon all three key components of an affective model (i.e., input, model, output) and introduces the use of deep learning (DL) [3], [4], [5] methodologies for affective modeling from multiple physiological signals.

...read moreread less

Proceedings Article•

Better Mixing via Deep Representations

[...]

Yoshua Bengio¹, Grégoire Mesnil², Grégoire Mesnil¹, Yann N. Dauphin¹, Salah Rifai¹ - Show less +1 more•Institutions (2)

Université de Montréal¹, University of Rouen²

16 Jun 2013

TL;DR: In this paper, it has been shown that the higher-level samples fill more uniformly the space they occupy and the high-density manifolds tend to unfold when represented at higher levels.

...read moreread less

Abstract: It has been hypothesized, and supported with experimental evidence, that deeper representations, when well trained, tend to do a better job at disentangling the underlying factors of variation. We study the following related conjecture: better representations, in the sense of better disentangling, can be exploited to produce Markov chains that mix faster between modes. Consequently, mixing between modes would be more efficient at higher levels of representation. To better understand this, we propose a secondary conjecture: the higher-level samples fill more uniformly the space they occupy and the high-density manifolds tend to unfold when represented at higher levels. The paper discusses these hypotheses and tests them experimentally through visualization and measurements of mixing between modes and interpolating between samples.

...read moreread less

Posted Content•

Pylearn2: a machine learning research library

[...]

Ian Goodfellow, David Warde-Farley, Pascal Lamblin, Vincent Dumoulin, Mehdi Mirza, Razvan Pascanu, James Bergstra, Frédéric Bastien, Yoshua Bengio - Show less +5 more

20 Aug 2013-arXiv: Machine Learning

TL;DR: A brief history of the library, an overview of its basic philosophy, a summary of the Library's architecture, and a description of how the Pylearn2 community functions socially are given.

...read moreread less

Abstract: Pylearn2 is a machine learning research library. This does not just mean that it is a collection of machine learning algorithms that share a common API; it means that it has been designed for flexibility and extensibility in order to facilitate research projects that involve new or unusual use cases. In this paper we give a brief history of the library, an overview of its basic philosophy, a summary of the library's architecture, and a description of how the Pylearn2 community functions socially.

...read moreread less

Posted Content•

On the number of response regions of deep feed forward networks with piece-wise linear activations

[...]

Razvan Pascanu, Guido Montúfar, Yoshua Bengio

20 Dec 2013-arXiv: Learning

TL;DR: In this paper, the complexity of deep feedforward networks with linear pre-synaptic couplings and rectified linear activations is compared with a single layer version of the model.

...read moreread less

Abstract: This paper explores the complexity of deep feedforward networks with linear pre-synaptic couplings and rectified linear activations. This is a contribution to the growing body of work contrasting the representational power of deep and shallow network architectures. In particular, we offer a framework for comparing deep and shallow models that belong to the family of piecewise linear functions based on computational geometry. We look at a deep rectifier multi-layer perceptron (MLP) with linear outputs units and compare it with a single layer version of the model. In the asymptotic regime, when the number of inputs stays constant, if the shallow model has $kn$ hidden units and $n_0$ inputs, then the number of linear regions is $O(k^{n_0}n^{n_0})$. For a $k$ layer model with $n$ hidden units on each layer it is $\Omega(\left\lfloor {n}/{n_0}\right\rfloor^{k-1}n^{n_0})$. The number $\left\lfloor{n}/{n_0}\right\rfloor^{k-1}$ grows faster than $k^{n_0}$ when $n$ tends to infinity or when $k$ tends to infinity and $n \geq 2n_0$. Additionally, even when $k$ is small, if we restrict $n$ to be $2n_0$, we can show that a deep model has considerably more linear regions that a shallow one. We consider this as a first step towards understanding the complexity of these models and specifically towards providing suitable mathematical tools for future analysis.

...read moreread less

Proceedings Article•

Generalized Denoising Auto-Encoders as Generative Models

[...]

Yoshua Bengio¹, Li Yao¹, Guillaume Alain¹, Pascal Vincent¹•Institutions (1)

Université de Montréal¹

05 Dec 2013

TL;DR: In this paper, the authors propose a different attack on the problem, which deals with all these issues: arbitrary (but noisy enough) corruption, arbitrary reconstruction loss (seen as a log-likelihood), handling both discrete and continuous-valued variables, and removing the bias due to non-infinitesimal corruption noise.

...read moreread less

Proceedings Article•

Audio Chord Recognition with Recurrent Neural Networks.

[...]

Nicolas Boulanger-Lewandowski¹, Yoshua Bengio¹, Pascal Vincent¹•Institutions (1)

Université de Montréal¹

01 Jan 2013

TL;DR: An efficient algorithm to search for the global mode of the output distribution while taking long-term dependencies into account is devised and the resulting method is competitive with state-of-the-art approaches on the MIREX dataset in the major/minor prediction task.

...read moreread less

Abstract: In this paper, we present an audio chord recognition system based on a recurrent neural network. The audio features are obtained from a deep neural network optimized with a combination of chromagram targets and chord information, and aggregated over different time scales. Contrarily to other existing approaches, our system incorporates acoustic and musicological models under a single training objective. We devise an efficient algorithm to search for the global mode of the output distribution while taking long-term dependencies into account. The resulting method is competitive with state-of-the-art approaches on the MIREX dataset in the major/minor prediction task.

...read moreread less

Proceedings Article•

Multi-Prediction Deep Boltzmann Machines

[...]

Ian Goodfellow¹, Mehdi Mirza¹, Aaron Courville¹, Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

05 Dec 2013

TL;DR: The multi-prediction deep Boltzmann machine does not require greedy layerwise pretraining, and outperforms the standard DBM at classification, classification with missing inputs, and mean field prediction tasks.

...read moreread less

Abstract: We introduce the multi-prediction deep Boltzmann machine (MP-DBM). The MP-DBM can be seen as a single probabilistic model trained to maximize a variational approximation to the generalized pseudolikelihood, or as a family of recurrent nets that share parameters and approximately solve different inference problems. Prior methods of training DBMs either do not perform well on classification tasks or require an initial learning pass that trains the DBM greedily, one layer at a time. The MP-DBM does not require greedy layerwise pretraining, and outperforms the standard DBM at classification, classification with missing inputs, and mean field prediction tasks.1

...read moreread less

Posted Content•

Knowledge Matters: Importance of Prior Information for Optimization

[...]

Caglar Gulcehre¹, Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

17 Jan 2013-arXiv: Learning

TL;DR: In this paper, the authors explore the effect of introducing prior information into the intermediate level of neural networks for a learning task on which all the state-of-theart machine learning algorithms tested failed to learn.

...read moreread less

Abstract: We explore the effect of introducing prior information into the intermediate level of neural networks for a learning task on which all the state-of-the-art machine learning algorithms tested failed to learn. We motivate our work from the hypothesis that humans learn such intermediate concepts from other individuals via a form of supervision or guidance using a curriculum. The experiments we have conducted provide positive evidence in favor of this hypothesis. In our experiments, a two-tiered MLP architecture is trained on a dataset with 64x64 binary inputs images, each image with three sprites. The final task is to decide whether all the sprites are the same or one of them is different. Sprites are pentomino tetris shapes and they are placed in an image with different locations using scaling and rotation transformations. The first part of the two-tiered MLP is pre-trained with intermediate-level targets being the presence of sprites at each location, while the second part takes the output of the first part as input and predicts the final task's target binary event. The two-tiered MLP architecture, with a few tens of thousand examples, was able to learn the task perfectly, whereas all other algorithms (include unsupervised pre-training, but also traditional algorithms like SVMs, decision trees and boosting) all perform no better than chance. We hypothesize that the optimization difficulty involved when the intermediate pre-training is not performed is due to the {\em composition} of two highly non-linear tasks. Our findings are also consistent with hypotheses on cultural learning inspired by the observations of optimization problems with deep learning, presumably because of effective local minima.

...read moreread less

Proceedings Article•DOI•

Modeling term dependencies with quantum language models for IR

[...]

Alessandro Sordoni¹, Jian-Yun Nie¹, Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

28 Jul 2013

TL;DR: This work develops a new, generalized Language Modeling approach for IR by adopting the probabilistic framework of QT, which is the first practical application of quantum probability to show significant improvements over a robust bag-of-words baseline and achieves better performance on a stronger non bag- of- words baseline.

...read moreread less

Abstract: Traditional information retrieval (IR) models use bag-of-words as the basic representation and assume that some form of independence holds between terms. Representing term dependencies and defining a scoring function capable of integrating such additional evidence is theoretically and practically challenging. Recently, Quantum Theory (QT) has been proposed as a possible, more general framework for IR. However, only a limited number of investigations have been made and the potential of QT has not been fully explored and tested. We develop a new, generalized Language Modeling approach for IR by adopting the probabilistic framework of QT. In particular, quantum probability could account for both single and compound terms at once without having to extend the term space artificially as in previous studies. This naturally allows us to avoid the weight-normalization problem, which arises in the current practice by mixing scores from matching compound terms and from matching single terms. Our model is the first practical application of quantum probability to show significant improvements over a robust bag-of-words baseline and achieves better performance on a stronger non bag-of-words baseline.

...read moreread less

Posted Content•

Estimating or Propagating Gradients Through Stochastic Neurons

[...]

Yoshua Bengio

14 May 2013-arXiv: Learning

TL;DR: It is demonstrated that a simple biologically plausible formula gives rise to an an unbiased (but noisy) estimator of the gradient with respect to a binary stochastic neuron firing proba bility, and an approach to approximating this unbiased but high-variance estimator by learning to predict it using a biased estimator.

...read moreread less

Abstract: Stochastic neurons can be useful for a number of reasons in deep learning models, but in many cases they pose a challenging problem: how to estimate the gradient of a loss function with respect to the input of such s tochastic neurons, i.e., can we “back-propagate” through these stochastic neurons? We examine this question, existing approaches, and present two novel families of solutions, applicable in different settings. In particular, it is demonstrate d that a simple biologically plausible formula gives rise to an an unbiased (but noisy) estimator of the gradient with respect to a binary stochastic neuron firing proba bility. Unlike other estimators which view the noise as a small perturbation in order to estimate gradients by finite differences, this estimator is unbiased even w ithout assuming that the stochastic perturbation is small. This estimator is also in teresting because it can be applied in very general settings which do not allow gradient back-propagation, including the estimation of the gradient with respect to futur e rewards, as required in reinforcement learning setups. We also propose an approach to approximating this unbiased but high-variance estimator by learning to predict it using a biased estimator. The second approach we propose assumes that an estimator of the gradient can be back-propagated and it provides an unbiased estimator of the gradient, but can only work with non-linearities unlike the hard threshold, but like the rectifier, that are not flat for all of their range. This is similar to trad itional sigmoidal units but has the advantage that for many inputs, a hard decision (e.g., a 0 output) can be produced, which would be convenient for conditional computation and achieving sparse representations and sparse gradients.

...read moreread less

Proceedings Article•DOI•

High-dimensional sequence transduction

[...]

Nicolas Boulanger-Lewandowski¹, Yoshua Bengio¹, Pascal Vincent¹•Institutions (1)

Université de Montréal¹

26 May 2013

TL;DR: In this article, a probabilistic model based on a recurrent neural network was proposed to learn realistic output distributions given the input and devise an efficient algorithm to search for the global mode of that distribution.

...read moreread less

Abstract: We investigate the problem of transforming an input sequence into a high-dimensional output sequence in order to transcribe polyphonic audio music into symbolic notation. We introduce a probabilistic model based on a recurrent neural network that is able to learn realistic output distributions given the input and we devise an efficient algorithm to search for the global mode of that distribution. The resulting method produces musically plausible transcriptions even under high levels of noise and drastically outperforms previous state-of- the-art approaches on five datasets of synthesized sounds and real recordings, approximately halving the test error rate.

...read moreread less

Book Chapter•DOI•

Deep Learning of Representations

[...]

Yoshua Bengio¹, Aaron Courville¹•Institutions (1)

Université de Montréal¹

01 Jan 2013

TL;DR: This chapter reviews the main motivations and ideas behind deep learning algorithms and their representation-learning components, as well as recent results, and proposes a vision of challenges and hopes on the road ahead, focusing on the questions of invariance and disentangling.

...read moreread less

Abstract: Unsupervised learning of representations has been found useful in many applications and benefits from several advantages, e.g., where there are many unlabeled examples and few labeled ones (semi-supervised learning), or where the unlabeled or labeled examples are from a distribution different but related to the one of interest (self-taught learning, multi-task learning, and domain adaptation). Some of these algorithms have successfully been used to learn a hierarchy of features, i.e., to build a deep architecture, either as initialization for a supervised predictor, or as a generative model. Deep learning algorithms can yield representations that are more abstract and better disentangle the hidden factors of variation underlying the unknown generating distribution, i.e., to capture invariances and discover non-local structure in that distribution. This chapter reviews the main motivations and ideas behind deep learning algorithms and their representation-learning components, as well as recent results in this area, and proposes a vision of challenges and hopes on the road ahead, focusing on the questions of invariance and disentangling.

...read moreread less