Top 18 papers published by Yoshua Bengio from Université de Montréal in 2010

Proceedings Article•

Understanding the difficulty of training deep feedforward neural networks

[...]

Xavier Glorot¹, Yoshua Bengio¹•Institutions (1)

31 Mar 2010

TL;DR: The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.

...read moreread less

Abstract: Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures. All these experimental results were obtained with new initialization or training mechanisms. Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. We first observe the influence of the non-linear activations functions. We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. Surprisingly, we find that saturated units can move out of saturation by themselves, albeit slowly, and explaining the plateaus sometimes seen when training neural networks. We find that a new non-linearity that saturates less can often be beneficial. Finally, we study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1. Based on these considerations, we propose a new initialization scheme that brings substantially faster convergence. 1 Deep Neural Networks Deep learning methods aim at learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower level features. They include Appearing in Proceedings of the 13 International Conference on Artificial Intelligence and Statistics (AISTATS) 2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 of JMLR: WC Weston et al., 2008). Much attention has recently been devoted to them (see (Bengio, 2009) for a review), because of their theoretical appeal, inspiration from biology and human cognition, and because of empirical success in vision (Ranzato et al., 2007; Larochelle et al., 2007; Vincent et al., 2008) and natural language processing (NLP) (Collobert & Weston, 2008; Mnih & Hinton, 2009). Theoretical results reviewed and discussed by Bengio (2009), suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g. in vision, language, and other AI-level tasks), one may need deep architectures. Most of the recent experimental results with deep architecture are obtained with models that can be turned into deep supervised neural networks, but with initialization or training schemes different from the classical feedforward neural networks (Rumelhart et al., 1986). Why are these new algorithms working so much better than the standard random initialization and gradient-based optimization of a supervised training criterion? Part of the answer may be found in recent analyses of the effect of unsupervised pretraining (Erhan et al., 2009), showing that it acts as a regularizer that initializes the parameters in a “better” basin of attraction of the optimization procedure, corresponding to an apparent local minimum associated with better generalization. But earlier work (Bengio et al., 2007) had shown that even a purely supervised but greedy layer-wise procedure would give better results. So here instead of focusing on what unsupervised pre-training or semi-supervised criteria bring to deep architectures, we focus on analyzing what may be going wrong with good old (but deep) multilayer neural networks. Our analysis is driven by investigative experiments to monitor activations (watching for saturation of hidden units) and gradients, across layers and across training iterations. We also evaluate the effects on these of choices of activation function (with the idea that it might affect saturation) and initialization procedure (since unsupervised pretraining is a particular form of initialization and it has a drastic impact).

...read moreread less

9,500 citations

Journal Article•

Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion

[...]

Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol - Show less +1 more

01 Mar 2010-Journal of Machine Learning Research

TL;DR: Denoising autoencoders as mentioned in this paper are trained locally to denoise corrupted versions of their inputs, which is a straightforward variation on the stacking of ordinary autoencoder.

...read moreread less

Abstract: We explore an original strategy for building deep networks, based on stacking layers of denoising autoencoders which are trained locally to denoise corrupted versions of their inputs. The resulting algorithm is a straightforward variation on the stacking of ordinary autoencoders. It is however shown on a benchmark of classification problems to yield significantly lower classification error, thus bridging the performance gap with deep belief networks (DBN), and in several cases surpassing it. Higher level representations learnt in this purely unsupervised fashion also help boost the performance of subsequent SVM classifiers. Qualitative experiments show that, contrary to ordinary autoencoders, denoising autoencoders are able to learn Gabor-like edge detectors from natural image patches and larger stroke detectors from digit images. This work clearly establishes the value of using a denoising criterion as a tractable unsupervised objective to guide the learning of useful higher level representations.

...read moreread less

4,814 citations

Proceedings Article•

Word Representations: A Simple and General Method for Semi-Supervised Learning

[...]

Joseph Turian¹, Lev-Arie Ratinov², Yoshua Bengio¹•Institutions (2)

Université de Montréal¹, University of Illinois at Urbana–Champaign²

11 Jul 2010

TL;DR: This work evaluates Brown clusters, Collobert and Weston (2008) embeddings, and HLBL (Mnih & Hinton, 2009) embeds of words on both NER and chunking, and finds that each of the three word representations improves the accuracy of these baselines.

...read moreread less

Abstract: If we take an existing supervised NLP system, a simple and general way to improve accuracy is to use unsupervised word representations as extra word features. We evaluate Brown clusters, Collobert and Weston (2008) embeddings, and HLBL (Mnih & Hinton, 2009) embeddings of words on both NER and chunking. We use near state-of-the-art supervised baselines, and find that each of the three word representations improves the accuracy of these baselines. We find further improvements by combining different word representations. You can download our word features, for off-the-shelf use in existing NLP systems, as well as our code, here: http://metaoptimize.com/projects/wordreprs/

...read moreread less

2,243 citations

Journal Article•

Why Does Unsupervised Pre-training Help Deep Learning?

[...]

Dumitru Erhan¹, Yoshua Bengio¹, Aaron Courville¹, Pierre-Antoine Manzagol¹, Pascal Vincent¹, Samy Bengio² - Show less +2 more•Institutions (2)

Université de Montréal¹, Google²

01 Mar 2010-Journal of Machine Learning Research

TL;DR: In this paper, the authors empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples, and they suggest that unsupervised pretraining guides the learning towards basins of attraction of minima that support better generalization.

...read moreread less

Abstract: Much recent research has been devoted to learning algorithms for deep architectures such as Deep Belief Networks and stacks of auto-encoder variants, with impressive results obtained in several areas, mostly on vision and language data sets. The best results obtained on supervised learning tasks involve an unsupervised learning component, usually in an unsupervised pre-training phase. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this difficult learning problem. The main question investigated here is the following: how does unsupervised pre-training work? Answering this questions is important if learning in deep architectures is to be further improved. We propose several explanatory hypotheses and test them through extensive simulations. We empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples. The experiments confirm and clarify the advantage of unsupervised pre-training. The results suggest that unsupervised pre-training guides the learning towards basins of attraction of minima that support better generalization from the training data set; the evidence from these results supports a regularization explanation for the effect of pre-training.

...read moreread less

2,036 citations

Proceedings Article•DOI•

Theano: A CPU and GPU Math Compiler in Python

[...]

James Bergstra¹, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, Yoshua Bengio - Show less +5 more•Institutions (1)

Université de Montréal¹

01 Jan 2010

TL;DR: This paper illustrates how to use Theano, outlines the scope of the compiler, provides benchmarks on both CPU and GPU processors, and explains its overall design.

...read moreread less

Abstract: Theano is a compiler for mathematical expressions in Python that combines the convenience of NumPy's syntax with the speed of optimized native machine language. The user composes mathematical expressions in a high-level description that mimics NumPy's syntax and semantics, while being statically typed and functional (as opposed to imperative). These expressions allow Theano to provide symbolic differentiation. Before performing computation, Theano optimizes the choice of expressions, translates them into C++ (or CUDA for GPU), compiles them into dynamically loaded Python modules, all automatically. Common machine learn- ing algorithms implemented with Theano are from 1:6 to 7:5 faster than competitive alternatives (including those implemented with C/C++, NumPy/SciPy and MATLAB) when compiled for the CPU and between 6:5 and 44 faster when compiled for the GPU. This paper illustrates how to use Theano, outlines the scope of the compiler, provides benchmarks on both CPU and GPU processors, and explains its overall design.

...read moreread less

939 citations

Proceedings Article•

Why Does Unsupervised Pre-training Help Deep Learning?

[...]

Dumitru Erhan¹, Aaron Courville¹, Yoshua Bengio¹, Pascal Vincent¹•Institutions (1)

Université de Montréal¹

31 Mar 2010

627 citations

Journal Article•DOI•

Deep belief networks are compact universal approximators

[...]

Nicolas Le Roux¹, Yoshua Bengio²•Institutions (2)

Microsoft¹, Université de Montréal²

01 Aug 2010-Neural Computation

TL;DR: It is proved that deep but narrow feedforward neural networks with sigmoidal units can represent any Boolean expression.

...read moreread less

Abstract: Deep belief networks (DBN) are generative models with many layers of hidden causal variables, recently introduced by Hinton, Osindero, and Teh (2006), along with a greedy layer-wise unsupervised learning algorithm. Building on Le Roux and Bengio (2008) and Sutskever and Hinton (2008), we show that deep but narrow generative networks do not require more parameters than shallow ones to achieve universal approximation. Exploiting the proof technique, we prove that deep but narrow feedforward neural networks with sigmoidal units can represent any Boolean expression.

...read moreread less

170 citations

Proceedings Article•

Tempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines

[...]

Guillaume Desjardins¹, Aaron Courville¹, Yoshua Bengio¹, Pascal Vincent¹, Olivier Delalleau¹ - Show less +1 more•Institutions (1)

Université de Montréal¹

31 Mar 2010

TL;DR: This work explores the use of tempered Markov Chain Monte-Carlo for sampling in RBMs and finds both through visualization of samples and measures of likelihood that it helps both sampling and learning.

...read moreread less

Abstract: Alternating Gibbs sampling is the most common scheme used for sampling from Restricted Boltzmann Machines (RBM), a crucial component in deep architectures such as Deep Belief Networks. However, we find that it often does a very poor job of rendering the diversity of modes captured by the trained model. We suspect that this hinders the advantage that could in principle be brought by training algorithms relying on Gibbs sampling for uncovering spurious modes, such as the Persistent Contrastive Divergence algorithm. To alleviate this problem, we explore the use of tempered Markov Chain Monte-Carlo for sampling in RBMs. We find both through visualization of samples and measures of likelihood that it helps both sampling and learning.

...read moreread less

134 citations

Journal Article•

Understanding the difficulty of training deep feedforward neural networks

[...]

Xavier Glorot¹, Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

01 Jan 2010-Journal of Machine Learning Research

TL;DR: In this article, the authors show that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation.

...read moreread less

Abstract: Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures. All these experimental results were obtained with new initialization or training mechanisms. Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. We first observe the influence of the non-linear activations functions. We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. Surprisingly, we find that saturated units can move out of saturation by themselves, albeit slowly, and explaining the plateaus sometimes seen when training neural networks. We find that a new non-linearity that saturates less can often be beneficial. Finally, we study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1. Based on these considerations, we propose a new initialization scheme that brings substantially faster convergence. 1 Deep Neural Networks Deep learning methods aim at learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower level features. They include Appearing in Proceedings of the 13 International Conference on Artificial Intelligence and Statistics (AISTATS) 2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 of JMLR: WC Weston et al., 2008). Much attention has recently been devoted to them (see (Bengio, 2009) for a review), because of their theoretical appeal, inspiration from biology and human cognition, and because of empirical success in vision (Ranzato et al., 2007; Larochelle et al., 2007; Vincent et al., 2008) and natural language processing (NLP) (Collobert & Weston, 2008; Mnih & Hinton, 2009). Theoretical results reviewed and discussed by Bengio (2009), suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g. in vision, language, and other AI-level tasks), one may need deep architectures. Most of the recent experimental results with deep architecture are obtained with models that can be turned into deep supervised neural networks, but with initialization or training schemes different from the classical feedforward neural networks (Rumelhart et al., 1986). Why are these new algorithms working so much better than the standard random initialization and gradient-based optimization of a supervised training criterion? Part of the answer may be found in recent analyses of the effect of unsupervised pretraining (Erhan et al., 2009), showing that it acts as a regularizer that initializes the parameters in a “better” basin of attraction of the optimization procedure, corresponding to an apparent local minimum associated with better generalization. But earlier work (Bengio et al., 2007) had shown that even a purely supervised but greedy layer-wise procedure would give better results. So here instead of focusing on what unsupervised pre-training or semi-supervised criteria bring to deep architectures, we focus on analyzing what may be going wrong with good old (but deep) multilayer neural networks. Our analysis is driven by investigative experiments to monitor activations (watching for saturation of hidden units) and gradients, across layers and across training iterations. We also evaluate the effects on these of choices of activation function (with the idea that it might affect saturation) and initialization procedure (since unsupervised pretraining is a particular form of initialization and it has a drastic impact).

...read moreread less

125 citations

Parallel Tempering for Training of Restricted Boltzmann Machines

[...]

Guillaume Desjardins, Aaron Courville, Yoshua Bengio, Pascal Vincent, Olivier Delalleau - Show less +1 more

01 Jan 2010

TL;DR: This work explores the use of tempered Markov Chain Monte-Carlo for sampling in RBMs and finds both through visualization of samples and measures of likelihood on a toy dataset that it helps both sampling and learning.

...read moreread less

Abstract: Alternating Gibbs sampling between visible and latent units is the most common scheme used for sampling from Restricted Boltzmann Machines (RBM), a crucial component in deep architectures such as Deep Belief Networks (DBN). However, we find that it often does a very poor job of rendering the diversity of modes captured by the trained model. We suspect that this property hinders RBM training methods such as the Contrastive Divergence and Persistent Contrastive Divergence algorithm that rely on Gibbs sampling to approximate the likelihood gradient. To alleviate this problem, we explore the use of tempered Markov Chain Monte-Carlo for sampling in RBMs. We find both through visualization of samples and measures of likelihood on a toy dataset that it helps both sampling and learning.

...read moreread less

77 citations

Journal Article•DOI•

Decision trees do not generalize to new variations

[...]

Yoshua Bengio¹, Olivier Delalleau¹, Clarence Simard¹•Institutions (1)

Université de Montréal¹

01 Nov 2010

TL;DR: It is demonstrated formally that decision trees can be seriously hurt by the curse of dimensionality in a sense that is a bit different from other nonparametric statistical methods, but most importantly that they cannot generalize to variations not seen in the training set.

...read moreread less

Abstract: e de MontrMontreal, Canada The family of decision tree learning algorithms is among the most widespread and studied. Motivated by the desire to develop learning algorithms that can generalize when learning highly varying functions such as those presumably needed to achieve artificial intelligence, we study some theoretical limitations of decision trees. We demonstrate formally that they can be seriously hurt by the curse of dimensionality in a sense that is a bit different from other nonparametric statistical methods, but most importantly, that they cannot generalize to variations not seen in the training set. This is because a decision tree creates a partition of the input space and needs at least one example in each of the regions associated with a leaf to make a sensible prediction in that region. A better understanding of the fundamental reasons for this limitation suggests that one should use forests or even deeper architectures instead of trees, which provide a form of distributed representation and can generalize to variations not encountered in the training data.

...read moreread less

Proceedings Article•

Learning tags that vary within a song

[...]

Michael I. Mandel¹, Douglas Eck¹, Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

01 Jan 2010

TL;DR: It is found that the agreement between different people’s tags decreases as the distance between the parts of a song that they heard increases, and a conditional restricted Boltzmann machine is described to model this relationship.

...read moreread less

Abstract: This paper examines the relationship between human generated tags describing different parts of the same song. These tags were collected using Amazon’s Mechanical Turk service. We find that the agreement between different people’s tags decreases as the distance between the parts of a song that they heard increases. To model these tags and these relationships, we describe a conditional restricted Boltzmann machine. Using this model to fill in tags that should probably be present given a context of other tags, we train automatic tag classifiers (autotaggers) that outperform those trained on the original data.

...read moreread less

Posted Content•

Adaptive Parallel Tempering for Stochastic Maximum Likelihood Learning of RBMs

[...]

Guillaume Desjardins, Aaron Courville, Yoshua Bengio

15 Dec 2010-arXiv: Machine Learning

TL;DR: This paper shows that the choice of optimal temperatures can be automated by minimizing average return time while chains can be spawned dynamically, as needed, thus minimizing the computational overhead, and results in better likelihood scores.

...read moreread less

Abstract: Restricted Boltzmann Machines (RBM) have attracted a lot of attention of late, as one the principle building blocks of deep networks. Training RBMs remains problematic however, because of the intractibility of their partition function. The maximum likelihood gradient requires a very robust sampler which can accurately sample from the model despite the loss of ergodicity often incurred during learning. While using Parallel Tempering in the negative phase of Stochastic Maximum Likelihood (SML-PT) helps address the issue, it imposes a trade-off between computational complexity and high ergodicity, and requires careful hand-tuning of the temperatures. In this paper, we show that this trade-off is unnecessary. The choice of optimal temperatures can be automated by minimizing average return time (a concept first proposed by [Katzgraber et al., 2006]) while chains can be spawned dynamically, as needed, thus minimizing the computational overhead. We show on a synthetic dataset, that this results in better likelihood scores.

...read moreread less

Journal Article•DOI•

Tractable multivariate binary density estimation and the restricted boltzmann forest

[...]

Hugo Larochelle¹, Yoshua Bengio², Joseph Turian²•Institutions (2)

University of Toronto¹, Université de Montréal²

01 Sep 2010-Neural Computation

TL;DR: This work argues that for the problem of tractable density estimation, the restricted Boltzmann machine (RBM) provides a competitive framework for multivariate binary density modeling and presents the restricted Bolzmann forest (RBForest), which replaces the binary variables in the hidden layer of RBMs with groups of tree-structured binary variables.

...read moreread less

Abstract: We investigate the problem of estimating the density function of multivariate binary data. In particular, we focus on models for which computing the estimated probability of any data point is tractable. In such a setting, previous work has mostly concentrated on mixture modeling approaches. We argue that for the problem of tractable density estimation, the restricted Boltzmann machine (RBM) provides a competitive framework for multivariate binary density modeling. With this in mind, we also generalize the RBM framework and present the restricted Boltzmann forest (RBForest), which replaces the binary variables in the hidden layer of RBMs with groups of tree-structured binary variables. This extension allows us to obtain models that have more modeling capacity but remain tractable. In experiments on several data sets, we demonstrate the competitiveness of this approach and study some of its properties.

...read moreread less

Journal Article•DOI•

Alternative time representation in dopamine models.

[...]

Francois Rivest¹, John F. Kalaska¹, Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

01 Feb 2010-Journal of Computational Neuroscience

TL;DR: A general rate-based learning model based on long short-term memory networks that learns a time representation when needed that reproduces the known finding that trace conditioning is more difficult than delay conditioning and that the learned representation of the task can be highly dependent on the types of trials experienced during training.

...read moreread less

Abstract: Dopaminergic neuron activity has been modeled during learning and appetitive behavior, most commonly using the temporal-difference (TD) algorithm. However, a proper representation of elapsed time and of the exact task is usually required for the model to work. Most models use timing elements such as delay-line representations of time that are not biologically realistic for intervals in the range of seconds. The interval-timing literature provides several alternatives. One of them is that timing could emerge from general network dynamics, instead of coming from a dedicated circuit. Here, we present a general rate-based learning model based on long short-term memory (LSTM) networks that learns a time representation when needed. Using a naive network learning its environment in conjunction with TD, we reproduce dopamine activity in appetitive trace conditioning with a constant CS-US interval, including probe trials with unexpected delays. The proposed model learns a representation of the environment dynamics in an adaptive biologically plausible framework, without recourse to delay lines or other special-purpose circuits. Instead, the model predicts that the task-dependent representation of time is learned by experience, is encoded in ramp-like changes in single-neuron activity distributed across small neural networks, and reflects a temporal integration mechanism resulting from the inherent dynamics of recurrent loops within the network. The model also reproduces the known finding that trace conditioning is more difficult than delay conditioning and that the learned representation of the task can be highly dependent on the types of trials experienced during training. Finally, it suggests that the phasic dopaminergic signal could facilitate learning in the cortex.

...read moreread less

Improving First and Second-Order Methods by Modeling Uncertainty

[...]

Nicolas Le Roux, Yoshua Bengio, Andrew Fitzgibbon

01 Jan 2010

Posted Content•

Deep Self-Taught Learning for Handwritten Character Recognition

[...]

Frédéric Bastien, Yoshua Bengio, Arnaud Bergeron, Nicolas Boulanger-Lewandowski, Thomas M. Breuel, Youssouf Chherawala, Moustapha Cissé, Myriam Côté, Dumitru Erhan, Jeremy Eustache, Xavier Glorot, Xavier Muller, Sylvain Pannetier Lebeuf, Razvan Pascanu, Salah Rifai, François Savard, Guillaume Sicard - Show less +13 more

18 Sep 2010-arXiv: Learning

TL;DR: It is shown that deep learners benefit more from out-of-distribution examples than a corresponding shallow learner, at least in the area of handwritten character recognition.

...read moreread less

Abstract: Recent theoretical and empirical work in statistical machine learning has demonstrated the importance of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple non-linear transformations. Self-taught learning (exploiting unlabeled examples or examples from other distributions) has already been applied to deep learners, but mostly to show the advantage of unlabeled examples. Here we explore the advantage brought by out-of-distribution examples. For this purpose we developed a powerful generator of stochastic variations and noise processes for character images, including not only affine transformations but also slant, local elastic deformations, changes in thickness, background images, grey level changes, contrast, occlusion, and various types of noise. The out-of-distribution examples are obtained from these highly distorted images or by including examples of object classes different from those in the target test set. We show that deep learners benefit more from out-of-distribution examples than a corresponding shallow learner, at least in the area of handwritten character recognition. In fact, we show that they beat previously published results and reach human-level performance on both handwritten digit classification and 62-class handwritten character recognition.

...read moreread less

A simple and general method for semi-supervised learning

[...]

Joseph Turian, Lev Ratinov, Yoshua Bengio

01 Jan 2010

TL;DR: This work evaluates Brown clusters, Collobert and Weston (2008) embeddings, and HLBL (Mnih & Hinton, 2009) embeds of words on both NER and chunking, and finds that each of the three word representations improves the accuracy of these baselines.

...read moreread less

Abstract: If we take an existing supervised NLP system, a simple and general way to improve accuracy is to use unsupervised word representations as extra word features. We evaluate Brown clusters, Collobert and Weston (2008) embeddings, and HLBL (Mnih & Hinton, 2009) embeddings of words on both NER and chunking. We use near state-of-the-art supervised baselines, and find that each of the three word representations improves the accuracy of these baselines. We find further improvements by combining di erent word representations. You can download our word features, for o -the-shelf use in existing NLP systems, as well as our code, here: http://metaoptimize. com/projects/wordreprs/

...read moreread less

Showing papers by "Yoshua Bengio published in 2010"