Showing papers by "Yoshua Bengio published in 2009"

PDF

Open Access

Book•

[...]

Yoshua Bengio¹•Institutions (1)

01 Jan 2009

TL;DR: The motivations and principles regarding learning algorithms for deep architectures, in particular those exploiting as building blocks unsupervised learning of single-layer modelssuch as Restricted Boltzmann Machines, used to construct deeper models such as Deep Belief Networks are discussed.

...read moreread less

Abstract: Can machine learning deliver AI? Theoretical results, inspiration from the brain and cognition, as well as machine learning experiments suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g. in vision, language, and other AI-level tasks), one would need deep architectures. Deep architectures are composed of multiple levels of non-linear operations, such as in neural nets with many hidden layers, graphical models with many levels of latent variables, or in complicated propositional formulae re-using many sub-formulae. Each level of the architecture represents features at a different level of abstraction, defined as a composition of lower-level features. Searching the parameter space of deep architectures is a difficult task, but new algorithms have been discovered and a new sub-area has emerged in the machine learning community since 2006, following these discoveries. Learning algorithms such as those for Deep Belief Networks and other related unsupervised learning algorithms have recently been proposed to train deep architectures, yielding exciting results and beating the state-of-the-art in certain areas. Learning Deep Architectures for AI discusses the motivations for and principles of learning algorithms for deep architectures. By analyzing and comparing recent results with different learning algorithms for deep architectures, explanations for their success are proposed and discussed, highlighting challenges and suggesting avenues for future explorations in this area.

...read moreread less

7,767 citations

Proceedings Article•DOI•

Curriculum learning

[...]

Yoshua Bengio, Jérôme Louradour, Ronan Collobert¹, Jason Weston¹•Institutions (1)

Princeton University¹

14 Jun 2009

TL;DR: It is hypothesized that curriculum learning has both an effect on the speed of convergence of the training process to a minimum and on the quality of the local minima obtained: curriculum learning can be seen as a particular form of continuation method (a general strategy for global optimization of non-convex functions).

...read moreread less

Abstract: Humans and animals learn much better when the examples are not randomly presented but organized in a meaningful order which illustrates gradually more concepts, and gradually more complex ones. Here, we formalize such training strategies in the context of machine learning, and call them "curriculum learning". In the context of recent research studying the difficulty of training in the presence of non-convex training criteria (for deep deterministic and stochastic neural networks), we explore curriculum learning in various set-ups. The experiments show that significant improvements in generalization can be achieved. We hypothesize that curriculum learning has both an effect on the speed of convergence of the training process to a minimum and, in the case of non-convex criteria, on the quality of the local minima obtained: curriculum learning can be seen as a particular form of continuation method (a general strategy for global optimization of non-convex functions).

...read moreread less

4,588 citations

Journal Article•DOI•

Exploring Strategies for Training Deep Neural Networks

[...]

Hugo Larochelle, Yoshua Bengio, Jérôme Louradour, Pascal Lamblin

01 Dec 2009-Journal of Machine Learning Research

TL;DR: These experiments confirm the hypothesis that the greedy layer-wise unsupervised training strategy helps the optimization by initializing weights in a region near a good local minimum, but also implicitly acts as a sort of regularization that brings better generalization and encourages internal distributed representations that are high-level abstractions of the input.

...read moreread less

Abstract: Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization often appears to get stuck in poor solutions. Hinton et al. recently proposed a greedy layer-wise unsupervised learning procedure relying on the training algorithm of restricted Boltzmann machines (RBM) to initialize the parameters of a deep belief network (DBN), a generative model with many layers of hidden causal variables. This was followed by the proposal of another greedy layer-wise procedure, relying on the usage of autoassociator networks. In the context of the above optimization problem, we study these algorithms empirically to better understand their success. Our experiments confirm the hypothesis that the greedy layer-wise unsupervised training strategy helps the optimization by initializing weights in a region near a good local minimum, but also implicitly acts as a sort of regularization that brings better generalization and encourages internal distributed representations that are high-level abstractions of the input. We also present a series of experiments aimed at evaluating the link between the performance of deep neural networks and practical aspects of their topology, for example, demonstrating cases where the addition of more depth helps. Finally, we empirically explore simple variants of these training algorithms, such as the use of different RBM input unit distributions, a simple way of combining gradient estimators to improve performance, as well as on-line versions of those algorithms.

...read moreread less

1,124 citations

Proceedings Article•

The Difficulty of Training Deep Architectures and the Effect of Unsupervised Pre-Training

[...]

Dumitru Erhan¹, Pierre-Antoine Manzagol¹, Yoshua Bengio¹, Samy Bengio², Pascal Vincent¹ - Show less +1 more•Institutions (2)

Université de Montréal¹, Google²

15 Apr 2009

TL;DR: The experiments confirm and clarify the advantage of unsupervised pre- training, and empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples.

...read moreread less

Abstract: Whereas theoretical work suggests that deep architectures might be more e cient at representing highly-varying functions, training deep architectures was unsuccessful until the recent advent of algorithms based on unsupervised pretraining. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this di cult learning problem. Answering these questions is important if learning in deep architectures is to be further improved. We attempt to shed some light on these questions through extensive simulations. The experiments confirm and clarify the advantage of unsupervised pre-training. They demonstrate the robustness of the training procedure with respect to the random initialization, the positive e ect of pre-training in terms of optimization and its role as a regularizer. We empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples.

...read moreread less

408 citations

Journal Article•DOI•

Justifying and generalizing contrastive divergence

[...]

Yoshua Bengio¹, Olivier Delalleau¹•Institutions (1)

Université de Montréal¹

01 Jun 2009-Neural Computation

TL;DR: An expansion of the log likelihood in undirected graphical models such as the restricted Boltzmann machine (RBM), where each term in the expansion is associated with a sample in a Gibbs chain alternating between two random variables, shows that its residual term converges to zero, justifying the use of a truncation of only a short Gibbs chain.

...read moreread less

Abstract: We study an expansion of the log likelihood in undirected graphical models such as the restricted Boltzmann machine (RBM), where each term in the expansion is associated with a sample in a Gibbs chain alternating between two random variables (the visible vector and the hidden vector in RBMs). We are particularly interested in estimators of the gradient of the log likelihood obtained through this expansion. We show that its residual term converges to zero, justifying the use of a truncation---running only a short Gibbs chain, which is the main idea behind the contrastive divergence (CD) estimator of the log-likelihood gradient. By truncating even more, we obtain a stochastic reconstruction error, related through a mean-field approximation to the reconstruction error often used to train autoassociators and stacked autoassociators. The derivation is not specific to the particular parametric forms used in RBMs and requires only convergence of the Gibbs chain. We present theoretical and empirical evidence linking the number of Gibbs steps k and the magnitude of the RBM parameters to the bias in the CD estimator. These experiments also suggest that the sign of the CD estimator is correct most of the time, even when the bias is large, so that CD-k is a good descent direction even for small k.

...read moreread less

227 citations

Journal Article•DOI•

A hybrid Pareto model for asymmetric fat-tailed data: the univariate case

[...]

Julie Carreau¹, Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

01 Mar 2009-Extremes

TL;DR: In this article, a hybrid Pareto distribution that can be used in a mixture model is proposed to extend the generalized pareto (GP) to the whole real axis, and the mixture of hybrid Paretos offers an alternate way to estimate the tail index which is comparable to the one estimated with the standard GP methodology.

...read moreread less

Abstract: Density estimators that can adapt to asymmetric heavy tails are required in many applications such as finance and insurance. Extreme value theory (EVT) has developed principled methods based on asymptotic results to estimate the tails of most distributions. However, the finite sample approximation might introduce a severe bias in many cases. Moreover, the full range of the distribution is often needed, not only the tail area. On the other hand, non-parametric methods, while being powerful where data are abundant, fail to extrapolate properly in the tail area. We put forward a non-parametric density estimator that brings together the strengths of non-parametric density estimation and of EVT. A hybrid Pareto distribution that can be used in a mixture model is proposed to extend the generalized Pareto (GP) to the whole real axis. Experiments on simulated data show the following. On one hand, the mixture of hybrid Paretos converges faster in terms of log-likelihood and provides good estimates of the tail of the distributions when compared with other density estimators including the GP distribution. On the other hand, the mixture of hybrid Paretos offers an alternate way to estimate the tail index which is comparable to the one estimated with the standard GP methodology. The mixture of hybrids is also evaluated on the Danish fire insurance data set.

...read moreread less

92 citations

Proceedings Article•

[...]

Yoshua Bengio¹, James Bergstra¹•Institutions (1)

Université de Montréal¹

07 Dec 2009

TL;DR: A new type of neural network activation function based on recent physiological rate models for complex cells in visual area V1 is introduced, which results in orientation-selective features, similar to the receptive fields of complex cells.

...read moreread less

Abstract: We introduce a new type of neural network activation function based on recent physiological rate models for complex cells in visual area V1. A single-hidden-layer neural network of this kind of model achieves 1.50% error on MNIST. We also introduce an existing criterion for learning slow, decorrelated features as a pretraining strategy for image models. This pretraining strategy results in orientation-selective features, similar to the receptive fields of complex cells. With this pretraining, the same single-hidden-layer model achieves 1.34% error, even though the pretraining sample distribution is very different from the fine-tuning distribution. To implement this pretraining strategy, we derive a fast algorithm for online learning of decorrelated features such that each iteration of the algorithm runs in linear time with respect to the number of features.

...read moreread less

76 citations

Journal Article•

Incorporating Functional Knowledge in Neural Networks

[...]

Charles Dugas, Yoshua Bengio, François Bélisle, Claude Nadeau, René Garcia - Show less +1 more

01 Dec 2009-Journal of Machine Learning Research

TL;DR: A class of functions similar to multi-layer neural networks but that is a universal approximator of Lipschitz functions with these and other properties is proposed and applied to the task of modelling the price of call options.

...read moreread less

Abstract: Incorporating prior knowledge of a particular task into the architecture of a learning algorithm can greatly improve generalization performance. We study here a case where we know that the function to be learned is non-decreasing in its two arguments and convex in one of them. For this purpose we propose a class of functions similar to multi-layer neural networks but (1) that has those properties, (2) is a universal approximator of Lipschitz functions with these and other properties. We apply this new class of functions to the task of modelling the price of call options. Experiments show improvements on regressing the price of call options using the new types of function classes that incorporate the a priori constraints.

...read moreread less

64 citations

Proceedings Article•DOI•

Quadratic Features and Deep Architectures for Chunking

[...]

Joseph Turian¹, James Bergstra¹, Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

31 May 2009

TL;DR: Quadratic filters, a simplification of a theoretical model of V1 complex cells, reliably increase accuracy and logistic regression with quadratic filters outperforms a standard single hidden layer neural network.

...read moreread less

Abstract: We experiment with several chunking models. Deeper architectures achieve better generalization. Quadratic filters, a simplification of a theoretical model of V1 complex cells, reliably increase accuracy. In fact, logistic regression with quadratic filters outperforms a standard single hidden layer neural network. Adding quadratic filters to logistic regression is almost as effective as feature engineering. Despite predicting each output label independently, our model is competitive with ones that use previous decisions.

...read moreread less

41 citations

Journal Article•DOI•

A Hybrid Pareto Mixture for Conditional Asymmetric Fat-Tailed Distributions

[...]

Julie Carreau¹, Yoshua Bengio²•Institutions (2)

Versailles Saint-Quentin-en-Yvelines University¹, Université de Montréal²

01 Jul 2009-IEEE Transactions on Neural Networks

TL;DR: This paper proposes a conditional mixture model with hybrid Pareto components to estimate p(Y|X=x), and shows experimentally that this novel approach better models the conditional density in terms of likelihood, compared to competing algorithms: conditional mixture models with other types of components and a classical kernel-based nonparametric model.

...read moreread less

Abstract: In many cases, we observe some variables X that contain predictive information over a scalar variable of interest Y, with (X, Y) pairs observed in a training set. We can take advantage of this information to estimate the conditional density p(Y|X=x). In this paper, we propose a conditional mixture model with hybrid Pareto components to estimate p(Y|X=x). The hybrid Pareto is a Gaussian whose upper tail has been replaced by a generalized Pareto tail. A third parameter, in addition to the location and spread parameters of the Gaussian, controls the heaviness of the upper tail. Using the hybrid Pareto in a mixture model results in a nonparametric estimator that can adapt to multimodality, asymmetry, and heavy tails. A conditional density estimator is built by modeling the parameters of the mixture estimator as functions of X. We use a neural network to implement these functions. Such conditional density estimators have important applications in many domains such as finance and insurance. We show experimentally that this novel approach better models the conditional density in terms of likelihood, compared to competing algorithms: conditional mixture models with other types of components and a classical kernel-based nonparametric model.

...read moreread less

22 citations

Statistical Machine Learning Algorithms for Target Classification from Acoustic Signature

[...]

Vincent Mirelli, Stephen Tenney, Yoshua Bengio, Nicolas Chapados, Olivier Delalleau - Show less +1 more

01 Jan 2009

TL;DR: This work compares two recently-introduced state-of-the-art machine learning algorithms, Support Vector Machines and Discriminative Restricted Boltzmann Machines, and develops how to use them to solve this difficult acoustic classification task, and obtains classification accuracy results that could make these techniques suitable for fielding on autonomous devices.

...read moreread less

Abstract: Machine learning classification algorithms are relevant to a large number of Army classification problems, including the determination of a weapon class from a detonation acoustic signature. However, much such work has been focused on classification of events from small weapons used for asymmetric warfare, which have been of importance in recent years. In this work we consider classification of very different weapon classes, such as mortar, rockets and RPGs, which are difficult to reliably classify with standard techniques since they tend to have similar acoustic signatures. To address this problem, we compare two recently-introduced state-of-the-art machine learning algorithms, Support Vector Machines and Discriminative Restricted Boltzmann Machines, and develop how to use them to solve this difficult acoustic classification task. We obtain classification accuracy results that could make these techniques suitable for fielding on autonomous devices. Discriminative Restricted Boltzmann Machines appear to yield slightly better accuracy than Support Vector Machines, and are less sensitive to the choice of signal preprocessing and model hyperparameters. Importantly, we also address methodological issues that one faces in order to rigorously compare several classifiers on limited data collected from field trials; these questions are of significance to any application of machine learning methods to Army problems. Approved for public release; distribution is unlimited

...read moreread less

Proceedings Article•

An Infinite Factor Model Hierarchy Via a Noisy-Or Mechanism

[...]

Douglas Eck¹, Yoshua Bengio¹, Aaron Courville¹•Institutions (1)

Université de Montréal¹

07 Dec 2009

TL;DR: This work extends the latent factor model framework to two or more unbounded layers of latent factors so that each layer defines a conditional factorial prior distribution over the binary latent variables of the layer below via a noisy-or mechanism.

...read moreread less

Abstract: The Indian Buffet Process is a Bayesian nonparametric approach that models objects as arising from an infinite number of latent factors. Here we extend the latent factor model framework to two or more unbounded layers of latent factors. From a generative perspective, each layer defines a conditional factorial prior distribution over the binary latent variables of the layer below via a noisy-or mechanism. We explore the properties of the model with two empirical studies, one digit recognition task and one music tag data experiment.

...read moreread less