scispace - formally typeset
Search or ask a question

Showing papers by "Yoshua Bengio published in 2009"


Book
01 Jan 2009
TL;DR: The motivations and principles regarding learning algorithms for deep architectures, in particular those exploiting as building blocks unsupervised learning of single-layer modelssuch as Restricted Boltzmann Machines, used to construct deeper models such as Deep Belief Networks are discussed.
Abstract: Can machine learning deliver AI? Theoretical results, inspiration from the brain and cognition, as well as machine learning experiments suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g. in vision, language, and other AI-level tasks), one would need deep architectures. Deep architectures are composed of multiple levels of non-linear operations, such as in neural nets with many hidden layers, graphical models with many levels of latent variables, or in complicated propositional formulae re-using many sub-formulae. Each level of the architecture represents features at a different level of abstraction, defined as a composition of lower-level features. Searching the parameter space of deep architectures is a difficult task, but new algorithms have been discovered and a new sub-area has emerged in the machine learning community since 2006, following these discoveries. Learning algorithms such as those for Deep Belief Networks and other related unsupervised learning algorithms have recently been proposed to train deep architectures, yielding exciting results and beating the state-of-the-art in certain areas. Learning Deep Architectures for AI discusses the motivations for and principles of learning algorithms for deep architectures. By analyzing and comparing recent results with different learning algorithms for deep architectures, explanations for their success are proposed and discussed, highlighting challenges and suggesting avenues for future explorations in this area.

7,767 citations


Proceedings ArticleDOI
14 Jun 2009
TL;DR: It is hypothesized that curriculum learning has both an effect on the speed of convergence of the training process to a minimum and on the quality of the local minima obtained: curriculum learning can be seen as a particular form of continuation method (a general strategy for global optimization of non-convex functions).
Abstract: Humans and animals learn much better when the examples are not randomly presented but organized in a meaningful order which illustrates gradually more concepts, and gradually more complex ones. Here, we formalize such training strategies in the context of machine learning, and call them "curriculum learning". In the context of recent research studying the difficulty of training in the presence of non-convex training criteria (for deep deterministic and stochastic neural networks), we explore curriculum learning in various set-ups. The experiments show that significant improvements in generalization can be achieved. We hypothesize that curriculum learning has both an effect on the speed of convergence of the training process to a minimum and, in the case of non-convex criteria, on the quality of the local minima obtained: curriculum learning can be seen as a particular form of continuation method (a general strategy for global optimization of non-convex functions).

4,588 citations


Journal ArticleDOI
TL;DR: These experiments confirm the hypothesis that the greedy layer-wise unsupervised training strategy helps the optimization by initializing weights in a region near a good local minimum, but also implicitly acts as a sort of regularization that brings better generalization and encourages internal distributed representations that are high-level abstractions of the input.
Abstract: Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization often appears to get stuck in poor solutions. Hinton et al. recently proposed a greedy layer-wise unsupervised learning procedure relying on the training algorithm of restricted Boltzmann machines (RBM) to initialize the parameters of a deep belief network (DBN), a generative model with many layers of hidden causal variables. This was followed by the proposal of another greedy layer-wise procedure, relying on the usage of autoassociator networks. In the context of the above optimization problem, we study these algorithms empirically to better understand their success. Our experiments confirm the hypothesis that the greedy layer-wise unsupervised training strategy helps the optimization by initializing weights in a region near a good local minimum, but also implicitly acts as a sort of regularization that brings better generalization and encourages internal distributed representations that are high-level abstractions of the input. We also present a series of experiments aimed at evaluating the link between the performance of deep neural networks and practical aspects of their topology, for example, demonstrating cases where the addition of more depth helps. Finally, we empirically explore simple variants of these training algorithms, such as the use of different RBM input unit distributions, a simple way of combining gradient estimators to improve performance, as well as on-line versions of those algorithms.

1,124 citations


Proceedings Article
15 Apr 2009
TL;DR: The experiments confirm and clarify the advantage of unsupervised pre- training, and empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples.
Abstract: Whereas theoretical work suggests that deep architectures might be more e cient at representing highly-varying functions, training deep architectures was unsuccessful until the recent advent of algorithms based on unsupervised pretraining. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this di cult learning problem. Answering these questions is important if learning in deep architectures is to be further improved. We attempt to shed some light on these questions through extensive simulations. The experiments confirm and clarify the advantage of unsupervised pre-training. They demonstrate the robustness of the training procedure with respect to the random initialization, the positive e ect of pre-training in terms of optimization and its role as a regularizer. We empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples.

408 citations


Journal ArticleDOI
TL;DR: An expansion of the log likelihood in undirected graphical models such as the restricted Boltzmann machine (RBM), where each term in the expansion is associated with a sample in a Gibbs chain alternating between two random variables, shows that its residual term converges to zero, justifying the use of a truncation of only a short Gibbs chain.
Abstract: We study an expansion of the log likelihood in undirected graphical models such as the restricted Boltzmann machine (RBM), where each term in the expansion is associated with a sample in a Gibbs chain alternating between two random variables (the visible vector and the hidden vector in RBMs). We are particularly interested in estimators of the gradient of the log likelihood obtained through this expansion. We show that its residual term converges to zero, justifying the use of a truncation---running only a short Gibbs chain, which is the main idea behind the contrastive divergence (CD) estimator of the log-likelihood gradient. By truncating even more, we obtain a stochastic reconstruction error, related through a mean-field approximation to the reconstruction error often used to train autoassociators and stacked autoassociators. The derivation is not specific to the particular parametric forms used in RBMs and requires only convergence of the Gibbs chain. We present theoretical and empirical evidence linking the number of Gibbs steps k and the magnitude of the RBM parameters to the bias in the CD estimator. These experiments also suggest that the sign of the CD estimator is correct most of the time, even when the bias is large, so that CD-k is a good descent direction even for small k.

227 citations


Journal ArticleDOI
01 Mar 2009-Extremes
TL;DR: In this article, a hybrid Pareto distribution that can be used in a mixture model is proposed to extend the generalized pareto (GP) to the whole real axis, and the mixture of hybrid Paretos offers an alternate way to estimate the tail index which is comparable to the one estimated with the standard GP methodology.
Abstract: Density estimators that can adapt to asymmetric heavy tails are required in many applications such as finance and insurance. Extreme value theory (EVT) has developed principled methods based on asymptotic results to estimate the tails of most distributions. However, the finite sample approximation might introduce a severe bias in many cases. Moreover, the full range of the distribution is often needed, not only the tail area. On the other hand, non-parametric methods, while being powerful where data are abundant, fail to extrapolate properly in the tail area. We put forward a non-parametric density estimator that brings together the strengths of non-parametric density estimation and of EVT. A hybrid Pareto distribution that can be used in a mixture model is proposed to extend the generalized Pareto (GP) to the whole real axis. Experiments on simulated data show the following. On one hand, the mixture of hybrid Paretos converges faster in terms of log-likelihood and provides good estimates of the tail of the distributions when compared with other density estimators including the GP distribution. On the other hand, the mixture of hybrid Paretos offers an alternate way to estimate the tail index which is comparable to the one estimated with the standard GP methodology. The mixture of hybrids is also evaluated on the Danish fire insurance data set.

92 citations


Proceedings Article
07 Dec 2009
TL;DR: A new type of neural network activation function based on recent physiological rate models for complex cells in visual area V1 is introduced, which results in orientation-selective features, similar to the receptive fields of complex cells.
Abstract: We introduce a new type of neural network activation function based on recent physiological rate models for complex cells in visual area V1. A single-hidden-layer neural network of this kind of model achieves 1.50% error on MNIST. We also introduce an existing criterion for learning slow, decorrelated features as a pretraining strategy for image models. This pretraining strategy results in orientation-selective features, similar to the receptive fields of complex cells. With this pretraining, the same single-hidden-layer model achieves 1.34% error, even though the pretraining sample distribution is very different from the fine-tuning distribution. To implement this pretraining strategy, we derive a fast algorithm for online learning of decorrelated features such that each iteration of the algorithm runs in linear time with respect to the number of features.

76 citations


Journal Article
TL;DR: A class of functions similar to multi-layer neural networks but that is a universal approximator of Lipschitz functions with these and other properties is proposed and applied to the task of modelling the price of call options.
Abstract: Incorporating prior knowledge of a particular task into the architecture of a learning algorithm can greatly improve generalization performance. We study here a case where we know that the function to be learned is non-decreasing in its two arguments and convex in one of them. For this purpose we propose a class of functions similar to multi-layer neural networks but (1) that has those properties, (2) is a universal approximator of Lipschitz functions with these and other properties. We apply this new class of functions to the task of modelling the price of call options. Experiments show improvements on regressing the price of call options using the new types of function classes that incorporate the a priori constraints.

64 citations


Proceedings ArticleDOI
31 May 2009
TL;DR: Quadratic filters, a simplification of a theoretical model of V1 complex cells, reliably increase accuracy and logistic regression with quadratic filters outperforms a standard single hidden layer neural network.
Abstract: We experiment with several chunking models. Deeper architectures achieve better generalization. Quadratic filters, a simplification of a theoretical model of V1 complex cells, reliably increase accuracy. In fact, logistic regression with quadratic filters outperforms a standard single hidden layer neural network. Adding quadratic filters to logistic regression is almost as effective as feature engineering. Despite predicting each output label independently, our model is competitive with ones that use previous decisions.

41 citations


Journal ArticleDOI
TL;DR: This paper proposes a conditional mixture model with hybrid Pareto components to estimate p(Y|X=x), and shows experimentally that this novel approach better models the conditional density in terms of likelihood, compared to competing algorithms: conditional mixture models with other types of components and a classical kernel-based nonparametric model.
Abstract: In many cases, we observe some variables X that contain predictive information over a scalar variable of interest Y, with (X, Y) pairs observed in a training set. We can take advantage of this information to estimate the conditional density p(Y|X=x). In this paper, we propose a conditional mixture model with hybrid Pareto components to estimate p(Y|X=x). The hybrid Pareto is a Gaussian whose upper tail has been replaced by a generalized Pareto tail. A third parameter, in addition to the location and spread parameters of the Gaussian, controls the heaviness of the upper tail. Using the hybrid Pareto in a mixture model results in a nonparametric estimator that can adapt to multimodality, asymmetry, and heavy tails. A conditional density estimator is built by modeling the parameters of the mixture estimator as functions of X. We use a neural network to implement these functions. Such conditional density estimators have important applications in many domains such as finance and insurance. We show experimentally that this novel approach better models the conditional density in terms of likelihood, compared to competing algorithms: conditional mixture models with other types of components and a classical kernel-based nonparametric model.

22 citations


01 Jan 2009
TL;DR: This work compares two recently-introduced state-of-the-art machine learning algorithms, Support Vector Machines and Discriminative Restricted Boltzmann Machines, and develops how to use them to solve this difficult acoustic classification task, and obtains classification accuracy results that could make these techniques suitable for fielding on autonomous devices.
Abstract: Machine learning classification algorithms are relevant to a large number of Army classification problems, including the determination of a weapon class from a detonation acoustic signature. However, much such work has been focused on classification of events from small weapons used for asymmetric warfare, which have been of importance in recent years. In this work we consider classification of very different weapon classes, such as mortar, rockets and RPGs, which are difficult to reliably classify with standard techniques since they tend to have similar acoustic signatures. To address this problem, we compare two recently-introduced state-of-the-art machine learning algorithms, Support Vector Machines and Discriminative Restricted Boltzmann Machines, and develop how to use them to solve this difficult acoustic classification task. We obtain classification accuracy results that could make these techniques suitable for fielding on autonomous devices. Discriminative Restricted Boltzmann Machines appear to yield slightly better accuracy than Support Vector Machines, and are less sensitive to the choice of signal preprocessing and model hyperparameters. Importantly, we also address methodological issues that one faces in order to rigorously compare several classifiers on limited data collected from field trials; these questions are of significance to any application of machine learning methods to Army problems. Approved for public release; distribution is unlimited

Proceedings Article
07 Dec 2009
TL;DR: This work extends the latent factor model framework to two or more unbounded layers of latent factors so that each layer defines a conditional factorial prior distribution over the binary latent variables of the layer below via a noisy-or mechanism.
Abstract: The Indian Buffet Process is a Bayesian nonparametric approach that models objects as arising from an infinite number of latent factors. Here we extend the latent factor model framework to two or more unbounded layers of latent factors. From a generative perspective, each layer defines a conditional factorial prior distribution over the binary latent variables of the layer below via a noisy-or mechanism. We explore the properties of the model with two empirical studies, one digit recognition task and one music tag data experiment.