scispace - formally typeset
Search or ask a question

Showing papers by "Yoshua Bengio published in 2007"


01 Jan 2007
TL;DR: It is argued that deep architectures have the potential to generalize in non-local ways, i.e., beyond immediate neighbors, and that this is crucial in order to make progress on the kind of complex tasks required for artificial intelligence.
Abstract: One long-term goal of machine learning research is to produce methods that are applicable to highly complex tasks, such as perception (vision, audition), reasoning, intelligent control, and other artificially intelligent behaviors. We argue that in order to progress toward this goal, the Machine Learning community must endeavor to discover algorithms that can learn highly complex functions, with minimal need for prior knowledge, and with minimal human intervention. We present mathematical and empirical evidence suggesting that many popular approaches to non-parametric learning, particularly kernel methods, are fundamentally limited in their ability to learn complex high-dimensional functions. Our analysis focuses on two problems. First, kernel machines are shallow architectures, in which one large layer of simple template matchers is followed by a single layer of trainable coefficients. We argue that shallow architectures can be very inefficient in terms of required number of computational elements and examples. Second, we analyze a limitation of kernel machines with a local kernel, linked to the curse of dimensionality, that applies to supervised, unsupervised (manifold learning) and semi-supervised kernel machines. Using empirical results on invariant image recognition tasks, kernel methods are compared with deep architectures, in which lower-level features or concepts are progressively combined into more abstract and higher-level representations. We argue that deep architectures have the potential to generalize in non-local ways, i.e., beyond immediate neighbors, and that this is crucial in order to make progress on the kind of complex tasks required for artificial intelligence.

1,163 citations


Proceedings ArticleDOI
20 Jun 2007
TL;DR: A series of experiments indicate that these models with deep architectures show promise in solving harder learning problems that exhibit many factors of variation.
Abstract: Recently, several learning algorithms relying on models with deep architectures have been proposed. Though they have demonstrated impressive performance, to date, they have only been evaluated on relatively simple problems such as digit recognition in a controlled environment, for which many machine learning algorithms already report reasonable results. Here, we present a series of experiments which indicate that these models show promise in solving harder learning problems that exhibit many factors of variation. These models are compared with well-established algorithms such as Support Vector Machines and single hidden-layer feed-forward neural networks.

1,122 citations


Proceedings Article
03 Dec 2007
TL;DR: An efficient, general, online approximation to the natural gradient descent which is suited to large scale problems and much faster convergence in computation time and in number of iterations with TONGA than with stochastic gradient descent, even on very large datasets.
Abstract: Guided by the goal of obtaining an optimization algorithm that is both fast and yields good generalization, we study the descent direction maximizing the decrease in generalization error or the probability of not increasing generalization error. The surprising result is that from both the Bayesian and frequentist perspectives this can yield the natural gradient direction. Although that direction can be very expensive to compute we develop an efficient, general, online approximation to the natural gradient descent which is suited to large scale problems. We report experimental results showing much faster convergence in computation time and in number of iterations with TONGA (Topmoumoute Online natural Gradient Algorithm) than with stochastic gradient descent, even on very large datasets.

208 citations


Proceedings Article
11 Mar 2007
TL;DR: In the first approach proposed, a finite parametrization is possible, allowing gradient-based learning and a kernel machine can be made hyperparameter-free and still generalizes in spite of an absence of explicit regularization.
Abstract: This article extends neural networks to the case of an uncountable number of hidden units, in several ways. In the first approach proposed, a finite parametrization is possible, allowing gradient-based learning. While having the same number of parameters as an ordinary neural network, its internal structure suggests that it can represent some smooth functions much more compactly. Under mild assumptions, we also find better error bounds than with ordinary neural networks. Furthermore, this parametrization may help reducing the problem of saturation of the neurons. In a second approach, the input-to-hidden weights are fully nonparametric, yielding a kernel machine for which we demonstrate a simple kernel formula. Interestingly, the resulting kernel machine can be made hyperparameter-free and still generalizes in spite of an absence of explicit regularization.

59 citations


Proceedings Article
03 Dec 2007
TL;DR: The surprising result presented here is that about as few as a thousand images are enough to approximately recover the relative locations of about a thousand pixels.
Abstract: We study the following question: is the two-dimensional structure of images a very strong prior or is it something that can be learned with a few examples of natural images? If someone gave us a learning task involving images for which the two-dimensional topology of pixels was not known, could we discover it automatically and exploit it? For example suppose that the pixels had been permuted in a fixed but unknown way, could we recover the relative two-dimensional location of pixels on images? The surprising result presented here is that not only the answer is yes, but that about as few as a thousand images are enough to approximately recover the relative locations of about a thousand pixels. This is achieved using a manifold learning algorithm applied to pixels associated with a measure of distributional similarity between pixel intensities. We compare different topology-extraction approaches and show how having the two-dimensional topology can be exploited.

24 citations


01 Jan 2007
TL;DR: In this article, the authors present a methodology to optimize several hyper parameters based on the computation of the gradient of a model selection criterion with respect to the hyper parameters in the case of a quadratic training criterion.
Abstract: Many machine learning algorithms can be formulated as the minimiza tion of a training criterion which involves a hyper parameter This hyper parameter is usually chosen by trial and error with a model selection cri terion In this paper we present a methodology to optimize several hyper parameters based on the computation of the gradient of a model selection criterion with respect to the hyper parameters In the case of a quadratic training criterion the gradient of the selection criterion with respect to the hyper parameters is e ciently computed by back propagating through a Cholesky decomposition In the more general case we show that the implicit function theorem can be used to derive a formula for the hyper parameter gradient involving second derivatives of the training criterion

20 citations


Book ChapterDOI
TL;DR: This chapter focuses on what is required for the learning of complex behaviors, in a mathematical sense, and brings forward two types of arguments which convey the message that many currently popular machine learning approaches to learning flexible functions have fundamental limitations that render them inappropriate for learning highly varying functions.
Abstract: A common goal of computational neuroscience and of artificial intelligence research based on statistical learning algorithms is the discovery and understanding of computational principles that could explain what we consider adaptive intelligence, in animals as well as in machines. This chapter focuses on what is required for the learning of complex behaviors. We believe it involves the learning of highly varying functions, in a mathematical sense. We bring forward two types of arguments which convey the message that many currently popular machine learning approaches to learning flexible functions have fundamental limitations that render them inappropriate for learning highly varying functions. The first issue concerns the representation of such functions with what we call shallow model architectures. We discuss limitations of shallow architectures, such as so-called kernel machines, boosting algorithms, and one-hidden-layer artificial neural networks. The second issue is more focused and concerns kernel machines with a local kernel (the type used most often in practice) that act like a collection of template-matching units. We present mathematical results on such computational architectures showing that they have a limitation similar to those already proved for older non-parametric methods, and connected to the so-called curse of dimensionality. Though it has long been believed that efficient learning in deep architectures is difficult, recently proposed computational principles for learning in deep architectures may offer a breakthrough.

20 citations


Proceedings Article
03 Dec 2007
TL;DR: A functional representation of time series is introduced which allows forecasts to be performed over an unspecified horizon with progressively-revealed information sets and a complete covariance matrix between forecasts at several time-steps is available.
Abstract: We introduce a functional representation of time series which allows forecasts to be performed over an unspecified horizon with progressively-revealed information sets. By virtue of using Gaussian processes, a complete covariance matrix between forecasts at several time-steps is available. This information is put to use in an application to actively trade price spreads between commodity futures contracts. The approach delivers impressive out-of-sample risk-adjusted returns after transaction costs on a portfolio of 30 spreads.

18 citations



01 Jan 2007
TL;DR: Motivation: understanding intelligence, building AI, scaling to large scale learning of complex functions, and measuring the impact of artificial intelligence on the real world.
Abstract: Motivation: understanding intelligence, building AI, scaling to large scale learning of complex functions

4 citations


Journal ArticleDOI
TL;DR: It is shown that using a non-additive criterion (incremental Sharpe Ratio) yields a noisy K-best-paths extraction problem, that can give substantially improved performance.
Abstract: We describe a general method to transform a non-Markovian sequential decision problem into a supervised learning problem using a K-best-paths algorithm. We consider an application in financial portfolio management where we can train a controller to directly optimize a Sharpe Ratio (or other risk-averse non-additive) utility function. We illustrate the approach by demonstrating experimental results using a kernel-based controller architecture that would not normally be considered in traditional reinforcement learning or approximate dynamic programming. We further show that using a non-additive criterion (incremental Sharpe Ratio) yields a noisy K-best-paths extraction problem, that can give substantially improved performance.

Proceedings Article
11 Mar 2007
TL;DR: This work draws from Extreme Value Theory the tools to build a hybrid unimodal density having a parameter controlling the heaviness of the upper tail, a Gaussian whose upper tail has been replaced by a generalized Pareto tail.
Abstract: We propose an estimator for the conditional density p(Y |X) that can adapt for asymmetric heavy tails which might depend on X. Such estimators have important applications in nance and insurance. We draw from Extreme Value Theory the tools to build a hybrid unimodal density having a parameter controlling the heaviness of the upper tail. This hybrid is a Gaussian whose upper tail has been replaced by a generalized Pareto tail. We use this hybrid in a multi-modal mixture in order to obtain a nonparametric density estimator that can easily adapt for heavy tailed data. To obtain a conditional density estimator, the parameters of the mixture estimator can be seen as functions of X and these functions learned. We show experimentally that this approach better models the conditional density in terms of likelihood than compared competing algorithms : conditional mixture models with other types of components and multivariate nonparametric models.

Patent
20 Jun 2007
TL;DR: In this article, a method of decoding coded data into decoded data is presented, where the coded data is represented by a symbol string of more probable symbols and less probable symbols.
Abstract: The invention provides a method of decoding coded data into decoded data. The decoded data is represented by a symbol string of more probable symbols ("MPS") and less probable symbols ("LPS"). The method comprises receiving a segment of coded data as a binary fraction. For a position in the decoded symbol string, an interval of possible values of the coded data is defined, the interval bounded by 1 and a lower bound. A test variable (1030) is computed that divides the interval into two sub-intervals according to relative probabilities that the symbol should be occupied by the MPS or the LPS. A first sub-interval extends from 1 to the test variable and is associated with the MPS. A second sub-interval extends from the test variable to the lower bound and is associated with the LPS. A fence variable (2030) is computed to be the lesser of the coded data segment and ½, and when the test variable (2000) is less than the fence variable, an MPS is placed in the position (2060).