scispace - formally typeset
Search or ask a question
Author

Yoshua Bengio

Bio: Yoshua Bengio is an academic researcher from Université de Montréal. The author has contributed to research in topics: Artificial neural network & Deep learning. The author has an hindex of 202, co-authored 1033 publications receiving 420313 citations. Previous affiliations of Yoshua Bengio include McGill University & Centre de Recherches Mathématiques.


Papers
More filters
Journal ArticleDOI
TL;DR: Stochastic GFlowNets as discussed by the authors decompose state transitions into two steps and learn a dynamics model to capture environmental stochasticity, which can be applied to more general tasks.
Abstract: Generative Flow Networks (or GFlowNets for short) are a family of probabilistic agents that learn to sample complex combinatorial structures through the lens of"inference as control". They have shown great potential in generating high-quality and diverse candidates from a given energy landscape. However, existing GFlowNets can be applied only to deterministic environments, and fail in more general tasks with stochastic dynamics, which can limit their applicability. To overcome this challenge, this paper introduces Stochastic GFlowNets, a new algorithm that extends GFlowNets to stochastic environments. By decomposing state transitions into two steps, Stochastic GFlowNets isolate environmental stochasticity and learn a dynamics model to capture it. Extensive experimental results demonstrate that Stochastic GFlowNets offer significant advantages over standard GFlowNets as well as MCMC- and RL-based approaches, on a variety of standard benchmarks with stochastic dynamics.

3 citations

Posted Content
TL;DR: In this paper, the authors propose an empirical and hypothesis-free method to compare different option pricing systems by having trade against each other or against the market, and use this criterion to train a non-parametric statistical model (here based on neural networks) to estimate a price for the option that maximizes the expected utility when trading against a market.
Abstract: Prior work on option pricing falls mostly in two categories: it either relies on strong distributional or economical assumptions, or it tries to mimic the Black-Scholes formula through statistical models, trained to fit today's market price based on information available today. The work presented here is closer to the second category but its objective is different: predict the future value of the option, and establish its current value based on a trading scenario. This work thus innovates in two ways: first it proposes an empirical and hypothesis-free method to compare different option pricing systems (by having trade against each other or against the market), second it uses this criterion to train a non-parametric statistical model (here based on neural networks) to estimate a price for the option that maximizes the expected utility when trading against the market. Note that the price will depend on the utility function and current portfolio (i.e. current risks) of the trading agent. Preliminary experiments are presented on the S&P 500 options. Les travaux precedents sur la valorisation des options entraient en gros dans deux categories : ou bien ils etaient bases sur de fortes hypotheses distributionnelles ou economiques, ou bien ils essayaient d'imiter la formule de Black-Scholes par des modeles statistiques entraines a approximer les prix de marche quotidiens a l'aide d'information disponible le jour meme. Le travail presente ici se rapproche plus de la deuxieme categorie mais son objectif est different : predire les prix futurs d'une option, et etablir sa valeur courante a l'aide d'un scenario de transactions. Ce travail innove donc de deux facons : premierement, il propose une methode empirique et sans hypothese pour comparer differents systemes de valorisation d'options (en transigeant contre lui-meme ou contre le marche) et deuxiemement, il utilise ce critere pour entrainer un modele statistique non-parametrique (utilisant dans ce cas-ci des reseaux de neurones) pour estimer un prix pour l'option qui maximise l'utilite esperee lorsque l'on transige contre le marche. A noter que les prix dependront de la fonction d'utilite ainsi que du portefeuille (i.e. des risques courants) de la personne qui transige. Des resultats preliminaires sur des options d'achat du S&P 500 sont presentes.

3 citations

Book ChapterDOI
TL;DR: A cognitively relevant model for automatic speech recognition that combines both a local representation and and a distributed representation subnetworks to which correspond respectively a fast-learning and a slow-learning capability is proposed.
Abstract: The purpose of this chapter is to study the application of some connectionist models to automatic speech recognition. Ways to take advantage of a-priori knowledge in the design of those models are first considered. Then algorithms for some recurrent networks are described since they are well-suited to handling temporal dependences such as those found in speech. Some simple methods that accelerate the convergence of gradient descent with the back-propagation algorithm are discussed. An alternative approach to speed-up the networks are systems based on Radial Basis Functions (local representation). Detailed results of several experiments with these networks on the recognition of phonemes for the TIMIT database are presented. In conclusion, a cognitively relevant model is proposed. This model combines both a local representation and and a distributed representation subnetworks to which correspond respectively a fast-learning and a slow-learning capability.

3 citations

Posted Content
TL;DR: This work presents a lower-bound for the likelihood of the generative model and shows that optimizing this bound regularizes the model so that the Bhattacharyya distance between the bottom-up and top-down approximate distributions is minimized.
Abstract: Efficient unsupervised training and inference in deep generative models remains a challenging problem. One basic approach, called Helmholtz machine, involves training a top-down directed generative model together with a bottom-up auxiliary model that is trained to help perform approximate inference. Recent results indicate that better results can be obtained with better approximate inference procedures. Instead of employing more powerful procedures, we here propose to regularize the generative model to stay close to the class of distributions that can be efficiently inverted by the approximate inference model. We achieve this by interpreting both the top-down and the bottom-up directed models as approximate inference distributions and by defining the model distribution to be the geometric mean of these two. We present a lower-bound for the likelihood of this model and we show that optimizing this bound regularizes the model so that the Bhattacharyya distance between the bottom-up and top-down approximate distributions is minimized. We demonstrate that we can use this approach to fit generative models with many layers of hidden binary stochastic variables to complex training distributions and hat this method prefers significantly deeper architectures while it supports orders of magnitude more efficient approximate inference than other approaches.

3 citations


Cited by
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Proceedings Article
01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

111,197 citations

Journal ArticleDOI
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

72,897 citations

Journal ArticleDOI
28 May 2015-Nature
TL;DR: Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.
Abstract: Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

46,982 citations

Posted Content
TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

44,703 citations