Top 29 papers published by Ilya Sutskever from OpenAI in 2016

Journal Article•DOI•

Mastering the game of Go with deep neural networks and tree search

[...]

David Silver¹, Aja Huang¹, Chris J. Maddison¹, Arthur Guez¹, Laurent Sifre¹, George van den Driessche¹, Julian Schrittwieser¹, Ioannis Antonoglou¹, Veda Panneershelvam¹, Marc Lanctot¹, Sander Dieleman¹, Dominik Grewe¹, John Nham¹, Nal Kalchbrenner¹, Ilya Sutskever¹, Timothy P. Lillicrap¹, Madeleine Leach¹, Koray Kavukcuoglu¹, Thore Graepel¹, Demis Hassabis¹ - Show less +16 more•Institutions (1)

Google¹

28 Jan 2016-Nature

TL;DR: Using this search algorithm, the program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0.5, the first time that a computer program has defeated a human professional player in the full-sized game of Go.

...read moreread less

Abstract: The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of stateof-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.

...read moreread less

14,377 citations

Posted Content•

InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

[...]

Xi Chen¹, Yan Duan¹, Rein Houthooft¹, John Schulman¹, Ilya Sutskever², Pieter Abbeel¹ - Show less +2 more•Institutions (2)

University of California, Berkeley¹, OpenAI²

12 Jun 2016-arXiv: Learning

TL;DR: InfoGAN as mentioned in this paper is a generative adversarial network that maximizes the mutual information between a small subset of the latent variables and the observation, which can be interpreted as a variation of the Wake-Sleep algorithm.

...read moreread less

Abstract: This paper describes InfoGAN, an information-theoretic extension to the Generative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner. InfoGAN is a generative adversarial network that also maximizes the mutual information between a small subset of the latent variables and the observation. We derive a lower bound to the mutual information objective that can be optimized efficiently, and show that our training procedure can be interpreted as a variation of the Wake-Sleep algorithm. Specifically, InfoGAN successfully disentangles writing styles from digit shapes on the MNIST dataset, pose from lighting of 3D rendered images, and background digits from the central digit on the SVHN dataset. It also discovers visual concepts that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset. Experiments show that InfoGAN learns interpretable representations that are competitive with representations learned by existing fully supervised methods.

...read moreread less

2,409 citations

Proceedings Article•

InfoGAN: interpretable representation learning by information maximizing generative adversarial nets

[...]

Xi Chen¹, Yan Duan¹, Rein Houthooft¹, John Schulman¹, Ilya Sutskever², Pieter Abbeel¹ - Show less +2 more•Institutions (2)

University of California, Berkeley¹, OpenAI²

05 Dec 2016

TL;DR: InfoGAN as mentioned in this paper is an information-theoretic extension to the GAN that is able to learn disentangled representations in a completely unsupervised manner, and it also discovers visual concepts that include hair styles, presence of eyeglasses, and emotions on the CelebA face dataset.

...read moreread less

Abstract: This paper describes InfoGAN, an information-theoretic extension to the Generative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner. InfoGAN is a generative adversarial network that also maximizes the mutual information between a small subset of the latent variables and the observation. We derive a lower bound of the mutual information objective that can be optimized efficiently. Specifically, InfoGAN successfully disentangles writing styles from digit shapes on the MNIST dataset, pose from lighting of 3D rendered images, and background digits from the central digit on the SVHN dataset. It also discovers visual concepts that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset. Experiments show that InfoGAN learns interpretable representations that are competitive with representations learned by existing supervised methods. For an up-to-date version of this paper, please see https://arxiv.org/abs/1606.03657.

...read moreread less

2,290 citations

Proceedings Article•

Improved Variational Inference with Inverse Autoregressive Flow

[...]

Durk P. Kingma, Tim Salimans¹, Rafal Jozefowicz², Xi Chen³, Ilya Sutskever², Max Welling⁴ - Show less +2 more•Institutions (4)

OpenAI¹, Google², University of California, Berkeley³, University of Amsterdam⁴

01 Jan 2016

TL;DR: A new type of normalizing flow, inverse autoregressive flow (IAF), is proposed that, in contrast to earlier published flows, scales well to high-dimensional latent spaces and significantly improves upon diagonal Gaussian approximate posteriors.

...read moreread less

Abstract: The framework of normalizing flows provides a general strategy for flexible variational inference of posteriors over latent variables. We propose a new type of normalizing flow, inverse autoregressive flow (IAF), that, in contrast to earlier published flows, scales well to high-dimensional latent spaces. The proposed flow consists of a chain of invertible transformations, where each transformation is based on an autoregressive neural network. In experiments, we show that IAF significantly improves upon diagonal Gaussian approximate posteriors. In addition, we demonstrate that a novel type of variational autoencoder, coupled with IAF, is competitive with neural autoregressive models in terms of attained log-likelihood on natural images, while allowing significantly faster synthesis.

...read moreread less

901 citations

Proceedings Article•

Improving Variational Inference with Inverse Autoregressive Flow

[...]

Diederik P. Kingma¹, Tim Salimans², Rafal Jozefowicz³, Xi Chen⁴, Ilya Sutskever³, Max Welling⁵ - Show less +2 more•Institutions (5)

University of Amsterdam¹, OpenAI², Google³, University of California, Berkeley⁴, Canadian Institute for Advanced Research⁵

15 Jun 2016

TL;DR: This article proposed a data transformation called inverse autoregressive flows (IAF) to transform a simple distribution over the latent variables into a much more flexible distribution, while still allowing us to compute the resulting variables' probability density function.

...read moreread less

Abstract: We propose a simple and scalable method for improving the flexibility of variational inference through a transformation with autoregressive neural networks. Autoregressive neural networks, such as RNNs or the PixelCNN, are very powerful models and potentially interesting for use as variational posterior approximation. However, ancestral sampling in such networks is a long sequential operation, and therefore typically very slow on modern parallel hardware, such as GPUs. We show that by inverting autoregressive neural networks we can obtain equally powerful posterior models from which we can sample efficiently on modern hardware. We show that such data transformations, inverse autoregressive flows (IAF), can be used to transform a simple distribution over the latent variables into a much more flexible distribution, while still allowing us to compute the resulting variables' probability density function. The method is simple to implement, can be made arbitrarily flexible and, in contrast with previous work, is well applicable to models with high-dimensional latent spaces, such as convolutional generative models. The method is applied to a novel deep architecture of variational auto-encoders. In experiments with natural images, we demonstrate that autoregressive flow leads to significant performance gains.

...read moreread less

767 citations

Posted Content•

RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning

[...]

Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, Pieter Abbeel - Show less +2 more

04 Nov 2016-arXiv: Artificial Intelligence

TL;DR: This paper proposes to represent a "fast" reinforcement learning algorithm as a recurrent neural network (RNN) and learn it from data, encoded in the weights of the RNN, which are learned slowly through a general-purpose ("slow") RL algorithm.

...read moreread less

Abstract: Deep reinforcement learning (deep RL) has been successful in learning sophisticated behaviors automatically; however, the learning process requires a huge number of trials. In contrast, animals can learn new tasks in just a few trials, benefiting from their prior knowledge about the world. This paper seeks to bridge this gap. Rather than designing a "fast" reinforcement learning algorithm, we propose to represent it as a recurrent neural network (RNN) and learn it from data. In our proposed method, RL$^2$, the algorithm is encoded in the weights of the RNN, which are learned slowly through a general-purpose ("slow") RL algorithm. The RNN receives all information a typical RL algorithm would receive, including observations, actions, rewards, and termination flags; and it retains its state across episodes in a given Markov Decision Process (MDP). The activations of the RNN store the state of the "fast" RL algorithm on the current (previously unseen) MDP. We evaluate RL$^2$ experimentally on both small-scale and large-scale problems. On the small-scale side, we train it to solve randomly generated multi-arm bandit problems and finite MDPs. After RL$^2$ is trained, its performance on new MDPs is close to human-designed algorithms with optimality guarantees. On the large-scale side, we test RL$^2$ on a vision-based navigation task and show that it scales up to high-dimensional problems.

...read moreread less

668 citations

Posted Content•

Continuous Deep Q-Learning with Model-based Acceleration

[...]

Shixiang Gu¹, Timothy P. Lillicrap², Ilya Sutskever², Sergey Levine²•Institutions (2)

Max Planck Society¹, Google²

02 Mar 2016-arXiv: Learning

TL;DR: This paper proposed normalized advantage functions (NAF) as an alternative to the more commonly used policy gradient and actor-critic methods to accelerate model-free reinforcement learning for continuous control tasks.

...read moreread less

Abstract: Model-free reinforcement learning has been successfully applied to a range of challenging problems, and has recently been extended to handle large neural network policies and value functions. However, the sample complexity of model-free algorithms, particularly when using high-dimensional function approximators, tends to limit their applicability to physical systems. In this paper, we explore algorithms and representations to reduce the sample complexity of deep reinforcement learning for continuous control tasks. We propose two complementary techniques for improving the efficiency of such algorithms. First, we derive a continuous variant of the Q-learning algorithm, which we call normalized adantage functions (NAF), as an alternative to the more commonly used policy gradient and actor-critic methods. NAF representation allows us to apply Q-learning with experience replay to continuous tasks, and substantially improves performance on a set of simulated robotic control tasks. To further improve the efficiency of our approach, we explore the use of learned models for accelerating model-free reinforcement learning. We show that iteratively refitted local linear models are especially effective for this, and demonstrate substantially faster learning on domains where such models are applicable.

...read moreread less

618 citations

Proceedings Article•

Multi-task Sequence to Sequence Learning

[...]

Minh-Thang Luong¹, Quoc V. Le¹, Ilya Sutskever¹, Oriol Vinyals¹, Lukasz Kaiser¹ - Show less +1 more•Institutions (1)

Google¹

01 Jan 2016

TL;DR: The results show that training on a small amount of parsing and image caption data can improve the translation quality between English and German by up to 1.5 BLEU points over strong single-task baselines on the WMT benchmarks, and reveal interesting properties of the two unsupervised learning objectives, autoencoder and skip-thought, in the MTL context.

...read moreread less

Abstract: Sequence to sequence learning has recently emerged as a new paradigm in supervised learning. To date, most of its applications focused on only one task and not much work explored this framework for multiple tasks. This paper examines three multi-task learning (MTL) settings for sequence to sequence models: (a) the oneto-many setting - where the encoder is shared between several tasks such as machine translation and syntactic parsing, (b) the many-to-one setting - useful when only the decoder can be shared, as in the case of translation and image caption generation, and (c) the many-to-many setting - where multiple encoders and decoders are shared, which is the case with unsupervised objectives and translation. Our results show that training on a small amount of parsing and image caption data can improve the translation quality between English and German by up to 1.5 BLEU points over strong single-task baselines on the WMT benchmarks. Furthermore, we have established a new state-of-the-art result in constituent parsing with 93.0 F1. Lastly, we reveal interesting properties of the two unsupervised learning objectives, autoencoder and skip-thought, in the MTL context: autoencoder helps less in terms of perplexities but more on BLEU scores compared to skip-thought.

...read moreread less

524 citations

Proceedings Article•

Continuous deep Q-learning with model-based acceleration

[...]

Shixiang Gu¹, Timothy P. Lillicrap², Ilya Sutskever², Sergey Levine²•Institutions (2)

Max Planck Society¹, Google²

19 Jun 2016

TL;DR: This paper derives a continuous variant of the Q-learning algorithm, which it is called normalized advantage functions (NAF), as an alternative to the more commonly used policy gradient and actor-critic methods, and substantially improves performance on a set of simulated robotic control tasks.

...read moreread less

Abstract: Model-free reinforcement learning has been successfully applied to a range of challenging problems, and has recently been extended to handle large neural network policies and value functions. However, the sample complexity of modelfree algorithms, particularly when using high-dimensional function approximators, tends to limit their applicability to physical systems. In this paper, we explore algorithms and representations to reduce the sample complexity of deep reinforcement learning for continuous control tasks. We propose two complementary techniques for improving the efficiency of such algorithms. First, we derive a continuous variant of the Q-learning algorithm, which we call normalized advantage functions (NAF), as an alternative to the more commonly used policy gradient and actor-critic methods. NAF representation allows us to apply Q-learning with experience replay to continuous tasks, and substantially improves performance on a set of simulated robotic control tasks. To further improve the efficiency of our approach, we explore the use of learned models for accelerating model-free reinforcement learning. We show that iteratively refitted local linear models are especially effective for this, and demonstrate substantially faster learning on domains where such models are applicable.

...read moreread less

437 citations

Posted Content•

Variational Lossy Autoencoder

[...]

Xi Chen¹, Diederik P. Kingma², Tim Salimans², Yan Duan¹, Prafulla Dhariwal², John Schulman¹, Ilya Sutskever³, Pieter Abbeel¹ - Show less +4 more•Institutions (3)

University of California, Berkeley¹, OpenAI², Google³

08 Nov 2016-arXiv: Learning

TL;DR: Li et al. as mentioned in this paper combine VAE with neural autoregressive models such as RNN, MADE and PixelRNN/CNN to learn a global representation for 2D images that describes only global structure and discards information about detailed texture.

...read moreread less

Abstract: Representation learning seeks to expose certain aspects of observed data in a learned representation that's amenable to downstream tasks like classification. For instance, a good representation for 2D images might be one that describes only global structure and discards information about detailed texture. In this paper, we present a simple but principled method to learn such global representations by combining Variational Autoencoder (VAE) with neural autoregressive models such as RNN, MADE and PixelRNN/CNN. Our proposed VAE model allows us to have control over what the global latent code can learn and , by designing the architecture accordingly, we can force the global latent code to discard irrelevant information such as texture in 2D images, and hence the VAE only "autoencodes" data in a lossy fashion. In addition, by leveraging autoregressive models as both prior distribution $p(z)$ and decoding distribution $p(x|z)$, we can greatly improve generative modeling performance of VAEs, achieving new state-of-the-art results on MNIST, OMNIGLOT and Caltech-101 Silhouettes density estimation tasks.

...read moreread less

385 citations

Proceedings Article•

Variational Lossy Autoencoder

[...]

Xi Chen¹, Diederik P. Kingma², Tim Salimans², Yan Duan¹, Prafulla Dhariwal², John Schulman¹, Ilya Sutskever³, Pieter Abbeel¹ - Show less +4 more•Institutions (3)

University of California, Berkeley¹, OpenAI², Google³

04 Nov 2016

TL;DR: This paper presents a simple but principled method to learn global representations by combining Variational Autoencoder (VAE) with neural autoregressive models such as RNN, MADE and PixelRNN/CNN with greatly improve generative modeling performance of VAEs.

...read moreread less

Abstract: Representation learning seeks to expose certain aspects of observed data in a learned representation that's amenable to downstream tasks like classification. For instance, a good representation for 2D images might be one that describes only global structure and discards information about detailed texture. In this paper, we present a simple but principled method to learn such global representations by combining Variational Autoencoder (VAE) with neural autoregressive models such as RNN, MADE and PixelRNN/CNN. Our proposed VAE model allows us to have control over what the global latent code can learn and , by designing the architecture accordingly, we can force the global latent code to discard irrelevant information such as texture in 2D images, and hence the VAE only "autoencodes" data in a lossy fashion. In addition, by leveraging autoregressive models as both prior distribution $p(z)$ and decoding distribution $p(x|z)$, we can greatly improve generative modeling performance of VAEs, achieving new state-of-the-art results on MNIST, OMNIGLOT and Caltech-101 Silhouettes density estimation tasks.

...read moreread less

Posted Content•

Improving Variational Inference with Inverse Autoregressive Flow

[...]

Diederik P. Kingma¹, Tim Salimans², Rafal Jozefowicz³, Xi Chen⁴, Ilya Sutskever³, Max Welling⁵ - Show less +2 more•Institutions (5)

University of Amsterdam¹, OpenAI², Google³, University of California, Berkeley⁴, Canadian Institute for Advanced Research⁵

15 Jun 2016-arXiv: Learning

TL;DR: This paper proposed a new type of normalizing flow, inverse autoregressive flow (IAF), that, in contrast to earlier published flows, scales well to high-dimensional latent spaces, and demonstrated that a novel type of variational autoencoder, coupled with IAF, is competitive with neural autoregression models in terms of attained log-likelihood on natural images, while allowing significantly faster synthesis.

...read moreread less

Abstract: The framework of normalizing flows provides a general strategy for flexible variational inference of posteriors over latent variables. We propose a new type of normalizing flow, inverse autoregressive flow (IAF), that, in contrast to earlier published flows, scales well to high-dimensional latent spaces. The proposed flow consists of a chain of invertible transformations, where each transformation is based on an autoregressive neural network. In experiments, we show that IAF significantly improves upon diagonal Gaussian approximate posteriors. In addition, we demonstrate that a novel type of variational autoencoder, coupled with IAF, is competitive with neural autoregressive models in terms of attained log-likelihood on natural images, while allowing significantly faster synthesis.

...read moreread less

Proceedings Article•

Neural Programmer: Inducing Latent Programs with Gradient Descent

[...]

Arvind Neelakantan¹, Quoc V. Le², Ilya Sutskever²•Institutions (2)

University of Massachusetts Amherst¹, Google²

01 Jan 2016

TL;DR: Neural Programmer as mentioned in this paper is an end-to-end differentiable neural network augmented with a small set of basic arithmetic and logic operations, which can call these augmented operations over several steps, thereby inducing compositional programs that are more complex than the built-in operations.

...read moreread less

Abstract: Deep neural networks have achieved impressive supervised classification performance in many tasks including image recognition, speech recognition, and sequence to sequence learning. However, this success has not been translated to applications like question answering that may involve complex arithmetic and logic reasoning. A major limitation of these models is in their inability to learn even simple arithmetic and logic operations. For example, it has been shown that neural networks fail to learn to add two binary numbers reliably. In this work, we propose Neural Programmer, an end-to-end differentiable neural network augmented with a small set of basic arithmetic and logic operations. Neural Programmer can call these augmented operations over several steps, thereby inducing compositional programs that are more complex than the built-in operations. The model learns from a weak supervision signal which is the result of execution of the correct program, hence it does not require expensive annotation of the correct program itself. The decisions of what operations to call, and what data segments to apply to are inferred by Neural Programmer. Such decisions, during training, are done in a differentiable fashion so that the entire network can be trained jointly by gradient descent. We find that training the model is difficult, but it can be greatly improved by adding random noise to the gradient. On a fairly complex synthetic table-comprehension dataset, traditional recurrent networks and attentional models perform poorly while Neural Programmer typically obtains nearly perfect accuracy.

...read moreread less

Proceedings Article•

Neural GPUs Learn Algorithms

[...]

ukasz Kaiser¹, Ilya Sutskever¹•Institutions (1)

Google¹

01 Jan 2016

TL;DR: It is shown that the Neural GPU can be trained on short instances of an algorithmic task and successfully generalize to long instances, and a technique for training deep recurrent networks: parameter sharing relaxation is introduced.

...read moreread less

Abstract: Learning an algorithm from examples is a fundamental problem that has been widely studied. Recently it has been addressed using neural networks, in particular by Neural Turing Machines (NTMs). These are fully differentiable computers that use backpropagation to learn their own programming. Despite their appeal NTMs have a weakness that is caused by their sequential nature: they are not parallel and are are hard to train due to their large depth when unfolded. We present a neural network architecture to address this problem: the Neural GPU. It is based on a type of convolutional gated recurrent unit and, like the NTM, is computationally universal. Unlike the NTM, the Neural GPU is highly parallel which makes it easier to train and efficient to run. An essential property of algorithms is their ability to handle inputs of arbitrary size. We show that the Neural GPU can be trained on short instances of an algorithmic task and successfully generalize to long instances. We verified it on a number of tasks including long addition and long multiplication of numbers represented in binary. We train the Neural GPU on numbers with upto 20 bits and observe no errors whatsoever while testing it, even on much longer numbers. To achieve these results we introduce a technique for training deep recurrent networks: parameter sharing relaxation. We also found a small amount of dropout and gradient noise to have a large positive effect on learning and generalization.

...read moreread less

Proceedings Article•

MuProp: Unbiased Backpropagation for Stochastic Neural Networks

[...]

Shixiang Gu¹, Shixiang Gu², Sergey Levine³, Ilya Sutskever³, Andriy Mnih³ - Show less +1 more•Institutions (3)

Max Planck Society¹, University of Cambridge², Google³

01 Jan 2016

TL;DR: MuProp is presented, an unbiased gradient estimator for stochastic networks, designed to make this task easier by improving on the likelihood-ratio estimator by reducing its variance using a control variate based on the first-order Taylor expansion of a mean-field network.

...read moreread less

Abstract: Deep neural networks are powerful parametric models that can be trained efficiently using the backpropagation algorithm. Stochastic neural networks combine the power of large parametric functions with that of graphical models, which makes it possible to learn very complex distributions. However, as backpropagation is not directly applicable to stochastic networks that include discrete sampling operations within their computational graph, training such networks remains difficult. We present MuProp, an unbiased gradient estimator for stochastic networks, designed to make this task easier. MuProp improves on the likelihood-ratio estimator by reducing its variance using a control variate based on the first-order Taylor expansion of a mean-field network. Crucially, unlike prior attempts at using backpropagation for training stochastic networks, the resulting estimator is unbiased and well behaved. Our experiments on structured output prediction and discrete latent variable modeling demonstrate that MuProp yields consistently good performance across a range of difficult tasks.

...read moreread less

InfoGAN: interpretable representation learning by information maximizing Generative Adversarial Nets

[...]

Xi Chen¹, Yan Duan¹, Rein Houthooft¹, John Schulman¹, Ilya Sutskever², Pieter Abbeel¹ - Show less +2 more•Institutions (2)

University of California, Berkeley¹, OpenAI²

01 Jan 2016

TL;DR: Experiments show that InfoGAN learns interpretable representations that are competitive with representations learned by existing fully supervised methods.

...read moreread less

Abstract: This paper describes InfoGAN, an information-theoretic extension to the Gener-ative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner. InfoGAN is a generative adversarial network that also maximizes the mutual information between a small subset of the latent variables and the observation. We derive a lower bound of the mutual information objective that can be optimized efficiently. Specifically, InfoGAN successfully disentangles writing styles from digit shapes on the MNIST dataset, pose from lighting of 3D rendered images, and background digits from the central digit on the SVHN dataset. It also discovers visual concepts that include hair styles, pres-ence/absence of eyeglasses, and emotions on the CelebA face dataset. Experiments show that InfoGAN learns interpretable representations that are competitive with representations learned by existing supervised methods.

...read moreread less

Proceedings Article•

An Online Sequence-to-Sequence Model Using Partial Conditioning

[...]

Navdeep Jaitly¹, Quoc V. Le¹, Oriol Vinyals¹, Ilya Sutskever¹, David Sussillo², Samy Bengio¹ - Show less +2 more•Institutions (2)

Google¹, Stanford University²

01 Jan 2016

TL;DR: The Neural Transducer as mentioned in this paper is a neural transducer that can make incremental predictions as more input arrives, without redoing the entire computation, which is unsuitable for tasks that require incremental predictions to be made as more data arrives or tasks that have long input sequences and output sequences.

...read moreread less

Abstract: Sequence-to-sequence models have achieved impressive results on various tasks. However, they are unsuitable for tasks that require incremental predictions to be made as more data arrives or tasks that have long input sequences and output sequences. This is because they generate an output sequence conditioned on an entire input sequence. In this paper, we present a Neural Transducer that can make incremental predictions as more input arrives, without redoing the entire computation. Unlike sequence-to-sequence models, the Neural Transducer computes the next-step distribution conditioned on the partially observed input sequence and the partially generated sequence. At each time step, the transducer can decide to emit zero to many output symbols. The data can be processed using an encoder and presented as input to the transducer. The discrete decision to emit a symbol at every time step makes it difficult to learn with conventional backpropagation. It is however possible to train the transducer by using a dynamic programming algorithm to generate target discrete decisions. Our experiments show that the Neural Transducer works well in settings where it is required to produce output predictions as data come in. We also find that the Neural Transducer performs well for long sequences even when attention mechanisms are not used.

...read moreread less

Journal Article•

Neural Random Access Machines.

[...]

Karol Kurach¹, Marcin Andrychowicz¹, Ilya Sutskever¹•Institutions (1)

Google¹

01 Jan 2016-Ercim News

TL;DR: The proposed model can learn to solve algorithmic tasks of such type and is capable of operating on simple data structures like linked-lists and binary trees and generalize to sequences of arbitrary length.

...read moreread less

Abstract: In this paper, we propose and investigate a new neural network architecture called Neural Random Access Machine. It can manipulate and dereference pointers to an external variable-size random-access memory. The model is trained from pure input-output examples using backpropagation. We evaluate the new model on a number of simple algorithmic tasks whose solutions require pointer manipulation and dereferencing. Our results show that the proposed model can learn to solve algorithmic tasks of such type and is capable of operating on simple data structures like linked-lists and binary trees. For easier tasks, the learned solutions generalize to sequences of arbitrary length. Moreover, memory access during inference can be done in a constant time under some assumptions.

...read moreread less

Posted Content•

A Neural Transducer

[...]

Navdeep Jaitly, David Sussillo, Quoc V. Le, Oriol Vinyals, Ilya Sutskever, Samy Bengio - Show less +2 more

01 Jan 2016-arXiv: Learning

TL;DR: A Neural Transducer that can make incremental predictions as more input arrives, without redoing the entire computation, and performs well for long sequences even when attention mechanisms are not used.

...read moreread less

Abstract: Sequence-to-sequence models have achieved impressive results on various tasks. However, they are unsuitable for tasks that require incremental predictions to be made as more data arrives or tasks that have long input sequences and output sequences. This is because they generate an output sequence conditioned on an entire input sequence. In this paper, we present a Neural Transducer that can make incremental predictions as more input arrives, without redoing the entire computation. Unlike sequence-to-sequence models, the Neural Transducer computes the next-step distribution conditioned on the partially observed input sequence and the partially generated sequence. At each time step, the transducer can decide to emit zero to many output symbols. The data can be processed using an encoder and presented as input to the transducer. The discrete decision to emit a symbol at every time step makes it difficult to learn with conventional backpropagation. It is however possible to train the transducer by using a dynamic programming algorithm to generate target discrete decisions. Our experiments show that the Neural Transducer works well in settings where it is required to produce output predictions as data come in. We also find that the Neural Transducer performs well for long sequences even when attention mechanisms are not used.

...read moreread less

Proceedings Article•

Improving Variational Autoencoders with Inverse Autoregressive Flow

[...]

Diederik P. Kingma¹, Tim Salimans², Rafal Jozefowicz³, Xi Chen⁴, Ilya Sutskever³, Max Welling¹ - Show less +2 more•Institutions (4)

University of Amsterdam¹, OpenAI², Google³, University of California, Berkeley⁴

01 Jan 2016

TL;DR: In experiments with natural images, it is demonstrated that autoregressive flow leads to significant performance gains and is well applicable to models with high-dimensional latent spaces, such as convolutional generative models.

...read moreread less

Abstract: We propose a simple and scalable method for improving the flexibility of variational inference through a transformation with autoregressive neural networks. Autoregressive neural networks, such as RNNs or the PixelCNN, are very powerful models and potentially interesting for use as variational posterior approximation. However, ancestral sampling in such networks is a long sequential operation, and therefore typically very slow on modern parallel hardware, such as GPUs. We show that by inverting autoregressive neural networks we can obtain equally powerful posterior models from which we can sample efficiently on modern hardware. We show that such data transformations, inverse autoregressive flows (IAF), can be used to transform a simple distribution over the latent variables into a much more flexible distribution, while still allowing us to compute the resulting variables' probability density function. The method is simple to implement, can be made arbitrarily flexible and, in contrast with previous work, is well applicable to models with high-dimensional latent spaces, such as convolutional generative models. The method is applied to a novel deep architecture of variational auto-encoders. In experiments with natural images, we demonstrate that autoregressive flow leads to significant performance gains.

...read moreread less

Posted Content•

Learning Online Alignments with Continuous Rewards Policy Gradient

[...]

Yuping Luo¹, Chung-Cheng Chiu², Navdeep Jaitly², Ilya Sutskever•Institutions (2)

Tsinghua University¹, Google²

03 Aug 2016-arXiv: Learning

TL;DR: The authors used hard binary stochastic decisions to select the timesteps at which outputs will be produced, and trained a policy gradient method to produce these binary decisions using a standard policy gradient.

...read moreread less

Abstract: Sequence-to-sequence models with soft attention had significant success in machine translation, speech recognition, and question answering. Though capable and easy to use, they require that the entirety of the input sequence is available at the beginning of inference, an assumption that is not valid for instantaneous translation and speech recognition. To address this problem, we present a new method for solving sequence-to-sequence problems using hard online alignments instead of soft offline alignments. The online alignments model is able to start producing outputs without the need to first process the entire input sequence. A highly accurate online sequence-to-sequence model is useful because it can be used to build an accurate voice-based instantaneous translator. Our model uses hard binary stochastic decisions to select the timesteps at which outputs will be produced. The model is trained to produce these stochastic decisions using a standard policy gradient method. In our experiments, we show that this model achieves encouraging performance on TIMIT and Wall Street Journal (WSJ) speech recognition datasets.

...read moreread less

Proceedings Article•

Neural Random-Access Machines

[...]

Karol Kurach¹, Marcin Andrychowicz¹, Ilya Sutskever¹•Institutions (1)

Google¹

01 Jan 2016

TL;DR: In this article, a neural random access machine (NAM) is proposed to manipulate and dereference pointers to an external variable-size random access memory, which is trained from pure input-output examples using backpropagation.

...read moreread less

Abstract: In this paper, we propose and investigate a new neural network architecture called Neural Random Access Machine. It can manipulate and dereference pointers to an external variable-size random-access memory. The model is trained from pure input-output examples using backpropagation. We evaluate the new model on a number of simple algorithmic tasks whose solutions require pointer manipulation and dereferencing. Our results show that the proposed model can learn to solve algorithmic tasks of such type and is capable of operating on simple data structures like linked-lists and binary trees. For easier tasks, the learned solutions generalize to sequences of arbitrary length. Moreover, memory access during inference can be done in a constant time under some assumptions.

...read moreread less

Posted Content•

Extensions and Limitations of the Neural GPU

[...]

Eric Price, Wojciech Zaremba, Ilya Sutskever

04 Nov 2016-arXiv: Neural and Evolutionary Computing

TL;DR: It is found that Neural GPUs that correctly generalize to arbitrarily long numbers still fail to compute the correct answer on highly-symmetric, atypical inputs: for example, a Neural GPU that achieves near-perfect generalization on decimal multiplication of up to 100-digit long numbers can fail on $\dots002$.

...read moreread less

Abstract: The Neural GPU is a recent model that can learn algorithms such as multi-digit binary addition and binary multiplication in a way that generalizes to inputs of arbitrary length. We show that there are two simple ways of improving the performance of the Neural GPU: by carefully designing a curriculum, and by increasing model size. The latter requires a memory efficient implementation, as a naive implementation of the Neural GPU is memory intensive. We find that these techniques increase the set of algorithmic problems that can be solved by the Neural GPU: we have been able to learn to perform all the arithmetic operations (and generalize to arbitrarily long numbers) when the arguments are given in the decimal representation (which, surprisingly, has not been possible before). We have also been able to train the Neural GPU to evaluate long arithmetic expressions with multiple operands that require respecting the precedence order of the operands, although these have succeeded only in their binary representation, and not with perfect accuracy. In addition, we gain insight into the Neural GPU by investigating its failure modes. We find that Neural GPUs that correctly generalize to arbitrarily long numbers still fail to compute the correct answer on highly-symmetric, atypical inputs: for example, a Neural GPU that achieves near-perfect generalization on decimal multiplication of up to 100-digit long numbers can fail on $000000\dots002 \times 000000\dots002$ while succeeding at $2 \times 2$. These failure modes are reminiscent of adversarial examples.

...read moreread less

Patent•

Training neural networks on partitioned training data

[...]

Ilya Sutskever¹, Wojciech Zaremba¹•Institutions (1)

Google¹

07 Apr 2016

Patent•

Convolutional gated recurrent neural networks

[...]

Lukasz Kaiser¹, Ilya Sutskever¹•Institutions (1)

Google¹

11 Nov 2016

Patent•

Generating target sequences from input sequences using partial conditioning

[...]

Navdeep Jaitly¹, Quoc V. Le¹, Oriol Vinyals¹, Samuel Bengio¹, Ilya Sutskever¹ - Show less +1 more•Institutions (1)

Google¹

11 Nov 2016

TL;DR: In this paper, a system can be configured to perform tasks such as converting recorded speech to a sequence of phonemes that represent the speech, translating an input sequence of words in one language into a corresponding sequence of word in another language, or predicting a target sequence that follow an input word in a language (e.g., a language model).

...read moreread less

Abstract: A system can be configured to perform tasks such as converting recorded speech to a sequence of phonemes that represent the speech, converting an input sequence of graphemes into a target sequence of phonemes, translating an input sequence of words in one language into a corresponding sequence of words in another language, or predicting a target sequence of words that follow an input sequence of words in a language (eg, a language model) In a speech recognizer, the RNN system may be used to convert speech to a target sequence of phonemes in real-time so that a transcription of the speech can be generated and presented to a user, even before the user has completed uttering the entire speech input

...read moreread less

Patent•

Predicting likelihoods of conditions being satisfied using recurrent neural networks

[...]

Greg S. Corrado¹, Ilya Sutskever¹, Jeffrey Dean¹•Institutions (1)

Google¹

26 Jul 2016

Patent•

Selecting actions to be performed by a reinforcement learning agent using tree search

[...]

Thore Graepel¹, Shih-chieh Huang, David Silver, Arthur Guez, Laurent Sifre, Ilya Sutskever, Chris J. Maddison - Show less +3 more•Institutions (1)

Google¹

29 Sep 2016

TL;DR: In this paper, a value neural network is trained to generate a value score for the state of an environment that represents a predicted long-term reward resulting from the environment being in the state.

...read moreread less

Abstract: Methods, systems and apparatus, including computer programs encoded on computer storage media, for training a value neural network that is configured to receive an observation characterizing a state of an environment being interacted with by an agent and to process the observation in accordance with parameters of the value neural network to generate a value score One of the systems performs operations that include training a supervised learning policy neural network; initializing initial values of parameters of a reinforcement learning policy neural network having a same architecture as the supervised learning policy network to the trained values of the parameters of the supervised learning policy neural network; training the reinforcement learning policy neural network on second training data; and training the value neural network to generate a value score for the state of the environment that represents a predicted long-term reward resulting from the environment being in the state

...read moreread less

Patent•

Augmenting neural networks with external memory using reinforcement learning

[...]

Ilya Sutskever¹, Ivo Danihelka¹, Alex Graves¹, Gregory Duncan Wayne¹, Wojciech Zaremba¹ - Show less +1 more•Institutions (1)

Google¹

30 Dec 2016

Showing papers by "Ilya Sutskever published in 2016"