Showing papers on "Reinforcement learning published in 2017"

PDF

Open Access

Posted Content•

[...]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov - Show less +1 more

20 Jul 2017-arXiv: Learning

TL;DR: A new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent, are proposed.

...read moreread less

Abstract: We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.

...read moreread less

9,020 citations

Journal Article•DOI•

Mastering the game of Go without human knowledge

[...]

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy P. Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, Demis Hassabis - Show less +13 more

19 Oct 2017-Nature

TL;DR: An algorithm based solely on reinforcement learning is introduced, without human data, guidance or domain knowledge beyond game rules, that achieves superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.

...read moreread less

Abstract: A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo. Starting from zero knowledge and without human data, AlphaGo Zero was able to teach itself to play Go and to develop novel strategies that provide new insights into the oldest of games. To beat world champions at the game of Go, the computer program AlphaGo has relied largely on supervised learning from millions of human expert moves. David Silver and colleagues have now produced a system called AlphaGo Zero, which is based purely on reinforcement learning and learns solely from self-play. Starting from random moves, it can reach superhuman level in just a couple of days of training and five million games of self-play, and can now beat all previous versions of AlphaGo. Because the machine independently discovers the same fundamental principles of the game that took humans millennia to conceptualize, the work suggests that such principles have some universal character, beyond human bias.

...read moreread less

7,818 citations

Proceedings Article•DOI•

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

[...]

Ramprasaath R. Selvaraju¹, Michael Cogswell², Abhishek Das², Ramakrishna Vedantam¹, Devi Parikh³, Dhruv Batra¹ - Show less +2 more•Institutions (3)

Virginia Tech¹, Georgia Institute of Technology², Facebook³

01 Oct 2017

TL;DR: This work combines existing fine-grained visualizations to create a high-resolution class-discriminative visualization, Guided Grad-CAM, and applies it to image classification, image captioning, and visual question answering (VQA) models, including ResNet-based architectures.

...read moreread less

Abstract: We propose a technique for producing ‘visual explanations’ for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent. Our approach – Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say logits for ‘dog’ or even a caption), flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept. Unlike previous approaches, Grad- CAM is applicable to a wide variety of CNN model-families: (1) CNNs with fully-connected layers (e.g. VGG), (2) CNNs used for structured outputs (e.g. captioning), (3) CNNs used in tasks with multi-modal inputs (e.g. visual question answering) or reinforcement learning, without architectural changes or re-training. We combine Grad-CAM with existing fine-grained visualizations to create a high-resolution class-discriminative visualization, Guided Grad-CAM, and apply it to image classification, image captioning, and visual question answering (VQA) models, including ResNet-based architectures. In the context of image classification models, our visualizations (a) lend insights into failure modes of these models (showing that seemingly unreasonable predictions have reasonable explanations), (b) outperform previous methods on the ILSVRC-15 weakly-supervised localization task, (c) are more faithful to the underlying model, and (d) help achieve model generalization by identifying dataset bias. For image captioning and VQA, our visualizations show even non-attention based models can localize inputs. Finally, we design and conduct human studies to measure if Grad-CAM explanations help users establish appropriate trust in predictions from deep networks and show that Grad-CAM helps untrained users successfully discern a ‘stronger’ deep network from a ‘weaker’ one even when both make identical predictions. Our code is available at https: //github.com/ramprs/grad-cam/ along with a demo on CloudCV [2] and video at youtu.be/COjUB9Izk6E.

...read moreread less

7,556 citations

Proceedings Article•

Model-agnostic meta-learning for fast adaptation of deep networks

[...]

Chelsea Finn¹, Pieter Abbeel¹, Sergey Levine¹•Institutions (1)

University of California, Berkeley¹

06 Aug 2017

TL;DR: An algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent and applicable to a variety of different learning problems, including classification, regression, and reinforcement learning is proposed.

...read moreread less

Abstract: We propose an algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent and applicable to a variety of different learning problems, including classification, regression, and reinforcement learning. The goal of meta-learning is to train a model on a variety of learning tasks, such that it can solve new learning tasks using only a small number of training samples. In our approach, the parameters of the model are explicitly trained such that a small number of gradient steps with a small amount of training data from a new task will produce good generalization performance on that task. In effect, our method trains the model to be easy to fine-tune. We demonstrate that this approach leads to state-of-the-art performance on two few-shot image classification benchmarks, produces good results on few-shot regression, and accelerates fine-tuning for policy gradient reinforcement learning with neural network policies.

...read moreread less

7,027 citations

Book•

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

[...]

Aurélien Géron

13 Mar 2017

TL;DR: This practical book shows you how to implement programs capable of learning from data by using concrete examples, minimal theory, and two production-ready Python frameworks-scikit-learn and TensorFlow-author Aurelien Geron helps you gain an intuitive understanding of the concepts and tools for building intelligent systems.

...read moreread less

Abstract: Through a series of recent breakthroughs, deep learning has boosted the entire field of machine learning. Now, even programmers who know close to nothing about this technology can use simple, efficient tools to implement programs capable of learning from data. This practical book shows you how. By using concrete examples, minimal theory, and two production-ready Python frameworks-scikit-learn and TensorFlow-author Aurelien Geron helps you gain an intuitive understanding of the concepts and tools for building intelligent systems. You'll learn a range of techniques, starting with simple linear regression and progressing to deep neural networks. With exercises in each chapter to help you apply what you've learned, all you need is programming experience to get started. Explore the machine learning landscape, particularly neural nets Use scikit-learn to track an example machine-learning project end-to-end Explore several training models, including support vector machines, decision trees, random forests, and ensemble methods Use the TensorFlow library to build and train neural nets Dive into neural net architectures, including convolutional nets, recurrent nets, and deep reinforcement learning Learn techniques for training and scaling deep neural nets Apply practical code examples without acquiring excessive machine learning theory or algorithm details

...read moreread less

1,870 citations

Proceedings Article•

Seqgan: sequence generative adversarial nets with policy gradient

[...]

Lantao Yu¹, Weinan Zhang¹, Jun Wang², Yong Yu¹•Institutions (2)

Shanghai Jiao Tong University¹, University College London²

13 Feb 2017

TL;DR: SeqGAN as mentioned in this paper models the data generator as a stochastic policy in reinforcement learning (RL), and the RL reward signal comes from the discriminator judged on a complete sequence, and is passed back to the intermediate state-action steps using Monte Carlo search.

...read moreread less

Abstract: As a new way of training generative models, Generative Adversarial Net (GAN) that uses a discriminative model to guide the training of the generative model has enjoyed considerable success in generating real-valued data. However, it has limitations when the goal is for generating sequences of discrete tokens. A major reason lies in that the discrete outputs from the generative model make it difficult to pass the gradient update from the discriminative model to the generative model. Also, the discriminative model can only assess a complete sequence, while for a partially generated sequence, it is nontrivial to balance its current score and the future one once the entire sequence has been generated. In this paper, we propose a sequence generation framework, called SeqGAN, to solve the problems. Modeling the data generator as a stochastic policy in reinforcement learning (RL), SeqGAN bypasses the generator differentiation problem by directly performing gradient policy update. The RL reward signal comes from the GAN discriminator judged on a complete sequence, and is passed back to the intermediate state-action steps using Monte Carlo search. Extensive experiments on synthetic data and real-world tasks demonstrate significant improvements over strong baselines.

...read moreread less

1,869 citations

Journal Article•DOI•

Deep Reinforcement Learning: A Brief Survey

[...]

Kai Arulkumaran¹, Marc Peter Deisenroth¹, Miles Brundage², Anil A. Bharath¹•Institutions (2)

Imperial College London¹, University of Oxford²

09 Nov 2017-IEEE Signal Processing Magazine

TL;DR: Deep reinforcement learning (DRL) is poised to revolutionize the field of artificial intelligence (AI) and represents a step toward building autonomous systems with a higher-level understanding of the visual world as discussed by the authors.

...read moreread less

Abstract: Deep reinforcement learning (DRL) is poised to revolutionize the field of artificial intelligence (AI) and represents a step toward building autonomous systems with a higherlevel understanding of the visual world. Currently, deep learning is enabling reinforcement learning (RL) to scale to problems that were previously intractable, such as learning to play video games directly from pixels. DRL algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of RL, then progress to the main streams of value-based and policy-based methods. Our survey will cover central algorithms in deep RL, including the deep Q-network (DQN), trust region policy optimization (TRPO), and asynchronous advantage actor critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via RL. To conclude, we describe several current areas of research within the field.

...read moreread less

1,743 citations

Journal Article•DOI•

A brief survey of deep reinforcement learning

[...]

Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, Anil A. Bharath

09 Nov 2017-arXiv: Learning

TL;DR: This survey will cover central algorithms in deep RL, including the deep Q-network (DQN), trust region policy optimization (TRPO), and asynchronous advantage actor critic, and highlight the unique advantages of deep neural networks, focusing on visual understanding via RL.

...read moreread less

Abstract: Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policy-based methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep $Q$-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.

...read moreread less

1,707 citations

Posted Content•

Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments

[...]

Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, Igor Mordatch - Show less +2 more

07 Jun 2017-arXiv: Learning

TL;DR: An adaptation of actor-critic methods that considers action policies of other agents and is able to successfully learn policies that require complex multi-agent coordination is presented.

...read moreread less

Abstract: We explore deep reinforcement learning methods for multi-agent domains. We begin by analyzing the difficulty of traditional algorithms in the multi-agent case: Q-learning is challenged by an inherent non-stationarity of the environment, while policy gradient suffers from a variance that increases as the number of agents grows. We then present an adaptation of actor-critic methods that considers action policies of other agents and is able to successfully learn policies that require complex multi-agent coordination. Additionally, we introduce a training regimen utilizing an ensemble of policies for each agent that leads to more robust multi-agent policies. We show the strength of our approach compared to existing methods in cooperative as well as competitive scenarios, where agent populations are able to discover various physical and informational coordination strategies.

...read moreread less

1,477 citations

Proceedings Article•

Hindsight Experience Replay

[...]

Marcin Andrychowicz¹, Filip Wolski¹, Alex Ray¹, Jonas Schneider¹, Rachel Fong¹, Peter Welinder¹, Bob McGrew¹, Josh Tobin², OpenAI Pieter Abbeel, Wojciech Zaremba¹ - Show less +6 more•Institutions (2)

OpenAI¹, University of California, Berkeley²

05 Jul 2017

TL;DR: A novel technique is presented which allows sample-efficient learning from rewards which are sparse and binary and therefore avoid the need for complicated reward engineering and may be seen as a form of implicit curriculum.

...read moreread less

Abstract: Dealing with sparse rewards is one of the biggest challenges in Reinforcement Learning (RL). We present a novel technique called Hindsight Experience Replay which allows sample-efficient learning from rewards which are sparse and binary and therefore avoid the need for complicated reward engineering. It can be combined with an arbitrary off-policy RL algorithm and may be seen as a form of implicit curriculum. We demonstrate our approach on the task of manipulating objects with a robotic arm. In particular, we run experiments on three different tasks: pushing, sliding, and pick-and-place, in each case using only binary rewards indicating whether or not the task is completed. Our ablation studies show that Hindsight Experience Replay is a crucial ingredient which makes training possible in these challenging environments. We show that our policies trained on a physics simulation can be deployed on a physical robot and successfully complete the task. The video presenting our experiments is available at https://goo.gl/SMrQnI.

...read moreread less

1,457 citations

Proceedings Article•DOI•

Target-driven visual navigation in indoor scenes using deep reinforcement learning

[...]

Yuke Zhu¹, Roozbeh Mottaghi², Eric Kolve², Joseph J. Lim¹, Abhinav Gupta², Li Fei-Fei¹, Ali Farhadi² - Show less +3 more•Institutions (2)

Stanford University¹, Allen Institute for Artificial Intelligence²

01 May 2017

TL;DR: This article proposed an actor-critic model whose policy is a function of the goal as well as the current state, which allows better generalization and generalizes across targets and scenes.

...read moreread less

Abstract: Two less addressed issues of deep reinforcement learning are (1) lack of generalization capability to new goals, and (2) data inefficiency, i.e., the model requires several (and often costly) episodes of trial and error to converge, which makes it impractical to be applied to real-world scenarios. In this paper, we address these two issues and apply our model to target-driven visual navigation. To address the first issue, we propose an actor-critic model whose policy is a function of the goal as well as the current state, which allows better generalization. To address the second issue, we propose the AI2-THOR framework, which provides an environment with high-quality 3D scenes and a physics engine. Our framework enables agents to take actions and interact with objects. Hence, we can collect a huge number of training samples efficiently. We show that our proposed method (1) converges faster than the state-of-the-art deep reinforcement learning methods, (2) generalizes across targets and scenes, (3) generalizes to a real robot scenario with a small amount of fine-tuning (although the model is trained in simulation), (4) is end-to-end trainable and does not need feature engineering, feature matching between frames or 3D reconstruction of the environment.

...read moreread less

Proceedings Article•DOI•

Self-Critical Sequence Training for Image Captioning

[...]

Steven J. Rennie¹, Etienne Marcheret¹, Youssef Mroueh¹, Jarret Ross¹, Vaibhava Goel¹ - Show less +1 more•Institutions (1)

IBM¹

21 Jul 2017

TL;DR: This paper considers the problem of optimizing image captioning systems using reinforcement learning, and shows that by carefully optimizing systems using the test metrics of the MSCOCO task, significant gains in performance can be realized.

...read moreread less

Abstract: Recently it has been shown that policy-gradient methods for reinforcement learning can be utilized to train deep end-to-end systems directly on non-differentiable metrics for the task at hand. In this paper we consider the problem of optimizing image captioning systems using reinforcement learning, and show that by carefully optimizing our systems using the test metrics of the MSCOCO task, significant gains in performance can be realized. Our systems are built using a new optimization approach that we call self-critical sequence training (SCST). SCST is a form of the popular REINFORCE algorithm that, rather than estimating a baseline to normalize the rewards and reduce variance, utilizes the output of its own test-time inference algorithm to normalize the rewards it experiences. Using this approach, estimating the reward signal (as actor-critic methods must do) and estimating normalization (as REINFORCE algorithms typically do) is avoided, while at the same time harmonizing the model with respect to its test-time inference procedure. Empirically we find that directly optimizing the CIDEr metric with SCST and greedy decoding at test-time is highly effective. Our results on the MSCOCO evaluation sever establish a new state-of-the-art on the task, improving the best result in terms of CIDEr from 104.9 to 114.7.

...read moreread less

Proceedings Article•

Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments

[...]

Ryan Lowe¹, Yi Wu², Aviv Tamar², Jean Harb¹, OpenAI Pieter Abbeel, Igor Mordatch³ - Show less +2 more•Institutions (3)

McGill University¹, University of California, Berkeley², OpenAI³

07 Jun 2017

TL;DR: In this article, an actor-critic method was used to learn multi-agent coordination policies in cooperative and competitive multi-player RL games, where agent populations are able to discover various physical and informational coordination strategies.

...read moreread less

Posted Content•

Evolution Strategies as a Scalable Alternative to Reinforcement Learning.

[...]

Tim Salimans, Jonathan Ho, Xi Chen, Ilya Sutskever

10 Mar 2017-arXiv: Machine Learning

TL;DR: This work explores the use of Evolution Strategies (ES), a class of black box optimization algorithms, as an alternative to popular MDP-based RL techniques such as Q-learning and Policy Gradients, and highlights several advantages of ES as a blackbox optimization technique.

...read moreread less

Abstract: We explore the use of Evolution Strategies (ES), a class of black box optimization algorithms, as an alternative to popular MDP-based RL techniques such as Q-learning and Policy Gradients. Experiments on MuJoCo and Atari show that ES is a viable solution strategy that scales extremely well with the number of CPUs available: By using a novel communication strategy based on common random numbers, our ES implementation only needs to communicate scalars, making it possible to scale to over a thousand parallel workers. This allows us to solve 3D humanoid walking in 10 minutes and obtain competitive results on most Atari games after one hour of training. In addition, we highlight several advantages of ES as a black box optimization technique: it is invariant to action frequency and delayed rewards, tolerant of extremely long horizons, and does not need temporal discounting or value function approximation.

...read moreread less

Posted Content•

A Survey on Multi-Task Learning

[...]

Yu Zhang¹, Qiang Yang¹•Institutions (1)

Hong Kong University of Science and Technology¹

25 Jul 2017-arXiv: Learning

TL;DR: Multi-task learning (MTL) as mentioned in this paper is a learning paradigm in machine learning and its aim is to leverage useful information contained in multiple related tasks to help improve the generalization performance of all the tasks.

...read moreread less

Abstract: Multi-Task Learning (MTL) is a learning paradigm in machine learning and its aim is to leverage useful information contained in multiple related tasks to help improve the generalization performance of all the tasks. In this paper, we give a survey for MTL from the perspective of algorithmic modeling, applications and theoretical analyses. For algorithmic modeling, we give a definition of MTL and then classify different MTL algorithms into five categories, including feature learning approach, low-rank approach, task clustering approach, task relation learning approach and decomposition approach as well as discussing the characteristics of each approach. In order to improve the performance of learning tasks further, MTL can be combined with other learning paradigms including semi-supervised learning, active learning, unsupervised learning, reinforcement learning, multi-view learning and graphical models. When the number of tasks is large or the data dimensionality is high, we review online, parallel and distributed MTL models as well as dimensionality reduction and feature hashing to reveal their computational and storage advantages. Many real-world applications use MTL to boost their performance and we review representative works in this paper. Finally, we present theoretical analyses and discuss several future directions for MTL.

...read moreread less

Proceedings Article•DOI•

Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates

[...]

Shixiang Gu¹, Ethan Holly¹, Timothy P. Lillicrap¹, Sergey Levine¹•Institutions (1)

Google¹

01 May 2017

TL;DR: In this article, a deep reinforcement learning algorithm based on off-policy training of deep Q-functions can scale to complex 3D manipulation tasks and can learn deep neural network policies efficiently enough to train on real physical robots.

...read moreread less

Abstract: Reinforcement learning holds the promise of enabling autonomous robots to learn large repertoires of behavioral skills with minimal human intervention. However, robotic applications of reinforcement learning often compromise the autonomy of the learning process in favor of achieving training times that are practical for real physical systems. This typically involves introducing hand-engineered policy representations and human-supplied demonstrations. Deep reinforcement learning alleviates this limitation by training general-purpose neural network policies, but applications of direct deep reinforcement learning algorithms have so far been restricted to simulated settings and relatively simple tasks, due to their apparent high sample complexity. In this paper, we demonstrate that a recent deep reinforcement learning algorithm based on off-policy training of deep Q-functions can scale to complex 3D manipulation tasks and can learn deep neural network policies efficiently enough to train on real physical robots. We demonstrate that the training times can be further reduced by parallelizing the algorithm across multiple robots which pool their policy updates asynchronously. Our experimental evaluation shows that our method can learn a variety of 3D manipulation skills in simulation and a complex door opening skill on real robots without any prior demonstrations or manually designed representations.

...read moreread less

Posted Content•

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

[...]

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy P. Lillicrap, Karen Simonyan, Demis Hassabis - Show less +9 more

05 Dec 2017-arXiv: Artificial Intelligence

TL;DR: This paper generalises the approach into a single AlphaZero algorithm that can achieve, tabula rasa, superhuman performance in many challenging domains, and convincingly defeated a world-champion program in each case.

...read moreread less

Abstract: The game of chess is the most widely-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. In contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go, by tabula rasa reinforcement learning from games of self-play. In this paper, we generalise this approach into a single AlphaZero algorithm that can achieve, tabula rasa, superhuman performance in many challenging domains. Starting from random play, and given no domain knowledge except the game rules, AlphaZero achieved within 24 hours a superhuman level of play in the games of chess and shogi (Japanese chess) as well as Go, and convincingly defeated a world-champion program in each case.

...read moreread less

Posted Content•

Rainbow: Combining Improvements in Deep Reinforcement Learning

[...]

Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Gheshlaghi Azar, David Silver - Show less +6 more

06 Oct 2017-arXiv: Artificial Intelligence

TL;DR: This paper examines six extensions to the DQN algorithm and empirically studies their combination, showing that the combination provides state-of-the-art performance on the Atari 2600 benchmark, both in terms of data efficiency and final performance.

...read moreread less

Abstract: The deep reinforcement learning community has made several independent improvements to the DQN algorithm. However, it is unclear which of these extensions are complementary and can be fruitfully combined. This paper examines six extensions to the DQN algorithm and empirically studies their combination. Our experiments show that the combination provides state-of-the-art performance on the Atari 2600 benchmark, both in terms of data efficiency and final performance. We also provide results from a detailed ablation study that shows the contribution of each component to overall performance.

...read moreread less

Posted Content•

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

[...]

Chelsea Finn¹, Pieter Abbeel¹, Sergey Levine¹•Institutions (1)

University of California, Berkeley¹

09 Mar 2017-arXiv: Learning

TL;DR: Model-Agnostic Meta-Learning (MAML) as discussed by the authors is a meta-learning approach that aims to train a model on a variety of learning tasks such that it can solve new learning tasks using only a small number of training samples.

...read moreread less

Posted Content•

Deep Reinforcement Learning: An Overview

[...]

Yuxi Li

25 Jan 2017-arXiv: Learning

TL;DR: This work discusses core RL elements, including value function, in particular, Deep Q-Network (DQN), policy, reward, model, planning, and exploration, and important mechanisms for RL, including attention and memory, unsupervised learning, transfer learning, multi-agent RL, hierarchical RL, and learning to learn.

...read moreread less

Abstract: We give an overview of recent exciting achievements of deep reinforcement learning (RL). We discuss six core elements, six important mechanisms, and twelve applications. We start with background of machine learning, deep learning and reinforcement learning. Next we discuss core RL elements, including value function, in particular, Deep Q-Network (DQN), policy, reward, model, planning, and exploration. After that, we discuss important mechanisms for RL, including attention and memory, unsupervised learning, transfer learning, multi-agent RL, hierarchical RL, and learning to learn. Then we discuss various applications of RL, including games, in particular, AlphaGo, robotics, natural language processing, including dialogue systems, machine translation, and text generation, computer vision, neural architecture design, business management, finance, healthcare, Industry 4.0, smart grid, intelligent transportation systems, and computer systems. We mention topics not reviewed yet, and list a collection of RL resources. After presenting a brief summary, we close with discussions. Please see Deep Reinforcement Learning, arXiv:1810.06339, for a significant update.

...read moreread less

Proceedings Article•

Rainbow: Combining Improvements in Deep Reinforcement Learning

[...]

Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Gheshlaghi Azar, David Silver - Show less +6 more

06 Oct 2017

TL;DR: In this article, the authors examined six extensions to the DQN algorithm and empirically studied their combination, showing that the combination provided state-of-the-art performance on the Atari 2600 benchmark.

...read moreread less

Posted Content•

A Simple Neural Attentive Meta-Learner

[...]

Nikhil Mishra¹, Mostafa Rohaninejad, Xi Chen¹, Pieter Abbeel¹•Institutions (1)

University of California, Berkeley¹

11 Jul 2017-arXiv: Artificial Intelligence

TL;DR: This work proposes a class of simple and generic meta-learner architectures that use a novel combination of temporal convolutions and soft attention; the former to aggregate information from past experience and the latter to pinpoint specific pieces of information.

...read moreread less

Abstract: Deep neural networks excel in regimes with large amounts of data, but tend to struggle when data is scarce or when they need to adapt quickly to changes in the task. In response, recent work in meta-learning proposes training a meta-learner on a distribution of similar tasks, in the hopes of generalization to novel but related tasks by learning a high-level strategy that captures the essence of the problem it is asked to solve. However, many recent meta-learning approaches are extensively hand-designed, either using architectures specialized to a particular application, or hard-coding algorithmic components that constrain how the meta-learner solves the task. We propose a class of simple and generic meta-learner architectures that use a novel combination of temporal convolutions and soft attention; the former to aggregate information from past experience and the latter to pinpoint specific pieces of information. In the most extensive set of meta-learning experiments to date, we evaluate the resulting Simple Neural AttentIve Learner (or SNAIL) on several heavily-benchmarked tasks. On all tasks, in both supervised and reinforcement learning, SNAIL attains state-of-the-art performance by significant margins.

...read moreread less

Posted Content•

Meta-SGD: Learning to Learn Quickly for Few Shot Learning.

[...]

Zhenguo Li, Fengwei Zhou, Fei Chen, Hang Li

31 Jul 2017-arXiv: Learning

TL;DR: Meta-SGD, an SGD-like, easily trainable meta-learner that can initialize and adapt any differentiable learner in just one step, shows highly competitive performance for few-shot learning on regression, classification, and reinforcement learning.

...read moreread less

Abstract: Few-shot learning is challenging for learning algorithms that learn each task in isolation and from scratch. In contrast, meta-learning learns from many related tasks a meta-learner that can learn a new task more accurately and faster with fewer examples, where the choice of meta-learners is crucial. In this paper, we develop Meta-SGD, an SGD-like, easily trainable meta-learner that can initialize and adapt any differentiable learner in just one step, on both supervised learning and reinforcement learning. Compared to the popular meta-learner LSTM, Meta-SGD is conceptually simpler, easier to implement, and can be learned more efficiently. Compared to the latest meta-learner MAML, Meta-SGD has a much higher capacity by learning to learn not just the learner initialization, but also the learner update direction and learning rate, all in a single meta-learning process. Meta-SGD shows highly competitive performance for few-shot learning on regression, classification, and reinforcement learning.

...read moreread less

Proceedings Article•

Neural Combinatorial Optimization with Reinforcement Learning

[...]

Irwan Bello¹, Hieu Pham¹, Quoc V. Le², Mohammad Norouzi³, Samy Bengio¹ - Show less +1 more•Institutions (3)

Google¹, Tel Aviv University², Bu-Ali Sina University³

17 Feb 2017

TL;DR: A framework to tackle combinatorial optimization problems using neural networks and reinforcement learning, and Neural Combinatorial Optimization achieves close to optimal results on 2D Euclidean graphs with up to 100 nodes.

...read moreread less

Abstract: This paper presents a framework to tackle combinatorial optimization problems using neural networks and reinforcement learning. We focus on the traveling salesman problem (TSP) and train a recurrent network that, given a set of city coordinates, predicts a distribution over different city permutations. Using negative tour length as the reward signal, we optimize the parameters of the recurrent network using a policy gradient method. We compare learning the network parameters on a set of training graphs against learning them on individual test graphs. Despite the computational expense, without much engineering and heuristic designing, Neural Combinatorial Optimization achieves close to optimal results on 2D Euclidean graphs with up to 100 nodes. Applied to the KnapSack, another NP-hard problem, the same method obtains optimal solutions for instances with up to 200 items.

...read moreread less

Proceedings Article•

Constrained policy optimization

[...]

Joshua Achiam¹, David Held¹, Aviv Tamar¹, Pieter Abbeel¹•Institutions (1)

University of California, Berkeley¹

06 Aug 2017

TL;DR: Constrained Policy Optimization (CPO) as discussed by the authors is the first general-purpose policy search algorithm for constrained reinforcement learning with guarantees for near-constraint satisfaction at each iteration.

...read moreread less

Abstract: For many applications of reinforcement learning it can be more convenient to specify both a reward function and constraints, rather than trying to design behavior through the reward function. For example, systems that physically interact with or around humans should satisfy safety constraints. Recent advances in policy search algorithms (Mnih et al., 2016; Schulman et al., 2015; Lillicrap et al., 2016; Levine et al., 2016) have enabled new capabilities in high-dimensional control, but do not consider the constrained setting. We propose Constrained Policy Optimization (CPO), the first general-purpose policy search algorithm for constrained reinforcement learning with guarantees for near-constraint satisfaction at each iteration. Our method allows us to train neural network policies for high-dimensional control while making guarantees about policy behavior all throughout training. Our guarantees are based on a new theoretical result, which is of independent interest: we prove a bound relating the expected returns of two policies to an average divergence between them. We demonstrate the effectiveness of our approach on simulated robot locomotion tasks where the agent must satisfy constraints motivated by safety.

...read moreread less

Journal Article•DOI•

Deep Reinforcement Learning framework for Autonomous Driving

[...]

Ahmad El Sallab, Mohammed Abdou, Etienne Perot, Senthil Yogamani

08 Apr 2017-electronic imaging

TL;DR: The proposed framework for autonomous driving using deep reinforcement learning incorporates Recurrent Neural Networks for information integration, enabling the car to handle partially observable scenarios and integrates the recent work on attention models to focus on relevant information, thereby reducing the computational complexity for deployment on embedded hardware.

...read moreread less

Abstract: Reinforcement learning is considered to be a strong AI paradigm which can be used to teach machines through interaction with the environment and learning from their mistakes. Despite its perceived utility, it has not yet been successfully applied in automotive applications. Motivated by the successful demonstrations of learning of Atari games and Go by Google DeepMind, we propose a framework for autonomous driving using deep reinforcement learning. This is of particular relevance as it is difficult to pose autonomous driving as a supervised learning problem due to strong interactions with the environment including other vehicles, pedestrians and roadworks. As it is a relatively new area of research for autonomous driving, we provide a short overview of deep reinforcement learning and then describe our proposed framework. It incorporates Recurrent Neural Networks for information integration, enabling the car to handle partially observable scenarios. It also integrates the recent work on attention models to focus on relevant information, thereby reducing the computational complexity for deployment on embedded hardware. The framework was tested in an open source 3D car racing simulator called TORCS. Our simulation results demonstrate learning of autonomous maneuvering in a scenario of complex road curvatures and simple interaction of other vehicles.

...read moreread less

Posted Content•

Emergence of Locomotion Behaviours in Rich Environments

[...]

Nicolas Heess, Dhruva Tb, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, S. M. Ali Eslami, Martin Riedmiller, David Silver - Show less +8 more

07 Jul 2017-arXiv: Artificial Intelligence

TL;DR: This paper explores how a rich environment can help to promote the learning of complex behavior, and finds that this encourages the emergence of robust behaviours that perform well across a suite of tasks.

...read moreread less

Abstract: The reinforcement learning paradigm allows, in principle, for complex behaviours to be learned directly from simple reward signals. In practice, however, it is common to carefully hand-design the reward function to encourage a particular solution, or to derive it from demonstration data. In this paper explore how a rich environment can help to promote the learning of complex behavior. Specifically, we train agents in diverse environmental contexts, and find that this encourages the emergence of robust behaviours that perform well across a suite of tasks. We demonstrate this principle for locomotion -- behaviours that are known for their sensitivity to the choice of reward. We train several simulated bodies on a diverse set of challenging terrains and obstacles, using a simple reward function based on forward progress. Using a novel scalable variant of policy gradient reinforcement learning, our agents learn to run, jump, crouch and turn as required by the environment without explicit reward-based guidance. A visual depiction of highlights of the learned behavior can be viewed following this https URL .

...read moreread less

Proceedings Article•

A Distributional Perspective on Reinforcement Learning

[...]

Marc G. Bellemare¹, Will Dabney², Rémi Munos¹•Institutions (2)

Google¹, Amazon.com²

17 Jul 2017

TL;DR: This paper argues for the fundamental importance of the value distribution: the distribution of the random return received by a reinforcement learning agent, and designs a new algorithm which applies Bellman's equation to the learning of approximate value distributions.

...read moreread less

Abstract: In this paper we argue for the fundamental importance of the value distribution: the distribution of the random return received by a reinforcement learning agent. This is in contrast to the common approach to reinforcement learning which models the expectation of this return, or value. Although there is an established body of literature studying the value distribution, thus far it has always been used for a specific purpose such as implementing risk-aware behaviour. We begin with theoretical results in both the policy evaluation and control settings, exposing a significant distributional instability in the latter. We then use the distributional perspective to design a new algorithm which applies Bellman's equation to the learning of approximate value distributions. We evaluate our algorithm using the suite of games from the Arcade Learning Environment. We obtain both state-of-the-art results and anecdotal evidence demonstrating the importance of the value distribution in approximate reinforcement learning. Finally, we combine theoretical and empirical evidence to highlight the ways in which the value distribution impacts learning in the approximate setting.

...read moreread less

Book Chapter•DOI•

Cooperative Multi-agent Control Using Deep Reinforcement Learning

[...]

Jayesh K. Gupta¹, M. Egorov¹, Mykel J. Kochenderfer¹•Institutions (1)

Stanford University¹

08 May 2017

TL;DR: It is shown that policy gradient methods tend to outperform both temporal-difference and actor-critic methods and that curriculum learning is vital to scaling reinforcement learning algorithms in complex multi-agent domains.

...read moreread less

Abstract: This work considers the problem of learning cooperative policies in complex, partially observable domains without explicit communication. We extend three classes of single-agent deep reinforcement learning algorithms based on policy gradient, temporal-difference error, and actor-critic methods to cooperative multi-agent systems. To effectively scale these algorithms beyond a trivial number of agents, we combine them with a multi-agent variant of curriculum learning. The algorithms are benchmarked on a suite of cooperative control tasks, including tasks with discrete and continuous actions, as well as tasks with dozens of cooperating agents. We report the performance of the algorithms using different neural architectures, training procedures, and reward structures. We show that policy gradient methods tend to outperform both temporal-difference and actor-critic methods and that curriculum learning is vital to scaling reinforcement learning algorithms in complex multi-agent domains.

...read moreread less

Posted Content•

AI2-THOR: An Interactive 3D Environment for Visual AI

[...]

Eric Kolve, Roozbeh Mottaghi, Daniel Gordon, Yuke Zhu, Abhinav Gupta, Ali Farhadi - Show less +2 more

14 Dec 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: AI2-THOR consists of near photo-realistic 3D indoor scenes, where AI agents can navigate in the scenes and interact with objects to perform tasks and facilitate building visually intelligent models.

...read moreread less

Abstract: We introduce The House Of inteRactions (THOR), a framework for visual AI research, available at this http URL. AI2-THOR consists of near photo-realistic 3D indoor scenes, where AI agents can navigate in the scenes and interact with objects to perform tasks. AI2-THOR enables research in many different domains including but not limited to deep reinforcement learning, imitation learning, learning by interaction, planning, visual question answering, unsupervised representation learning, object detection and segmentation, and learning models of cognition. The goal of AI2-THOR is to facilitate building visually intelligent models and push the research forward in this domain.

...read moreread less

Collapse