Open AccessPosted Content
Benchmarking Batch Deep Reinforcement Learning Algorithms.
Reads0
Chats0
TLDR
This paper benchmark the performance of recent off-policy and batch reinforcement learning algorithms under unified settings on the Atari domain, with data generated by a single partially-trained behavioral policy, and finds that many of these algorithms underperform DQN trained online with the same amount of data.Abstract:
Widely-used deep reinforcement learning algorithms have been shown to fail in the batch setting--learning from a fixed data set without interaction with the environment. Following this result, there have been several papers showing reasonable performances under a variety of environments and batch settings. In this paper, we benchmark the performance of recent off-policy and batch reinforcement learning algorithms under unified settings on the Atari domain, with data generated by a single partially-trained behavioral policy. We find that under these conditions, many of these algorithms underperform DQN trained online with the same amount of data, as well as the partially-trained behavioral policy. To introduce a strong baseline, we adapt the Batch-Constrained Q-learning algorithm to a discrete-action setting, and show it outperforms all existing algorithms at this task.read more
Citations
More filters
Journal Article
D4RL: Datasets for Deep Data-Driven Reinforcement Learning
TL;DR: This work introduces benchmarks specifically designed for the offline setting, guided by key properties of datasets relevant to real-world applications of offline RL, and releases benchmark tasks and datasets with a comprehensive evaluation of existing algorithms and an evaluation protocol together with an open-source codebase.
Posted Content
An Optimistic Perspective on Offline Reinforcement Learning
TL;DR: It is demonstrated that recent off-policy deep RL algorithms, even when trained solely on this replay dataset, outperform the fully trained DQN agent and Random Ensemble Mixture (REM), a robust Q-learning algorithm that enforces optimal Bellman consistency on random convex combinations of multiple Q-value estimates is presented.
Posted Content
Acme: A Research Framework for Distributed Reinforcement Learning
Matthew W. Hoffman,Bobak Shahriari,John Aslanides,Gabriel Barth-Maron,Feryal Behbahani,Tamara Norman,Abbas Abdolmaleki,Albin Cassirer,Fan Yang,Kate Baumli,Sarah Henderson,Alexander Novikov,Sergio Gomez Colmenarejo,Serkan Cabi,Caglar Gulcehre,Tom Le Paine,Andrew Cowie,Ziyu Wang,Bilal Piot,Nando de Freitas +19 more
TL;DR: It is shown that the design decisions behind Acme lead to agents that can be scaled both up and down and that, for the most part, greater levels of parallelization result in agents with equivalent performance, just faster.
Posted Content
A Theoretical Analysis of Deep Q-Learning
TL;DR: In this paper, the authors make the first attempt to theoretically understand the deep Q-network (DQN) algorithm (Mnih et al., 2015) from both algorithmic and statistical perspectives, focusing on a slight simplification of DQN that fully captures its key features.
Proceedings Article
Critic Regularized Regression
Ziyu Wang,Alexander Novikov,Konrad Zolna,Josh Merel,Jost Tobias Springenberg,Scott Reed,Bobak Shahriari,Noah Siegel,Caglar Gulcehre,Nicolas Heess,Nando de Freitas +10 more
TL;DR: This paper proposes a novel offline RL algorithm to learn policies from data using a form of critic-regularized regression (CRR), and finds that CRR performs surprisingly well and scales to tasks with high-dimensional state and action spaces -- outperforming several state-of-the-art offline RL algorithms by a significant margin on a wide range of benchmark tasks.
References
More filters
Proceedings Article
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Posted Content
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
TL;DR: In this article, the adaptive estimates of lower-order moments are used for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimate of lowerorder moments.
Journal ArticleDOI
Human-level control through deep reinforcement learning
Volodymyr Mnih,Koray Kavukcuoglu,David Silver,Andrei Rusu,Joel Veness,Marc G. Bellemare,Alex Graves,Martin Riedmiller,Andreas K. Fidjeland,Georg Ostrovski,Stig Petersen,Charles Beattie,Amir Sadik,Ioannis Antonoglou,Helen King,Dharshan Kumaran,Daan Wierstra,Shane Legg,Demis Hassabis +18 more
TL;DR: This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.
Book
Dynamic Programming
TL;DR: The more the authors study the information processing aspects of the mind, the more perplexed and impressed they become, and it will be a very long time before they understand these processes sufficiently to reproduce them.
Journal ArticleDOI
Robust Estimation of a Location Parameter
TL;DR: In this article, a new approach toward a theory of robust estimation is presented, which treats in detail the asymptotic theory of estimating a location parameter for contaminated normal distributions, and exhibits estimators that are asyptotically most robust (in a sense to be specified) among all translation invariant estimators.