Open AccessJournal Article
Temporal Difference Uncertainties as a Signal for Exploration
Sebastian Flennerhag,Jane X. Wang,Pablo Sprechmann,Francesco Visin,Alexandre Galashov,Steven Kapturowski,Diana Borsa,Nicolas Heess,Andre Barreto,Razvan Pascanu +9 more
Reads0
Chats0
TLDR
A novel method for estimating uncertainty over the value function that relies on inducing a distribution over temporal difference errors and incorporates exploration as an intrinsic reward and treats exploration as a separate learning problem, induced by the agent's temporal difference uncertainties.Abstract:
An effective approach to exploration in reinforcement learning is to rely on an agent's uncertainty over the optimal policy, which can yield near-optimal exploration strategies in tabular settings. However, in non-tabular settings that involve function approximators, obtaining accurate uncertainty estimates is almost as challenging as the exploration problem itself. In this paper, we highlight that value estimates are easily biased and temporally inconsistent. In light of this, we propose a novel method for estimating uncertainty over the value function that relies on inducing a distribution over temporal difference errors. This exploration signal controls for state-action transitions so as to isolate uncertainty in value that is due to uncertainty over the agent's parameters. Because our measure of uncertainty conditions on state-action transitions, we cannot act on this measure directly. Instead, we incorporate it as an intrinsic reward and treat exploration as a separate learning problem, induced by the agent's temporal difference uncertainties. We introduce a distinct exploration policy that learns to collect data with high estimated uncertainty, which gives rise to a curriculum that smoothly changes throughout learning and vanishes in the limit of perfect value estimates. We evaluate our method on hard exploration tasks, including Deep Sea and Atari 2600 environments and find that our proposed form of exploration facilitates efficient exploration.read more
Citations
More filters
Proceedings ArticleDOI
Semantic Exploration from Language Abstractions and Pretrained Representations
Allison C. Tam,Neil C. Rabinowitz,Andrew K. Lampinen,Nicholas Roy,Stephanie C.Y. Chan,DJ Strouse,Jane X. Wang,Andrea Banino,Felix Hill +8 more
TL;DR: This work evaluates vision-language representations, pretrained on natural image captioning datasets, and shows that these pretrained representations drive meaningful, task-relevant exploration and improve performance on 3D simulated environments.
Proceedings Article
Sample Efficient Deep Reinforcement Learning via Uncertainty Estimation
TL;DR: This work proposes a method whereby two complementary uncertainty estimation methods ac-count for both the Q-value and the environment stochasticity to better mitigate the negative impacts of noisy supervision, and introduces inverse-variance RL, a Bayesian framework which combines probabilistic ensembles and Batch Inverse Variance weighting.
Proceedings ArticleDOI
Deciding What to Model: Value-Equivalent Sampling for Reinforcement Learning
Dilip Arumugam,Benjamin Van Roy +1 more
TL;DR: An algorithm is introduced that iteratively computes an approximately-value-equivalent, lossy compression of the environment which an agent may feasibly target in lieu of the true model, and an information-theoretic, Bayesian regret bound is proved for this algorithm that holds for any Night-horizon, episodic sequential decision-making problem.
Posted Content
Reinforcement Learning, Bit by Bit
TL;DR: This paper developed concepts and established a regret bound that together offer principled guidance on what information to seek, how to seek that information, and it how to retain information in reinforcement learning agents, and designed simple agents that build on them and present computational results that demonstrate improvements in data efficiency.
Posted Content
Learning more skills through optimistic exploration.
TL;DR: This article proposed DISDAIN, an information gain auxiliary objective that involves training an ensemble of discriminators and rewarding the policy for their disagreement, which directly estimates the epistemic uncertainty that comes from the discriminator not having seen enough training examples.
References
More filters
Proceedings Article
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Journal ArticleDOI
Human-level control through deep reinforcement learning
Volodymyr Mnih,Koray Kavukcuoglu,David Silver,Andrei Rusu,Joel Veness,Marc G. Bellemare,Alex Graves,Martin Riedmiller,Andreas K. Fidjeland,Georg Ostrovski,Stig Petersen,Charles Beattie,Amir Sadik,Ioannis Antonoglou,Helen King,Dharshan Kumaran,Daan Wierstra,Shane Legg,Demis Hassabis +18 more
TL;DR: This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.
Deep reinforcement learning with double Q-learning
TL;DR: In this article, the authors show that the DQN algorithm suffers from substantial overestimation in some games in the Atari 2600 domain, and they propose a specific adaptation to the algorithm and show that this algorithm not only reduces the observed overestimations, but also leads to much better performance on several games.
Journal ArticleDOI
On the likelihood that one unknown probability exceeds another in view of the evidence of two samples
Proceedings Article
Maximum entropy inverse reinforcement learning
TL;DR: A probabilistic approach based on the principle of maximum entropy that provides a well-defined, globally normalized distribution over decision sequences, while providing the same performance guarantees as existing methods is developed.