scispace - formally typeset
Open AccessJournal Article

Temporal Difference Uncertainties as a Signal for Exploration

Reads0
Chats0
TLDR
A novel method for estimating uncertainty over the value function that relies on inducing a distribution over temporal difference errors and incorporates exploration as an intrinsic reward and treats exploration as a separate learning problem, induced by the agent's temporal difference uncertainties.
Abstract
An effective approach to exploration in reinforcement learning is to rely on an agent's uncertainty over the optimal policy, which can yield near-optimal exploration strategies in tabular settings. However, in non-tabular settings that involve function approximators, obtaining accurate uncertainty estimates is almost as challenging as the exploration problem itself. In this paper, we highlight that value estimates are easily biased and temporally inconsistent. In light of this, we propose a novel method for estimating uncertainty over the value function that relies on inducing a distribution over temporal difference errors. This exploration signal controls for state-action transitions so as to isolate uncertainty in value that is due to uncertainty over the agent's parameters. Because our measure of uncertainty conditions on state-action transitions, we cannot act on this measure directly. Instead, we incorporate it as an intrinsic reward and treat exploration as a separate learning problem, induced by the agent's temporal difference uncertainties. We introduce a distinct exploration policy that learns to collect data with high estimated uncertainty, which gives rise to a curriculum that smoothly changes throughout learning and vanishes in the limit of perfect value estimates. We evaluate our method on hard exploration tasks, including Deep Sea and Atari 2600 environments and find that our proposed form of exploration facilitates efficient exploration.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Semantic Exploration from Language Abstractions and Pretrained Representations

TL;DR: This work evaluates vision-language representations, pretrained on natural image captioning datasets, and shows that these pretrained representations drive meaningful, task-relevant exploration and improve performance on 3D simulated environments.
Proceedings Article

Sample Efficient Deep Reinforcement Learning via Uncertainty Estimation

TL;DR: This work proposes a method whereby two complementary uncertainty estimation methods ac-count for both the Q-value and the environment stochasticity to better mitigate the negative impacts of noisy supervision, and introduces inverse-variance RL, a Bayesian framework which combines probabilistic ensembles and Batch Inverse Variance weighting.
Proceedings ArticleDOI

Deciding What to Model: Value-Equivalent Sampling for Reinforcement Learning

TL;DR: An algorithm is introduced that iteratively computes an approximately-value-equivalent, lossy compression of the environment which an agent may feasibly target in lieu of the true model, and an information-theoretic, Bayesian regret bound is proved for this algorithm that holds for any Night-horizon, episodic sequential decision-making problem.
Posted Content

Reinforcement Learning, Bit by Bit

TL;DR: This paper developed concepts and established a regret bound that together offer principled guidance on what information to seek, how to seek that information, and it how to retain information in reinforcement learning agents, and designed simple agents that build on them and present computational results that demonstrate improvements in data efficiency.
Posted Content

Learning more skills through optimistic exploration.

TL;DR: This article proposed DISDAIN, an information gain auxiliary objective that involves training an ensemble of discriminators and rewarding the policy for their disagreement, which directly estimates the epistemic uncertainty that comes from the discriminator not having seen enough training examples.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Journal ArticleDOI

Human-level control through deep reinforcement learning

TL;DR: This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

Deep reinforcement learning with double Q-learning

TL;DR: In this article, the authors show that the DQN algorithm suffers from substantial overestimation in some games in the Atari 2600 domain, and they propose a specific adaptation to the algorithm and show that this algorithm not only reduces the observed overestimations, but also leads to much better performance on several games.
Proceedings Article

Maximum entropy inverse reinforcement learning

TL;DR: A probabilistic approach based on the principle of maximum entropy that provides a well-defined, globally normalized distribution over decision sequences, while providing the same performance guarantees as existing methods is developed.
Related Papers (5)