More) Efficient Reinforcement Learning via Posterior Sampling

Open AccessPosted Content

More) Efficient Reinforcement Learning via Posterior Sampling

- 04 Jun 2013 -

TLDR

An O(τS/√AT) bound on expected regret is established, one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.

Abstract:

Most provably-efficient learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration, posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an agent to encode prior knowledge in a natural way. We establish an $\tilde{O}(\tau S \sqrt{AT})$ bound on the expected regret, where $T$ is time, $\tau$ is the episode length and $S$ and $A$ are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds.

Citations

PDF

Open Access

More filters

Proceedings Article

Deep exploration via bootstrapped DQN

Ian Osband, +3 more

TL;DR: Bootstrapped DQN as discussed by the authors combines deep exploration with deep neural networks for exponentially faster learning than any dithering strategy, which is a promising approach to efficient exploration with generalization.

...read moreread less

Posted Content

RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning

Yan Duan, +5 more

- 04 Nov 2016 -

arXiv: Artificial Intelligence

TL;DR: This paper proposes to represent a "fast" reinforcement learning algorithm as a recurrent neural network (RNN) and learn it from data, encoded in the weights of the RNN, which are learned slowly through a general-purpose ("slow") RL algorithm.

...read moreread less

Posted Content

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables

Kate Rakelly, +4 more

- 19 Mar 2019 -

arXiv: Learning

TL;DR: This article proposed an off-policy meta-RL algorithm that disentangles task inference and control by performing online probabilistic filtering of latent task variables to infer how to solve a new task from small amounts of experience.

...read moreread less

Posted Content

Randomized Prior Functions for Deep Reinforcement Learning

Ian Osband, +2 more

- 08 Jun 2018 -

arXiv: Machine Learning

TL;DR: This article proposed to add a randomized untrainable ''prior'' network to each ensemble member and showed that this approach is efficient with linear representations, provide simple illustrations of its efficacy with nonlinear representations and scale to large-scale problems far better than previous attempts.

...read moreread less

Proceedings Article

Thompson Sampling for Complex Online Problems

Aditya Gopalan, +2 more

TL;DR: It is proved a frequentist regret bound for Thompson sampling in a very general setting involving parameter, action and observation spaces and a likelihood function over them, and improved regret bounds are derived for classes of complex bandit problems involving selecting subsets of arms, including the first nontrivial regret bounds for nonlinear reward feedback from subsets.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

On the likelihood that one unknown probability exceeds another in view of the evidence of two samples

William R. Thompson

- 01 Dec 1933 -

Biometrika

Proceedings Article

An Empirical Evaluation of Thompson Sampling

Olivier Chapelle, +1 more

TL;DR: Empirical results using Thompson sampling on simulated and real data are presented, and it is shown that it is highly competitive and should be part of the standard baselines to compare against.

...read moreread less

Book

Stochastic systems : estimation, identification, and adaptive control

P. R. Kumar, +1 more

TL;DR: The mathematics of filtering and ee/ise 556: stochastic systems fall 2013 usc search identification and system parameter estimation 1991 gbv is described.

...read moreread less

Journal ArticleDOI

R-max - a general polynomial time algorithm for near-optimal reinforcement learning

Ronen I. Brafman, +1 more

- 01 Mar 2003 -

Journal of Machine Learning Research

TL;DR: R-MAX is a very simple model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time and formally justifies the ``optimism under uncertainty'' bias used in many RL algorithms.

...read moreread less

Journal Article

Near-optimal Regret Bounds for Reinforcement Learning

Thomas Jaksch, +2 more

- 01 Mar 2010 -

Journal of Machine Learning Research

TL;DR: For undiscounted reinforcement learning in Markov decision processes (MDPs), this paper presented a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D.

...read moreread less

Collapse

More) Efficient Reinforcement Learning via Posterior Sampling

Citations

Deep exploration via bootstrapped DQN

RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables

Randomized Prior Functions for Deep Reinforcement Learning

Thompson Sampling for Complex Online Problems

References

On the likelihood that one unknown probability exceeds another in view of the evidence of two samples

An Empirical Evaluation of Thompson Sampling

Stochastic systems : estimation, identification, and adaptive control

R-max - a general polynomial time algorithm for near-optimal reinforcement learning

Near-optimal Regret Bounds for Reinforcement Learning

Related Papers (5)

Why is Posterior Sampling Better than Optimism for Reinforcement Learning

Model-based Reinforcement Learning and the Eluder Dimension

Near-optimal Regret Bounds for Reinforcement Learning

Almost optimal model-free reinforcement learning via reference-advantage decomposition

Is Q-learning Provably Efficient?