scispace - formally typeset
Open AccessPosted Content

More) Efficient Reinforcement Learning via Posterior Sampling

TLDR
An O(τS/√AT) bound on expected regret is established, one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.
Abstract
Most provably-efficient learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration, posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an agent to encode prior knowledge in a natural way. We establish an $\tilde{O}(\tau S \sqrt{AT})$ bound on the expected regret, where $T$ is time, $\tau$ is the episode length and $S$ and $A$ are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds.

read more

Citations
More filters
Proceedings Article

Deep exploration via bootstrapped DQN

TL;DR: Bootstrapped DQN as discussed by the authors combines deep exploration with deep neural networks for exponentially faster learning than any dithering strategy, which is a promising approach to efficient exploration with generalization.
Posted Content

RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning

TL;DR: This paper proposes to represent a "fast" reinforcement learning algorithm as a recurrent neural network (RNN) and learn it from data, encoded in the weights of the RNN, which are learned slowly through a general-purpose ("slow") RL algorithm.
Posted Content

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables

TL;DR: This article proposed an off-policy meta-RL algorithm that disentangles task inference and control by performing online probabilistic filtering of latent task variables to infer how to solve a new task from small amounts of experience.
Posted Content

Randomized Prior Functions for Deep Reinforcement Learning

TL;DR: This article proposed to add a randomized untrainable ''prior'' network to each ensemble member and showed that this approach is efficient with linear representations, provide simple illustrations of its efficacy with nonlinear representations and scale to large-scale problems far better than previous attempts.
Proceedings Article

Thompson Sampling for Complex Online Problems

TL;DR: It is proved a frequentist regret bound for Thompson sampling in a very general setting involving parameter, action and observation spaces and a likelihood function over them, and improved regret bounds are derived for classes of complex bandit problems involving selecting subsets of arms, including the first nontrivial regret bounds for nonlinear reward feedback from subsets.
References
More filters
Proceedings Article

An Empirical Evaluation of Thompson Sampling

TL;DR: Empirical results using Thompson sampling on simulated and real data are presented, and it is shown that it is highly competitive and should be part of the standard baselines to compare against.
Book

Stochastic systems : estimation, identification, and adaptive control

TL;DR: The mathematics of filtering and ee/ise 556: stochastic systems fall 2013 usc search identification and system parameter estimation 1991 gbv is described.
Journal ArticleDOI

R-max - a general polynomial time algorithm for near-optimal reinforcement learning

TL;DR: R-MAX is a very simple model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time and formally justifies the ``optimism under uncertainty'' bias used in many RL algorithms.
Journal Article

Near-optimal Regret Bounds for Reinforcement Learning

TL;DR: For undiscounted reinforcement learning in Markov decision processes (MDPs), this paper presented a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D.
Related Papers (5)