Open AccessPosted Content
More) Efficient Reinforcement Learning via Posterior Sampling
TLDR
An O(τS/√AT) bound on expected regret is established, one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.Abstract:
Most provably-efficient learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration, posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an agent to encode prior knowledge in a natural way. We establish an $\tilde{O}(\tau S \sqrt{AT})$ bound on the expected regret, where $T$ is time, $\tau$ is the episode length and $S$ and $A$ are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds.read more
Citations
More filters
Proceedings Article
Deep exploration via bootstrapped DQN
TL;DR: Bootstrapped DQN as discussed by the authors combines deep exploration with deep neural networks for exponentially faster learning than any dithering strategy, which is a promising approach to efficient exploration with generalization.
Posted Content
RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning
TL;DR: This paper proposes to represent a "fast" reinforcement learning algorithm as a recurrent neural network (RNN) and learn it from data, encoded in the weights of the RNN, which are learned slowly through a general-purpose ("slow") RL algorithm.
Posted Content
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables
TL;DR: This article proposed an off-policy meta-RL algorithm that disentangles task inference and control by performing online probabilistic filtering of latent task variables to infer how to solve a new task from small amounts of experience.
Posted Content
Randomized Prior Functions for Deep Reinforcement Learning
TL;DR: This article proposed to add a randomized untrainable ''prior'' network to each ensemble member and showed that this approach is efficient with linear representations, provide simple illustrations of its efficacy with nonlinear representations and scale to large-scale problems far better than previous attempts.
Proceedings Article
Thompson Sampling for Complex Online Problems
TL;DR: It is proved a frequentist regret bound for Thompson sampling in a very general setting involving parameter, action and observation spaces and a likelihood function over them, and improved regret bounds are derived for classes of complex bandit problems involving selecting subsets of arms, including the first nontrivial regret bounds for nonlinear reward feedback from subsets.
References
More filters
Journal ArticleDOI
On the likelihood that one unknown probability exceeds another in view of the evidence of two samples
Proceedings Article
An Empirical Evaluation of Thompson Sampling
Olivier Chapelle,Lihong Li +1 more
TL;DR: Empirical results using Thompson sampling on simulated and real data are presented, and it is shown that it is highly competitive and should be part of the standard baselines to compare against.
Book
Stochastic systems : estimation, identification, and adaptive control
P. R. Kumar,Pravin Varaiya +1 more
TL;DR: The mathematics of filtering and ee/ise 556: stochastic systems fall 2013 usc search identification and system parameter estimation 1991 gbv is described.
Journal ArticleDOI
R-max - a general polynomial time algorithm for near-optimal reinforcement learning
TL;DR: R-MAX is a very simple model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time and formally justifies the ``optimism under uncertainty'' bias used in many RL algorithms.
Journal Article
Near-optimal Regret Bounds for Reinforcement Learning
TL;DR: For undiscounted reinforcement learning in Markov decision processes (MDPs), this paper presented a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D.