Journal ArticleDOI
Optimal adaptive policies for Markov decision processes
Reads0
Chats0
TLDR
This paper gives the explicit form for a class of adaptive policies that possess optimal increase rate properties for the total expected finite horizon reward, under sufficient assumptions of finite state-action spaces and irreducibility of the transition law.Abstract:
In this paper we consider the problem of adaptive control for Markov Decision Processes. We give the explicit form for a class of adaptive policies that possess optimal increase rate properties for the total expected finite horizon reward, under sufficient assumptions of finite state-action spaces and irreducibility of the transition law. A main feature of the proposed policies is that the choice of actions, at each state and time period, is based on indices that are inflations of the right-hand side of the estimated average reward optimality equations.read more
Citations
More filters
Book
Regret Analysis of Stochastic and Nonstochastic Multi-Armed Bandit Problems
TL;DR: In this article, the authors focus on regret analysis in the context of multi-armed bandit problems, where regret is defined as the balance between staying with the option that gave highest payoff in the past and exploring new options that might give higher payoffs in the future.
Journal Article
Near-optimal Regret Bounds for Reinforcement Learning
TL;DR: For undiscounted reinforcement learning in Markov decision processes (MDPs), this paper presented a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D.
Proceedings Article
Deep exploration via bootstrapped DQN
TL;DR: Bootstrapped DQN as discussed by the authors combines deep exploration with deep neural networks for exponentially faster learning than any dithering strategy, which is a promising approach to efficient exploration with generalization.
Proceedings Article
Minimax regret bounds for reinforcement learning
TL;DR: The problem of provably optimal exploration in reinforcement learning for finite horizon MDPs is considered, and an optimistic modification to value iteration achieves a regret bound of $\tilde{O}( \sqrt{HSAT} + H^2S^2A+H\sqrt {T})$ where $H$ is the time horizon, $S$ the number of states, $A$the number of actions and $T$ thenumber of time-steps.
Posted Content
Deep Exploration via Bootstrapped DQN
TL;DR: Bootstrapped DQN as mentioned in this paper is a simple algorithm that explores in a computationally and statistically efficient manner through use of randomized value functions, which can lead to exponentially faster learning.
References
More filters
Book
Large Deviations Techniques and Applications
Amir Dembo,Ofer Zeitouni +1 more
TL;DR: The LDP for Abstract Empirical Measures and applications-The Finite Dimensional Case and Applications of Empirically Measures LDP are presented.
Journal ArticleDOI
Asymptotically efficient adaptive allocation rules
Tze Leung Lai,Herbert Robbins +1 more
Journal ArticleDOI
Some aspects of the sequential design of experiments
TL;DR: The authors proposed a theory of sequential design of experiments, in which the size and composition of the samples are not fixed in advance but are functions of the observations themselves, which is a major advance.
BookDOI
Entropy, large deviations, and statistical mechanics
TL;DR: In this paper, the authors introduce the concept of large deviations for random variables with a finite state space, which is a generalization of the notion of large deviation for random vectors.
Book
Applied Linear Algebra
TL;DR: In this article, the authors introduce geometric vectors and vector spaces, as well as linear transformations and matrix algebras, for solving equations and finding inverses in matrix algebra.