An Upper Bound on the Loss from Approximate Optimal-Value Functions
Satinder Singh,Richard C. Yee +1 more
Reads0
Chats0
TLDR
An upper bound on performance loss is derived that is slightly tighter than that in Bertsekas (1987), and the extension of the bound to Q-learning is shown to provide a partial theoretical rationale for the approximation of value functions.Abstract:
Many reinforcement learning approaches can be formulated using the theory of Markov decision processes and the associated method of dynamic programming (DP) The value of this theoretical understanding, however, is tempered by many practical concerns One important question is whether DP-based approaches that use function approximation rather than lookup tables can avoid catastrophic effects on performance This note presents a result of Bertsekas (1987) which guarantees that small errors in the approximation of a task's optimal value function cannot produce arbitrarily bad performance when actions are selected by a greedy policy We derive an upper bound on performance loss that is slightly tighter than that in Bertsekas (1987), and we show the extension of the bound to Q-learning (Watkins, 1989) These results provide a partial theoretical rationale for the approximation of value functions, an issue of great practical importance in reinforcement learningread more
Citations
More filters
Journal ArticleDOI
Machine learning
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Journal ArticleDOI
Learning to act using real-time dynamic programming
TL;DR: An algorithm based on dynamic programming, which is called Real-Time DP, is introduced, by which an embedded system can improve its performance with experience and illuminate aspects of other DP-based reinforcement learning methods such as Watkins'' Q-Learning algorithm.
Book
Algorithms for Reinforcement Learning
TL;DR: This book focuses on those algorithms of reinforcement learning that build on the powerful theory of dynamic programming, and gives a fairly comprehensive catalog of learning problems, and describes the core ideas, followed by the discussion of their theoretical properties and limitations.
Journal ArticleDOI
Convergence Results for Single-Step On-PolicyReinforcement-Learning Algorithms
TL;DR: This paper examines the convergence of single-step on-policy RL algorithms for control with both decaying exploration and persistent exploration and provides examples of exploration strategies that result in convergence to both optimal values and optimal policies.
Dissertation
On the Sample Complexity of Reinforcement Learning
TL;DR: Novel algorithms with more restricted guarantees are suggested whose sample complexities are again independent of the size of the state space and depend linearly on the complexity of the policy class, but have only a polynomial dependence on the horizon time.
References
More filters
Journal ArticleDOI
Machine learning
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Journal ArticleDOI
Technical Note : \cal Q -Learning
Chris Watkins,Peter Dayan +1 more
TL;DR: This paper presents and proves in detail a convergence theorem forQ-learning based on that outlined in Watkins (1989), showing that Q-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action- values are represented discretely.
Journal ArticleDOI
Learning to Predict by the Methods of Temporal Differences
TL;DR: This article introduces a class of incremental learning procedures specialized for prediction – that is, for using past experience with an incompletely known system to predict its future behavior – and proves their convergence and optimality for special cases and relation to supervised-learning methods.
Journal ArticleDOI
Technical Note Q-Learning
Chris Watkins,Peter Dayan +1 more
TL;DR: In this article, it is shown that Q-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action values are represented discretely.