Lucian Bus¸oniu, Robert Babuˇska, Bart De Schutter, and Damien Ernst
Reinforcement learning and
dynamic programming using
function approximators
Preface
Control systems are making a tremendous impact on our society. Though invisible
to most users, they are essential for the operation of nearly all devices – from basic
home appliances to aircraft and nuclear power plants. Apart from technical systems,
the principles of control are routinely applied and exploited in a variety of disciplines
such as economics, medicine, social sciences, and artificial intelligence.
A common denominator in the diverse applications of control is the need to in-
fluence or modify the behavior of dynamic systems to attain prespecified goals. One
approach to achieve this is to assign a numerical performance index to each state tra-
jectory of the system. The control problem is then solved by searching for a control
policy that drives the system along trajectories corresponding to the best value of the
performance index. This approach essentially reduces the problem of finding good
control policies to the search for solutions of a mathematical optimization problem.
Early work in the field of optimal control dates back to the 1940s with the pi-
oneering research of Pontryagin and Bellman. Dynamic programming (DP), intro-
duced by Bellman, is still among the state-of-the-art tools commonly used to solve
optimal control problems when a system model is available. The alternative idea of
finding a solution
in the absence
of a model was explored as early as the 1960s. In
the 1980s, a revival of interest in this model-free paradigm led to the development of
the field of reinforcement learning (RL). The central theme in RL research is the de-
sign of algorithms that learn control policies solely from the knowledge of transition
samples or trajectories, which are collected beforehand or by online interaction with
the system. Most approaches developed to tackle the RL problem are closely related
to DP algorithms.
A core obstacle in DP and RL is that solutions cannot be represented exactly for
problems with large discrete state-action spaces or continuous spaces. Instead, com-
pact representations relying on function approximators must be used. This challenge
was already recognized while the first DP techniques were being developed. How-
ever, it has only been in recent years – and largely in correlation with the advance
of RL – that approximation-based methods have grown in diversity, maturity, and
efficiency, enabling RL and DP to scale up to realistic problems.
This book provides an accessible in-depth treatment of reinforcement learning
and dynamic programming methods using function approximators. We start with a
concise introduction to classical DP and RL, in order to build the foundation for
the remainder of the book. Next, we present an extensive review of state-of-the-art
approaches to DP and RL with approximation. Theoretical guarantees are provided
on the solutions obtained, and numerical examples and comparisons are used to il-
lustrate the properties of the individual methods. The remaining three chapters are
i
ii
dedicated to a detailed presentation of representative algorithms from the three ma-
jor classes of techniques: value iteration, policy iteration, and policy search. The
properties and the performance of these algorithms are highlighted in simulation and
experimental studies on a range of control applications.
We believe that this balanced combination of practical algorithms, theoretical
analysis, and comprehensive examples makes our book suitable not only for re-
searchers, teachers, and graduate students in the fields of optimal and adaptive con-
trol, machine learning and artificial intelligence, but also for practitioners seeking
novel strategies for solving challenging real-life control problems.
This book can be read in several ways. Readers unfamiliar with the field are
advised to start with Chapter 1 for a gentle introduction, and continue with Chap-
ter 2 (which discusses classical DP and RL) and Chapter 3 (which considers
approximation-based methods). Those who are familiar with the basic concepts of
RL and DP may consult the list of notations given at the end of the book, and then
start directly with Chapter 3. This first part of the book is sufficient to get an overview
of the field. Thereafter,readers can pick any combination of Chapters 4 to 6, depend-
ing on their interests: approximate value iteration (Chapter 4), approximate policy
iteration and online learning (Chapter 5), or approximate policy search (Chapter 6).
Supplementary information relevant to this book, including a complete archive
of the computer code used in the experimental studies, is available at the Web site:
http://www.dcsc.tudelft.nl/rlbook/
Comments, suggestions, or questions concerning the book or the Web site are wel-
come. Interested readers are encouraged to get in touch with the authors using the
contact information on the Web site.
The authors have been inspired over the years by many scientists who undoubt-
edly left their mark on this book; in particular by Louis Wehenkel, Pierre Geurts,
Guy-Bart Stan, R´emi Munos, Martin Riedmiller, and Michail Lagoudakis. Pierre
Geurts also provided the computer program for building ensembles of regression
trees, used in several examples in the book. This work would not have been pos-
sible without our colleagues, students, and the excellent professional environments
at the Delft Center for Systems and Control of the Delft University of Technology,
the Netherlands, the Montefiore Institute of the University of Li`ege, Belgium, and at
Sup´elec Rennes, France. Among our colleagues in Delft, Justin Rice deserves special
mention for carefully proofreading the manuscript. To all these people we extend our
sincere thanks.
We thank Sam Ge for giving us the opportunity to publish our book with Taylor
& Francis CRC Press, and the editorial and production team at Taylor & Francis for
their valuable help. We gratefully acknowledge the financial support of the BSIK-
ICIS project “Interactive Collaborative InformationSystems” (grant no. BSIK03024)
and the Dutch funding organizations NWO and STW. Damien Ernst is a Research
Associate of the FRS-FNRS, the financial support of which he acknowledges. We
appreciate the kind permission offered by the IEEE to reproduce material from our
previous works over which they hold copyright.
iii
Finally, we thank our families for their continual understanding, patience, and
support.
Lucian Bus¸oniu
Robert Babuˇska
Bart De Schutter
Damien Ernst
November 2009