Reinforcement Learning: An Introduction

Home
/
Papers
/
Reinforcement Learning: An Introduction

Book•

Reinforcement Learning: An Introduction

Richard S. Sutton¹, Andrew G. Barto•Institutions (1)

01 Jan 1988-

TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.

read less

Abstract: Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Their discussion ranges from the history of the field's intellectual foundations to the most recent developments and applications. The only necessary mathematical background is familiarity with elementary concepts of probability. The book is divided into three parts. Part I defines the reinforcement learning problem in terms of Markov decision processes. Part II provides basic solution methods: dynamic programming, Monte Carlo methods, and temporal-difference learning. Part III presents a unified view of the solution methods and incorporates artificial neural networks, eligibility traces, and planning; the two final chapters present case studies and consider the future of reinforcement learning.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Book•

Deep Learning

[...]

Ian Goodfellow¹, Yoshua Bengio², Aaron Courville²•Institutions (2)

Google¹, Université de Montréal²

18 Nov 2016

TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.

...read moreread less

Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

...read moreread less

38,208 citations

Journal Article•DOI•

Human-level control through deep reinforcement learning

[...]

Volodymyr Mnih¹, Koray Kavukcuoglu¹, David Silver¹, Andrei Rusu¹, Joel Veness¹, Marc G. Bellemare¹, Alex Graves¹, Martin Riedmiller¹, Andreas K. Fidjeland¹, Georg Ostrovski¹, Stig Petersen¹, Charles Beattie¹, Amir Sadik¹, Ioannis Antonoglou¹, Helen King¹, Dharshan Kumaran¹, Daan Wierstra¹, Shane Legg¹, Demis Hassabis¹ - Show less +15 more•Institutions (1)

Google¹

26 Feb 2015-Nature

TL;DR: This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

...read moreread less

Abstract: The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

...read moreread less

23,074 citations

Journal Article•DOI•

Deep learning in neural networks

[...]

Jürgen Schmidhuber¹•Institutions (1)

University of Lugano¹

01 Jan 2015-Neural Networks

TL;DR: This historical survey compactly summarizes relevant work, much of it from the previous millennium, review deep supervised learning, unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.

...read moreread less

14,635 citations

Cites background from "Reinforcement Learning: An Introduc..."

...Such NNs learn to perceive/encode/predict/ classify patterns or pattern sequences, but they do not learn to act in the more general sense of Reinforcement Learning (RL) in unknown environments (see surveys, e.g., Kaelbling et al., 1996; Sutton & Barto, 1998; Wiering & van Otterlo, 2012)....
[...]
...The latter is often explained in a probabilistic framework (e.g., Sutton & Barto, 1998), but its basic idea can already be conveyed in a deterministic setting....
[...]
...Such NNs learn to perceive / encode / predict / classify patterns or pattern sequences, but they do not learn to act in the more general sense of Reinforcement Learning (RL) in unknown environments (e.g., Kaelbling et al., 1996; Sutton and Barto, 1998)....
[...]
...Many variants of traditional RL exist (e.g., Barto et al., 1983; Watkins, 1989; Watkins and Dayan, 1992; Moore and Atkeson, 1993; Schwartz, 1993; Baird, 1994; Rummery and Niranjan, 1994; Singh, 1994; Baird, 1995; Kaelbling et al., 1995; Peng and Williams, 1996; Mahadevan, 1996; Tsitsiklis and van Roy, 1996; Bradtke et al., 1996; Santamarı́a et al., 1997; Prokhorov and Wunsch, 1997; Sutton and Barto, 1998; Wiering and Schmidhuber, 1998b; Baird and Moore, 1999; Meuleau et al., 1999; Morimoto and Doya, 2000; Bertsekas, 2001; Brafman and Tennenholtz, 2002; Abounadi et al., 2002; Lagoudakis and Parr, 2003; Sutton et al., 2008; Maei and Sutton, 2010)....
[...]
...This assumption does not hold in the broader fields of Sequential Decision Making and Reinforcement Learning (RL) (Kaelbling et al., 1996; Sutton and Barto, 1998; Hutter, 2005) (Sec....
[...]

Journal Article•DOI•

Mastering the game of Go with deep neural networks and tree search

[...]

David Silver¹, Aja Huang¹, Chris J. Maddison¹, Arthur Guez¹, Laurent Sifre¹, George van den Driessche¹, Julian Schrittwieser¹, Ioannis Antonoglou¹, Veda Panneershelvam¹, Marc Lanctot¹, Sander Dieleman¹, Dominik Grewe¹, John Nham¹, Nal Kalchbrenner¹, Ilya Sutskever¹, Timothy P. Lillicrap¹, Madeleine Leach¹, Koray Kavukcuoglu¹, Thore Graepel¹, Demis Hassabis¹ - Show less +16 more•Institutions (1)

Google¹

28 Jan 2016-Nature

TL;DR: Using this search algorithm, the program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0.5, the first time that a computer program has defeated a human professional player in the full-sized game of Go.

...read moreread less

Abstract: The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of stateof-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.

...read moreread less

14,377 citations

Pattern Recognition and Machine Learning

[...]

Christopher M. Bishop¹•Institutions (1)

Microsoft¹

01 Jan 2006

TL;DR: Probability distributions of linear models for regression and classification are given in this article, along with a discussion of combining models and combining models in the context of machine learning and classification.

...read moreread less

Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.

...read moreread less

10,141 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Higher Order Conditioning with Constant Motivation

[...]

Glen Finch, Elmer Culler

01 Oct 1934-American Journal of Psychology

26 citations

Journal Article•DOI•

Deconstructing the law of effect

[...]

Charles R. Gallistel¹•Institutions (1)

Rutgers University¹

01 Aug 2005-Games and Economic Behavior

TL;DR: Measurement of the time course of adjustments to step changes in the reward-scheduling environment show that adjustments can be as abrupt as the changes that drive them, and can occur with the minimum possible latency.

...read moreread less

26 citations

Book Chapter•DOI•

An RLS-based natural actor-critic algorithm for locomotion of a two-linked robot arm

[...]

Jooyoung Park¹, Jong-Ho Kim¹, Daesung Kang¹•Institutions (1)

Korea University¹

15 Dec 2005

TL;DR: An actor-critic type algorithm utilizing the RLS(recursive least-squares) method, which is one of the most efficient techniques for adaptive signal processing, together with natural policy gradient, showed better performance than the conventional stochastic gradient ascent algorithm.

...read moreread less

Abstract: Recently, actor-critic methods have drawn much interests in the area of reinforcement learning, and several algorithms have been studied along the line of the actor-critic strategy. This paper studies an actor-critic type algorithm utilizing the RLS(recursive least-squares) method, which is one of the most efficient techniques for adaptive signal processing, together with natural policy gradient. In the actor part of the studied algorithm, we follow the strategy of performing parameter update via the natural gradient method, while in its update for the critic part, the recursive least-squares method is employed in order to make the parameter estimation for the value functions more efficient. The studied algorithm was applied to locomotion of a two-linked robot arm, and showed better performance compared to the conventional stochastic gradient ascent algorithm.

...read moreread less

26 citations

Journal Article•DOI•

An N-player sequential stochastic game with identical payoffs

[...]

Kumpati S. Narendra¹, R. M. Wheeler¹•Institutions (1)

Yale University¹

01 Nov 1983

TL;DR: It is shown that the expected change in each player's payoff is nonnegative at every instant, so that the group improves its performance monotonically, which appears to have important implications in decentralized decision-making in large complex systems.

...read moreread less

Abstract: A sequential stochastic game among an arbitrary number of players in which all players' payoffs are identical is analyzed. The players are unaware that they are in a game and hence they have no knowledge of other players' strategies or the payoff structure. At each instant the players use a simple learning algorithm to update their mixed strategy choices based entirely on the response of a random environment. It is shown that the expected change in each player's payoff is nonnegative at every instant, so that the group improves its performance monotonically. This result appears to have important implications in decentralized decision-making in large complex systems.

...read moreread less

26 citations

"Reinforcement Learning: An Introduc..." refers methods in this paper

...4 Graybiel (2000) is a brief primer on the basal ganglia. The experiments mentioned that involve optogenetic activation of dopamine neurons were conducted by Tsai, Zhang, Adamantidis, Stuber, Bonci, de Lecea, and Deisseroth (2009), Steinberg, Keiflin, Boivin, Witten, Deisseroth, and Janak (2013), and Claridge-Chang, Roorda, Vrontou, Sjulson, Li, Hirsh, and Miesenböck (2009)....
[...]
...7 Gradient bandit algorithms are a special case of the gradient-based reinforcement learning algorithms introduced by Williams (1992), and that later developed into the actor–critic and policy-gradient algorithms that we treat later in this book....
[...]
...3 Gradient-descent methods for minimizing mean-squared error in supervised learning are well known. Widrow and Hoff (1960) introduced the least-meansquare (LMS) algorithm, which is the prototypical incremental gradientdescent algorithm....
[...]
...3 The optimality of the TD algorithm under batch training was established by Sutton (1988). Illuminating this result is Barnard’s (1993) derivation of the TD algorithm as a combination of one step of an incremental method for learning a model of the Markov chain and one step of a method for computing predictions from the model....
[...]

H-Learning: A Reinforcement Learning Method for Optimizing Undiscounted Average Reward

[...]

Prasad Tadepalli, DoKyeong Ok

01 Jan 1994

TL;DR: This paper introduces a model-bases reinforcement learning method called H-learning, which optimizes undiscounted average reward, and compares it with three other reinforcement learning methods in the domain of scheduling Automatic Guided Vehicles, and transportation robots used in modern manufacturing plants and facilities.

...read moreread less

Abstract: In this paper, we introduce a model-bases reinforcement learning method called H-learning, which optimizes undiscounted average reward. We compare it with three other reinforcement learning methods in the domain of scheduling Automatic Guided Vehicles, and transportation robots used in modern manufacturing plants and facilities. The four methods differ along two dimensions. They are either model-based or model-free, and optimize discounted total reward or undiscounted average reward. Our experimental results indicate that H-learning is more robust with respect to changes in the domain parameters, and in many cases, converges in fewer steps to better average reward per time step than all the other methods. An added advantage is that unlike the other methods it does not have any parameters to tune.

...read moreread less

25 citations