scispace - formally typeset
Search or ask a question
Book

Reinforcement Learning: An Introduction

TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Abstract: Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Their discussion ranges from the history of the field's intellectual foundations to the most recent developments and applications. The only necessary mathematical background is familiarity with elementary concepts of probability. The book is divided into three parts. Part I defines the reinforcement learning problem in terms of Markov decision processes. Part II provides basic solution methods: dynamic programming, Monte Carlo methods, and temporal-difference learning. Part III presents a unified view of the solution methods and incorporates artificial neural networks, eligibility traces, and planning; the two final chapters present case studies and consider the future of reinforcement learning.

Content maybe subject to copyright    Report

Citations
More filters
Book
18 Nov 2016
TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

38,208 citations

Journal ArticleDOI
26 Feb 2015-Nature
TL;DR: This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.
Abstract: The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

23,074 citations

Journal ArticleDOI
TL;DR: This historical survey compactly summarizes relevant work, much of it from the previous millennium, review deep supervised learning, unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.

14,635 citations


Cites background from "Reinforcement Learning: An Introduc..."

  • ...Such NNs learn to perceive/encode/predict/ classify patterns or pattern sequences, but they do not learn to act in the more general sense of Reinforcement Learning (RL) in unknown environments (see surveys, e.g., Kaelbling et al., 1996; Sutton & Barto, 1998; Wiering & van Otterlo, 2012)....

    [...]

  • ...The latter is often explained in a probabilistic framework (e.g., Sutton & Barto, 1998), but its basic idea can already be conveyed in a deterministic setting....

    [...]

  • ...Such NNs learn to perceive / encode / predict / classify patterns or pattern sequences, but they do not learn to act in the more general sense of Reinforcement Learning (RL) in unknown environments (e.g., Kaelbling et al., 1996; Sutton and Barto, 1998)....

    [...]

  • ...Many variants of traditional RL exist (e.g., Barto et al., 1983; Watkins, 1989; Watkins and Dayan, 1992; Moore and Atkeson, 1993; Schwartz, 1993; Baird, 1994; Rummery and Niranjan, 1994; Singh, 1994; Baird, 1995; Kaelbling et al., 1995; Peng and Williams, 1996; Mahadevan, 1996; Tsitsiklis and van Roy, 1996; Bradtke et al., 1996; Santamarı́a et al., 1997; Prokhorov and Wunsch, 1997; Sutton and Barto, 1998; Wiering and Schmidhuber, 1998b; Baird and Moore, 1999; Meuleau et al., 1999; Morimoto and Doya, 2000; Bertsekas, 2001; Brafman and Tennenholtz, 2002; Abounadi et al., 2002; Lagoudakis and Parr, 2003; Sutton et al., 2008; Maei and Sutton, 2010)....

    [...]

  • ...This assumption does not hold in the broader fields of Sequential Decision Making and Reinforcement Learning (RL) (Kaelbling et al., 1996; Sutton and Barto, 1998; Hutter, 2005) (Sec....

    [...]

Journal ArticleDOI
28 Jan 2016-Nature
TL;DR: Using this search algorithm, the program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0.5, the first time that a computer program has defeated a human professional player in the full-sized game of Go.
Abstract: The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of stateof-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.

14,377 citations

Christopher M. Bishop1
01 Jan 2006
TL;DR: Probability distributions of linear models for regression and classification are given in this article, along with a discussion of combining models and combining models in the context of machine learning and classification.
Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.

10,141 citations

References
More filters
Proceedings ArticleDOI
13 Dec 1989
TL;DR: A review is presented of the field of neuroengineering as a whole, highlighting the importance of neurocontrol and neuroidentification, and of the five major architectures in use today in neurocontrol (in robotics, in particular).
Abstract: A review is presented of the field of neuroengineering as a whole, highlighting the importance of neurocontrol and neuroidentification. Then a description is given of the five major architectures in use today in neurocontrol (in robotics, in particular) and a few areas for future research. The author concludes with comments on neuroidentification. >

325 citations

Journal ArticleDOI
TL;DR: It is concluded that real-time learning mechanisms that do not require evaluative feedback from the environment are fundamental to natural intelligence and may have implications for artificial intelligence.
Abstract: A neuronal model of classical conditioning is proposed. The model is most easily described by contrasting it with a still influential neuronal model first analyzed by Hebb (1949). It is proposed that the Hebbian model be modified in three ways to yield a model more in accordance with animal learning phenomena. First, instead of correlating pre- and postsynaptic levels of activity, changes in pre- and postsynaptic levels of activity should be correlated to determine the changes in synaptic efficacy that represent learning. Second, instead of correlating approximately simultaneous pre- and postsynaptic signals, earlier changes in presynaptic signals should be correlated with later changes in postsynaptic signals. Third, a change in the efficacy of a synapse should be proportional to the current efficacy of the synapse, accounting for the initial positive acceleration in the S-shaped acquisition curves observed in animal learning. The resulting model, termed a drive-reinforcement model of single neuron function, suggests that nervous system activity can be understood in terms of two classes of neuronal signals: drives that are defined to be signal levels and reinforcers that are defined to be changes in signal levels. Defining drives and reinforcers in this way, in conjunction with the neuronal model, suggests a basis for a neurobiological theory of learning. The proposed neuronal model is an extension of the Sutton-Barto (1981) model, which in turn can be seen as a temporally refined extension of the Rescorla-Wagner (1972) model. It is shown that the proposed neuronal model predicts a wide range of classical conditioning phenomena, including delay and trace conditioning, conditioned and unconditioned stimulus duration and amplitude effects, partial reinforcement effects, interstimulus interval effects, second-order conditioning, conditioned inhibition, extinction, reacquisition effects, backward conditioning, blocking, overshadowing, compound conditioning, and discriminative stimulus effects. The neuronal model also eliminates some inconsistencies with the experimental evidence that occur with the Rescorla-Wagner and Sutton-Barto models. Implications of the neuronal model for animal learning theory, connectionist and neural network modeling, artificial intelligence, adaptive control theory, and adaptive signal processing are discussed. It is concluded that real-time learning mechanisms that do not require evaluative feedback from the environment are fundamental to natural intelligence and may have implications for artificial intelligence. Experimental tests of the model are suggested.

324 citations

Book ChapterDOI
TL;DR: This chapter explores the numerical methods for solving dynamic programming (DP) problems and focuses on continuous Markov decision processes (MDPs) because these problems arise frequently in economic applications.
Abstract: Publisher Summary This chapter explores the numerical methods for solving dynamic programming (DP) problems. The DP framework has been extensively used in economics because it is sufficiently rich to model almost any problem involving sequential decision making over time and under uncertainty. The chapter focuses on continuous Markov decision processes (MDPs) because these problems arise frequently in economic applications. Although, complexity theory suggests a number of useful algorithms, the theory has relatively little to say about important practical issues, such as determining the point at which various exponential-time algorithms such as Chebyshev approximation methods start to blow up, making it optimal to switch to polynomial-time algorithms. In future work, it will be essential to provide numerical comparisons of a broader range of methods over a broader range of test problems, including problems of moderate to high dimensionality.

319 citations


"Reinforcement Learning: An Introduc..." refers methods in this paper

  • ...At about the same time as Samuel's work, Bellman and Dreyfus (1959) proposed using function approximation methods with DP....

    [...]

  • ...) There is now a fairly extensive literature on function approximation methods and DP, such as multigrid methods and methods using splines and orthogonal polynomials (e.g., Bellman and Dreyfus, 1959; Bellman, Kalaba, and Kotkin, 1973; Daniel, 1976; Whitt, 1978; Reetz, 1977; Schweitzer and Seidmann, 1985; Chow and Tsitsiklis, 1991; Kushner and Dupuis, 1992; Rust, 1996)....

    [...]

  • ...Dynamic programming has been extensively developed in the last four decades, including extensions to partially observable MDPs (surveyed by Lovejoy, 1991), many applications (surveyed by White, 1985, 1988, 1993), approximation methods (surveyed by Rust, 1996), and asynchronous methods (Bertsekas, 1982, 1983). Many excellent modern treatments of dynamic programming are available (e.g., Bertsekas, 1995; Puterman, 1994; Ross, 1983; and Whittle, 1982, 1983). Bryson (1996) provides a detailed authoritative history of optimal control....

    [...]

Journal ArticleDOI
01 May 1985
TL;DR: A class of learning tasks is described that combines aspects of learning automation tasks and supervised learning pattern-classification tasks, called associative reinforcement learning tasks, and an algorithm is presented, called the associative reward-penalty, or AR-P algorithm, for which a form of optimal performance is proved.
Abstract: A class of learning tasks is described that combines aspects of learning automation tasks and supervised learning pattern-classification tasks. These tasks are called associative reinforcement learning tasks. An algorithm is presented, called the associative reward-penalty, or AR-P algorithm for which a form of optimal performance is proved. This algorithm simultaneously generalizes a class of stochastic learning automata and a class of supervised learning pattern-classification methods related to the Robbins-Monro stochastic approximation procedure. The relevance of this hybrid algorithm is discussed with respect to the collective behaviour of learning automata and the behaviour of networks of pattern-classifying adaptive elements. Simulation results are presented that illustrate the associative reinforcement learning task and the performance of the AR-P algorithm as compared with that of several existing algorithms.

319 citations


"Reinforcement Learning: An Introduc..." refers background in this paper

  • ...The term associative reinforcement learning has also been used for associative search (Barto and Anandan, 1985), but we prefer to reserve that term as a synonym for the full reinforcement learning problem (as in Sutton, 1984)....

    [...]

  • ...that we and colleagues accomplished was directed toward showing that reinforcement learning and supervised learning were indeed different (Barto, Sutton, and Brouwer, 1981; Barto and Sutton, 1981b; Barto and Anandan, 1985)....

    [...]

Journal ArticleDOI
TL;DR: The striking similarities in teaching signals and learning behavior between the computational and biological results suggest that dopamine-like reward responses may serve as effective teaching signals for learning behavioral tasks that are typical for primate cognitive behavior, such as spatial delayed responding.

314 citations


"Reinforcement Learning: An Introduc..." refers background in this paper

  • ...1–2 Most of the specific material from these sections is from Sutton (1988), including the TD(0) algorithm, the random walk example, and the term “temporaldifference learning....

    [...]