scispace - formally typeset
Search or ask a question
Book

Reinforcement Learning: An Introduction

TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Abstract: Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Their discussion ranges from the history of the field's intellectual foundations to the most recent developments and applications. The only necessary mathematical background is familiarity with elementary concepts of probability. The book is divided into three parts. Part I defines the reinforcement learning problem in terms of Markov decision processes. Part II provides basic solution methods: dynamic programming, Monte Carlo methods, and temporal-difference learning. Part III presents a unified view of the solution methods and incorporates artificial neural networks, eligibility traces, and planning; the two final chapters present case studies and consider the future of reinforcement learning.

Content maybe subject to copyright    Report

Citations
More filters
Book
18 Nov 2016
TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

38,208 citations

Journal ArticleDOI
26 Feb 2015-Nature
TL;DR: This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.
Abstract: The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

23,074 citations

Journal ArticleDOI
TL;DR: This historical survey compactly summarizes relevant work, much of it from the previous millennium, review deep supervised learning, unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.

14,635 citations


Cites background from "Reinforcement Learning: An Introduc..."

  • ...Such NNs learn to perceive/encode/predict/ classify patterns or pattern sequences, but they do not learn to act in the more general sense of Reinforcement Learning (RL) in unknown environments (see surveys, e.g., Kaelbling et al., 1996; Sutton & Barto, 1998; Wiering & van Otterlo, 2012)....

    [...]

  • ...The latter is often explained in a probabilistic framework (e.g., Sutton & Barto, 1998), but its basic idea can already be conveyed in a deterministic setting....

    [...]

  • ...Such NNs learn to perceive / encode / predict / classify patterns or pattern sequences, but they do not learn to act in the more general sense of Reinforcement Learning (RL) in unknown environments (e.g., Kaelbling et al., 1996; Sutton and Barto, 1998)....

    [...]

  • ...Many variants of traditional RL exist (e.g., Barto et al., 1983; Watkins, 1989; Watkins and Dayan, 1992; Moore and Atkeson, 1993; Schwartz, 1993; Baird, 1994; Rummery and Niranjan, 1994; Singh, 1994; Baird, 1995; Kaelbling et al., 1995; Peng and Williams, 1996; Mahadevan, 1996; Tsitsiklis and van Roy, 1996; Bradtke et al., 1996; Santamarı́a et al., 1997; Prokhorov and Wunsch, 1997; Sutton and Barto, 1998; Wiering and Schmidhuber, 1998b; Baird and Moore, 1999; Meuleau et al., 1999; Morimoto and Doya, 2000; Bertsekas, 2001; Brafman and Tennenholtz, 2002; Abounadi et al., 2002; Lagoudakis and Parr, 2003; Sutton et al., 2008; Maei and Sutton, 2010)....

    [...]

  • ...This assumption does not hold in the broader fields of Sequential Decision Making and Reinforcement Learning (RL) (Kaelbling et al., 1996; Sutton and Barto, 1998; Hutter, 2005) (Sec....

    [...]

Journal ArticleDOI
28 Jan 2016-Nature
TL;DR: Using this search algorithm, the program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0.5, the first time that a computer program has defeated a human professional player in the full-sized game of Go.
Abstract: The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of stateof-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.

14,377 citations

Christopher M. Bishop1
01 Jan 2006
TL;DR: Probability distributions of linear models for regression and classification are given in this article, along with a discussion of combining models and combining models in the context of machine learning and classification.
Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.

10,141 citations

References
More filters
Book ChapterDOI
TL;DR: This chapter presents the formulation of the learning models and its application to the synthesis of control systems and pattern-recognition systems in such a manner that these systems may be said to possess attributes of learning.
Abstract: Publisher Summary This chapter presents the formulation of the learning models and its application to the synthesis of control systems and pattern-recognition systems in such a manner that these systems may be said to possess attributes of learning. Learning may be defined as the process by which an activity originates or is changed through reaction to an encountered situation which is not due to “native” response tendencies, maturation, or temporary states of the organism. The chapter also discusses the basic properties of reinforcement-learning pattern recognition systems. There are two types of control systems—open-loop and closed-loop. Just as for learning control systems, correspondences exist between elements of stochastic learning theory and pattern recognition systems. Mathematical learning models combine the mathematical properties and psychological concepts of learning. They are formulated from experimental results and attempt to predict learning behavior quantitatively. One of the most important principles in all learning theory is the law of reinforcement.

119 citations

Proceedings Article
08 Jul 1997
TL;DR: A new, faster, algorithm is presented based on a multiresolution search of a quicklyconstructible augmented kd-tree to make exact LWPR predictions, and an approximation that achieves up to a two-orders ofmagnitude speedup with negligible accuracy losses is introduced.
Abstract: Locally weighted polynomial regression (LWPR) is a popular instance-based algorithm for learning continuous non-linear mappings. For more than two or three inputs and for more than a few thousand datapoints the computational expense of predictions is daunting. We discuss drawbacks with previous approaches to dealing with this problem, and present a new algorithm based on a multiresolution search of a quicklyconstructible augmented kd-tree. Without needing to rebuild the tree, we can make fast predictions with arbitrary local weighting functions, arbitrary kernel widths and arbitrary queries. The paper begins with a new, faster, algorithm for exact LWPR predictions. Next we introduce an approximation that achieves up to a two-ordersof-magnitude speedup with negligible accuracy losses. Increasing a certain approximation parameter achieves greater speedups still, but with a correspondingly larger accuracy degradation. This is nevertheless useful during operations such as the early stages of model selection and locating optima of a tted surface. We also show how the approximations can permit real-time query-speci c optimization of the kernel width. We conclude with a brief discussion of potential extensions for tractable instance-based learning on datasets that are too large to t in a computer's main memory. 1 Locally Weighted Polynomial Regression Locally weighted polynomial regression (LWPR) is a form of instance-based (a.k.a memory-based) algorithm for learning continuous non-linear mappings from real-valued input vectors to real-valued output vectors. It is particularly appropriate for learning complex highly non-linear functions of up to about 30 inputs from noisy data. Popularized in the statistics literature in the past decades (Cleveland and Delvin, 1988; Grosse, 1989; Atkeson et al., 1997a) it is enjoying increasing use in applications such as learning robot dynamics (Moore, 1992; Schaal and Atkeson, 1994) and learning process models. Both classical and Bayesian linear regression analysis tools can be extended to work in the locally weighted framework (Hastie and Tibshirani, 1990), providing con dence intervals on predictions, on gradient estimates and on noise estimates|all important when a learned mapping is to be used by a controller (Atkeson et al., 1997b; Schneider, 1997). Let us review LWPR. We begin with linear regression on one input and one output. Global linear regression (left of Figure 1) nds the line that minimizes the sum squared residuals. If this is represented as

117 citations

Journal ArticleDOI
TL;DR: This paper presents a learning algorithm for neural networks, called Alopex, which uses local correlations between changes in individual weights and changes in the global error measure, and shows that learning times are comparable to those for standard gradient descent methods.
Abstract: We present a learning algorithm for neural networks, called Alopex. Instead of error gradient, Alopex uses local correlations between changes in individual weights and changes in the global error measure. The algorithm does not make any assumptions about transfer functions of individual neurons, and does not explicitly depend on the functional form of the error measure. Hence, it can be used in networks with arbitrary transfer functions and for minimizing a large class of error measures. The learning algorithm is the same for feedforward and recurrent networks. All the weights in a network are updated simultaneously, using only local computations. This allows complete parallelization of the algorithm. The algorithm is stochastic and it uses a “temperature” parameter in a manner similar to that in simulated annealing. A heuristic “annealing schedule” is presented that is effective in finding global minima of error surfaces. In this paper, we report extensive simulation studies illustrating these advantages and show that learning times are comparable to those for standard gradient descent methods. Feedforward networks trained with Alopex are used to solve the MONK's problems and symmetry problems. Recurrent networks trained with the same algorithm are used for solving temporal XOR problems. Scaling properties of the algorithm are demonstrated using encoder problems of different sizes and advantages of appropriate error measures are illustrated using a variety of problems.

117 citations

Book
31 Dec 1986
TL;DR: A novel algorithm is examined that combines ASPECTS of REINFORCEMENT LEARNING and a DATA-DIRECTED SEARCH for USEFUL WEIGHTS, and is shown to out perform reinFORMCEMENT-LEARNING ALGORITHMS.
Abstract: THE DIFFICULTIES OF LEARNING IN MULTILAYERED NETWORKS OF COMPUTATIONAL UNITS HAS LIMITED THE USE OF CONNECTIONIST SYSTEMS IN COMPLEX DOMAINS. THIS DISSERTATION ELUCIDATES THE ISSUES OF LEARNING IN A NETWORK''S HIDDEN UNITS, AND REVIEWS METHODS FOR ADDRESSING THESE ISSUES THAT HAVE BEEN DEVELOPED THROUGH THE YEARS. ISSUES OF LEARNING IN HIDDEN UNITS ARE SHOWN TO BE ANALOGOUS TO LEARNING ISSUES FOR MULTILAYER SYSTEMS EMPLOYING SYMBOLIC REPRSENTATIONS. COMPARISONS OF A NUMBER OF ALGORITHMS FOR LEARNING IN HIDDEN UNITS ARE MADE BY APPLYING THEM IN A CONSISTENT MANNER TO SEVERAL TASKS. RECENTLY DEVELOPED ALGORITHMS, INCLUDING RUMELHART, ET AL''S, ERROR BACK-PROPOGATIONS ALGORITHM AND BARTO, ET AL''S, REINFORCEMENT-LEARNING ALGORITHMS, LEARN THE SOLUTIONS TO THE TASKS MUCH MORE SUCCESSFULLY THAN METHODS OF THE PAST. A NOVEL ALGORITHM IS EXAMINED THAT COMBINES ASPECTS OF REINFORCEMENT LEARNING AND A DATA-DIRECTED SEARCH FOR USEFUL WEIGHTS, AND IS SHOWN TO OUT PERFORM REINFORMCEMENT-LEARNING ALGORITHMS. A CONNECTIONIST FRAMEWORK FOR THE LEARNING OF STRATEGIES IS DESCRIBED WHICH COMBINES THE ERROR BACK-PROPOGATION ALGORITHM FOR LEARNING IN HIDDEN UNITS WITH SUTTON''S AHC ALGORITHM TO LEARN EVALUATION FUNCTIONS AND WITH A REINFORCEMENT-LEARNING ALGORITHM TO LEARN SEARCH HEURISTICS. THE GENERAL- ITY OF THIS HYBRID SYSTEM IS DEMONSTRATED THROUGH SUCCESSFUL APPLICATIONS

115 citations


"Reinforcement Learning: An Introduc..." refers background or methods in this paper

  • ...The pole-balancing example is from Michie and Chambers (1968) and Barto, Sutton, and Anderson (1983)....

    [...]

  • ..., Kohonen, 1977; Anderson, Silverstein, Ritz, and Jones, 1977) to reinforcement learning. Barto, Anderson, and Sutton (1982) used a two-layer ANN to learn a nonlinear control policy, and emphasized the first layer’s role of learning a suitable representation. Hampson (1983, 1989) was an early proponent of multilayer ANNs for learning value functions. Barto, Sutton, and Anderson (1983) presented an actor-critic algorithm in the form of an ANN learning to balance a simulated pole (see Sections 15.7 and 15.8). Barto and Anandan (1985) introduced a stochastic version of Widrow, Gupta, and Maitra’s (1973) selective bootstrap algorithm called the associative reward-penalty (AR−P ) algorithm. Barto (1985, 1986) and Barto and Jordan (1987) described multi-layer ANNs consisting of AR−P units trained with a globally-broadcast reinforcement signal to learn classification rules that are not linearly separable....

    [...]

  • ...At this time we developed a method for using temporal-difference learning in trial-and-error learning, known as the actor– critic architecture, and applied this method to Michie and Chambers’s pole-balancing problem (Barto, Sutton, and Anderson, 1983). This method was extensively studied in Sutton’s (1984) Ph.D. dissertation and extended to use backpropagation neural networks in Anderson’s (1986) Ph....

    [...]

  • ...At this time we developed a method for using temporal-difference learning in trial-and-error learning, known as the actor– critic architecture, and applied this method to Michie and Chambers’s pole-balancing problem (Barto, Sutton, and Anderson, 1983). This method was extensively studied in Sutton’s (1984) Ph.D. dissertation and extended to use backpropagation neural networks in Anderson’s (1986) Ph.D. dissertation. Around this time, Holland (1986) incorporated temporal-difference ideas explicitly into his classifier systems....

    [...]

  • ..., Kohonen, 1977; Anderson, Silverstein, Ritz, and Jones, 1977) to reinforcement learning. Barto, Anderson, and Sutton (1982) used a two-layer ANN to learn a nonlinear control policy, and emphasized the first layer’s role of learning a suitable representation. Hampson (1983, 1989) was an early proponent of multilayer ANNs for learning value functions. Barto, Sutton, and Anderson (1983) presented an actor-critic algorithm in the form of an ANN learning to balance a simulated pole (see Sections 15....

    [...]

Proceedings Article
27 Nov 1995
TL;DR: Experimental tests show that this TDNN-TD(λ) network can match the performance of the previous hand-engineered system, and both neural network approaches significantly outperform the best previous (non-learning) solution to this problem.
Abstract: Job-shop scheduling is an important task for manufacturing industries. We are interested in the particular task of scheduling payload processing for NASA's space shuttle program. This paper summarizes our previous work on formulating this task for solution by the reinforcement learning algorithm TD(λ). A shortcoming of this previous work was its reliance on hand-engineered input features. This paper shows how to extend the time-delay neural network (TDNN) architecture to apply it to irregular-length schedules. Experimental tests show that this TDNN-TD(λ) network can match the performance of our previous hand-engineered system. The tests also show that both neural network approaches significantly outperform the best previous (non-learning) solution to this problem in terms of the quality of the resulting schedules and the number of search steps required to construct them.

114 citations