scispace - formally typeset
Search or ask a question
Book

Reinforcement Learning: An Introduction

TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Abstract: Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Their discussion ranges from the history of the field's intellectual foundations to the most recent developments and applications. The only necessary mathematical background is familiarity with elementary concepts of probability. The book is divided into three parts. Part I defines the reinforcement learning problem in terms of Markov decision processes. Part II provides basic solution methods: dynamic programming, Monte Carlo methods, and temporal-difference learning. Part III presents a unified view of the solution methods and incorporates artificial neural networks, eligibility traces, and planning; the two final chapters present case studies and consider the future of reinforcement learning.

Content maybe subject to copyright    Report

Citations
More filters
Book
18 Nov 2016
TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

38,208 citations

Journal ArticleDOI
26 Feb 2015-Nature
TL;DR: This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.
Abstract: The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

23,074 citations

Journal ArticleDOI
TL;DR: This historical survey compactly summarizes relevant work, much of it from the previous millennium, review deep supervised learning, unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.

14,635 citations


Cites background from "Reinforcement Learning: An Introduc..."

  • ...Such NNs learn to perceive/encode/predict/ classify patterns or pattern sequences, but they do not learn to act in the more general sense of Reinforcement Learning (RL) in unknown environments (see surveys, e.g., Kaelbling et al., 1996; Sutton & Barto, 1998; Wiering & van Otterlo, 2012)....

    [...]

  • ...The latter is often explained in a probabilistic framework (e.g., Sutton & Barto, 1998), but its basic idea can already be conveyed in a deterministic setting....

    [...]

  • ...Such NNs learn to perceive / encode / predict / classify patterns or pattern sequences, but they do not learn to act in the more general sense of Reinforcement Learning (RL) in unknown environments (e.g., Kaelbling et al., 1996; Sutton and Barto, 1998)....

    [...]

  • ...Many variants of traditional RL exist (e.g., Barto et al., 1983; Watkins, 1989; Watkins and Dayan, 1992; Moore and Atkeson, 1993; Schwartz, 1993; Baird, 1994; Rummery and Niranjan, 1994; Singh, 1994; Baird, 1995; Kaelbling et al., 1995; Peng and Williams, 1996; Mahadevan, 1996; Tsitsiklis and van Roy, 1996; Bradtke et al., 1996; Santamarı́a et al., 1997; Prokhorov and Wunsch, 1997; Sutton and Barto, 1998; Wiering and Schmidhuber, 1998b; Baird and Moore, 1999; Meuleau et al., 1999; Morimoto and Doya, 2000; Bertsekas, 2001; Brafman and Tennenholtz, 2002; Abounadi et al., 2002; Lagoudakis and Parr, 2003; Sutton et al., 2008; Maei and Sutton, 2010)....

    [...]

  • ...This assumption does not hold in the broader fields of Sequential Decision Making and Reinforcement Learning (RL) (Kaelbling et al., 1996; Sutton and Barto, 1998; Hutter, 2005) (Sec....

    [...]

Journal ArticleDOI
28 Jan 2016-Nature
TL;DR: Using this search algorithm, the program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0.5, the first time that a computer program has defeated a human professional player in the full-sized game of Go.
Abstract: The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of stateof-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.

14,377 citations

Christopher M. Bishop1
01 Jan 2006
TL;DR: Probability distributions of linear models for regression and classification are given in this article, along with a discussion of combining models and combining models in the context of machine learning and classification.
Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.

10,141 citations

References
More filters
Proceedings Article
04 Aug 2001
TL;DR: R-MAX as mentioned in this paper is a model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time, where the agent always maintains a complete, but possibly inaccurate model of its environment and acts based on the optimal policy derived from this model.
Abstract: R-MAX is a simple model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time. In R-MAX, the agent always maintains a complete, but possibly inaccurate model of its environment and acts based on the optimal policy derived from this model. The model is initialized in an optimistic fashion: all actions in all states return the maximal possible reward (hence the name). During execution, the model is updated based on the agent's observations. R-MAX improves upon several previous algorithms: (1) It is simpler and more general than Kearns and Singh's E3 algorithm, covering zerosum stochastic games. (2) It has a built-in mechanism for resolving the exploration vs. exploitation dilemma. (3) It formally justifies the "optimism under uncertainty" bias used in many RL algorithms. (4) It is much simpler and more general than Brafman and Tennenholtz's LSG algorithmfor learning in single controller stochastic games. (5) It generalizes the algorithm by Monderer and Tennenholtz for learning in repeated games. (6) It is the only algorithm for near-optimal learning in repeated games known to be polynomial, providing a much simpler and more efficient alternative to previous algorithms by Banos and by Megiddo.

923 citations

Journal ArticleDOI
TL;DR: In this article, a model for memory based on neurophysiolo gical considerations is reviewed, where neurons associate two patterns of neural activity by incrementing synaptic connectivity proportionally to the product of pre-and postsynaptic activity, forming a matrix of synaptic connectivities.
Abstract: A previously proposed model for memory based on neurophysiolo gical considerations is reviewed. We assume that (a) nervous system activity is usefully represented as the set of simultaneous individual neuron activities in a group of neurons; (b) different memory traces make use of the same synapses; and (c) synapses associate two patterns of neural activity by incrementing synaptic connectivity proportionally to the product of pre- and postsynaptic activity, forming a matrix of synaptic connectivities. We extend this model by (a) introducing positive feedback of a set of neurons onto itself and (b) allowing the individual neurons to saturate. A hybrid model, partly analog and partly binary, arises. The system has certain characteristics reminiscent of analysis by distinctive features. Next, we apply the model to "categorical perception." Finally, we discuss probability learning. The model can predict overshooting, recency data, and probabilities occurring in systems with more than two events with reasonably good accuracy.

908 citations

Journal ArticleDOI
TL;DR: The possibility of agreement on a theoretical framework, at least in certain intensively studied areas, may be maximized by defining concepts in terms of experimentally manipulable variables and developing the consequences of assumptions by strict mathematical reasoning as mentioned in this paper.
Abstract: Improved experimental techniques for the study of conditioning and simple discrimination learning enable the present day investigator to obtain data which are sufficiently orderly and reproducible to support exact quantitative predictions of behavior. Analogy with other sciences suggests that full utilization of these techniques in the analysis of learning processes will depend to some extent upon a comparable refinement of theoretical concepts and methods. The necessary interplay between theory and experiment has been hindered, however, by the fact that none of the many current theories of learning commands general agreement among researchers. It seems likely that progress toward a common frame of reference will be slow so long as most theories are built around verbally defined hypothetical constructs which are not susceptible to unequivocal verification. While awaiting resolution of the many apparent disparities among competing theories, it may be advantageous to systematize well established empirical relationships at a peripheral, statistical level of analysis. The possibility of agreement on a theoretical framework, at least in certain intensively studied areas, may be maximized by defining concepts in terms of experimentally manipulable variables, and developing the consequences of assumptions by strict mathematical reasoning. This essay will introduce a series of

908 citations


"Reinforcement Learning: An Introduc..." refers background in this paper

  • ...In psychology, bandit problems have played roles in statistical learning theory (e.g., Bush and Mosteller, 1955; Estes, 1950)....

    [...]

Journal ArticleDOI
TL;DR: The data suggest that the hippocampus does not represent space as a passive computation, but rather that hippocampal spatial processing is an active process likely regulated by cognitive mechanisms.
Abstract: Neural ensembles were recorded from the CA3 region of rats running on T-based decision tasks. Examination of neural representations of space at fast time scales revealed a transient but repeatable phenomenon as rats made a decision: the location reconstructed from the neural ensemble swept forward, first down one path and then the other. Estimated representations were coherent and preferentially swept ahead of the animal rather than behind the animal, implying it represented future possibilities rather than recently traveled paths. Similar phenomena occurred at other important decisions (such as in recovery from an error). Local field potentials from these sites contained pronounced theta and gamma frequencies, but no sharp wave frequencies. Forward-shifted spatial representations were influenced by task demands and experience. These data suggest that the hippocampus does not represent space as a passive computation, but rather that hippocampal spatial processing is an active process likely regulated by cognitive mechanisms.

897 citations


"Reinforcement Learning: An Introduc..." refers background in this paper

  • ...6 Early work on using estimates of the upper confidence bound to select actions was done by Lai and Robbins (1985), Kaelbling (1993b), and Agrawal (1995)....

    [...]

  • ...When a rat pauses at a choice point in a maze, the representation of space in the hippocampus sweeps forward (and not backwards) along the possible paths the animal can take from that point (Johnson and Redish, 2007)....

    [...]

Proceedings Article
24 Jul 1998
TL;DR: New algorithms for reinforcement learning are presented and it is proved that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes.
Abstract: We present new algorithms for reinforcement learning and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states and actions, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the Exploration-Exploitation trade-off.

889 citations