scispace - formally typeset
Search or ask a question
Book

Reinforcement Learning: An Introduction

TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Abstract: Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Their discussion ranges from the history of the field's intellectual foundations to the most recent developments and applications. The only necessary mathematical background is familiarity with elementary concepts of probability. The book is divided into three parts. Part I defines the reinforcement learning problem in terms of Markov decision processes. Part II provides basic solution methods: dynamic programming, Monte Carlo methods, and temporal-difference learning. Part III presents a unified view of the solution methods and incorporates artificial neural networks, eligibility traces, and planning; the two final chapters present case studies and consider the future of reinforcement learning.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
21 Jun 2019
TL;DR: LTS (Learn to schedule), a deep reinforcement learning (DRL) based scheduling approach that can dynamically adapt to the variation of both viewer traffics and CDN performance is proposed.
Abstract: With the growing demand for crowdsourced live streaming (CLS), how to schedule the large-scale dynamic viewers effectively among different Content Delivery Network (CDN) providers has become one of the most significant challenges for CLS platforms. Although abundant algorithms have been proposed in recent years, they suffer from a critical limitation: due to their inaccurate feature engineering or naive rules, they cannot optimally schedule viewers. To address this concern, we propose LTS (Learn to schedule), a deep reinforcement learning (DRL) based scheduling approach that can dynamically adapt to the variation of both viewer traffics and CDN performance. After the extensive evaluation the real data from a leading CLS platform in China, we demonstrate that LTS improves the average quality of experience (QoE) over state-of-the-art approach by 8.71%-15.63%.

18 citations


Cites background or methods from "Reinforcement Learning: An Introduc..."

  • ...In our experiment, we use UCB-1 [14] and DiscountedUCB [6], both of which are using moving average to estimate...

    [...]

  • ...DQN proposed in [12] makes it possible for RL to learn directly from high dimensional inputs, which is a massive step forward from traditional RL [14]....

    [...]

  • ...In our experiment, we use UCB-1 [14] and DiscountedUCB [6], both of which are using moving average to estimate the expected performance, and the difference is the latter gives more weight to more recent measurements....

    [...]

  • ...For UCB-1 and Discounted-UCB, a critical limitation also separates them from the optimal solution: since they use a straightforward way to estimate the changes of either workload patterns or CDN provider performance (e.g., moving average), they cannot well utilize the historical information....

    [...]

Posted Content
TL;DR: This work proposes a new definition of risk, which is called caution, as a penalty function added to the dual of the linear programming (LP) formulation of tabular RL, and proposes a block-coordinate augmentation of the aforementioned approach, which improves the reliability of reward accumulation without additional computation as compared to risk-neutral LP solvers.
Abstract: We study the estimation of risk-sensitive policies in reinforcement learning problems defined by a Markov Decision Process (MDPs) whose state and action spaces are countably finite. Prior efforts are predominately afflicted by computational challenges associated with the fact that risk-sensitive MDPs are time-inconsistent. To ameliorate this issue, we propose a new definition of risk, which we call caution, as a penalty function added to the dual objective of the linear programming (LP) formulation of reinforcement learning. The caution measures the distributional risk of a policy, which is a function of the policy's long-term state occupancy distribution. To solve this problem in an online model-free manner, we propose a stochastic variant of primal-dual method that uses Kullback-Lieber (KL) divergence as its proximal term. We establish that the number of iterations/samples required to attain approximately optimal solutions of this scheme matches tight dependencies on the cardinality of the state and action spaces, but differs in its dependence on the infinity norm of the gradient of the risk measure. Experiments demonstrate the merits of this approach for improving the reliability of reward accumulation without additional computational burdens.

18 citations


Cites background from "Reinforcement Learning: An Introduc..."

  • ...INTRODUCTION IN REINFORCEMENT learning (RL) [2], an autonomous agent in a given state selects an action and then transitions to a new state randomly depending on its current state and action, at which point the environment reveals a reward....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the authors proposed an approach based on deep reinforcement learning to approximate unitary operators as circuits, and showed that this approach decreases the execution time, potentially allowing real-time quantum compiling.
Abstract: The general problem of quantum compiling is to approximate any unitary transformation that describes the quantum computation as a sequence of elements selected from a finite base of universal quantum gates. The Solovay-Kitaev theorem guarantees the existence of such an approximating sequence. Though, the solutions to the quantum compiling problem suffer from a tradeoff between the length of the sequences, the precompilation time, and the execution time. Traditional approaches are time-consuming, unsuitable to be employed during computation. Here, we propose a deep reinforcement learning method as an alternative strategy, which requires a single precompilation procedure to learn a general strategy to approximate single-qubit unitaries. We show that this approach reduces the overall execution time, improving the tradeoff between the length of the sequence and execution time, potentially allowing real-time operations. Quantum compilers are characterized by a trade-off between the length of the sequences, the precompilation time, and the execution time. Here, the authors propose an approach based on deep reinforcement learning to approximate unitary operators as circuits, and show that this approach decreases the execution time, potentially allowing real-time quantum compiling.

18 citations

Journal ArticleDOI
13 Feb 2021-Sensors
TL;DR: In this paper, a closed-loop learning flow for autonomous driving mini-vehicles that includes the target deployment environment in-the-loop is proposed, where a family of compact and high-throughput tiny-CNNs are used to control the mini vehicle that learn by imitating a computer vision algorithm in the target environment.
Abstract: Standard-sized autonomous vehicles have rapidly improved thanks to the breakthroughs of deep learning. However, scaling autonomous driving to mini-vehicles poses several challenges due to their limited on-board storage and computing capabilities. Moreover, autonomous systems lack robustness when deployed in dynamic environments where the underlying distribution is different from the distribution learned during training. To address these challenges, we propose a closed-loop learning flow for autonomous driving mini-vehicles that includes the target deployment environment in-the-loop. We leverage a family of compact and high-throughput tinyCNNs to control the mini-vehicle that learn by imitating a computer vision algorithm, i.e., the expert, in the target environment. Thus, the tinyCNNs, having only access to an on-board fast-rate linear camera, gain robustness to lighting conditions and improve over time. Moreover, we introduce an online predictor that can choose between different tinyCNN models at runtime—trading accuracy and latency—which minimises the inference’s energy consumption by up to 3.2×. Finally, we leverage GAP8, a parallel ultra-low-power RISC-V-based micro-controller unit (MCU), to meet the real-time inference requirements. When running the family of tinyCNNs, our solution running on GAP8 outperforms any other implementation on the STM32L4 and NXP k64f (traditional single-core MCUs), reducing the latency by over 13× and the energy consumption by 92%.

18 citations

Book ChapterDOI
02 Dec 2002
TL;DR: This paper reports on an efficient approach that allows independent reinforcement learning agents to reach a Pareto optimal Nash equilibrium with limited communication, and explores this technique in repeated common interest games with deterministic or stochastic outcomes.
Abstract: Coordination is an important issue in multi-agent systems when agents want to maximize their revenue. Often coordination is achieved through communication, however communication has its price. We are interested in finding an approach where the communication between the agents is kept low, and a global optimal behavior can still be found.In this paper we report on an efficient approach that allows independent reinforcement learning agents to reach a Pareto optimal Nash equilibrium with limited communication. The communication happens at regular time steps and is basically a signal for the agents to start an exploration phase. During each exploration phase, some agents exclude their current best action so as to give the team the opportunityto look for a possibly better Nash equilibrium. This technique of reducing the action space by exclusions was only recently introduced for finding periodical policies in games of conflicting interests. Here, we explore this technique in repeated common interest games with deterministic or stochastic outcomes.

18 citations

References
More filters
Journal ArticleDOI
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

72,897 citations

Book
01 Sep 1988
TL;DR: In this article, the authors present the computer techniques, mathematical tools, and research results that will enable both students and practitioners to apply genetic algorithms to problems in many fields, including computer programming and mathematics.
Abstract: From the Publisher: This book brings together - in an informal and tutorial fashion - the computer techniques, mathematical tools, and research results that will enable both students and practitioners to apply genetic algorithms to problems in many fields Major concepts are illustrated with running examples, and major algorithms are illustrated by Pascal computer programs No prior knowledge of GAs or genetics is assumed, and only a minimum of computer programming and mathematics background is required

52,797 citations

Journal ArticleDOI
01 Jan 1998
TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.
Abstract: Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day.

42,067 citations

01 Jan 1989
TL;DR: This book brings together the computer techniques, mathematical tools, and research results that will enable both students and practitioners to apply genetic algorithms to problems in many fields.
Abstract: From the Publisher: This book brings together - in an informal and tutorial fashion - the computer techniques, mathematical tools, and research results that will enable both students and practitioners to apply genetic algorithms to problems in many fields. Major concepts are illustrated with running examples, and major algorithms are illustrated by Pascal computer programs. No prior knowledge of GAs or genetics is assumed, and only a minimum of computer programming and mathematics background is required.

33,034 citations

Journal ArticleDOI
TL;DR: This book covers a broad range of topics for regular factorial designs and presents all of the material in very mathematical fashion and will surely become an invaluable resource for researchers and graduate students doing research in the design of factorial experiments.
Abstract: (2007). Pattern Recognition and Machine Learning. Technometrics: Vol. 49, No. 3, pp. 366-366.

18,802 citations