scispace - formally typeset
Search or ask a question

Showing papers on "Reinforcement learning published in 2013"


Posted Content
TL;DR: This work presents the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning, which outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
Abstract: We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.

8,757 citations


Journal ArticleDOI
TL;DR: The Arcade Learning Environment (ALE) as discussed by the authors is a platform for evaluating the development of general, domain-independent AI technology, which provides an interface to hundreds of Atari 2600 game environments, each one different, interesting, and designed to be a challenge for human players.
Abstract: In this article we introduce the Arcade Learning Environment (ALE): both a challenge problem and a platform and methodology for evaluating the development of general, domain-independent AI technology. ALE provides an interface to hundreds of Atari 2600 game environments, each one different, interesting, and designed to be a challenge for human players. ALE presents significant research challenges for reinforcement learning, model learning, model-based planning, imitation learning, transfer learning, and intrinsic motivation. Most importantly, it provides a rigorous testbed for evaluating and comparing approaches to these problems. We illustrate the promise of ALE by developing and benchmarking domain-independent agents designed using well-established AI techniques for both reinforcement learning and planning. In doing so, we also propose an evaluation methodology made possible by ALE, reporting empirical results on over 55 different games. All of the software, including the benchmark agents, is publicly available.

2,429 citations


Journal ArticleDOI
TL;DR: This article attempts to strengthen the links between the two research communities by providing a survey of work in reinforcement learning for behavior generation in robots by highlighting both key challenges in robot reinforcement learning as well as notable successes.
Abstract: Reinforcement learning offers to robotics a framework and set of tools for the design of sophisticated and hard-to-engineer behaviors. Conversely, the challenges of robotic problems provide both inspiration, impact, and validation for developments in reinforcement learning. The relationship between disciplines has sufficient promise to be likened to that between physics and mathematics. In this article, we attempt to strengthen the links between the two research communities by providing a survey of work in reinforcement learning for behavior generation in robots. We highlight both key challenges in robot reinforcement learning as well as notable successes. We discuss how contributions tamed the complexity of the domain and study the role of algorithms, representations, and prior knowledge in achieving these successes. As a result, a particular focus of our paper lies on the choice between model-based and model-free as well as between value-function-based and policy-search methods. By analyzing a simple problem in some detail we demonstrate how reinforcement learning approaches may be profitably applied, and we note throughout open questions and the tremendous potential for future research.

2,391 citations


Book
15 Aug 2013
TL;DR: This work classifies model-free methods based on their policy evaluation strategy, policy update strategy, and exploration strategy and presents a unified view on existing algorithms.
Abstract: Policy search is a subfield in reinforcement learning which focuses on finding good parameters for a given policy parametrization. It is well suited for robotics as it can cope with high-dimensional state and action spaces, one of the main challenges in robot learning. We review recent successes of both model-free and model-based policy search in robot learning.Model-free policy search is a general approach to learn policies based on sampled trajectories. We classify model-free methods based on their policy evaluation strategy, policy update strategy, and exploration strategy and present a unified view on existing algorithms. Learning a policy is often easier than learning an accurate forward model, and, hence, model-free methods are more frequently used in practice. However, for each sampled trajectory, it is necessary to interact with the robot, which can be time consuming and challenging in practice. Model-based policy search addresses this problem by first learning a simulator of the robot's dynamics from data. Subsequently, the simulator generates trajectories that are used for policy learning. For both model-free and model-based policy search methods, we review their respective properties and their applicability to robotic systems.

903 citations


Journal ArticleDOI
TL;DR: Dopamine neurons show limited activations to punishers when proper controls are made and do not code salience to a substantial extent, suggesting intact dopamine mechanisms are required for learning and posysynaptic plasticity.

572 citations


Journal ArticleDOI
TL;DR: This article reviews classical and recent developments of adaptive Resonance Theory, and provides a synthesis of concepts, principles, mechanisms, architectures, and the interdisciplinary data bases that they have helped to explain and predict.

459 citations


Journal ArticleDOI
TL;DR: An online adaptive reinforcement learning-based solution is developed for the infinite-horizon optimal control problem for continuous-time uncertain nonlinear systems using a novel actor-critic-identifier (ACI) architecture to approximate the Hamilton-Jacobi-Bellman equation.

447 citations


Journal ArticleDOI
TL;DR: This paper presents the development and evaluation of a novel system of multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (MARLIN-ATSC), and shows unprecedented reduction in the average intersection delay.
Abstract: Population is steadily increasing worldwide, resulting in intractable traffic congestion in dense urban areas. Adaptive traffic signal control (ATSC) has shown strong potential to effectively alleviate urban traffic congestion by adjusting signal timing plans in real time in response to traffic fluctuations to achieve desirable objectives (e.g., minimize delay). Efficient and robust ATSC can be designed using a multiagent reinforcement learning (MARL) approach in which each controller (agent) is responsible for the control of traffic lights around a single traffic junction. Applying MARL approaches to the ATSC problem is associated with a few challenges as agents typically react to changes in the environment at the individual level, but the overall behavior of all agents may not be optimal. This paper presents the development and evaluation of a novel system of multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (MARLIN-ATSC). MARLIN-ATSC offers two possible modes: 1) independent mode, where each intersection controller works independently of other agents; and 2) integrated mode, where each controller coordinates signal control actions with neighboring intersections. MARLIN-ATSC is tested on a large-scale simulated network of 59 intersections in the lower downtown core of the City of Toronto, ON, Canada, for the morning rush hour. The results show unprecedented reduction in the average intersection delay ranging from 27% in mode 1 to 39% in mode 2 at the network level and travel-time savings of 15% in mode 1 and 26% in mode 2, along the busiest routes in Downtown Toronto.

437 citations


Journal ArticleDOI
TL;DR: This paper presents an online policy iteration (PI) algorithm to learn the continuous-time optimal control solution for unknown constrained-input systems where two neural networks are tuned online and simultaneously to generate the optimal bounded control policy.
Abstract: This paper presents an online policy iteration (PI) algorithm to learn the continuous-time optimal control solution for unknown constrained-input systems. The proposed PI algorithm is implemented on an actor-critic structure where two neural networks (NNs) are tuned online and simultaneously to generate the optimal bounded control policy. The requirement of complete knowledge of the system dynamics is obviated by employing a novel NN identifier in conjunction with the actor and critic NNs. It is shown how the identifier weights estimation error affects the convergence of the critic NN. A novel learning rule is developed to guarantee that the identifier weights converge to small neighborhoods of their ideal values exponentially fast. To provide an easy-to-check persistence of excitation condition, the experience replay technique is used. That is, recorded past experiences are used simultaneously with current data for the adaptation of the identifier weights. Stability of the whole system consisting of the actor, critic, system state, and system identifier is guaranteed while all three networks undergo adaptation. Convergence to a near-optimal control law is also shown. The effectiveness of the proposed method is illustrated with a simulation example.

371 citations


Book
16 Apr 2013
TL;DR: This chapter discusses Nengo: Advanced modeling methods, a framework for building a brain, and theories of cognition, which aim to clarify the role of language in the development of cognition.
Abstract: Contents 1 The science of cognition 1.1 The last 50 years 1.2 How we got here 1.3 Where we are 1.4 Questions and answers 1.5 Nengo: An introduction Part I: How to build a brain 2 An introduction to brain building 2.1 Brain parts 2.2 A framework for building a brain 2.2.1 Representation 2.2.2 Transformation 2.2.3 Dynamics 2.2.4 The three principles 2.3 Levels 2.4 Nengo: Neural representation 3 Biological cognition - Semantics 3.1 The semantic pointer hypothesis 3.2 What is a semantic pointer? 3.3 Semantics: An overview 3.4 Shallow semantics 3.5 Deep semantics for perception 3.6 Deep semantics for action 3.7 The semantics of perception and action 3.8 Nengo: Neural computations 4 Biological cognition - Syntax 4.1 Structured representations 4.2 Binding without neurons 4.3 Binding with neurons. 4.4 Manipulating structured representations 4.5 Learning structural manipulations 4.6 Clean-up memory and scaling 4.7 Example: Fluid intelligence 4.8 Deep semantics for cognition 4.9 Nengo: Structured representations in neurons 5 Biological cognition - Control 5.1 The flow of information 5.2 The basal ganglia 5.3 Basal ganglia, cortex, and thalamus 5.4 Example: Fixed sequences of actions 5.5 Attention and the routing of information 5.6 Example: Flexible sequences of actions 5.7 Timing and control 5.8 Example: The Tower of Hanoi 5.9 Nengo: Question answering 6 Biological cognition - Memory and learning 6.1 Extending cognition through time 6.2 Working memory 6.3 Example: Serial list memory 6.4 Biological learning 6.5 Example: Learning new actions 6.6 Example: Learning new syntactic manipulations 6.7 Nengo: Learning 7 The Semantic Pointer Architecture (SPA) 7.1 A summary of the SPA 7.2 A SPA unified network 7.3 Tasks 7.3.1 Recognition 7.3.2 Copy drawing 7.3.3 Reinforcement learning 7.3.4 Serial working memory 7.3.5 Counting 7.3.6 Question answering 7.3.7 Rapid variable creation 7.3.8 Fluid reasoning 7.3.9 Discussion 7.4 A unified view: Symbols and probabilities 7.5 Nengo: Advanced modeling methods Part II Is that how you build a brain? 8 Evaluating cognitive theories 341 8.1 Introduction 8.2 Core cognitive criteria (CCC) 8.2.1 Representational structure 8.2.1.1 Systematicity 8.2.1.2 Compositionality 8.2.1.3 Productivity 8.2.1.4 The massive binding problem 8.2.2 Performance concerns 8.2.2.1 Syntactic generalization 8.2.2.2 Robustness 8.2.2.3 Adaptability 8.2.2.4 Memory 8.2.2.5 Scalability 8.2.3 Scientific merit 8.2.3.1 Triangulation (contact with more sources of data) 8.2.3.2 Compactness 8.3 Conclusion 8.4 Nengo Bonus: How to build a brain - a practical guide 9 Theories of cognition 9.1 The state of the art 9.1.1 ACT-R 9.1.2 Synchrony-based approaches 9.1.3 Neural blackboard architecture (NBA) 9.1.4 The integrated connectionist/symbolic architecture (ICS) 9.1.5 Leabra 9.1.6 Dynamic field theory (DFT) 9.2 An evaluation 9.2.1 Representational structure 9.2.2 Performance concerns 9.2.3 Scientific merit 9.2.4 Summary 9.3 The same... 9.4 ...but different 9.5 The SPA versus the SOA 10 Consequences and challenges 10.1 Representation 10.2 Concepts 10.3 Inference 10.4 Dynamics 10.5 Challenges 10.6 Conclusion A Mathematical notation and overview A.1 Vectors A.2 Vector spaces A.3 The dot product A.4 Basis of a vector space A.5 Linear transformations on vectors A.6 Time derivatives for dynamics B Mathematical derivations for the NEF B.1 Representation B.1.1 Encoding B.1.2 Decoding B.2 Transformation B.3 Dynamics C Further details on deep semantic models C.1 The perceptual model C.2 The motor model D Mathematical derivations for the SPA D.1 Binding and unbinding HRRs D.2 Learning high-level transformations D.3 Ordinal serial encoding model D.4 Spike-timing dependent plasticity D.5 Number of neurons for representing structure E SPA model details E.1 Tower of Hanoi Bibliography Index

362 citations


Proceedings Article
05 Dec 2013
TL;DR: This paper introduces Advise, a Bayesian approach that attempts to maximize the information gained from human feedback by utilizing it as direct policy labels and shows that it can outperform state-of-the-art approaches and is robust to infrequent and inconsistent human feedback.
Abstract: A long term goal of Interactive Reinforcement Learning is to incorporate nonexpert human feedback to solve complex tasks. Some state-of-the-art methods have approached this problem by mapping human information to rewards and values and iterating over them to compute better control policies. In this paper we argue for an alternate, more effective characterization of human feedback: Policy Shaping. We introduce Advise, a Bayesian approach that attempts to maximize the information gained from human feedback by utilizing it as direct policy labels. We compare Advise to state-of-the-art approaches and show that it can outperform them and is robust to infrequent and inconsistent human feedback.

Journal ArticleDOI
Fiery Cushman1
TL;DR: A broad division between two algorithms for learning and choice derived from formal models of reinforcement learning provides an ideal framework for a dual-system theory in the moral domain.
Abstract: Dual-system approaches to psychology explain the fundamental properties of human judgment, decision making, and behavior across diverse domains. Yet, the appropriate characterization of each system is a source of debate. For instance, a large body of research on moral psychology makes use of the contrast between “emotional” and “rational/cognitive” processes, yet even the chief proponents of this division recognize its shortcomings. Largely independently, research in the computational neurosciences has identified a broad division between two algorithms for learning and choice derived from formal models of reinforcement learning. One assigns value to actions intrinsically based on past experience, while another derives representations of value from an internally represented causal model of the world. This division between action- and outcome-based value representation provides an ideal framework for a dual-system theory in the moral domain.

Posted Content
TL;DR: An O(τS/√AT) bound on expected regret is established, one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.
Abstract: Most provably-efficient learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration, posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an agent to encode prior knowledge in a natural way. We establish an $\tilde{O}(\tau S \sqrt{AT})$ bound on the expected regret, where $T$ is time, $\tau$ is the episode length and $S$ and $A$ are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds.

Journal ArticleDOI
TL;DR: It is demonstrated that having human decision makers perform a demanding secondary task engenders increased reliance on a model-free reinforcement-learning strategy, and competition between multiple learning systems can be controlled on a trial-by-trial basis by modulating the availability of cognitive resources.
Abstract: A number of accounts of human and animal behavior posit the operation of parallel and competing valuation systems in the control of choice behavior. In these accounts, a flexible but computationally expensive model-based reinforcement-learning system has been contrasted with a less flexible but more efficient model-free reinforcement-learning system. The factors governing which system controls behavior-and under what circumstances-are still unclear. Following the hypothesis that model-based reinforcement learning requires cognitive resources, we demonstrated that having human decision makers perform a demanding secondary task engenders increased reliance on a model-free reinforcement-learning strategy. Further, we showed that, across trials, people negotiate the trade-off between the two systems dynamically as a function of concurrent executive-function demands, and people's choice latencies reflect the computational expenses of the strategy they employ. These results demonstrate that competition between multiple learning systems can be controlled on a trial-by-trial basis by modulating the availability of cognitive resources.

Posted Content
TL;DR: This paper explicitly represents uncertainty about the parameters of the model and build probability distributions over Q-values based on these that are used to compute a myopic approximation to the value of information for each action and hence to select the action that best balances exploration and exploitation.
Abstract: Reinforcement learning systems are often concerned with balancing exploration of untested actions against exploitation of actions that are known to be good. The benefit of exploration can be estimated using the classical notion of Value of Information - the expected improvement in future decision quality arising from the information acquired by exploration. Estimating this quantity requires an assessment of the agent's uncertainty about its current value estimates for states. In this paper we investigate ways of representing and reasoning about this uncertainty in algorithms where the system attempts to learn a model of its environment. We explicitly represent uncertainty about the parameters of the model and build probability distributions over Q-values based on these. These distributions are used to compute a myopic approximation to the value of information for each action and hence to select the action that best balances exploration and exploitation.

Journal ArticleDOI
TL;DR: To synthesize fixed-final-time control-constrained optimal controllers for discrete-time nonlinear control-affine systems, a single neural network (NN)-based controller called the Finite-horizon Single Network Adaptive Critic is developed in this paper.
Abstract: To synthesize fixed-final-time control-constrained optimal controllers for discrete-time nonlinear control-affine systems, a single neural network (NN)-based controller called the Finite-horizon Single Network Adaptive Critic is developed in this paper. Inputs to the NN are the current system states and the time-to-go, and the network outputs are the costates that are used to compute optimal feedback control. Control constraints are handled through a nonquadratic cost function. Convergence proofs of: 1) the reinforcement learning-based training method to the optimal solution; 2) the training error; and 3) the network weights are provided. The resulting controller is shown to solve the associated time-varying Hamilton-Jacobi-Bellman equation and provide the fixed-final-time optimal solution. Performance of the new synthesis technique is demonstrated through different examples including an attitude control problem wherein a rigid spacecraft performs a finite-time attitude maneuver subject to control bounds. The new formulation has great potential for implementation since it consists of only one NN with single set of weights and it provides comprehensive feedback solutions online, though it is trained offline.

Book ChapterDOI
TL;DR: It is concluded that the brain maintains distinct model-based and model-free learning systems, with distinct neural substrates, which act in competitive balance to direct behavior.
Abstract: Motor learning can be framed theoretically as a problem of optimizing a movement policy in a potentially uncertain or changing environment. This is precisely the general problem studied in the field of reinforcement learning. Reinforcement learning theory proposes two distinct approaches to solving this general problem: Model-based approaches first identify the dynamics of the task or environment then use this knowledge to compute the optimal movement policy. Model-free approaches, by contrast, directly identify successful policies through a process of trial and error. Here, we review existing literature on motor control in the light of this distinction. Motor learning research in the last decade has been dominated by studies that elicit learning through adaptation paradigms and find the results to be consistent with a model-based framework. Studying the behavior of patients in such adaptation paradigms has implicated the cerebellum as prime candidate for the neural substrate of the internal models that sub serve model-based control. A growing body of experimental results, however, demonstrates that not all of motor learning in conventional paradigms can be explained within model-based frameworks, but can be understood in terms of an additional component of learning driven by model-free reinforcement of successful actions. We conclude that the brain maintains distinct model-based and model-free learning systems, with distinct neural substrates, which act in competitive balance to direct behavior.

Journal ArticleDOI
TL;DR: A novel parallel Q‐learning approach is presented aimed at reducing the time taken to determine optimal policies whilst learning online, and optimal scaling policies can be determined in a dynamic non‐stationary environment.
Abstract: SUMMARY Public Infrastructure as a Service (IaaS) clouds such as Amazon, GoGrid and Rackspace deliver computational resources by means of virtualisation technologies. These technologies allow multiple independent virtual machines to reside in apparent isolation on the same physical host. Dynamically scaling applications running on IaaS clouds can lead to varied and unpredictable results because of the performance interference effects associated with co-located virtual machines. Determining appropriate scaling policies in a dynamic non-stationary environment is non-trivial. One principle advantage exhibited by IaaS clouds over their traditional hosting counterparts is the ability to scale resources on-demand. However, a problem arises concerning resource allocation as to which resources should be added and removed when the underlying performance of the resource is in a constant state of flux. Decision theoretic frameworks such as Markov Decision Processes are particularly suited to decision making under uncertainty. By applying a temporal difference, reinforcement learning algorithm known as Q-learning, optimal scaling policies can be determined. Additionally, reinforcement learning techniques typically suffer from curse of dimensionality problems, where the state space grows exponentially with each additional state variable. To address this challenge, we also present a novel parallel Q-learning approach aimed at reducing the time taken to determine optimal policies whilst learning online. Copyright © 2012 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: Numerical results are given to validate the theoretical findings, highlighting the inherent tradeoffs facing small cells, namely exploration/exploitation, myopic/foresighted behavior and complete/incomplete information.
Abstract: In this paper, a decentralized and self-organizing mechanism for small cell networks (such as micro-, femto- and picocells) is proposed. In particular, an application to the case in which small cell networks aim to mitigate the interference caused to the macrocell network, while maximizing their own spectral efficiencies, is presented. The proposed mechanism is based on new notions of reinforcement learning (RL) through which small cells jointly estimate their time-average performance and optimize their probability distributions with which they judiciously choose their transmit configurations. Here, a minimum signal to interference plus noise ratio (SINR) is guaranteed at the macrocell user equipment (UE), while the small cells maximize their individual performances. The proposed RL procedure is fully distributed as every small cell base station requires only an observation of its instantaneous performance which can be obtained from its UE. Furthermore, it is shown that the proposed mechanism always converges to an epsilon Nash equilibrium when all small cells share the same interest. In addition, this mechanism is shown to possess better convergence properties and incur less overhead than existing techniques such as best response dynamics, fictitious play or classical RL. Finally, numerical results are given to validate the theoretical findings, highlighting the inherent tradeoffs facing small cells, namely exploration/exploitation, myopic/foresighted behavior and complete/incomplete information.

Proceedings ArticleDOI
Tom Schaul1
17 Oct 2013
TL;DR: It is shown how to learn competent behaviors when a model of the game dynamics is available or when it is not, when full state information is given to the agent or just subjective observations, when learning is interactive or in batch-mode, and for a number of different learning algorithms, including reinforcement learning and evolutionary search.
Abstract: We propose a powerful new tool for conducting research on computational intelligence and games. `PyVGDL' is a simple, high-level description language for 2D video games, and the accompanying software library permits parsing and instantly playing those games. The streamlined design of the language is based on defining locations and dynamics for simple building blocks, and the interaction effects when such objects collide, all of which are provided in a rich ontology. It can be used to quickly design games, without needing to deal with control structures, and the concise language is also accessible to generative approaches. We show how the dynamics of many classical games can be generated from a few lines of PyVGDL. The main objective of these generated games is to serve as diverse benchmark problems for learning and planning algorithms; so we provide a collection of interfaces for different types of learning agents, with visual or abstract observations, from a global or first-person viewpoint. To demonstrate the library's usefulness in a broad range of learning scenarios, we show how to learn competent behaviors when a model of the game dynamics is available or when it is not, when full state information is given to the agent or just subjective observations, when learning is interactive or in batch-mode, and for a number of different learning algorithms, including reinforcement learning and evolutionary search.

Book ChapterDOI
01 Jan 2013
TL;DR: This chapter argues that the answer to both questions is assuredly “yes” and that the machine learning framework of reinforcement learning is particularly appropriate for bringing learning together with what in animals one would call motivation.
Abstract: Psychologists distinguish between extrinsically motivated behavior, which is behavior undertaken to achieve some externally supplied reward, such as a prize, a high grade, or a high-paying job, and intrinsically motivated behavior, which is behavior done for its own sake. Is an analogous distinction meaningful for machine learning systems? Can we say of a machine learning system that it is motivated to learn, and if so, is it possible to provide it with an analog of intrinsic motivation? Despite the fact that a formal distinction between extrinsic and intrinsic motivation is elusive, this chapter argues that the answer to both questions is assuredly “yes” and that the machine learning framework of reinforcement learning is particularly appropriate for bringing learning together with what in animals one would call motivation. Despite the common perception that a reinforcement learning agent’s reward has to be extrinsic because the agent has a distinct input channel for reward signals, reinforcement learning provides a natural framework for incorporating principles of intrinsic motivation.

Journal ArticleDOI
05 Jul 2013-Robotics
TL;DR: A summary of the state-of-the-art of reinforcement learning in the context of robotics, in terms of both algorithms and policy representations is given.

Proceedings Article
05 Dec 2013
TL;DR: In this paper, the authors proposed posterior sampling for reinforcement learning (PSRL), which updates a prior distribution over Markov decision processes and takes one sample from this posterior, then follows the policy that is optimal for this sample during the episode.
Abstract: Most provably-efficient reinforcement learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration: posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an agent to encode prior knowledge in a natural way. We establish an O(τS/√AT) bound on expected regret, where T is time, τ is the episode length and S and A are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds.

Journal ArticleDOI
15 Sep 2013-Energy
TL;DR: In this article, a 2-step-ahead reinforcement learning algorithm is proposed to plan the battery scheduling in a microgrid for energy distribution, with a local consumer, a renewable generator (Wind turbine) and a storage facility (battery), connected to the external grid via a transformer.

Proceedings ArticleDOI
16 Apr 2013
TL;DR: The Chebyshev scalarization method overcomes the flaws of the linear scalarized function as it can discover Pareto optimal solutions regardless of the shape of the front, i.e. convex as well as non-convex.
Abstract: In multi-objective problems, it is key to find compromising solutions that balance different objectives. The linear scalarization function is often utilized to translate the multi-objective nature of a problem into a standard, single-objective problem. Generally, it is noted that such as linear combination can only find solutions in convex areas of the Pareto front, therefore making the method inapplicable in situations where the shape of the front is not known beforehand, as is often the case. We propose a non-linear scalarization function, called the Chebyshev scalarization function, as a basis for action selection strategies in multi-objective reinforcement learning. The Chebyshev scalarization method overcomes the flaws of the linear scalarization function as it can (i) discover Pareto optimal solutions regardless of the shape of the front, i.e. convex as well as non-convex , (ii) obtain a better spread amongst the set of Pareto optimal solutions and (iii) is not particularly dependent on the actual weights used.

Proceedings ArticleDOI
06 Jul 2013
TL;DR: This paper scale-up their compressed network encoding where network weight matrices are represented indirectly as a set of Fourier-type coefficients, to tasks that require very-large networks due to the high-dimensionality of their input space.
Abstract: The idea of using evolutionary computation to train artificial neural networks, or neuroevolution (NE), for reinforcement learning (RL) tasks has now been around for over 20 years. However, as RL tasks become more challenging, the networks required become larger, as do their genomes. But, scaling NE to large nets (i.e. tens of thousands of weights) is infeasible using direct encodings that map genes one-to-one to network components. In this paper, we scale-up our compressed network encoding where network weight matrices are represented indirectly as a set of Fourier-type coefficients, to tasks that require very-large networks due to the high-dimensionality of their input space. The approach is demonstrated successfully on two reinforcement learning tasks in which the control networks receive visual input: (1) a vision-based version of the octopus control task requiring networks with over 3 thousand weights, and (2) a version of the TORCS driving game where networks with over 1 million weights are evolved to drive a car around a track using video images from the driver's perspective.

Journal ArticleDOI
TL;DR: A new adaptive dynamic programming approach by integrating a reference network that provides an internal goal representation to help the systems learning and optimization and provides an alternative choice rather than crafting the reinforcement signal manually from prior knowledge is presented.
Abstract: In this paper, we present a new adaptive dynamic programming approach by integrating a reference network that provides an internal goal representation to help the systems learning and optimization. Specifically, we build the reference network on top of the critic network to form a dual critic network design that contains the detailed internal goal representation to help approximate the value function. This internal goal signal, working as the reinforcement signal for the critic network in our design, is adaptively generated by the reference network and can also be adjusted automatically. In this way, we provide an alternative choice rather than crafting the reinforcement signal manually from prior knowledge. In this paper, we adopt the online action-dependent heuristic dynamic programming (ADHDP) design and provide the detailed design of the dual critic network structure. Detailed Lyapunov stability analysis for our proposed approach is presented to support the proposed structure from a theoretical point of view. Furthermore, we also develop a virtual reality platform to demonstrate the real-time simulation of our approach under different disturbance situations. The overall adaptive learning performance has been tested on two tracking control benchmarks with a tracking filter. For comparative studies, we also present the tracking performance with the typical ADHDP, and the simulation results justify the improved performance with our approach.

Journal ArticleDOI
TL;DR: In simulations, this model can solve a Morris water-maze-like navigation task, in a number of trials consistent with reported animal performance, and the analytically derived learning rule is consistent with experimental evidence for dopamine-modulated spike-timing-dependent plasticity.
Abstract: Animals repeat rewarded behaviors, but the physiological basis of reward-based learning has only been partially elucidated. On one hand, experimental evidence shows that the neuromodulator dopamine carries information about rewards and affects synaptic plasticity. On the other hand, the theory of reinforcement learning provides a framework for reward-based learning. Recent models of reward-modulated spike-timing-dependent plasticity have made first steps towards bridging the gap between the two approaches, but faced two problems. First, reinforcement learning is typically formulated in a discrete framework, ill-adapted to the description of natural situations. Second, biologically plausible models of reward-modulated spike-timing-dependent plasticity require precise calculation of the reward prediction error, yet it remains to be shown how this can be computed by neurons. Here we propose a solution to these problems by extending the continuous temporal difference (TD) learning of Doya (2000) to the case of spiking neurons in an actor-critic network operating in continuous time, and with continuous state and action representations. In our model, the critic learns to predict expected future rewards in real time. Its activity, together with actual rewards, conditions the delivery of a neuromodulatory TD signal to itself and to the actor, which is responsible for action choice. In simulations, we show that such an architecture can solve a Morris water-maze-like navigation task, in a number of trials consistent with reported animal performance. We also use our model to solve the acrobot and the cartpole problems, two complex motor control tasks. Our model provides a plausible way of computing reward prediction error in the brain. Moreover, the analytically derived learning rule is consistent with experimental evidence for dopamine-modulated spike-timing-dependent plasticity.

Journal ArticleDOI
TL;DR: It is argued for a change in the conceptual framework within which neuroscientists approach the study of learning mechanisms in the brain, in which learning is mediated by computations that make implicit commitments to physical and mathematical principles governing the domains where domain-specific cognitive mechanisms operate.
Abstract: From the traditional perspective of associative learning theory, the hypothesis linking modifications of synaptic transmission to learning and memory is plausible. It is less so from an information-processing perspective, in which learning is mediated by computations that make implicit commitments to physical and mathematical principles governing the domains where domain-specific cognitive mechanisms operate. We compare the properties of associative learning and memory to the properties of long-term potentiation, concluding that the properties of the latter do not explain the fundamental properties of the former. We briefly review the neuroscience of reinforcement learning, emphasizing the representational implications of the neuroscientific findings. We then review more extensively findings that confirm the existence of complex computations in three information-processing domains: probabilistic inference, the representation of uncertainty, and the representation of space. We argue for a change in the conceptual framework within which neuroscientists approach the study of learning mechanisms in the brain.

Proceedings ArticleDOI
03 Mar 2013
TL;DR: The hypothesis that effective and fluent human-robot teaming may be best achieved by modeling effective practices for human teamwork is supported.
Abstract: We design and evaluate human-robot cross-training, a strategy widely used and validated for effective human team training. Cross-training is an interactive planning method in which a human and a robot iteratively switch roles to learn a shared plan for a collaborative task. We first present a computational formulation of the robot's interrole knowledge and show that it is quantitatively comparable to the human mental model. Based on this encoding, we formulate human-robot cross-training and evaluate it in human subject experiments (n = 36). We compare human-robot cross-training to standard reinforcement learning techniques, and show that cross-training provides statistically significant improvements in quantitative team performance measures. Additionally, significant differences emerge in the perceived robot performance and human trust. These results support the hypothesis that effective and fluent human-robot teaming may be best achieved by modeling effective practices for human teamwork.