Showing papers on "Reinforcement learning published in 2010"

PDF

Open Access

Posted Content•

A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning

[...]

Eric Brochu, Vlad M. Cora, Nando de Freitas

12 Dec 2010-arXiv: Learning

TL;DR: Bayesian optimization as mentioned in this paper employs the Bayesian technique of setting a prior over the objective function and combining it with evidence to get a posterior function, which permits a utility-based selection of the next observation to make on the objective functions, which must take into account both exploration (sampling from areas of high uncertainty) and exploitation, sampling areas likely to offer improvement over the current best observation.

...read moreread less

Abstract: We present a tutorial on Bayesian optimization, a method of finding the maximum of expensive cost functions. Bayesian optimization employs the Bayesian technique of setting a prior over the objective function and combining it with evidence to get a posterior function. This permits a utility-based selection of the next observation to make on the objective function, which must take into account both exploration (sampling from areas of high uncertainty) and exploitation (sampling areas likely to offer improvement over the current best observation). We also present two detailed extensions of Bayesian optimization, with experiments—active user modelling with preferences, and hierarchical reinforcement learning—and a discussion of the pros and cons of Bayesian optimization based on our experiences.

...read moreread less

1,425 citations

Book•

Algorithms for Reinforcement Learning

[...]

Csaba Szepesvári¹•Institutions (1)

University of Alberta¹

25 Jun 2010

TL;DR: This book focuses on those algorithms of reinforcement learning that build on the powerful theory of dynamic programming, and gives a fairly comprehensive catalog of learning problems, and describes the core ideas, followed by the discussion of their theoretical properties and limitations.

...read moreread less

Abstract: Reinforcement learning is a learning paradigm concerned with learning to control a system so as to maximize a numerical performance measure that expresses a long-term objective.What distinguishes reinforcement learning from supervised learning is that only partial feedback is given to the learner about the learner's predictions. Further, the predictions may have long term effects through influencing the future state of the controlled system. Thus, time plays a special role. The goal in reinforcement learning is to develop efficient learning algorithms, as well as to understand the algorithms' merits and limitations. Reinforcement learning is of great interest because of the large number of practical applications that it can be used to address, ranging from problems in artificial intelligence to operations research or control engineering. In this book, we focus on those algorithms of reinforcement learning that build on the powerful theory of dynamic programming.We give a fairly comprehensive catalog of learning problems, describe the core ideas, note a large number of state of the art algorithms, followed by the discussion of their theoretical properties and limitations.

...read moreread less

1,146 citations

Journal Article•DOI•

States versus Rewards: Dissociable Neural Prediction Error Signals Underlying Model-Based and Model-Free Reinforcement Learning

[...]

Jan Gläscher¹, Nathaniel D. Daw², Peter Dayan³, John P. O'Doherty¹, John P. O'Doherty⁴ - Show less +1 more•Institutions (4)

California Institute of Technology¹, Center for Neural Science², University College London³, Trinity College, Dublin⁴

27 May 2010-Neuron

TL;DR: Using functional magnetic resonance imaging in humans solving a probabilistic Markov decision task, the neural signature of an SPE is found in the intraparietal sulcus and lateral prefrontal cortex, in addition to the previously well-characterized RPE in the ventral striatum, which supports the existence of two unique forms of learning signal in humans.

...read moreread less

1,031 citations

Journal Article•

Near-optimal Regret Bounds for Reinforcement Learning

[...]

Thomas Jaksch, Ronald Ortner, Peter Auer

01 Mar 2010-Journal of Machine Learning Research

TL;DR: For undiscounted reinforcement learning in Markov decision processes (MDPs), this paper presented a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D.

...read moreread less

Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps (on average). We present a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of Ω(√DSAT) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in T. Finally, we also consider a setting where the MDP is allowed to change a fixed number of l times. We present a modification of our algorithm that is able to deal with this setting and show a regret bound of O(l1/3T2/3DS√A).

...read moreread less

945 citations

Reference Book•DOI•

Reinforcement Learning and Dynamic Programming Using Function Approximators

[...]

Lucian Busoniu, Robert Babuska, Bart De Schutter, Damien Ernst¹•Institutions (1)

University of Liège¹

29 Apr 2010

TL;DR: Reinforcement Learning and Dynamic Programming Using Function Approximators provides a comprehensive and unparalleled exploration of the field of RL and DP, with a focus on continuous-variable problems.

...read moreread less

Abstract: From household appliances to applications in robotics, engineered systems involving complex dynamics can only be as effective as the algorithms that control them. While Dynamic Programming (DP) has provided researchers with a way to optimally solve decision and control problems involving complex dynamic systems, its practical value was limited by algorithms that lacked the capacity to scale up to realistic problems. However, in recent years, dramatic developments in Reinforcement Learning (RL), the model-free counterpart of DP, changed our understanding of what is possible. Those developments led to the creation of reliable methods that can be applied even when a mathematical model of the system is unavailable, allowing researchers to solve challenging control problems in engineering, as well as in a variety of other disciplines, including economics, medicine, and artificial intelligence. Reinforcement Learning and Dynamic Programming Using Function Approximators provides a comprehensive and unparalleled exploration of the field of RL and DP. With a focus on continuous-variable problems, this seminal text details essential developments that have substantially altered the field over the past decade. In its pages, pioneering experts provide a concise introduction to classical RL and DP, followed by an extensive presentation of the state-of-the-art and novel methods in RL and DP with approximation. Combining algorithm development with theoretical guarantees, they elaborate on their work with illustrative examples and insightful comparisons. Three individual chapters are dedicated to representative algorithms from each of the major classes of techniques: value iteration, policy iteration, and policy search. The features and performance of these algorithms are highlighted in extensive experimental studies on a range of control applications. The recent development of applications involving complex systems has led to a surge of interest in RL and DP methods and the subsequent need for a quality resource on the subject. For graduate students and others new to the field, this book offers a thorough introduction to both the basics and emerging methods. And for those researchers and practitioners working in the fields of optimal and adaptive control, machine learning, artificial intelligence, and operations research, this resource offers a combination of practical algorithms, theoretical analysis, and comprehensive examples that they will be able to adapt and apply to their own work. Access the authors' website at www.dcsc.tudelft.nl/rlbook/ for additional material, including computer code used in the studies and information concerning new developments.

...read moreread less

917 citations

Proceedings Article•

Relative entropy policy search

[...]

Jan Peters¹, Katharina Mülling¹, Yasemin Altun¹•Institutions (1)

Max Planck Society¹

11 Jul 2010

TL;DR: The Relative Entropy Policy Search (REPS) method is suggested, which differs significantly from previous policy gradient approaches and yields an exact update step and works well on typical reinforcement learning benchmark problems.

...read moreread less

Abstract: Policy search is a successful approach to reinforcement learning. However, policy improvements often result in the loss of information. Hence, it has been marred by premature convergence and implausible solutions. As first suggested in the context of covariant policy gradients (Bagnell and Schneider 2003), many of these problems may be addressed by constraining the information loss. In this paper, we continue this path of reasoning and suggest the Relative Entropy Policy Search (REPS) method. The resulting method differs significantly from previous policy gradient approaches and yields an exact update step. It works well on typical reinforcement learning benchmark problems.

...read moreread less

641 citations

Book Chapter•DOI•

Multi-agent Reinforcement Learning: An Overview

[...]

Lucian Busoniu¹, Robert Babuska¹, Bart De Schutter¹•Institutions (1)

Delft University of Technology¹

01 Jan 2010

TL;DR: This chapter reviews a representative selection of multi-agent reinforcement learning algorithms for fully cooperative, fully competitive, and more general (neither cooperative nor competitive) tasks.

...read moreread less

Abstract: Multi-agent systems can be used to address problems in a variety of domains, including robotics, distributed control, telecommunications, and economics. The complexity of many tasks arising in these domains makes them difficult to solve with preprogrammed agent behaviors. The agents must instead discover a solution on their own, using learning. A significant part of the research on multi-agent learning concerns reinforcement learning techniques. This chapter reviews a representative selection of multi-agent reinforcement learning algorithms for fully cooperative, fully competitive, and more general (neither cooperative nor competitive) tasks. The benefits and challenges of multi-agent reinforcement learning are described. A central challenge in the field is the formal statement of a multi-agent learning goal; this chapter reviews the learning goals proposed in the literature. The problem domains where multi-agent reinforcement learning techniques have been applied are briefly discussed. Several multi-agent reinforcement learning algorithms are applied to an illustrative example involving the coordinated transportation of an object by two cooperative robots. In an outlook for the multi-agent reinforcement learning field, a set of important open issues are identified, and promising research directions to address these issues are outlined.

...read moreread less

548 citations

Journal Article•

A Generalized Path Integral Control Approach to Reinforcement Learning

[...]

Evangelos A. Theodorou, Jonas Buchli, Stefan Schaal

01 Mar 2010-Journal of Machine Learning Research

TL;DR: The framework of stochastic optimal control with path integrals is used to derive a novel approach to RL with parameterized policies to demonstrate interesting similarities with previous RL research in the framework of probability matching and provides intuition why the slightly heuristically motivated probability matching approach can actually perform well.

...read moreread less

Abstract: With the goal to generate more scalable algorithms with higher efficiency and fewer open parameters, reinforcement learning (RL) has recently moved towards combining classical techniques from optimal control and dynamic programming with modern learning techniques from statistical estimation theory. In this vein, this paper suggests to use the framework of stochastic optimal control with path integrals to derive a novel approach to RL with parameterized policies. While solidly grounded in value function estimation and optimal control based on the stochastic Hamilton-Jacobi-Bellman (HJB) equations, policy improvements can be transformed into an approximation problem of a path integral which has no open algorithmic parameters other than the exploration noise. The resulting algorithm can be conceived of as model-based, semi-model-based, or even model free, depending on how the learning problem is structured. The update equations have no danger of numerical instabilities as neither matrix inversions nor gradient learning rates are required. Our new algorithm demonstrates interesting similarities with previous RL research in the framework of probability matching and provides intuition why the slightly heuristically motivated probability matching approach can actually perform well. Empirical evaluations demonstrate significant performance improvements over gradient-based policy learning and scalability to high-dimensional control problems. Finally, a learning experiment on a simulated 12 degree-of-freedom robot dog illustrates the functionality of our algorithm in a complex robot learning scenario. We believe that Policy Improvement with Path Integrals (PI2) offers currently one of the most efficient, numerically robust, and easy to implement algorithms for RL based on trajectory roll-outs.

...read moreread less

520 citations

Journal Article•DOI•

Reinforcement learning-based multi-agent system for network traffic signal control

[...]

I. Arel¹, C. Liu¹, T. Urbanik¹, Airton G Kohls¹•Institutions (1)

University of Tennessee¹

03 Jun 2010-Iet Intelligent Transport Systems

TL;DR: Experimental results clearly demonstrate the advantages of multi-agent RL-based control over LQF governed isolated single-intersection control, thus paving the way for efficient distributed traffic signal control in complex settings.

...read moreread less

Abstract: A challenging application of artificial intelligence systems involves the scheduling of traffic signals in multi-intersection vehicular networks. This paper introduces a novel use of a multi-agent system and reinforcement learning (RL) framework to obtain an efficient traffic signal control policy. The latter is aimed at minimising the average delay, congestion and likelihood of intersection cross-blocking. A five-intersection traffic network has been studied in which each intersection is governed by an autonomous intelligent agent. Two types of agents, a central agent and an outbound agent, were employed. The outbound agents schedule traffic signals by following the longest-queue-first (LQF) algorithm, which has been proved to guarantee stability and fairness, and collaborate with the central agent by providing it local traffic statistics. The central agent learns a value function driven by its local and neighbours' traffic conditions. The novel methodology proposed here utilises the Q-Learning algorithm with a feedforward neural network for value function approximation. Experimental results clearly demonstrate the advantages of multi-agent RL-based control over LQF governed isolated single-intersection control, thus paving the way for efficient distributed traffic signal control in complex settings.

...read moreread less

463 citations

Journal Article•

Double Q-Learning

[...]

Hado van Hasselt¹•Institutions (1)

Centrum Wiskunde & Informatica¹

01 Jan 2010-IEEE Intelligent Systems

TL;DR: An alternative way to approximate the maximum expected value for any set of random variables is introduced and the obtained double estimator method is shown to sometimes underestimate rather than overestimate themaximum expected value.

...read moreread less

Abstract: In some stochastic environments the well-known reinforcement learning algorithm Q-learning performs very poorly. This poor performance is caused by large overestimations of action values. These overestimations result from a positive bias that is introduced because Q-learning uses the maximum action value as an approximation for the maximum expected action value. We introduce an alternative way to approximate the maximum expected value for any set of random variables. The obtained double estimator method is shown to sometimes underestimate rather than overestimate the maximum expected value. We apply the double estimator to Q-learning to construct Double Q-learning, a new off-policy reinforcement learning algorithm. We show the new algorithm converges to the optimal policy and that it performs well in some settings in which Q-learning performs poorly due to its overestimation.

...read moreread less

463 citations

Proceedings Article•

Double Q-learning

[...]

Hado van Hasselt¹•Institutions (1)

Centrum Wiskunde & Informatica¹

06 Dec 2010

TL;DR: Double Q-learning as mentioned in this paper is a new off-policy reinforcement learning algorithm that uses the double estimator to approximate the maximum expected value for any set of random variables and shows that it performs well in some settings in which Q-Learning performs poorly due to its overestimation.

...read moreread less

Journal Article•DOI•

A modern Bayesian look at the multi-armed bandit

[...]

Steven L. Scott¹•Institutions (1)

Google¹

01 Nov 2010-Applied Stochastic Models in Business and Industry

TL;DR: A heuristic for managing multi-armed bandits called randomized probability matching is described, which randomly allocates observations to arms according the Bayesian posterior probability that each arm is optimal.

...read moreread less

Abstract: A multi-armed bandit is an experiment with the goal of accumulating rewards from a payoff distribution with unknown parameters that are to be learned sequentially. This article describes a heuristic for managing multi-armed bandits called randomized probability matching, which randomly allocates observations to arms according the Bayesian posterior probability that each arm is optimal. Advances in Bayesian computation have made randomized probability matching easy to apply to virtually any payoff distribution. This flexibility frees the experimenter to work with payoff distributions that correspond to certain classical experimental designs that have the potential to outperform methods that are ‘optimal’ in simpler contexts. I summarize the relationships between randomized probability matching and several related heuristics that have been used in the reinforcement learning literature. Copyright © 2010 John Wiley & Sons, Ltd.

...read moreread less

Journal Article•DOI•

Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective

[...]

Satinder Singh¹, Richard L. Lewis¹, Andrew G. Barto², Jonathan Sorg¹•Institutions (2)

University of Michigan¹, University of Massachusetts Amherst²

01 Jun 2010-IEEE Transactions on Autonomous Mental Development

TL;DR: A new optimal reward framework is defined that captures the pressure to design good primary reward functions that lead to evolutionary success across environments and shows that optimal primary reward signals may yield both emergent intrinsic and extrinsic motivation.

...read moreread less

Abstract: There is great interest in building intrinsic motivation into artificial systems using the reinforcement learning framework. Yet, what intrinsic motivation may mean computationally, and how it may differ from extrinsic motivation, remains a murky and controversial subject. In this paper, we adopt an evolutionary perspective and define a new optimal reward framework that captures the pressure to design good primary reward functions that lead to evolutionary success across environments. The results of two computational experiments show that optimal primary reward signals may yield both emergent intrinsic and extrinsic motivation. The evolutionary perspective and the associated optimal reward framework thus lead to the conclusion that there are no hard and fast features distinguishing intrinsic and extrinsic reward computationally. Rather, the directness of the relationship between rewarding behavior and evolutionary success varies along a continuum.

...read moreread less

Proceedings Article•DOI•

Deep auto-encoder neural networks in reinforcement learning

[...]

Sascha Lange¹, Martin Riedmiller¹•Institutions (1)

University of Freiburg¹

18 Jul 2010

TL;DR: A framework for combining the training of deep auto-encoders (for learning compact feature spaces) with recently-proposed batch-mode RL algorithms ( for learning policies) is proposed and an emphasis is put on the data-efficiency and on studying the properties of the feature spaces automatically constructed by the deep Auto-encoder neural networks.

...read moreread less

Abstract: This paper discusses the effectiveness of deep auto-encoder neural networks in visual reinforcement learning (RL) tasks. We propose a framework for combining the training of deep auto-encoders (for learning compact feature spaces) with recently-proposed batch-mode RL algorithms (for learning policies). An emphasis is put on the data-efficiency of this combination and on studying the properties of the feature spaces automatically constructed by the deep auto-encoders. These feature spaces are empirically shown to adequately resemble existing similarities and spatial relations between observations and allow to learn useful policies. We propose several methods for improving the topology of the feature spaces making use of task-dependent information. Finally, we present first results on successfully learning good control policies directly on synthesized and real images.

...read moreread less

Journal Article•DOI•

Context, learning, and extinction.

[...]

Samuel J. Gershman¹, David M. Blei¹, Yael Niv¹•Institutions (1)

Princeton University¹

01 Jan 2010-Psychological Review

TL;DR: It is shown that online Bayesian inference within a model that assumes an unbounded number of latent causes can characterize a diverse set of behavioral results from such manipulations, some of which pose problems for the model of Redish et al. (2007).

...read moreread less

Abstract: A. Redish et al. (2007) proposed a reinforcement learning model of context-dependent learning and extinction in conditioning experiments, using the idea of "state classification" to categorize new observations into states. In the current article, the authors propose an interpretation of this idea in terms of normative statistical inference. They focus on renewal and latent inhibition, 2 conditioning paradigms in which contextual manipulations have been studied extensively, and show that online Bayesian inference within a model that assumes an unbounded number of latent causes can characterize a diverse set of behavioral results from such manipulations, some of which pose problems for the model of Redish et al. Moreover, in both paradigms, context dependence is absent in younger animals, or if hippocampal lesions are made prior to training. The authors suggest an explanation in terms of a restricted capacity to infer new causes.

...read moreread less

Journal Article•DOI•

Frontal Cortex and the Discovery of Abstract Action Rules

[...]

David Badre¹, Andrew S. Kayser², Mark D'Esposito³•Institutions (3)

Brown University¹, University of California, San Francisco², Helen Wills Neuroscience Institute³

29 Apr 2010-Neuron

TL;DR: This work uses a reinforcement learning paradigm to demonstrate that more anterior regions along the rostro-caudal axis of frontal cortex support rule learning at higher levels of abstraction, and indicates that when humans confront new rule learning problems, this ro Stro-Caudal division of labor supports the search for relationships between context and action at multiple levels of abstract simultaneously.

...read moreread less

Journal Article•DOI•

Decision Making Under Uncertainty: A Neural Model Based on Partially Observable Markov Decision Processes

[...]

Rajesh P. N. Rao¹•Institutions (1)

University of Washington¹

24 Nov 2010-Frontiers in Computational Neuroscience

TL;DR: A neural model of action selection and decision making based on the theory of partially observable Markov decision processes (POMDPs) is proposed and suggests an important role for interactions between the neocortex and the basal ganglia in learning the mapping between probabilistic sensory representations and actions that maximize rewards.

...read moreread less

Abstract: A fundamental problem faced by animals is learning to select actions based on noisy sensory information and incomplete knowledge of the world. It has been suggested that the brain engages in Bayesian inference during perception but how such probabilistic representations are used to select actions has remained unclear. Here we propose a neural model of action selection and decision making based on the theory of partially observable Markov decision processes (POMDPs). Actions are selected based not on a single “optimal” estimate of state but on the posterior distribution over states (the “belief” state). We show how such a model provides a unified framework for explaining experimental results in decision making that involve both information gathering and overt actions. The model utilizes temporal difference (TD) learning for maximizing expected reward. The resulting neural architecture posits an active role for the neocortex in belief computation while ascribing a role to the basal ganglia in belief representation, value computation, and action selection. When applied to the random dots motion discrimination task, model neurons representing belief exhibit responses similar to those of LIP neurons in primate neocortex. The appropriate threshold for switching from information gathering to overt actions emerges naturally during reward maximization. Additionally, the time course of reward prediction error in the model shares similarities with dopaminergic responses in the basal ganglia during the random dots task. For tasks with a deadline, the model learns a decision making strategy that changes with elapsed time, predicting a collapsing decision threshold consistent with some experimental studies. The model provides a new framework for understanding neural decision making and suggests an important role for interactions between the neocortex and the basal ganglia in learning the mapping between probabilistic sensory representations and actions that maximize rewards.

...read moreread less

Proceedings Article•DOI•

Reinforcement learning of motor skills in high dimensions: A path integral approach

[...]

Evangelos A. Theodorou¹, Jonas Buchli¹, Stefan Schaal¹•Institutions (1)

University of Southern California¹

03 May 2010

TL;DR: This paper derives a novel approach to RL for parameterized control policies based on the framework of stochastic optimal control with path integrals, and believes that this new algorithm, Policy Improvement with Path Integrals (PI2), offers currently one of the most efficient, numerically robust, and easy to implement algorithms for RL in robotics.

...read moreread less

Abstract: Reinforcement learning (RL) is one of the most general approaches to learning control. Its applicability to complex motor systems, however, has been largely impossible so far due to the computational difficulties that reinforcement learning encounters in high dimensional continuous state-action spaces. In this paper, we derive a novel approach to RL for parameterized control policies based on the framework of stochastic optimal control with path integrals. While solidly grounded in optimal control theory and estimation theory, the update equations for learning are surprisingly simple and have no danger of numerical instabilities as neither matrix inversions nor gradient learning rates are required. Empirical evaluations demonstrate significant performance improvements over gradient-based policy learning and scalability to high-dimensional control problems. Finally, a learning experiment on a robot dog illustrates the functionality of our algorithm in a real-world scenario. We believe that our new algorithm, Policy Improvement with Path Integrals (PI2), offers currently one of the most efficient, numerically robust, and easy to implement algorithms for RL in robotics.

...read moreread less

Proceedings Article•DOI•

Robot motor skill coordination with EM-based Reinforcement Learning

[...]

Petar Kormushev¹, Sylvain Calinon¹, Darwin G. Caldwell¹•Institutions (1)

Istituto Italiano di Tecnologia¹

03 Dec 2010

TL;DR: An approach allowing a robot to acquire new motor skills by learning the couplings across motor control variables through Expectation-Maximization based Reinforcement Learning is presented.

...read moreread less

Abstract: We present an approach allowing a robot to acquire new motor skills by learning the couplings across motor control variables. The demonstrated skill is first encoded in a compact form through a modified version of Dynamic Movement Primitives (DMP) which encapsulates correlation information. Expectation-Maximization based Reinforcement Learning is then used to modulate the mixture of dynamical systems initialized from the user's demonstration. The approach is evaluated on a torque-controlled 7 DOFs Barrett WAM robotic arm. Two skill learning experiments are conducted: a reaching task where the robot needs to adapt the learned movement to avoid an obstacle, and a dynamic pancake-flipping task.

...read moreread less

Journal Article•DOI•

2010 Special Issue: Parameter-exploring policy gradients

[...]

Frank Sehnke¹, Christian Osendorfer¹, Thomas Rückstieí¹, Alex Graves¹, Jan Peters², Jürgen Schmidhuber¹ - Show less +2 more•Institutions (2)

Technische Universität München¹, Max Planck Society²

01 May 2010-Neural Networks

TL;DR: This method estimates a likelihood gradient by sampling directly in parameter space, which leads to lower variance gradient estimates than obtained by regular policy gradient methods, and shows that the improvement is largest when the parameter samples are drawn symmetrically.

...read moreread less

Journal Article•DOI•

Learning latent structure: Carving nature at its joints

[...]

Samuel J. Gershman¹, Yael Niv¹•Institutions (1)

Princeton University¹

01 Apr 2010-Current Opinion in Neurobiology

TL;DR: An emerging literature on 'structure learning'--using experience to infer the structure of a task--and how this can be of service to RL, with an emphasis on structure in perception and action is surveyed.

...read moreread less

Journal Article•DOI•

Which road do I take? A learning-based model of route-choice behavior with real-time information

[...]

Eran Ben-Elia¹, Yoram Shiftan²•Institutions (2)

Utrecht University¹, Technion – Israel Institute of Technology²

01 May 2010-Transportation Research Part A-policy and Practice

TL;DR: It was found that information and experience have a combined effect on drivers' route-choice behavior andformed participants were more prone to risk-seeking and had greater sensitivity to travel time variability, while non-informed participants appeared to be more risk-averse and less sensitive to variability.

...read moreread less

Abstract: This paper presents a learning-based model of route-choice behavior when information is provided in real time. In a laboratory controlled experiment, participants made a long series of binary route-choice trials relying on real-time information and learning from their personal experience reinforced through feedback. A discrete choice model with a Mixed Logit specification, accounting for panel effects, was estimated based on the experiment’s data. It was found that information and experience have a combined effect on drivers’ route-choice behavior. Informed participants had faster learning rates and tended to base their decisions on memorization relating to previous outcomes whereas non-informed participants were slower in learning, required more exploration and tended to rely mostly on recent outcomes. Informed participants were more prone to risk-seeking and had greater sensitivity to travel time variability. In comparison, non-informed participants appeared to be more risk-averse and less sensitive to variability. These results have important policy implications on the design and implementation of ATIS initiatives. The advantage of incorporating insights from Prospect Theory and reinforced learning to improve the realism of travel behavior models is also discussed.

...read moreread less

Proceedings Article•DOI•

Combining manual feedback with subsequent MDP reward signals for reinforcement learning

[...]

W. Bradley Knox¹, Peter Stone¹•Institutions (1)

University of Texas at Austin¹

10 May 2010

TL;DR: The fast learning exhibited within the tamer framework is leveraged to hasten a reinforcement learning (RL) algorithm's climb up the learning curve, effectively demonstrating that human reinforcement and MDP reward can be used in conjunction with one another by an autonomous agent.

...read moreread less

Abstract: As learning agents move from research labs to the real world, it is increasingly important that human users, including those without programming skills, be able to teach agents desired behaviors. Recently, the tamer framework was introduced for designing agents that can be interactively shaped by human trainers who give only positive and negative feedback signals. Past work on tamer showed that shaping can greatly reduce the sample complexity required to learn a good policy, can enable lay users to teach agents the behaviors they desire, and can allow agents to learn within a Markov Decision Process (MDP) in the absence of a coded reward function. However, tamer does not allow this human training to be combined with autonomous learning based on such a coded reward function. This paper leverages the fast learning exhibited within the tamer framework to hasten a reinforcement learning (RL) algorithm's climb up the learning curve, effectively demonstrating that human reinforcement and MDP reward can be used in conjunction with one another by an autonomous agent. We tested eight plausible tamer+rl methods for combining a previously learned human reinforcement function, H, with MDP reward in a reinforcement learning algorithm. This paper identifies which of these methods are most effective and analyzes their strengths and weaknesses. Results from these tamer+rl algorithms indicate better final performance and better cumulative performance than either a tamer agent or an RL agent alone.

...read moreread less

Journal Article•DOI•

Distributed Q-Learning for Aggregated Interference Control in Cognitive Radio Networks

[...]

Ana Galindo-Serrano, Lorenza Giupponi

17 Feb 2010-IEEE Transactions on Vehicular Technology

TL;DR: A form of real-time multiagent reinforcement learning, which is known as decentralized Q-learning, is proposed to manage the aggregated interference generated by multiple WRAN systems.

...read moreread less

Abstract: This paper deals with the problem of aggregated interference generated by multiple cognitive radios (CRs) at the receivers of primary (licensed) users. In particular, we consider a secondary CR system based on the IEEE 802.22 standard for wireless regional area networks (WRANs), and we model it as a multiagent system where the multiple agents are the different secondary base stations in charge of controlling the secondary cells. We propose a form of real-time multiagent reinforcement learning, which is known as decentralized Q-learning, to manage the aggregated interference generated by multiple WRAN systems. We consider both situations of complete and partial information about the environment. By directly interacting with the surrounding environment in a distributed fashion, the multiagent system is able to learn, in the first case, an efficient policy to solve the problem and, in the second case, a reasonably good suboptimal policy. Computational and memory requirement considerations are also presented, discussing two different options for uploading and processing the learning information. Simulation results, which are presented for both the upstream and downstream cases, reveal that the proposed approach is able to fulfill the primary-user interference constraints, without introducing signaling overhead in the system.

...read moreread less

Book Chapter•DOI•

Adaptive ε-greedy exploration in reinforcement learning based on value differences

[...]

Michel Tokic¹•Institutions (1)

University of Ulm¹

21 Sep 2010

TL;DR: Preliminary results indicate that VDBE seems to be more parameter robust than commonly used ad hoc approaches such as e-greedy or softmax.

...read moreread less

Abstract: This paper presents "Value-Difference Based Exploration" (VDBE), a method for balancing the exploration/exploitation dilemma inherent to reinforcement learning. The proposed method adapts the exploration parameter of e-greedy in dependence of the temporal-difference error observed from value-function backups, which is considered as a measure of the agent's uncertainty about the environment. VDBE is evaluated on a multi-armed bandit task, which allows for insight into the behavior of the method. Preliminary results indicate that VDBE seems to be more parameter robust than commonly used ad hoc approaches such as e-greedy or softmax.

...read moreread less

Learning Motor Primitives for Robotics

[...]

Jens Kober¹, Jan Peters¹•Institutions (1)

Max Planck Society¹

15 Jan 2010

TL;DR: It is shown that two new motor skills, i.e., Ball-in-a-Cup and Ball-Paddling, can be learned on a real Barrett WAM robot arm at a pace similar to human learning while achieving a significantly more reliable final performance.

...read moreread less

Abstract: The acquisition and self-improvement of novel motor skills is among the most important problems in robotics. Motor primitives offer one of the most promising frameworks for the application of machine learning techniques in this context. Employing an improved form of the dynamic systems motor primitives originally introduced by Ijspeert et al. [2], we show how both discrete and rhythmic tasks can be learned using a concerted approach of both imitation and reinforcement learning. For doing so, we present both learning algorithms and representations targeted for the practical application in robotics. Furthermore, we show that it is possible to include a start-up phase in rhythmic primitives. We show that two new motor skills, i.e., Ball-in-a-Cup and Ball-Paddling, can be learned on a real Barrett WAM robot arm at a pace similar to human learning while achieving a significantly more reliable final performance.

...read moreread less

Book•

Efficient Reinforcement Learning Using Gaussian Processes

[...]

Marc Peter Deisenroth

22 Nov 2010

TL;DR: First, PILCO, a fully Bayesian approach for efficient RL in continuous-valued state and action spaces when no expert knowledge is available is introduced, and principled algorithms for robust filtering and smoothing in GP dynamic systems are proposed.

...read moreread less

Abstract: This book examines Gaussian processes in both model-based reinforcement learning (RL) and inference in nonlinear dynamic systems. First, we introduce PILCO, a fully Bayesian approach for efficient RL in continuous-valued state and action spaces when no expert knowledge is available. PILCO takes model uncertainties consistently into account during long-term planning to reduce model bias. Second, we propose principled algorithms for robust filtering and smoothing in GP dynamic systems. Umfang: IX, 205 S. Preis: €36.00 | £33.00 | $63.00

...read moreread less

A neural network model of adaptively timed reinforcement learning and hippocampal dynamics

[...]

Stephen Grossberg¹, John W. L. Merrill¹•Institutions (1)

Boston University¹

27 Jul 2010

TL;DR: In this paper, the authors proposed a method to solve the problem of artificial neural networks in the field of artificial intelligence, which was proposed by the U.S. Air Force Office of Scientific Research (90-0175, 90-0128); Defense Advanced Research Projects Agency (90 -0083); National Science Foundation (IRI-87-16960); Office of Naval Research (N00014-91-J-4100)

...read moreread less

Abstract: Air Force Office of Scientific Research (90-0175, 90-0128); Defense Advanced Research Projects Agency (90-0083); National Science Foundation (IRI-87-16960); Office of Naval Research (N00014-91-J-4100)

...read moreread less

Journal Article•DOI•

Real-time driving danger-level prediction

[...]

Jinjun Wang, Wei Xu, Yihong Gong

01 Dec 2010-Engineering Applications of Artificial Intelligence

TL;DR: Experimental results showed that using reinforcement learning based method with the vehicle dynamic parameters feature outperforms the rest algorithms, and adding the other two features could further improve the prediction accuracy.

...read moreread less

Journal Article•DOI•

Urban traffic signal control using reinforcement learning agents

[...]

P.G. Balaji¹, X. German¹, Dipti Srinivasan¹•Institutions (1)

National University of Singapore¹

30 Aug 2010-Iet Intelligent Transport Systems

TL;DR: The proposed multi- agent reinforcement learning (RLA) signal control showed significant improvement in mean time delay and speed in comparison to other traffic control system like hierarchical multi-agent system (HMS), cooperative ensemble (CE) and actuated control.

...read moreread less

Abstract: This study presents a distributed multi-agent-based traffic signal control for optimising green timing in an urban arterial road network to reduce the total travel time and delay experienced by vehicles. The proposed multi-agent architecture uses traffic data collected by sensors at each intersection, stored historical traffic patterns and data communicated from agents in adjacent intersections to compute green time for a phase. The parameters like weights, threshold values used in computing the green time is fine tuned by online reinforcement learning with an objective to reduce overall delay. PARAMICS software was used as a platform to simulate 29 signalised intersection at Central Business District of Singapore and test the performance of proposed multi-agent traffic signal control for different traffic scenarios. The proposed multi-agent reinforcement learning (RLA) signal control showed significant improvement in mean time delay and speed in comparison to other traffic control system like hierarchical multi-agent system (HMS), cooperative ensemble (CE) and actuated control.

...read moreread less

Collapse