scispace - formally typeset
Search or ask a question

Showing papers on "Reinforcement learning published in 2010"


Posted Content
TL;DR: Bayesian optimization as mentioned in this paper employs the Bayesian technique of setting a prior over the objective function and combining it with evidence to get a posterior function, which permits a utility-based selection of the next observation to make on the objective functions, which must take into account both exploration (sampling from areas of high uncertainty) and exploitation, sampling areas likely to offer improvement over the current best observation.
Abstract: We present a tutorial on Bayesian optimization, a method of finding the maximum of expensive cost functions. Bayesian optimization employs the Bayesian technique of setting a prior over the objective function and combining it with evidence to get a posterior function. This permits a utility-based selection of the next observation to make on the objective function, which must take into account both exploration (sampling from areas of high uncertainty) and exploitation (sampling areas likely to offer improvement over the current best observation). We also present two detailed extensions of Bayesian optimization, with experiments—active user modelling with preferences, and hierarchical reinforcement learning—and a discussion of the pros and cons of Bayesian optimization based on our experiences.

1,425 citations


Book
25 Jun 2010
TL;DR: This book focuses on those algorithms of reinforcement learning that build on the powerful theory of dynamic programming, and gives a fairly comprehensive catalog of learning problems, and describes the core ideas, followed by the discussion of their theoretical properties and limitations.
Abstract: Reinforcement learning is a learning paradigm concerned with learning to control a system so as to maximize a numerical performance measure that expresses a long-term objective.What distinguishes reinforcement learning from supervised learning is that only partial feedback is given to the learner about the learner's predictions. Further, the predictions may have long term effects through influencing the future state of the controlled system. Thus, time plays a special role. The goal in reinforcement learning is to develop efficient learning algorithms, as well as to understand the algorithms' merits and limitations. Reinforcement learning is of great interest because of the large number of practical applications that it can be used to address, ranging from problems in artificial intelligence to operations research or control engineering. In this book, we focus on those algorithms of reinforcement learning that build on the powerful theory of dynamic programming.We give a fairly comprehensive catalog of learning problems, describe the core ideas, note a large number of state of the art algorithms, followed by the discussion of their theoretical properties and limitations.

1,146 citations


Journal ArticleDOI
27 May 2010-Neuron
TL;DR: Using functional magnetic resonance imaging in humans solving a probabilistic Markov decision task, the neural signature of an SPE is found in the intraparietal sulcus and lateral prefrontal cortex, in addition to the previously well-characterized RPE in the ventral striatum, which supports the existence of two unique forms of learning signal in humans.

1,031 citations


Journal Article
TL;DR: For undiscounted reinforcement learning in Markov decision processes (MDPs), this paper presented a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D.
Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps (on average). We present a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of Ω(√DSAT) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in T. Finally, we also consider a setting where the MDP is allowed to change a fixed number of l times. We present a modification of our algorithm that is able to deal with this setting and show a regret bound of O(l1/3T2/3DS√A).

945 citations


Reference BookDOI
29 Apr 2010
TL;DR: Reinforcement Learning and Dynamic Programming Using Function Approximators provides a comprehensive and unparalleled exploration of the field of RL and DP, with a focus on continuous-variable problems.
Abstract: From household appliances to applications in robotics, engineered systems involving complex dynamics can only be as effective as the algorithms that control them. While Dynamic Programming (DP) has provided researchers with a way to optimally solve decision and control problems involving complex dynamic systems, its practical value was limited by algorithms that lacked the capacity to scale up to realistic problems. However, in recent years, dramatic developments in Reinforcement Learning (RL), the model-free counterpart of DP, changed our understanding of what is possible. Those developments led to the creation of reliable methods that can be applied even when a mathematical model of the system is unavailable, allowing researchers to solve challenging control problems in engineering, as well as in a variety of other disciplines, including economics, medicine, and artificial intelligence. Reinforcement Learning and Dynamic Programming Using Function Approximators provides a comprehensive and unparalleled exploration of the field of RL and DP. With a focus on continuous-variable problems, this seminal text details essential developments that have substantially altered the field over the past decade. In its pages, pioneering experts provide a concise introduction to classical RL and DP, followed by an extensive presentation of the state-of-the-art and novel methods in RL and DP with approximation. Combining algorithm development with theoretical guarantees, they elaborate on their work with illustrative examples and insightful comparisons. Three individual chapters are dedicated to representative algorithms from each of the major classes of techniques: value iteration, policy iteration, and policy search. The features and performance of these algorithms are highlighted in extensive experimental studies on a range of control applications. The recent development of applications involving complex systems has led to a surge of interest in RL and DP methods and the subsequent need for a quality resource on the subject. For graduate students and others new to the field, this book offers a thorough introduction to both the basics and emerging methods. And for those researchers and practitioners working in the fields of optimal and adaptive control, machine learning, artificial intelligence, and operations research, this resource offers a combination of practical algorithms, theoretical analysis, and comprehensive examples that they will be able to adapt and apply to their own work. Access the authors' website at www.dcsc.tudelft.nl/rlbook/ for additional material, including computer code used in the studies and information concerning new developments.

917 citations


Proceedings Article
11 Jul 2010
TL;DR: The Relative Entropy Policy Search (REPS) method is suggested, which differs significantly from previous policy gradient approaches and yields an exact update step and works well on typical reinforcement learning benchmark problems.
Abstract: Policy search is a successful approach to reinforcement learning. However, policy improvements often result in the loss of information. Hence, it has been marred by premature convergence and implausible solutions. As first suggested in the context of covariant policy gradients (Bagnell and Schneider 2003), many of these problems may be addressed by constraining the information loss. In this paper, we continue this path of reasoning and suggest the Relative Entropy Policy Search (REPS) method. The resulting method differs significantly from previous policy gradient approaches and yields an exact update step. It works well on typical reinforcement learning benchmark problems.

641 citations


Book ChapterDOI
01 Jan 2010
TL;DR: This chapter reviews a representative selection of multi-agent reinforcement learning algorithms for fully cooperative, fully competitive, and more general (neither cooperative nor competitive) tasks.
Abstract: Multi-agent systems can be used to address problems in a variety of domains, including robotics, distributed control, telecommunications, and economics. The complexity of many tasks arising in these domains makes them difficult to solve with preprogrammed agent behaviors. The agents must instead discover a solution on their own, using learning. A significant part of the research on multi-agent learning concerns reinforcement learning techniques. This chapter reviews a representative selection of multi-agent reinforcement learning algorithms for fully cooperative, fully competitive, and more general (neither cooperative nor competitive) tasks. The benefits and challenges of multi-agent reinforcement learning are described. A central challenge in the field is the formal statement of a multi-agent learning goal; this chapter reviews the learning goals proposed in the literature. The problem domains where multi-agent reinforcement learning techniques have been applied are briefly discussed. Several multi-agent reinforcement learning algorithms are applied to an illustrative example involving the coordinated transportation of an object by two cooperative robots. In an outlook for the multi-agent reinforcement learning field, a set of important open issues are identified, and promising research directions to address these issues are outlined.

548 citations


Journal Article
TL;DR: The framework of stochastic optimal control with path integrals is used to derive a novel approach to RL with parameterized policies to demonstrate interesting similarities with previous RL research in the framework of probability matching and provides intuition why the slightly heuristically motivated probability matching approach can actually perform well.
Abstract: With the goal to generate more scalable algorithms with higher efficiency and fewer open parameters, reinforcement learning (RL) has recently moved towards combining classical techniques from optimal control and dynamic programming with modern learning techniques from statistical estimation theory. In this vein, this paper suggests to use the framework of stochastic optimal control with path integrals to derive a novel approach to RL with parameterized policies. While solidly grounded in value function estimation and optimal control based on the stochastic Hamilton-Jacobi-Bellman (HJB) equations, policy improvements can be transformed into an approximation problem of a path integral which has no open algorithmic parameters other than the exploration noise. The resulting algorithm can be conceived of as model-based, semi-model-based, or even model free, depending on how the learning problem is structured. The update equations have no danger of numerical instabilities as neither matrix inversions nor gradient learning rates are required. Our new algorithm demonstrates interesting similarities with previous RL research in the framework of probability matching and provides intuition why the slightly heuristically motivated probability matching approach can actually perform well. Empirical evaluations demonstrate significant performance improvements over gradient-based policy learning and scalability to high-dimensional control problems. Finally, a learning experiment on a simulated 12 degree-of-freedom robot dog illustrates the functionality of our algorithm in a complex robot learning scenario. We believe that Policy Improvement with Path Integrals (PI2) offers currently one of the most efficient, numerically robust, and easy to implement algorithms for RL based on trajectory roll-outs.

520 citations


Journal ArticleDOI
TL;DR: Experimental results clearly demonstrate the advantages of multi-agent RL-based control over LQF governed isolated single-intersection control, thus paving the way for efficient distributed traffic signal control in complex settings.
Abstract: A challenging application of artificial intelligence systems involves the scheduling of traffic signals in multi-intersection vehicular networks. This paper introduces a novel use of a multi-agent system and reinforcement learning (RL) framework to obtain an efficient traffic signal control policy. The latter is aimed at minimising the average delay, congestion and likelihood of intersection cross-blocking. A five-intersection traffic network has been studied in which each intersection is governed by an autonomous intelligent agent. Two types of agents, a central agent and an outbound agent, were employed. The outbound agents schedule traffic signals by following the longest-queue-first (LQF) algorithm, which has been proved to guarantee stability and fairness, and collaborate with the central agent by providing it local traffic statistics. The central agent learns a value function driven by its local and neighbours' traffic conditions. The novel methodology proposed here utilises the Q-Learning algorithm with a feedforward neural network for value function approximation. Experimental results clearly demonstrate the advantages of multi-agent RL-based control over LQF governed isolated single-intersection control, thus paving the way for efficient distributed traffic signal control in complex settings.

463 citations


Journal Article
TL;DR: An alternative way to approximate the maximum expected value for any set of random variables is introduced and the obtained double estimator method is shown to sometimes underestimate rather than overestimate themaximum expected value.
Abstract: In some stochastic environments the well-known reinforcement learning algorithm Q-learning performs very poorly. This poor performance is caused by large overestimations of action values. These overestimations result from a positive bias that is introduced because Q-learning uses the maximum action value as an approximation for the maximum expected action value. We introduce an alternative way to approximate the maximum expected value for any set of random variables. The obtained double estimator method is shown to sometimes underestimate rather than overestimate the maximum expected value. We apply the double estimator to Q-learning to construct Double Q-learning, a new off-policy reinforcement learning algorithm. We show the new algorithm converges to the optimal policy and that it performs well in some settings in which Q-learning performs poorly due to its overestimation.

463 citations


Proceedings Article
06 Dec 2010
TL;DR: Double Q-learning as mentioned in this paper is a new off-policy reinforcement learning algorithm that uses the double estimator to approximate the maximum expected value for any set of random variables and shows that it performs well in some settings in which Q-Learning performs poorly due to its overestimation.
Abstract: In some stochastic environments the well-known reinforcement learning algorithm Q-learning performs very poorly. This poor performance is caused by large overestimations of action values. These overestimations result from a positive bias that is introduced because Q-learning uses the maximum action value as an approximation for the maximum expected action value. We introduce an alternative way to approximate the maximum expected value for any set of random variables. The obtained double estimator method is shown to sometimes underestimate rather than overestimate the maximum expected value. We apply the double estimator to Q-learning to construct Double Q-learning, a new off-policy reinforcement learning algorithm. We show the new algorithm converges to the optimal policy and that it performs well in some settings in which Q-learning performs poorly due to its overestimation.

Journal ArticleDOI
Steven L. Scott1
TL;DR: A heuristic for managing multi-armed bandits called randomized probability matching is described, which randomly allocates observations to arms according the Bayesian posterior probability that each arm is optimal.
Abstract: A multi-armed bandit is an experiment with the goal of accumulating rewards from a payoff distribution with unknown parameters that are to be learned sequentially. This article describes a heuristic for managing multi-armed bandits called randomized probability matching, which randomly allocates observations to arms according the Bayesian posterior probability that each arm is optimal. Advances in Bayesian computation have made randomized probability matching easy to apply to virtually any payoff distribution. This flexibility frees the experimenter to work with payoff distributions that correspond to certain classical experimental designs that have the potential to outperform methods that are ‘optimal’ in simpler contexts. I summarize the relationships between randomized probability matching and several related heuristics that have been used in the reinforcement learning literature. Copyright © 2010 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: A new optimal reward framework is defined that captures the pressure to design good primary reward functions that lead to evolutionary success across environments and shows that optimal primary reward signals may yield both emergent intrinsic and extrinsic motivation.
Abstract: There is great interest in building intrinsic motivation into artificial systems using the reinforcement learning framework. Yet, what intrinsic motivation may mean computationally, and how it may differ from extrinsic motivation, remains a murky and controversial subject. In this paper, we adopt an evolutionary perspective and define a new optimal reward framework that captures the pressure to design good primary reward functions that lead to evolutionary success across environments. The results of two computational experiments show that optimal primary reward signals may yield both emergent intrinsic and extrinsic motivation. The evolutionary perspective and the associated optimal reward framework thus lead to the conclusion that there are no hard and fast features distinguishing intrinsic and extrinsic reward computationally. Rather, the directness of the relationship between rewarding behavior and evolutionary success varies along a continuum.

Proceedings ArticleDOI
18 Jul 2010
TL;DR: A framework for combining the training of deep auto-encoders (for learning compact feature spaces) with recently-proposed batch-mode RL algorithms ( for learning policies) is proposed and an emphasis is put on the data-efficiency and on studying the properties of the feature spaces automatically constructed by the deep Auto-encoder neural networks.
Abstract: This paper discusses the effectiveness of deep auto-encoder neural networks in visual reinforcement learning (RL) tasks. We propose a framework for combining the training of deep auto-encoders (for learning compact feature spaces) with recently-proposed batch-mode RL algorithms (for learning policies). An emphasis is put on the data-efficiency of this combination and on studying the properties of the feature spaces automatically constructed by the deep auto-encoders. These feature spaces are empirically shown to adequately resemble existing similarities and spatial relations between observations and allow to learn useful policies. We propose several methods for improving the topology of the feature spaces making use of task-dependent information. Finally, we present first results on successfully learning good control policies directly on synthesized and real images.

Journal ArticleDOI
TL;DR: It is shown that online Bayesian inference within a model that assumes an unbounded number of latent causes can characterize a diverse set of behavioral results from such manipulations, some of which pose problems for the model of Redish et al. (2007).
Abstract: A. Redish et al. (2007) proposed a reinforcement learning model of context-dependent learning and extinction in conditioning experiments, using the idea of "state classification" to categorize new observations into states. In the current article, the authors propose an interpretation of this idea in terms of normative statistical inference. They focus on renewal and latent inhibition, 2 conditioning paradigms in which contextual manipulations have been studied extensively, and show that online Bayesian inference within a model that assumes an unbounded number of latent causes can characterize a diverse set of behavioral results from such manipulations, some of which pose problems for the model of Redish et al. Moreover, in both paradigms, context dependence is absent in younger animals, or if hippocampal lesions are made prior to training. The authors suggest an explanation in terms of a restricted capacity to infer new causes.

Journal ArticleDOI
29 Apr 2010-Neuron
TL;DR: This work uses a reinforcement learning paradigm to demonstrate that more anterior regions along the rostro-caudal axis of frontal cortex support rule learning at higher levels of abstraction, and indicates that when humans confront new rule learning problems, this ro Stro-Caudal division of labor supports the search for relationships between context and action at multiple levels of abstract simultaneously.

Journal ArticleDOI
TL;DR: A neural model of action selection and decision making based on the theory of partially observable Markov decision processes (POMDPs) is proposed and suggests an important role for interactions between the neocortex and the basal ganglia in learning the mapping between probabilistic sensory representations and actions that maximize rewards.
Abstract: A fundamental problem faced by animals is learning to select actions based on noisy sensory information and incomplete knowledge of the world. It has been suggested that the brain engages in Bayesian inference during perception but how such probabilistic representations are used to select actions has remained unclear. Here we propose a neural model of action selection and decision making based on the theory of partially observable Markov decision processes (POMDPs). Actions are selected based not on a single “optimal” estimate of state but on the posterior distribution over states (the “belief” state). We show how such a model provides a unified framework for explaining experimental results in decision making that involve both information gathering and overt actions. The model utilizes temporal difference (TD) learning for maximizing expected reward. The resulting neural architecture posits an active role for the neocortex in belief computation while ascribing a role to the basal ganglia in belief representation, value computation, and action selection. When applied to the random dots motion discrimination task, model neurons representing belief exhibit responses similar to those of LIP neurons in primate neocortex. The appropriate threshold for switching from information gathering to overt actions emerges naturally during reward maximization. Additionally, the time course of reward prediction error in the model shares similarities with dopaminergic responses in the basal ganglia during the random dots task. For tasks with a deadline, the model learns a decision making strategy that changes with elapsed time, predicting a collapsing decision threshold consistent with some experimental studies. The model provides a new framework for understanding neural decision making and suggests an important role for interactions between the neocortex and the basal ganglia in learning the mapping between probabilistic sensory representations and actions that maximize rewards.

Proceedings ArticleDOI
03 May 2010
TL;DR: This paper derives a novel approach to RL for parameterized control policies based on the framework of stochastic optimal control with path integrals, and believes that this new algorithm, Policy Improvement with Path Integrals (PI2), offers currently one of the most efficient, numerically robust, and easy to implement algorithms for RL in robotics.
Abstract: Reinforcement learning (RL) is one of the most general approaches to learning control. Its applicability to complex motor systems, however, has been largely impossible so far due to the computational difficulties that reinforcement learning encounters in high dimensional continuous state-action spaces. In this paper, we derive a novel approach to RL for parameterized control policies based on the framework of stochastic optimal control with path integrals. While solidly grounded in optimal control theory and estimation theory, the update equations for learning are surprisingly simple and have no danger of numerical instabilities as neither matrix inversions nor gradient learning rates are required. Empirical evaluations demonstrate significant performance improvements over gradient-based policy learning and scalability to high-dimensional control problems. Finally, a learning experiment on a robot dog illustrates the functionality of our algorithm in a real-world scenario. We believe that our new algorithm, Policy Improvement with Path Integrals (PI2), offers currently one of the most efficient, numerically robust, and easy to implement algorithms for RL in robotics.

Proceedings ArticleDOI
03 Dec 2010
TL;DR: An approach allowing a robot to acquire new motor skills by learning the couplings across motor control variables through Expectation-Maximization based Reinforcement Learning is presented.
Abstract: We present an approach allowing a robot to acquire new motor skills by learning the couplings across motor control variables. The demonstrated skill is first encoded in a compact form through a modified version of Dynamic Movement Primitives (DMP) which encapsulates correlation information. Expectation-Maximization based Reinforcement Learning is then used to modulate the mixture of dynamical systems initialized from the user's demonstration. The approach is evaluated on a torque-controlled 7 DOFs Barrett WAM robotic arm. Two skill learning experiments are conducted: a reaching task where the robot needs to adapt the learned movement to avoid an obstacle, and a dynamic pancake-flipping task.

Journal ArticleDOI
TL;DR: This method estimates a likelihood gradient by sampling directly in parameter space, which leads to lower variance gradient estimates than obtained by regular policy gradient methods, and shows that the improvement is largest when the parameter samples are drawn symmetrically.

Journal ArticleDOI
TL;DR: An emerging literature on 'structure learning'--using experience to infer the structure of a task--and how this can be of service to RL, with an emphasis on structure in perception and action is surveyed.

Journal ArticleDOI
TL;DR: It was found that information and experience have a combined effect on drivers' route-choice behavior andformed participants were more prone to risk-seeking and had greater sensitivity to travel time variability, while non-informed participants appeared to be more risk-averse and less sensitive to variability.
Abstract: This paper presents a learning-based model of route-choice behavior when information is provided in real time. In a laboratory controlled experiment, participants made a long series of binary route-choice trials relying on real-time information and learning from their personal experience reinforced through feedback. A discrete choice model with a Mixed Logit specification, accounting for panel effects, was estimated based on the experiment’s data. It was found that information and experience have a combined effect on drivers’ route-choice behavior. Informed participants had faster learning rates and tended to base their decisions on memorization relating to previous outcomes whereas non-informed participants were slower in learning, required more exploration and tended to rely mostly on recent outcomes. Informed participants were more prone to risk-seeking and had greater sensitivity to travel time variability. In comparison, non-informed participants appeared to be more risk-averse and less sensitive to variability. These results have important policy implications on the design and implementation of ATIS initiatives. The advantage of incorporating insights from Prospect Theory and reinforced learning to improve the realism of travel behavior models is also discussed.

Proceedings ArticleDOI
10 May 2010
TL;DR: The fast learning exhibited within the tamer framework is leveraged to hasten a reinforcement learning (RL) algorithm's climb up the learning curve, effectively demonstrating that human reinforcement and MDP reward can be used in conjunction with one another by an autonomous agent.
Abstract: As learning agents move from research labs to the real world, it is increasingly important that human users, including those without programming skills, be able to teach agents desired behaviors. Recently, the tamer framework was introduced for designing agents that can be interactively shaped by human trainers who give only positive and negative feedback signals. Past work on tamer showed that shaping can greatly reduce the sample complexity required to learn a good policy, can enable lay users to teach agents the behaviors they desire, and can allow agents to learn within a Markov Decision Process (MDP) in the absence of a coded reward function. However, tamer does not allow this human training to be combined with autonomous learning based on such a coded reward function. This paper leverages the fast learning exhibited within the tamer framework to hasten a reinforcement learning (RL) algorithm's climb up the learning curve, effectively demonstrating that human reinforcement and MDP reward can be used in conjunction with one another by an autonomous agent. We tested eight plausible tamer+rl methods for combining a previously learned human reinforcement function, H, with MDP reward in a reinforcement learning algorithm. This paper identifies which of these methods are most effective and analyzes their strengths and weaknesses. Results from these tamer+rl algorithms indicate better final performance and better cumulative performance than either a tamer agent or an RL agent alone.

Journal ArticleDOI
TL;DR: A form of real-time multiagent reinforcement learning, which is known as decentralized Q-learning, is proposed to manage the aggregated interference generated by multiple WRAN systems.
Abstract: This paper deals with the problem of aggregated interference generated by multiple cognitive radios (CRs) at the receivers of primary (licensed) users. In particular, we consider a secondary CR system based on the IEEE 802.22 standard for wireless regional area networks (WRANs), and we model it as a multiagent system where the multiple agents are the different secondary base stations in charge of controlling the secondary cells. We propose a form of real-time multiagent reinforcement learning, which is known as decentralized Q-learning, to manage the aggregated interference generated by multiple WRAN systems. We consider both situations of complete and partial information about the environment. By directly interacting with the surrounding environment in a distributed fashion, the multiagent system is able to learn, in the first case, an efficient policy to solve the problem and, in the second case, a reasonably good suboptimal policy. Computational and memory requirement considerations are also presented, discussing two different options for uploading and processing the learning information. Simulation results, which are presented for both the upstream and downstream cases, reveal that the proposed approach is able to fulfill the primary-user interference constraints, without introducing signaling overhead in the system.

Book ChapterDOI
Michel Tokic1
21 Sep 2010
TL;DR: Preliminary results indicate that VDBE seems to be more parameter robust than commonly used ad hoc approaches such as e-greedy or softmax.
Abstract: This paper presents "Value-Difference Based Exploration" (VDBE), a method for balancing the exploration/exploitation dilemma inherent to reinforcement learning. The proposed method adapts the exploration parameter of e-greedy in dependence of the temporal-difference error observed from value-function backups, which is considered as a measure of the agent's uncertainty about the environment. VDBE is evaluated on a multi-armed bandit task, which allows for insight into the behavior of the method. Preliminary results indicate that VDBE seems to be more parameter robust than commonly used ad hoc approaches such as e-greedy or softmax.

15 Jan 2010
TL;DR: It is shown that two new motor skills, i.e., Ball-in-a-Cup and Ball-Paddling, can be learned on a real Barrett WAM robot arm at a pace similar to human learning while achieving a significantly more reliable final performance.
Abstract: The acquisition and self-improvement of novel motor skills is among the most important problems in robotics. Motor primitives offer one of the most promising frameworks for the application of machine learning techniques in this context. Employing an improved form of the dynamic systems motor primitives originally introduced by Ijspeert et al. [2], we show how both discrete and rhythmic tasks can be learned using a concerted approach of both imitation and reinforcement learning. For doing so, we present both learning algorithms and representations targeted for the practical application in robotics. Furthermore, we show that it is possible to include a start-up phase in rhythmic primitives. We show that two new motor skills, i.e., Ball-in-a-Cup and Ball-Paddling, can be learned on a real Barrett WAM robot arm at a pace similar to human learning while achieving a significantly more reliable final performance.

Book
22 Nov 2010
TL;DR: First, PILCO, a fully Bayesian approach for efficient RL in continuous-valued state and action spaces when no expert knowledge is available is introduced, and principled algorithms for robust filtering and smoothing in GP dynamic systems are proposed.
Abstract: This book examines Gaussian processes in both model-based reinforcement learning (RL) and inference in nonlinear dynamic systems. First, we introduce PILCO, a fully Bayesian approach for efficient RL in continuous-valued state and action spaces when no expert knowledge is available. PILCO takes model uncertainties consistently into account during long-term planning to reduce model bias. Second, we propose principled algorithms for robust filtering and smoothing in GP dynamic systems. Umfang: IX, 205 S. Preis: €36.00 | £33.00 | $63.00

27 Jul 2010
TL;DR: In this paper, the authors proposed a method to solve the problem of artificial neural networks in the field of artificial intelligence, which was proposed by the U.S. Air Force Office of Scientific Research (90-0175, 90-0128); Defense Advanced Research Projects Agency (90 -0083); National Science Foundation (IRI-87-16960); Office of Naval Research (N00014-91-J-4100)
Abstract: Air Force Office of Scientific Research (90-0175, 90-0128); Defense Advanced Research Projects Agency (90-0083); National Science Foundation (IRI-87-16960); Office of Naval Research (N00014-91-J-4100)

Journal ArticleDOI
TL;DR: Experimental results showed that using reinforcement learning based method with the vehicle dynamic parameters feature outperforms the rest algorithms, and adding the other two features could further improve the prediction accuracy.

Journal ArticleDOI
TL;DR: The proposed multi- agent reinforcement learning (RLA) signal control showed significant improvement in mean time delay and speed in comparison to other traffic control system like hierarchical multi-agent system (HMS), cooperative ensemble (CE) and actuated control.
Abstract: This study presents a distributed multi-agent-based traffic signal control for optimising green timing in an urban arterial road network to reduce the total travel time and delay experienced by vehicles. The proposed multi-agent architecture uses traffic data collected by sensors at each intersection, stored historical traffic patterns and data communicated from agents in adjacent intersections to compute green time for a phase. The parameters like weights, threshold values used in computing the green time is fine tuned by online reinforcement learning with an objective to reduce overall delay. PARAMICS software was used as a platform to simulate 29 signalised intersection at Central Business District of Singapore and test the performance of proposed multi-agent traffic signal control for different traffic scenarios. The proposed multi-agent reinforcement learning (RLA) signal control showed significant improvement in mean time delay and speed in comparison to other traffic control system like hierarchical multi-agent system (HMS), cooperative ensemble (CE) and actuated control.