Showing papers on "Reinforcement learning published in 2008"

PDF

Open Access

Journal Article•DOI•

A Comprehensive Survey of Multiagent Reinforcement Learning

[...]

Lucian Busoniu¹, Robert Babuska¹, B. De Schutter•Institutions (1)

01 Mar 2008

TL;DR: The benefits and challenges of MARL are described along with some of the problem domains where the MARL techniques have been applied, and an outlook for the field is provided.

...read moreread less

Abstract: Multiagent systems are rapidly finding applications in a variety of domains, including robotics, distributed control, telecommunications, and economics. The complexity of many tasks arising in these domains makes them difficult to solve with preprogrammed agent behaviors. The agents must, instead, discover a solution on their own, using learning. A significant part of the research on multiagent learning concerns reinforcement learning techniques. This paper provides a comprehensive survey of multiagent reinforcement learning (MARL). A central issue in the field is the formal statement of the multiagent learning goal. Different viewpoints on this issue have led to the proposal of many different goals, among which two focal points can be distinguished: stability of the agents' learning dynamics, and adaptation to the changing behavior of the other agents. The MARL algorithms described in the literature aim---either explicitly or implicitly---at one of these two goals or at a combination of both, in a fully cooperative, fully competitive, or more general setting. A representative selection of these algorithms is discussed in detail in this paper, together with the specific issues that arise in each category. Additionally, the benefits and challenges of MARL are described along with some of the problem domains where the MARL techniques have been applied. Finally, an outlook for the field is provided.

...read moreread less

1,878 citations

Journal Article•DOI•

2008 Special Issue: Reinforcement learning of motor skills with policy gradients

[...]

Jan Peters¹, Stefan Schaal¹•Institutions (1)

University of Southern California¹

01 May 2008-Neural Networks

TL;DR: This paper examines learning of complex motor skills with human-like limbs, and combines the idea of modular motor control by means of motor primitives as a suitable way to generate parameterized control policies for reinforcement learning with the theory of stochastic policy gradient learning.

...read moreread less

921 citations

Journal Article•DOI•

Natural Actor-Critic

[...]

Jan Peters¹, Stefan Schaal¹•Institutions (1)

University of Southern California¹

01 Mar 2008-Neurocomputing

TL;DR: It is shown that several well-known reinforcement learning methods such as the original Actor-Critic and Bradtke's Linear Quadratic Q-Learning are in fact Natural Actor- Critic algorithms.

...read moreread less

659 citations

Journal Article•DOI•

Phenomenological models of synaptic plasticity based on spike timing

[...]

Abigail Morrison¹, Markus Diesmann¹, Wulfram Gerstner²•Institutions (2)

RIKEN Brain Science Institute¹, École Polytechnique Fédérale de Lausanne²

19 May 2008-Biological Cybernetics

TL;DR: This document reviews phenomenological models of short-term and long-term synaptic plasticity, in particular spike-timing dependent plasticity (STDP), and focuses on phenomenological synaptic models that are compatible with integrate-and-fire type neuron models where each neuron is described by a small number of variables.

...read moreread less

Abstract: Synaptic plasticity is considered to be the biological substrate of learning and memory. In this document we review phenomenological models of short-term and long-term synaptic plasticity, in particular spike-timing dependent plasticity (STDP). The aim of the document is to provide a framework for classifying and evaluating different models of plasticity. We focus on phenomenological synaptic models that are compatible with integrate-and-fire type neuron models where each neuron is described by a small number of variables. This implies that synaptic update rules for short-term or long-term plasticity can only depend on spike timing and, potentially, on membrane potential, as well as on the value of the synaptic weight, or on low-pass filtered (temporally averaged) versions of the above variables. We examine the ability of the models to account for experimental data and to fulfill expectations derived from theoretical considerations. We further discuss their relations to teacher-based rules (supervised learning) and reward-based rules (reinforcement learning). All models discussed in this paper are suitable for large-scale network simulations.

...read moreread less

606 citations

Journal Article•DOI•

Reinforcement learning: The Good, The Bad and The Ugly

[...]

Peter Dayan, Yael Niv¹•Institutions (1)

Princeton University¹

01 Apr 2008-Current Opinion in Neurobiology

TL;DR: The latest dispatches from the forefront offorcement learning are reviewed, some of the territories where lie monsters are mapped, and the future of reinforcement learning is mapped.

...read moreread less

585 citations

Journal Article•DOI•

An analysis of model-based Interval Estimation for Markov Decision Processes

[...]

Alexander L. Strehl¹, Michael L. Littman²•Institutions (2)

Yahoo!¹, Rutgers University²

01 Dec 2008-Journal of Computer and System Sciences

TL;DR: A theoretical analysis of Model-based Interval Estimation and a new variation called MBIE-EB are presented, proving their efficiency even under worst-case conditions.

...read moreread less

503 citations

Journal Article•DOI•

Decision theory, reinforcement learning, and the brain

[...]

Peter Dayan¹, Nathaniel D. Daw²•Institutions (2)

University College London¹, Center for Neural Science²

01 Dec 2008-Cognitive, Affective, & Behavioral Neuroscience

TL;DR: A well-known, coherent Bayesian approach to decision making is reviewed, showing how it unifies issues in Markovian decision problems, signal detection psychophysics, sequential sampling, and optimal exploration and discuss paradigmatic psychological and neural examples of each problem.

...read moreread less

Abstract: Decision making is a core competence for animals and humans acting and surviving in environments they only partially comprehend, gaining rewards and punishments for their troubles. Decision-theoretic concepts permeate experiments and computational models in ethology, psychology, and neuroscience. Here, we review a well-known, coherent Bayesian approach to decision making, showing how it unifies issues in Markovian decision problems, signal detection psychophysics, sequential sampling, and optimal exploration and discuss paradigmatic psychological and neural examples of each problem. We discuss computational issues concerning what subjects know about their task and how ambitious they are in seeking optimal solutions; we address algorithmic topics concerning model-based and model-free methods for making choices; and we highlight key aspects of the neural implementation of decision making.

...read moreread less

491 citations

Journal Article•DOI•

Self-Optimizing Memory Controllers: A Reinforcement Learning Approach

[...]

Engin Ipek¹, Onur Mutlu², Jose F. Martinez¹, Rich Caruana¹•Institutions (2)

Cornell University¹, Microsoft²

01 Jun 2008

TL;DR: This work proposes a new, self-optimizing memory controller design that operates using the principles of reinforcement learning (RL), and shows that an RL-based memory controller improves the performance of a set of parallel applications run on a 4-core CMP by 19% on average and it improves DRAM bandwidth utilization by 22% compared to a state-of-the-art controller.

...read moreread less

Abstract: Efficiently utilizing off-chip DRAM bandwidth is a critical issuein designing cost-effective, high-performance chip multiprocessors(CMPs). Conventional memory controllers deliver relativelylow performance in part because they often employ fixed,rigid access scheduling policies designed for average-case applicationbehavior. As a result, they cannot learn and optimizethe long-term performance impact of their scheduling decisions,and cannot adapt their scheduling policies to dynamic workloadbehavior.We propose a new, self-optimizing memory controller designthat operates using the principles of reinforcement learning (RL)to overcome these limitations. Our RL-based memory controllerobserves the system state and estimates the long-term performanceimpact of each action it can take. In this way, the controllerlearns to optimize its scheduling policy on the fly to maximizelong-term performance. Our results show that an RL-basedmemory controller improves the performance of a set of parallelapplications run on a 4-core CMP by 19% on average (upto 33%), and it improves DRAM bandwidth utilization by 22%compared to a state-of-the-art controller.

...read moreread less

484 citations

Book Chapter•DOI•

Policy Search for Motor Primitives in Robotics

[...]

Jens Kober¹, Jan Peters¹•Institutions (1)

Max Planck Society¹

08 Dec 2008

TL;DR: This paper extends previous work on policy learning from the immediate reward case to episodic reinforcement learning, resulting in a general, common framework also connected to policy gradient methods and yielding a novel algorithm for policy learning that is particularly well-suited for dynamic motor primitives.

...read moreread less

Abstract: Many motor skills in humanoid robotics can be learned using parametrized motor primitives as done in imitation learning. However, most interesting motor learning problems are high-dimensional reinforcement learning problems often beyond the reach of current methods. In this paper, we extend previous work on policy learning from the immediate reward case to episodic reinforcement learning. We show that this results in a general, common framework also connected to policy gradient methods and yielding a novel algorithm for policy learning that is particularly well-suited for dynamic motor primitives. The resulting algorithm is an EM-inspired algorithm applicable to complex motor learning tasks. We compare this algorithm to several well-known parametrized policy search methods and show that it outperforms them. We apply it in the context of motor learning and show that it can learn a complex Ball-in-a-Cup task using a real Barrett WAM™ robot arm.

...read moreread less

411 citations

Journal Article•DOI•

Teachable robots: Understanding human teaching behavior to build more effective robot learners

[...]

Andrea L. Thomaz¹, Cynthia Breazeal²•Institutions (2)

Georgia Institute of Technology¹, Massachusetts Institute of Technology²

01 Apr 2008-Artificial Intelligence

TL;DR: The importance of understanding the human-teacher/robot-learner partnership in order to design algorithms that support how people want to teach and simultaneously improve the robot's learning behavior is demonstrated.

...read moreread less

403 citations

Proceedings Article•

Near-optimal Regret Bounds for Reinforcement Learning

[...]

Peter Auer¹, Thomas Jaksch¹, Ronald Ortner¹•Institutions (1)

University of Leoben¹

08 Dec 2008

TL;DR: This work presents a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D, and proposes a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps.

...read moreread less

Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s' there is a policy which moves from s to s' in at most D steps (on average). We present a reinforcement learning algorithm with total regret O(DS √AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of Ω(√DSAT) on the total regret of any learning algorithm.

...read moreread less

Journal Article•

Accelerated Neural Evolution through Cooperatively Coevolved Synapses

[...]

Faustino Gomez¹, Jürgen Schmidhuber¹, Jürgen Schmidhuber², Risto Miikkulainen•Institutions (2)

University of Lugano¹, Technische Universität München²

01 Jun 2008-Journal of Machine Learning Research

TL;DR: This paper compares a neuroevolution method called Cooperative Synapse Neuroevolution (CoSyNE), that uses cooperative coevolution at the level of individual synaptic weights, to a broad range of reinforcement learning algorithms on very difficult versions of the pole balancing problem that involve large state spaces and hidden state.

...read moreread less

Abstract: Many complex control problems require sophisticated solutions that are not amenable to traditional controller design. Not only is it difficult to model real world systems, but often it is unclear what kind of behavior is required to solve the task. Reinforcement learning (RL) approaches have made progress by using direct interaction with the task environment, but have so far not scaled well to large state spaces and environments that are not fully observable. In recent years, neuroevolution, the artificial evolution of neural networks, has had remarkable success in tasks that exhibit these two properties. In this paper, we compare a neuroevolution method called Cooperative Synapse Neuroevolution (CoSyNE), that uses cooperative coevolution at the level of individual synaptic weights, to a broad range of reinforcement learning algorithms on very difficult versions of the pole balancing problem that involve large (continuous) state spaces and hidden state. CoSyNE is shown to be significantly more efficient and powerful than the other methods on these tasks.

...read moreread less

Journal Article•DOI•

Opposition versus randomness in soft computing techniques

[...]

Shahryar Rahnamayan¹, Hamid R. Tizhoosh¹, Magdy M. A. Salama¹•Institutions (1)

University of Waterloo¹

01 Mar 2008

TL;DR: This paper mathematically and experimentally proves that the simultaneous consideration of randomness and opposition is more advantageous than pure randomness, and applies that to accelerate differential evolution (DE).

...read moreread less

Abstract: For many soft computing methods, we need to generate random numbers to use either as initial estimates or during the learning and search process. Recently, results for evolutionary algorithms, reinforcement learning and neural networks have been reported which indicate that the simultaneous consideration of randomness and opposition is more advantageous than pure randomness. This new scheme, called opposition-based learning, has the apparent effect of accelerating soft computing algorithms. This paper mathematically and also experimentally proves this advantage and, as an application, applies that to accelerate differential evolution (DE). By taking advantage of random numbers and their opposites, the optimization, search or learning process in many soft computing techniques can be accelerated when there is no a priori knowledge about the solution. The mathematical proofs and the results of conducted experiments confirm each other.

...read moreread less

Proceedings Article•DOI•

An object-oriented representation for efficient reinforcement learning

[...]

Carlos Diuk¹, Andre Cohen¹, Michael L. Littman¹•Institutions (1)

Rutgers University¹

05 Jul 2008

TL;DR: Object-Oriented MDPs (OO-MDPs) are introduced, a representation based on objects and their interactions, which is a natural way of modeling environments and offers important generalization opportunities and a polynomial bound on its sample complexity is proved.

...read moreread less

Abstract: Rich representations in reinforcement learning have been studied for the purpose of enabling generalization and making learning feasible in large state spaces. We introduce Object-Oriented MDPs (OO-MDPs), a representation based on objects and their interactions, which is a natural way of modeling environments and offers important generalization opportunities. We introduce a learning algorithm for deterministic OO-MDPs and prove a polynomial bound on its sample complexity. We illustrate the performance gains of our representation and algorithm in the well-known Taxi domain, plus a real-life videogame.

...read moreread less

Journal Article•DOI•

Artificial Intelligence techniques: An introduction to their use for modelling environmental systems

[...]

Serena H. Chen¹, Anthony Jakeman¹, John Norton¹•Institutions (1)

Australian National University¹

01 Jul 2008-Mathematics and Computers in Simulation

TL;DR: The techniques covered are case-based reasoning, rule-based systems, artificial neural networks, fuzzy models, genetic algorithms, cellular automata, multi-agent systems, swarm intelligence, reinforcement learning and hybrid systems.

...read moreread less

Journal Article•DOI•

Game theory and neural basis of social decision making.

[...]

Daeyeol Lee¹•Institutions (1)

Yale University¹

26 Mar 2008-Nature Neuroscience

TL;DR: Molecular genetic studies have also begun to identify genetic mechanisms for personal traits related to reinforcement learning and complex social decision making, further illuminating the biological basis of social behavior.

...read moreread less

Abstract: Decision making in a social group has two distinguishing features. First, humans and other animals routinely alter their behavior in response to changes in their physical and social environment. As a result, the outcomes of decisions that depend on the behavior of multiple decision makers are difficult to predict and require highly adaptive decision-making strategies. Second, decision makers may have preferences regarding consequences to other individuals and therefore choose their actions to improve or reduce the well-being of others. Many neurobiological studies have exploited game theory to probe the neural basis of decision making and suggested that these features of social decision making might be reflected in the functions of brain areas involved in reward evaluation and reinforcement learning. Molecular genetic studies have also begun to identify genetic mechanisms for personal traits related to reinforcement learning and complex social decision making, further illuminating the biological basis of social behavior.

...read moreread less

Journal Article•DOI•

Distinct error-correcting and incidental learning of location relative to landmarks and boundaries.

[...]

Christian F. Doeller¹, Neil Burgess¹•Institutions (1)

University College London¹

15 Apr 2008-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: The demonstration that reinforcement learning applies selectively to formally equivalent aspects of task-performance supports broader consideration of two-system models in analyses of learning and decision making and suggests a privileged role for surface geometry in determining spatial context and support the idea of a “geometric module” albeit for location rather than orientation.

...read moreread less

Abstract: Associative reinforcement provides a powerful explanation of learned behavior. However, an unproven but long-held conjecture holds that spatial learning can occur incidentally rather than by reinforcement. Using a carefully controlled virtual-reality object-location memory task, we formally demonstrate that locations are concurrently learned relative to both local landmarks and local boundaries but that landmark-learning obeys associative reinforcement (showing “overshadowing” and “blocking” or “learned irrelevance”), whereas boundary-learning is incidental, showing neither overshadowing nor blocking nor learned irrelevance. Crucially, both types of learning occur at similar rates and do not reflect differences in levels of performance, cue salience, or instructions. These distinct types of learning likely reflect the distinct neural systems implicated in processing of landmarks and boundaries: the striatum and hippocampus, respectively [Doeller CF, King JA, Burgess N (2008) Proc Natl Acad Sci USA 105:5915–5920]. In turn, our results suggest the use of fundamentally different learning rules by these two systems, potentially explaining their differential roles in procedural and declarative memory more generally. Our results suggest a privileged role for surface geometry in determining spatial context and support the idea of a “geometric module,” albeit for location rather than orientation. Finally, the demonstration that reinforcement learning applies selectively to formally equivalent aspects of task-performance supports broader consideration of two-system models in analyses of learning and decision making.

...read moreread less

Book•

Reactive Search and Intelligent Optimization

[...]

Roberto Battiti¹, Mauro Brunato, Franco Mascia•Institutions (1)

University of Trento¹

06 Nov 2008

TL;DR: Reactive Search and Intelligent Optimization is an excellent introduction to the main principles of reactive search, as well as an attempt to develop some fresh intuition for the approaches.

...read moreread less

Abstract: Reactive Search integrates sub-symbolic machine learning techniques into search heuristics for solving complex optimization problems. By automatically adjusting the working parameters, a reactive search self-tunes and adapts, effectively learning by doing until a solution is found. Intelligent Optimization, a superset of Reactive Search, concerns online and off-line schemes based on the use of memory, adaptation, incremental development of models, experimental algorithms applied to optimization, intelligent tuning and design of heuristics. Reactive Search and Intelligent Optimization is an excellent introduction to the main principles of reactive search, as well as an attempt to develop some fresh intuition for the approaches. The book looks at different optimization possibilities with an emphasis on opportunities for learning and self-tuning strategies. While focusing more on methods than on problems, problems are introduced wherever they help make the discussion more concrete, or when a specific problem has been widely studied by reactive search and intelligent optimization heuristics. Individual chapters cover reacting on the neighborhood; reacting on the annealing schedule; reactive prohibitions; model-based search; reacting on the objective function; relationships between reactive search and reinforcement learning; and much more. Each chapter is structured to show basic issues and algorithms; the parameters critical for the success of the different methods discussed; and opportunities and schemes for the automated tuning of these parameters. Anyone working in decision making in business, engineering, economics or science will find a wealth of information here.

...read moreread less

Journal Article•DOI•

Quantum Reinforcement Learning

[...]

Daoyi Dong, Chunlin Chen¹, Han-Xiong Li², Tzyh-Jong Tarn³•Institutions (3)

Nanjing University¹, City University of Hong Kong², University of Washington³

01 Oct 2008

TL;DR: The results demonstrate the effectiveness and superiority of the QRL algorithm for some complex problems and shows that this approach makes a good tradeoff between exploration and exploitation using the probability amplitude and can speedup learning through the quantum parallelism.

...read moreread less

Abstract: The key approaches for machine learning, particularly learning in unknown probabilistic environments, are new representations and computation mechanisms. In this paper, a novel quantum reinforcement learning (QRL) method is proposed by combining quantum theory and reinforcement learning (RL). Inspired by the state superposition principle and quantum parallelism, a framework of a value-updating algorithm is introduced. The state (action) in traditional RL is identified as the eigen state (eigen action) in QRL. The state (action) set can be represented with a quantum superposition state, and the eigen state (eigen action) can be obtained by randomly observing the simulated quantum state according to the collapse postulate of quantum measurement. The probability of the eigen action is determined by the probability amplitude, which is updated in parallel according to rewards. Some related characteristics of QRL such as convergence, optimality, and balancing between exploration and exploitation are also analyzed, which shows that this approach makes a good tradeoff between exploration and exploitation using the probability amplitude and can speedup learning through the quantum parallelism. To evaluate the performance and practicability of QRL, several simulated experiments are given, and the results demonstrate the effectiveness and superiority of the QRL algorithm for some complex problems. This paper is also an effective exploration on the application of quantum computation to artificial intelligence.

...read moreread less

Proceedings Article•DOI•

An analysis of reinforcement learning with function approximation

[...]

Francisco S. Melo¹, Sean P. Meyn², M. Isabel Ribeiro•Institutions (2)

Carnegie Mellon University¹, University of Illinois at Urbana–Champaign²

05 Jul 2008

TL;DR: The convergence properties of several variations of Q-learning when combined with function approximation are analyzed, extending the analysis of TD-learning in (Tsitsiklis & Van Roy, 1996a) to stochastic control settings.

...read moreread less

Abstract: We address the problem of computing the optimal Q-function in Markov decision problems with infinite state-space. We analyze the convergence properties of several variations of Q-learning when combined with function approximation, extending the analysis of TD-learning in (Tsitsiklis & Van Roy, 1996a) to stochastic control settings. We identify conditions under which such approximate methods converge with probability 1. We conclude with a brief discussion on the general applicability of our results and compare them with several related works.

...read moreread less

Journal Article•DOI•

Learning and innovative elements of strategy adoption rules expand cooperative network topologies.

[...]

Shijun Wang¹, Máté S. Szalay², Changshui Zhang¹, Peter Csermely²•Institutions (2)

Tsinghua University¹, Semmelweis University²

09 Apr 2008-PLOS ONE

TL;DR: The results suggest that a balanced duo of learning and innovation may help to preserve cooperation during the re-organization of real-world networks, and may play a prominent role in the evolution of self-organizing, complex systems.

...read moreread less

Abstract: Cooperation plays a key role in the evolution of complex systems. However, the level of cooperation extensively varies with the topology of agent networks in the widely used models of repeated games. Here we show that cooperation remains rather stable by applying the reinforcement learning strategy adoption rule, Q-learning on a variety of random, regular, small-word, scale-free and modular network models in repeated, multi-agent Prisoner's Dilemma and Hawk-Dove games. Furthermore, we found that using the above model systems other long-term learning strategy adoption rules also promote cooperation, while introducing a low level of noise (as a model of innovation) to the strategy adoption rules makes the level of cooperation less dependent on the actual network topology. Our results demonstrate that long-term learning and random elements in the strategy adoption rules, when acting together, extend the range of network topologies enabling the development of cooperation at a wider range of costs and temptations. These results suggest that a balanced duo of learning and innovation may help to preserve cooperation during the re-organization of real-world networks, and may play a prominent role in the evolution of self-organizing, complex systems.

...read moreread less

Journal Article•DOI•

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

[...]

András Antos¹, Csaba Szepesvári¹, Rémi Munos²•Institutions (2)

Hungarian Academy of Sciences¹, French Institute for Research in Computer Science and Automation²

01 Apr 2008-Machine Learning

TL;DR: In this article, the authors consider the problem of finding a near-optimal policy in a continuous space, discounted Markovian Decision Problem (MDP) by employing value-function-based methods when only a single trajectory of a fixed policy is available as the input.

...read moreread less

Abstract: In this paper we consider the problem of finding a near-optimal policy in a continuous space, discounted Markovian Decision Problem (MDP) by employing value-function-based methods when only a single trajectory of a fixed policy is available as the input. We study a policy-iteration algorithm where the iterates are obtained via empirical risk minimization with a risk function that penalizes high magnitudes of the Bellman-residual. Our main result is a finite-sample, high-probability bound on the performance of the computed policy that depends on the mixing rate of the trajectory, the capacity of the function set as measured by a novel capacity concept (the VC-crossing dimension), the approximation power of the function set and the controllability properties of the MDP. Moreover, we prove that when a linear parameterization is used the new algorithm is equivalent to Least-Squares Policy Iteration. To the best of our knowledge this is the first theoretical result for off-policy control learning over continuous state-spaces using a single trajectory.

...read moreread less

Journal Article•DOI•

Learning CPG-based Biped Locomotion with a Policy Gradient Method: Application to a Humanoid Robot

[...]

Gen Endo¹, Jun Morimoto, Takamitsu Matsubara, Jun Nakanishi, Gordon Cheng - Show less +1 more•Institutions (1)

Tokyo Institute of Technology¹

01 Feb 2008-The International Journal of Robotics Research

TL;DR: It is demonstrated that an appropriate feedback controller can be acquired within a few thousand trials by numerical simulations and the controller obtained in numerical simulation achieves stable walking with a physical robot in the real world.

...read moreread less

Abstract: In this paper we describe a learning framework for a central pattern generator (CPG)-based biped locomotion controller using a policy gradient method. Our goals in this study are to achieve CPG-based biped walking with a 3D hardware humanoid and to develop an efficient learning algorithm with CPG by reducing the dimensionality of the state space used for learning. We demonstrate that an appropriate feedback controller can be acquired within a few thousand trials by numerical simulations and the controller obtained in numerical simulation achieves stable walking with a physical robot in the real world. Numerical simulations and hardware experiments evaluate the walking velocity and stability. The results suggest that the learning algorithm is capable of adapting to environmental changes. Furthermore, we present an online learning scheme with an initial policy for a hardware robot to improve the controller within 200 iterations.

...read moreread less

Journal Article•DOI•

Basal ganglia and dopamine contributions to probabilistic category learning

[...]

Daphna Shohamy¹, Catherine E. Myers¹, J. Kalanithi¹, Mark A. Gluck¹•Institutions (1)

Columbia University¹

01 Jan 2008-Neuroscience & Biobehavioral Reviews

TL;DR: Behavioral, neuropsychological, functional neuroimaging, and computational studies of basal ganglia and dopamine contributions to learning in humans implicate the basalganglia in incremental, feedback-based learning that involves integrating information across multiple experiences.

...read moreread less

Proceedings Article•DOI•

An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning

[...]

Ronald Parr¹, Lihong Li², Gavin Taylor¹, Christopher Painter-Wakefield¹, Michael L. Littman² - Show less +1 more•Institutions (2)

Duke University¹, Rutgers University²

05 Jul 2008

TL;DR: It is shown that linear value-function approximation is equivalent to a form of linear model approximation, and a relationship between the model-approximation error and the Bellman error is derived, which can guide feature selection for model improvement and/or value- function improvement.

...read moreread less

Abstract: We show that linear value-function approximation is equivalent to a form of linear model approximation. We then derive a relationship between the model-approximation error and the Bellman error, and show how this relationship can guide feature selection for model improvement and/or value-function improvement. We also show how these results give insight into the behavior of existing feature-selection algorithms.

...read moreread less

Journal Article•DOI•

Ensemble Algorithms in Reinforcement Learning

[...]

Marco A. Wiering¹, H. van Hasselt¹•Institutions (1)

University of Groningen¹

01 Aug 2008

TL;DR: Several ensemble methods that combine multiple different reinforcement learning (RL) algorithms in a single agent to enhance learning speed and final performance by combining the chosen actions or action probabilities of different RL algorithms are described.

...read moreread less

Abstract: This paper describes several ensemble methods that combine multiple different reinforcement learning (RL) algorithms in a single agent. The aim is to enhance learning speed and final performance by combining the chosen actions or action probabilities of different RL algorithms. We designed and implemented four different ensemble methods combining the following five different RL algorithms: Q-learning, Sarsa, actor-critic (AC), QV-learning, and AC learning automaton. The intuitively designed ensemble methods, namely, majority voting (MV), rank voting, Boltzmann multiplication (BM), and Boltzmann addition, combine the policies derived from the value functions of the different RL algorithms, in contrast to previous work where ensemble methods have been used in RL for representing and learning a single value function. We show experiments on five maze problems of varying complexity; the first problem is simple, but the other four maze tasks are of a dynamic or partially observable nature. The results indicate that the BM and MV ensembles significantly outperform the single RL algorithms.

...read moreread less

Book Chapter•DOI•

Multiagent Reinforcement Learning for Urban Traffic Control Using Coordination Graphs

[...]

Lior Kuyer¹, Shimon Whiteson¹, Bram Bakker¹, Nikos Vlassis²•Institutions (2)

University of Amsterdam¹, Technical University of Crete²

15 Sep 2008

TL;DR: In this paper, the authors extend this approach to include explicit coordination between neighboring traffic lights, which is achieved using the max-plus algorithm, which estimates the optimal joint action by sending locally optimized messages among connected agents.

...read moreread less

Abstract: Since traffic jams are ubiquitous in the modern world, optimizing the behavior of traffic lights for efficient traffic flow is a critically important goal Though most current traffic lights use simple heuristic protocols, more efficient controllers can be discovered automatically via multiagent reinforcement learning, where each agent controls a single traffic light However, in previous work on this approach, agents select only locally optimal actions without coordinating their behavior This paper extends this approach to include explicit coordination between neighboring traffic lights Coordination is achieved using the max-plus algorithm, which estimates the optimal joint action by sending locally optimized messages among connected agents This paper presents the first application of max-plus to a large-scale problem and thus verifies its efficacy in realistic settings It also provides empirical evidence that max-plus performs well on cyclic graphs, though it has been proven to converge only for tree-structured graphs Furthermore, it provides a new understanding of the properties a traffic network must have for such coordination to be beneficial and shows that max-plus outperforms previous methods on networks that possess those properties

...read moreread less

Proceedings Article•DOI•

Learning all optimal policies with multiple criteria

[...]

Leon Barrett, Srini Narayanan

05 Jul 2008

TL;DR: The algorithm can be viewed as an extension to standard reinforcement learning for MDPs where instead of repeatedly backing up maximal expected rewards, it back up the set of expected rewards that are maximal for some set of linear preferences.

...read moreread less

Abstract: We describe an algorithm for learning in the presence of multiple criteria. Our technique generalizes previous approaches in that it can learn optimal policies for all linear preference assignments over the multiple reward criteria at once. The algorithm can be viewed as an extension to standard reinforcement learning for MDPs where instead of repeatedly backing up maximal expected rewards, we back up the set of expected rewards that are maximal for some set of linear preferences (given by a weight vector, w). We present the algorithm along with a proof of correctness showing that our solution gives the optimal policy for any linear preference function. The solution reduces to the standard value iteration algorithm for a specific weight vector, w.

...read moreread less

Proceedings Article•DOI•

Transfer of samples in batch reinforcement learning

[...]

Alessandro Lazaric¹, Marcello Restelli¹, Andrea Bonarini¹•Institutions (1)

Polytechnic University of Milan¹

05 Jul 2008

TL;DR: A novel algorithm is introduced that transfers samples from the source tasks that are mostly similar to the target task, and is empirically show that, following the proposed approach, the transfer of samples is effective in reducing the learning complexity.

...read moreread less

Abstract: The main objective of transfer in reinforcement learning is to reduce the complexity of learning the solution of a target task by effectively reusing the knowledge retained from solving a set of source tasks. In this paper, we introduce a novel algorithm that transfers samples (i.e., tuples 〈s, a, s', r〉) from source to target tasks. Under the assumption that tasks have similar transition models and reward functions, we propose a method to select samples from the source tasks that are mostly similar to the target task, and, then, to use them as input for batch reinforcement-learning algorithms. As a result, the number of samples an agent needs to collect from the target task to learn its solution is reduced. We empirically show that, following the proposed approach, the transfer of samples is effective in reducing the learning complexity, even when some source tasks are significantly different from the target task.

...read moreread less

Journal Article•DOI•

Learning to Control in Operational Space

[...]

Jan Peters¹, Stefan Schaal¹•Institutions (1)

University of Southern California¹

01 Feb 2008-The International Journal of Robotics Research

TL;DR: The proposed method works in the setting of learning resolved motion rate control on a real, physical Mitsubishi PA-10 medical robotics arm and demonstrates feasibility for complex high degree-of-freedom robots.

...read moreread less

Abstract: One of the most general frameworks for phrasing control problems for complex, redundant robots is operational-space control. However, while this framework is of essential importance for robotics and well understood from an analytical point of view, it can be prohibitively hard to achieve accurate control in the face of modeling errors, which are inevitable in complex robots (e.g. humanoid robots). In this paper, we suggest a learning approach for operational-space control as a direct inverse model learning problem. A first important insight for this paper is that a physically correct solution to the inverse problem with redundant degrees of freedom does exist when learning of the inverse map is performed in a suitable piecewise linear way. The second crucial component of our work is based on the insight that many operational-space controllers can be understood in terms of a constrained optimal control problem. The cost function associated with this optimal control problem allows us to formulate a learning algorithm that automatically synthesizes a globally consistent desired resolution of redundancy while learning the operational-space controller. From the machine learning point of view, this learning problem corresponds to a reinforcement learning problem that maximizes an immediate reward. We employ an expectation-maximization policy search algorithm in order to solve this problem. Evaluations on a three degrees-of-freedom robot arm are used to illustrate the suggested approach. The application to a physically realistic simulator of the anthropomorphic SARCOS Master arm demonstrates feasibility for complex high degree-of-freedom robots. We also show that the proposed method works in the setting of learning resolved motion rate control on a real, physical Mitsubishi PA-10 medical robotics arm.

...read moreread less

Collapse