Showing papers on "Reinforcement learning published in 2005"

PDF

Open Access

Journal Article•DOI•

Cooperative Multi-Agent Learning: The State of the Art

[...]

Liviu Panait¹, Sean Luke¹•Institutions (1)

01 Nov 2005-Autonomous Agents and Multi-Agent Systems

TL;DR: This survey attempts to draw from multi-agent learning work in a spectrum of areas, including RL, evolutionary computation, game theory, complex systems, agent modeling, and robotics, and finds that this broad view leads to a division of the work into two categories.

...read moreread less

Abstract: Cooperative multi-agent systems (MAS) are ones in which several agents attempt, through their interaction, to jointly solve tasks or to maximize utility. Due to the interactions among the agents, multi-agent problem complexity can rise rapidly with the number of agents or their behavioral sophistication. The challenge this presents to the task of programming solutions to MAS problems has spawned increasing interest in machine learning techniques to automate the search and optimization process. We provide a broad survey of the cooperative multi-agent learning literature. Previous surveys of this area have largely focused on issues common to specific subareas (for example, reinforcement learning, RL or robotics). In this survey we attempt to draw from multi-agent learning work in a spectrum of areas, including RL, evolutionary computation, game theory, complex systems, agent modeling, and robotics. We find that this broad view leads to a division of the work into two categories, each with its own special issues: applying a single learner to discover joint solutions to multi-agent problems (team learning), or using multiple simultaneous learners, often one per agent (concurrent learning). Additionally, we discuss direct and indirect communication in connection with learning, plus open issues in task decomposition, scalability, and adaptive dynamics. We conclude with a presentation of multi-agent learning problem domains, and a list of multi-agent learning resources.

...read moreread less

1,283 citations

Journal Article•

Tree-Based Batch Mode Reinforcement Learning

[...]

Damien Ernst¹, Pierre Geurts¹, Louis Wehenkel¹•Institutions (1)

University of Liège¹

01 Dec 2005-Journal of Machine Learning Research

TL;DR: Within this framework, several classical tree-based supervised learning methods and two newly proposed ensemble algorithms, namely extremely and totally randomized trees, are described and found that the ensemble methods based on regression trees perform well in extracting relevant information about the optimal control policy from sets of four-tuples.

...read moreread less

Abstract: Reinforcement learning aims to determine an optimal control policy from interaction with a system or from observations gathered from a system. In batch mode, it can be achieved by approximating the so-called Q-function based on a set of four-tuples (xt, ut , rt, xt+1) where xt denotes the system state at time t, ut the control action taken, rt the instantaneous reward obtained and xt+1 the successor state of the system, and by determining the control policy from this Q-function. The Q-function approximation may be obtained from the limit of a sequence of (batch mode) supervised learning problems. Within this framework we describe the use of several classical tree-based supervised learning methods (CART, Kd-tree, tree bagging) and two newly proposed ensemble algorithms, namely extremely and totally randomized trees. We study their performances on several examples and find that the ensemble methods based on regression trees perform well in extracting relevant information about the optimal control policy from sets of four-tuples. In particular, the totally randomized trees give good results while ensuring the convergence of the sequence, whereas by relaxing the convergence constraint even better accuracy results are provided by the extremely randomized trees.

...read moreread less

1,079 citations

Book Chapter•DOI•

Neural fitted q iteration – first experiences with a data efficient neural reinforcement learning method

[...]

Martin Riedmiller

03 Oct 2005

TL;DR: NFQ, an algorithm for efficient and effective training of a Q-value function represented by a multi-layer perceptron, is introduced and it is shown empirically, that reasonably few interactions with the plant are needed to generate control policies of high quality.

...read moreread less

Abstract: This paper introduces NFQ, an algorithm for efficient and effective training of a Q-value function represented by a multi-layer perceptron. Based on the principle of storing and reusing transition experiences, a model-free, neural network based Reinforcement Learning algorithm is proposed. The method is evaluated on three benchmark problems. It is shown empirically, that reasonably few interactions with the plant are needed to generate control policies of high quality.

...read moreread less

944 citations

Journal Article•DOI•

Information Theory, Inference, and Learning Algorithms

[...]

Yuhong Yang

01 Dec 2005-Journal of the American Statistical Association

TL;DR: This book presents an interplay between the classical theory of general Lévy processes described by Skorohod (1991), Bertoin (1996), Sato (2003), and modern stochastic analysis as presented by Liptser and Shiryayev (1989), Protter (2004), and others.

...read moreread less

Abstract: (2005). Information Theory, Inference, and Learning Algorithms. Journal of the American Statistical Association: Vol. 100, No. 472, pp. 1461-1462.

...read moreread less

740 citations

Journal Article•DOI•

On Adaptation, Maximization, and Reinforcement Learning among Cognitive Strategies.

[...]

Ido Erev, Greg Barron

01 Oct 2005-Psychological Review

TL;DR: This model of reinforcement learning among cognitive strategies (RELACS) captures the 3 deviations, the learning curves, and the effect of information on uncertainty avoidance and outperforms other models in fitting the data and in predicting behavior in other experiments.

...read moreread less

Abstract: Analysis of binary choice behavior in iterated tasks with immediate feedback reveals robust deviations from maximization that can be described as indications of 3 effects: (a) a payoff variability effect, in which high payoff variability seems to move choice behavior toward random choice; (b) underweighting of rare events, in which alternatives that yield the best payoffs most of the time are attractive even when they are associated with a lower expected return; and (c) loss aversion, in which alternatives that minimize the probability of losses can be more attractive than those that maximize expected payoffs. The results are closer to probability matching than to maximization. Best approximation is provided with a model of reinforcement learning among cognitive strategies (RELACS). This model captures the 3 deviations, the learning curves, and the effect of information on uncertainty avoidance. It outperforms other models in fitting the data and in predicting behavior in other experiments.

...read moreread less

446 citations

Proceedings Article•DOI•

High speed obstacle avoidance using monocular vision and reinforcement learning

[...]

Jeffrey Lawrence Michels¹, Ashutosh Saxena¹, Andrew Y. Ng¹•Institutions (1)

Stanford University¹

07 Aug 2005

TL;DR: An approach in which supervised learning is first used to estimate depths from single monocular images, which is able to learn monocular vision cues that accurately estimate the relative depths of obstacles in a scene is presented.

...read moreread less

Abstract: We consider the task of driving a remote control car at high speeds through unstructured outdoor environments. We present an approach in which supervised learning is first used to estimate depths from single monocular images. The learning algorithm can be trained either on real camera images labeled with ground-truth distances to the closest obstacles, or on a training set consisting of synthetic graphics images. The resulting algorithm is able to learn monocular vision cues that accurately estimate the relative depths of obstacles in a scene. Reinforcement learning/policy search is then applied within a simulator that renders synthetic scenes. This learns a control policy that selects a steering direction as a function of the vision system's output. We present results evaluating the predictive ability of the algorithm both on held out test data, and in actual autonomous driving experiments.

...read moreread less

435 citations

Journal Article•DOI•

Reinforcement learning for RoboCup soccer keepaway

[...]

Peter Stone¹, Richard S. Sutton², Gregory Kuhlmann¹•Institutions (2)

University of Texas at Austin¹, University of Alberta²

01 Sep 2005-Adaptive Behavior

TL;DR: The application of episodic SMDP Sarsa(λ) with linear tile-coding function approximation and variable λ to learning higher-level decisions in a keepaway subtask of RoboCup soccer results in agents that significantly outperform a range of benchmark policies.

...read moreread less

Abstract: RoboCup simulated soccer presents many challenges to reinforcement learning methods, including a large state space, hidden and uncertain state, multiple independent agents learning simultaneously, and long and variable delays in the effects of actions. We describe our application of episodic SMDP Sarsa(λ) with linear tile-coding function approximation and variable λ to learning higher-level decisions in a keepaway subtask of RoboCup soccer. In keepaway, one team, “the keepers,” tries to keep control of the ball for as long as possible despite the efforts of “the takers.” The keepers learn individually when to hold the ball and when to pass to a teammate. Our agents learned policies that significantly outperform a range of benchmark policies. We demonstrate the generality of our approach by applying it to a number of task variations including different field sizes and different numbers of players on each team.

...read moreread less

430 citations

Proceedings Article•DOI•

Reinforcement learning with Gaussian processes

[...]

Yaakov Engel¹, Shie Mannor², Ron Meir³•Institutions (3)

University of Alberta¹, McGill University², Technion – Israel Institute of Technology³

07 Aug 2005

TL;DR: A SARSA based extension of GPTD is presented, termed GPSARSA, that allows the selection of actions and the gradual improvement of policies without requiring a world-model.

...read moreread less

Abstract: Gaussian Process Temporal Difference (GPTD) learning offers a Bayesian solution to the policy evaluation problem of reinforcement learning. In this paper we extend the GPTD framework by addressing two pressing issues, which were not adequately treated in the original GPTD paper (Engel et al., 2003). The first is the issue of stochasticity in the state transitions, and the second is concerned with action selection and policy improvement. We present a new generative model for the value function, deduced from its relation with the discounted return. We derive a corresponding on-line algorithm for learning the posterior moments of the value Gaussian process. We also present a SARSA based extension of GPTD, termed GPSARSA, that allows the selection of actions and the gradual improvement of policies without requiring a world-model.

...read moreread less

402 citations

Book Chapter•DOI•

Learning Movement Primitives

[...]

Stefan Schaal¹, Jan Peters¹, Jun Nakanishi, Auke Jan Ijspeert¹•Institutions (1)

University of Southern California¹

01 Aug 2005

TL;DR: A novel reinforcement learning technique based on natural stochastic policy gradients allows a general approach of improving DMPs by trial and error learning with respect to almost arbitrary optimization criteria, and demonstrates the different ingredients of the DMP approach in various examples.

...read moreread less

Abstract: This paper discusses a comprehensive framework for modular motor control based on a recently developed theory of dynamic movement primitives (DMP). DMPs are a formulation of movement primitives with autonomous nonlinear differential equations, whose time evolution creates smooth kinematic control policies. Model-based control theory is used to convert the outputs of these policies into motor commands. By means of coupling terms, on-line modifications can be incorporated into the time evolution of the differential equations, thus providing a rather flexible and reactive framework for motor planning and execution. The linear parameterization of DMPs lends itself naturally to supervised learning from demonstration. Moreover, the temporal, scale, and translation invariance of the differential equations with respect to these parameters provides a useful means for movement recognition. A novel reinforcement learning technique based on natural stochastic policy gradients allows a general approach of improving DMPs by trial and error learning with respect to almost arbitrary optimization criteria. We demonstrate the different ingredients of the DMP approach in various examples, involving skill learning from demonstration on the humanoid robot DB, and learning biped walking from demonstration in simulation, including self-improvement of the movement patterns towards energy efficiency through resonance tuning.

...read moreread less

381 citations

Journal Article•DOI•

A unified model for perceptual learning

[...]

Aaron R. Seitz¹, Takeo Watanabe²•Institutions (2)

Harvard University¹, Boston University²

01 Jul 2005-Trends in Cognitive Sciences

TL;DR: The model suggests that long-term sensitivity enhancements to task-relevant or irrelevant stimuli occur as a result of timely interactions between diffused signals triggered by task performance and signals produced by stimulus presentation.

...read moreread less

357 citations

Book Chapter•DOI•

Natural actor-critic

[...]

Jan Peters¹, Sethu Vijayakumar², Stefan Schaal¹•Institutions (2)

University of Southern California¹, University of Edinburgh²

03 Oct 2005

TL;DR: The Natural Actor-Critic as mentioned in this paper is a model-free reinforcement learning architecture, where actor updates are based on stochastic policy gradients employing Amari's natural gradient approach, while the critic obtains both the natural policy gradient and additional parameters of a value function simultaneously by linear regression.

...read moreread less

Abstract: This paper investigates a novel model-free reinforcement learning architecture, the Natural Actor-Critic. The actor updates are based on stochastic policy gradients employing Amari's natural gradient approach, while the critic obtains both the natural policy gradient and additional parameters of a value function simultaneously by linear regression. We show that actor improvements with natural policy gradients are particularly appealing as these are independent of coordinate frame of the chosen policy representation, and can be estimated more efficiently than regular policy gradients. The critic makes use of a special basis function parameterization motivated by the policy-gradient compatible function approximation. We show that several well-known reinforcement learning methods such as the original Actor-Critic and Bradtke's Linear Quadratic Q-Learning are in fact Natural Actor-Critic algorithms. Empirical evaluations illustrate the effectiveness of our techniques in comparison to previous methods, and also demonstrate their applicability for learning control on an anthropomorphic robot arm.

...read moreread less

Proceedings Article•DOI•

Multi-agent quadrotor testbed control design: integral sliding mode vs. reinforcement learning

[...]

Steven L. Waslander¹, Gabriel M. Hoffmann¹, Jung Soon Jang¹, Claire J. Tomlin¹•Institutions (1)

Stanford University¹

05 Dec 2005

TL;DR: Integral sliding mode and reinforcement learning control are presented as two design techniques for accommodating the nonlinear disturbances of outdoor altitude control that result in greatly improved performance over classical control techniques.

...read moreread less

Abstract: The Stanford Testbed of Autonomous Rotorcraft for Multi-Agent Control (STARMAC) is a multi-vehicle testbed currently comprised of two quadrotors, also called X4-flyers, with capacity for eight. This paper presents a comparison of control design techniques, specifically for outdoor altitude control, in and above ground effect, that accommodate the unique dynamics of the aircraft. Due to the complex airflow induced by the four interacting rotors, classical linear techniques failed to provide sufficient stability. Integral sliding mode and reinforcement learning control are presented as two design techniques for accommodating the nonlinear disturbances. The methods both result in greatly improved performance over classical control techniques.

...read moreread less

Proceedings Article•DOI•

Exploration and apprenticeship learning in reinforcement learning

[...]

Pieter Abbeel¹, Andrew Y. Ng¹•Institutions (1)

Stanford University¹

07 Aug 2005

TL;DR: This paper considers the apprenticeship learning setting in which a teacher demonstration of the task is available, and shows that, given the initial demonstration, no explicit exploration is necessary, and the student can attain near-optimal performance simply by repeatedly executing "exploitation policies" that try to maximize rewards.

...read moreread less

Abstract: We consider reinforcement learning in systems with unknown dynamics. Algorithms such as E3 (Kearns and Singh, 2002) learn near-optimal policies by using "exploration policies" to drive the system towards poorly modeled states, so as to encourage exploration. But this makes these algorithms impractical for many systems; for example, on an autonomous helicopter, overly aggressive exploration may well result in a crash. In this paper, we consider the apprenticeship learning setting in which a teacher demonstration of the task is available. We show that, given the initial demonstration, no explicit exploration is necessary, and we can attain near-optimal performance (compared to the teacher) simply by repeatedly executing "exploitation policies" that try to maximize rewards. In finite-state MDPs, our algorithm scales polynomially in the number of states; in continuous-state linear dynamical systems, it scales polynomially in the dimension of the state. These results are proved using a martingale construction over relative losses.

...read moreread less

Journal Article•DOI•

Risk-sensitive reinforcement learning applied to control under constraints

[...]

Peter Geibel¹, Fritz Wysotzki²•Institutions (2)

University of Osnabrück¹, Technical University of Berlin²

01 Jul 2005-Journal of Artificial Intelligence Research

TL;DR: A model free, heuristic reinforcement learning algorithm that aims at finding good deterministic policies based on weighting the original value function and the risk, which was successfully applied to the control of a feed tank with stochastic inflows that lies upstream of a distillation column.

...read moreread less

Abstract: In this paper, we consider Markov Decision Processes (MDPs) with error states. Error states are those states entering which is undesirable or dangerous. We define the risk with respect to a policy as the probability of entering such a state when the policy is pursued. We consider the problem of finding good policies whose risk is smaller than some user-specified threshold, and formalize it as a constrained MDP with two criteria. The first criterion corresponds to the value function originally given. We will show that the risk can be formulated as a second criterion function based on a cumulative return, whose definition is independent of the original value function. We present a model free, heuristic reinforcement learning algorithm that aims at finding good deterministic policies. It is based on weighting the original value function and the risk. The weight parameter is adapted in order to find a feasible solution for the constrained problem that has a good performance with respect to the value function. The algorithm was successfully applied to the control of a feed tank with stochastic inflows that lies upstream of a distillation column. This control task was originally formulated as an optimal control problem with chance constraints, and it was solved under certain assumptions on the model to obtain an optimal solution. The power of our learning algorithm is that it can be used even when some of these restrictive assumptions are relaxed.

...read moreread less

Proceedings Article•DOI•

Identifying useful subgoals in reinforcement learning by local graph partitioning

[...]

Özgür Şimşek¹, Alicia Peregrin Wolfe¹, Andrew G. Barto¹•Institutions (1)

University of Massachusetts Amherst¹

07 Aug 2005

TL;DR: A new subgoal-based method for automatically creating useful skills in reinforcement learning that identifies subgoals by partitioning local state transition graphs---those that are constructed using only the most recent experiences of the agent.

...read moreread less

Abstract: We present a new subgoal-based method for automatically creating useful skills in reinforcement learning. Our method identifies subgoals by partitioning local state transition graphs---those that are constructed using only the most recent experiences of the agent. The local scope of our subgoal discovery method allows it to successfully identify the type of subgoals we seek---states that lie between two densely-connected regions of the state space while producing an algorithm with low computational cost.

...read moreread less

Journal Article•

Reinforcement learning signals predict future decisions

[...]

Michael X Cohen, R. Cohen

01 Jan 2005-Acta Neurobiologiae Experimentalis

TL;DR: It is found that the magnitude of ERPs after losing to the computer opponent predicted whether subjects would change decision behavior on the subsequent trial, and FRNs to decision outcomes were disproportionately larger over the motor cortex contralateral to the response hand that was used to make the decision.

...read moreread less

Journal Article•DOI•

Attention-Gated Reinforcement Learning of Internal Representations for Classification

[...]

Pieter R. Roelfsema¹, Arjen van Ooyen¹•Institutions (1)

VU University Amsterdam¹

01 Oct 2005-Neural Computation

TL;DR: This work shows that this so-called credit assignment problem can be solved by a new role for attention in learning and shows that the new scheme, called attention-gated reinforcement learning (AGREL), is as efficient as supervised learning in classification tasks.

...read moreread less

Abstract: Animal learning is associated with changes in the efficacy of connections between neurons. The rules that govern this plasticity can be tested in neural networks. Rules that train neural networks to map stimuli onto outputs are given by supervised learning and reinforcement learning theories. Supervised learning is efficient but biologically implausible. In contrast, reinforcement learning is biologically plausible but comparatively inefficient. It lacks a mechanism that can identify units at early processing levels that play a decisive role in the stimulus-response mapping. Here we show that this so-called credit assignment problem can be solved by a new role for attention in learning. There are two factors in our new learning scheme that determine synaptic plasticity: (1) a reinforcement signal that is homogeneous across the network and depends on the amount of reward obtained after a trial, and (2) an attentional feedback signal from the output layer that limits plasticity to those units at earlier processing levels that are crucial for the stimulus-response mapping. The new scheme is called attention-gated reinforcement learning (AGREL). We show that it is as efficient as supervised learning in classification tasks. AGREL is biologically realistic and integrates the role of feedback connections, attention effects, synaptic plasticity, and reinforcement learning signals into a coherent framework.

...read moreread less

Journal Article•DOI•

Temporal Sequence Learning, Prediction, and Control: A Review of Different Models and Their Relation to Biological Mechanisms

[...]

Florentin Wörgötter¹, Bernd Porr¹•Institutions (1)

University of Stirling¹

01 Feb 2005-Neural Computation

TL;DR: This review compares methods for temporal sequence learning (TSL) across the disciplines machine-control, classical conditioning, neuronal models for TSL as well as spike-timing-dependent plasticity (STDP) and focuses on to what degree are reward-based and correlation-based learning related.

...read moreread less

Abstract: In this review, we compare methods for temporal sequence learning (TSL) across the disciplines machine-control, classical conditioning, neuronal models for TSL as well as spike-timing-dependent plasticity (STDP). This review introduces the most influential models and focuses on two questions: To what degree are reward-based (e.g., TD learning) and correlationbased (Hebbian) learning related? and How do the different models correspond to possibly underlying biological mechanisms of synaptic plasticity? We first compare the different models in an open-loop condition, where behavioral feedback does not alter the learning. Here we observe that reward-based and correlation-based learning are indeed very similar. Machine control is then used to introduce the problem of closed-loop control (e.g., actor-critic architectures). Here the problem of evaluative (rewards) versus nonevaluative (correlations) feedback from the environment will be discussed, showing that both learning approaches are fundamentally different in the closed-loop condition. In trying to answer the second question, we compare neuronal versions of the different learning architectures to the anatomy of the involved brain structures (basal-ganglia, thalamus, and cortex) and the molecular biophysics of glutamatergic and dopaminergic synapses. Finally, we discuss the different algorithms used to model STDP and compare them to reward-based learning rules. Certain similarities are found in spite of the strongly different timescales. Here we focus on the biophysics of the different calciumrelease mechanisms known to be involved in STDP.

...read moreread less

Journal Article•DOI•

On the convergence of reinforcement learning

[...]

Alan Beggs¹•Institutions (1)

University of Oxford¹

01 May 2005-Journal of Economic Theory

TL;DR: This paper examines the convergence of payoffs and strategies in Erev and Roth`s model of reinforcement learning and shows that it guarantees that the lim sup of a player`s average payoffs is at least his minmax payoff.

...read moreread less

Journal Article•DOI•

Basis Function Adaptation in Temporal Difference Reinforcement Learning

[...]

Ishai Menache¹, Shie Mannor², Nahum Shimkin¹•Institutions (2)

Technion – Israel Institute of Technology¹, McGill University²

01 Feb 2005-Annals of Operations Research

TL;DR: This paper examines methods for adapting the basis function during the learning process in the context of evaluating the value function under a fixed control policy using the Bellman approximation error as an optimization criterion.

...read moreread less

Abstract: Reinforcement Learning (RL) is an approach for solving complex multi-stage decision problems that fall under the general framework of Markov Decision Problems (MDPs), with possibly unknown parameters. Function approximation is essential for problems with a large state space, as it facilitates compact representation and enables generalization. Linear approximation architectures (where the adjustable parameters are the weights of pre-fixed basis functions) have recently gained prominence due to efficient algorithms and convergence guarantees. Nonetheless, an appropriate choice of basis function is important for the success of the algorithm. In the present paper we examine methods for adapting the basis function during the learning process in the context of evaluating the value function under a fixed control policy. Using the Bellman approximation error as an optimization criterion, we optimize the weights of the basis function while simultaneously adapting the (non-linear) basis function parameters. We present two algorithms for this problem. The first uses a gradient-based approach and the second applies the Cross Entropy method. The performance of the proposed algorithms is evaluated and compared in simulations.

...read moreread less

Journal Article•DOI•

A Generalization Error for Q-Learning

[...]

Susan A. Murphy¹•Institutions (1)

University of Michigan¹

01 Dec 2005-Journal of Machine Learning Research

TL;DR: This work considers Q-learning with function approximation for this setting and derives an upper bound on the generalization error in terms of quantities minimized by a Q- Learning algorithm, the complexity of the approximation space and an approximation term due to the mismatch between Q- learning and the goal of learning a policy that maximizes the value function.

...read moreread less

Abstract: Planning problems that involve learning a policy from a single training set of finite horizon trajectories arise in both social science and medical fields. We consider Q-learning with function approximation for this setting and derive an upper bound on the generalization error. This upper bound is in terms of quantities minimized by a Q-learning algorithm, the complexity of the approximation space and an approximation term due to the mismatch between Q-learning and the goal of learning a policy that maximizes the value function.

...read moreread less

Journal Article•DOI•

Application of reinforcement learning for agent-based production scheduling

[...]

Yi-Chi Wang¹, John M. Usher¹•Institutions (1)

Mississippi State University¹

01 Feb 2005-Engineering Applications of Artificial Intelligence

TL;DR: Encouraging results are provided that show the potential of RL for application to agent-based production scheduling in manufacturing systems as well as three example cases in which the best dispatching rules have been previously defined.

...read moreread less

Book Chapter•DOI•

Lessons from an Adaptive Home

[...]

Michael C. Mozer¹•Institutions (1)

University of Colorado Boulder¹

28 Jan 2005

Journal Article•DOI•

Individual Q -Learning in Normal Form Games

[...]

David S. Leslie, Edmund J. Collins

01 Aug 2005-Siam Journal on Control and Optimization

TL;DR: This work considers the behavior of value-based learning agents in the multi-agent multi-armed bandit problem, and shows that such agents cannot generally play at a Nash equilibrium, although if smooth best responses are used, a Nash distribution can be reached.

...read moreread less

Abstract: The single-agent multi-armed bandit problem can be solved by an agent that learns the values of each action using reinforcement learning. However, the multi-agent version of the problem, the iterated normal form game, presents a more complex challenge, since the rewards available to each agent depend on the strategies of the others. We consider the behavior of value-based learning agents in this situation, and show that such agents cannot generally play at a Nash equilibrium, although if smooth best responses are used, a Nash distribution can be reached. We introduce a particular value-based learning algorithm, which we call individual Q-learning, and use stochastic approximation to study the asymptotic behavior, showing that strategies will converge to Nash distribution almost surely in 2-player zero-sum games and 2-player partnership games. Player-dependent learning rates are then considered, and it is shown that this extension converges in some games for which many algorithms, including the basic algorithm initially considered, fail to converge.

...read moreread less

Journal Article•DOI•

Sensitivity to travel time variability: Travelers learning perspective

[...]

Erel Avineri¹, Erel Avineri², Joseph N. Prashker¹•Institutions (2)

Technion – Israel Institute of Technology¹, University of the West of England²

01 Apr 2005-Transportation Research Part C-emerging Technologies

TL;DR: In this article, the effect of the feedback mechanism on route-choice decision-making under uncertainty was discussed, and the experimental results were compared to the predictions of two static models (random utility maximization and cumulative prospect theory) and two dynamic models (stochastic fictitious play and reinforcement learning).

...read moreread less

Abstract: This paper discusses the effect of the feedback mechanism on route-choice decision-making under uncertainty. Recent ITS (intelligent transportation systems) applications have highlighted the need for better models of the behavioral processes involved in travel decisions. However, travel behavior, and specifically route-choice decision-making, is usually modeled using normative models instead of descriptive models. Common route-choice models are based on the assumption of utility maximization. In this work, route-choice laboratory experiments and computer simulations were conducted in order to analyze route-choice behavior in iterative tasks with immediate feedback. The experimental results were compared to the predictions of two static models (random utility maximization and cumulative prospect theory) and two dynamic models (stochastic fictitious play and reinforcement learning). Based on the experimental results, it is showed that the higher the variance in travel times, the lower is the travelers’ sensitivity to travel time differences. These results are in conflict with the paradigm about travel time variability and risk-taking behavior. The empirical results may be explained by the payoff variability effect: high payoff variability seems to move choice behavior toward random choice.

...read moreread less

Book•

Rule-Based Evolutionary Online Learning Systems: A Principled Approach to LCS Analysis and Design

[...]

Martin V. Butz

01 Jan 2005

TL;DR: This paper presents a meta-analyses of the XCS Classifier System and its applications in Binary Classification Problems, Reinforcement Learning Problems, and Cognitive Learning Classifier Systems.

...read moreread less

Abstract: Prerequisites.- Simple Learning Classifier Systems.- The XCS Classifier System.- How XCS Works: Ensuring Effective Evolutionary Pressures.- When XCS Works: Towards Computational Complexity.- Effective XCS Search: Building Block Processing.- XCS in Binary Classification Problems.- XCS in Multi-Valued Problems.- XCS in Reinforcement Learning Problems.- Facetwise LCS Design.- Towards Cognitive Learning Classifier Systems.- Summary and Conclusions.

...read moreread less

Journal Article•DOI•

Using feedback in collaborative reinforcement learning to adaptively optimize MANET routing

[...]

Jim Dowling¹, Eoin Curran¹, Raymond Cunningham¹, Vinny Cahill¹•Institutions (1)

Trinity College, Dublin¹

01 May 2005

TL;DR: This work evaluates an implementation of CRL in a routing protocol for mobile ad hoc networks, called SAMPLE, and shows how feedback in the selection of links by routing agents enables SAMPLE to adapt and optimize its routing behavior to varying network conditions and properties, resulting in optimization of network throughput.

...read moreread less

Abstract: Designers face many system optimization problems when building distributed systems. Traditionally, designers have relied on optimization techniques that require either prior knowledge or centrally managed runtime knowledge of the system's environment, but such techniques are not viable in dynamic networks where topology, resource, and node availability are subject to frequent and unpredictable change. To address this problem, we propose collaborative reinforcement learning (CRL) as a technique that enables groups of reinforcement learning agents to solve system optimization problems online in dynamic, decentralized networks. We evaluate an implementation of CRL in a routing protocol for mobile ad hoc networks, called SAMPLE. Simulation results show how feedback in the selection of links by routing agents enables SAMPLE to adapt and optimize its routing behavior to varying network conditions and properties, resulting in optimization of network throughput. In the experiments, SAMPLE displays emergent properties such as traffic flows that exploit stable routes and reroute around areas of wireless interference or congestion. SAMPLE is an example of a complex adaptive distributed system.

...read moreread less

Journal Article•DOI•

Soar-RL: integrating reinforcement learning with Soar

[...]

Shelley Nason¹, John E. Laird¹•Institutions (1)

University of Michigan¹

01 Mar 2005-Cognitive Systems Research

TL;DR: An architectural modification to Soar is described that gives a Soar agent the opportunity to learn statistical information about the past success of its actions and utilize this information when selecting an operator.

...read moreread less

Proceedings Article•DOI•

Bayesian sparse sampling for on-line reward optimization

[...]

Tao Wang¹, Daniel J. Lizotte¹, Michael Bowling¹, Dale Schuurmans¹•Institutions (1)

University of Alberta¹

07 Aug 2005

TL;DR: The idea is to grow a sparse lookahead tree, intelligently, by exploiting information in a Bayesian posterior---rather than enumerate action branches (standard sparse sampling) or compensate myopically (value of perfect information).

...read moreread less

Abstract: We present an efficient "sparse sampling" technique for approximating Bayes optimal decision making in reinforcement learning, addressing the well known exploration versus exploitation tradeoff. Our approach combines sparse sampling with Bayesian exploration to achieve improved decision making while controlling computational cost. The idea is to grow a sparse lookahead tree, intelligently, by exploiting information in a Bayesian posterior---rather than enumerate action branches (standard sparse sampling) or compensate myopically (value of perfect information). The outcome is a flexible, practical technique for improving action selection in simple reinforcement learning scenarios.

...read moreread less

Proceedings Article•DOI•

Dynamic preferences in multi-criteria reinforcement learning

[...]

Sriraam Natarajan¹, Prasad Tadepalli¹•Institutions (1)

Oregon State University¹

07 Aug 2005

TL;DR: This paper considers the problem of learning in the presence of time-varying preferences among multiple objectives, using numeric weights to represent their importance, and proposes a method that allows us to store a finite number of policies, choose an appropriate policy for any weight vector and improve upon it.

...read moreread less

Abstract: The current framework of reinforcement learning is based on maximizing the expected returns based on scalar rewards. But in many real world situations, tradeoffs must be made among multiple objectives. Moreover, the agent's preferences between different objectives may vary with time. In this paper, we consider the problem of learning in the presence of time-varying preferences among multiple objectives, using numeric weights to represent their importance. We propose a method that allows us to store a finite number of policies, choose an appropriate policy for any weight vector and improve upon it. The idea is that although there are infinitely many weight vectors, they may be well-covered by a small number of optimal policies. We show this empirically in two domains: a version of the Buridan's ass problem and network routing.

...read moreread less

Collapse