scispace - formally typeset
Search or ask a question

Showing papers on "Reinforcement learning published in 2001"


BookDOI
01 Jan 2001
TL;DR: This book presents the first comprehensive treatment of Monte Carlo techniques, including convergence results and applications to tracking, guidance, automated target recognition, aircraft navigation, robot navigation, econometrics, financial modeling, neural networks, optimal control, optimal filtering, communications, reinforcement learning, signal enhancement, model averaging and selection.
Abstract: Monte Carlo methods are revolutionizing the on-line analysis of data in fields as diverse as financial modeling, target tracking and computer vision. These methods, appearing under the names of bootstrap filters, condensation, optimal Monte Carlo filters, particle filters and survival of the fittest, have made it possible to solve numerically many complex, non-standard problems that were previously intractable. This book presents the first comprehensive treatment of these techniques, including convergence results and applications to tracking, guidance, automated target recognition, aircraft navigation, robot navigation, econometrics, financial modeling, neural networks, optimal control, optimal filtering, communications, reinforcement learning, signal enhancement, model averaging and selection, computer vision, semiconductor design, population biology, dynamic Bayesian networks, and time series analysis. This will be of great value to students, researchers and practitioners, who have some basic knowledge of probability. Arnaud Doucet received the Ph. D. degree from the University of Paris-XI Orsay in 1997. From 1998 to 2000, he conducted research at the Signal Processing Group of Cambridge University, UK. He is currently an assistant professor at the Department of Electrical Engineering of Melbourne University, Australia. His research interests include Bayesian statistics, dynamic models and Monte Carlo methods. Nando de Freitas obtained a Ph.D. degree in information engineering from Cambridge University in 1999. He is presently a research associate with the artificial intelligence group of the University of California at Berkeley. His main research interests are in Bayesian statistics and the application of on-line and batch Monte Carlo methods to machine learning. Neil Gordon obtained a Ph.D. in Statistics from Imperial College, University of London in 1993. He is with the Pattern and Information Processing group at the Defence Evaluation and Research Agency in the United Kingdom. His research interests are in time series, statistical data analysis, and pattern recognition with a particular emphasis on target tracking and missile guidance.

6,574 citations


Proceedings Article
04 Aug 2001
TL;DR: R-MAX as mentioned in this paper is a model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time, where the agent always maintains a complete, but possibly inaccurate model of its environment and acts based on the optimal policy derived from this model.
Abstract: R-MAX is a simple model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time. In R-MAX, the agent always maintains a complete, but possibly inaccurate model of its environment and acts based on the optimal policy derived from this model. The model is initialized in an optimistic fashion: all actions in all states return the maximal possible reward (hence the name). During execution, the model is updated based on the agent's observations. R-MAX improves upon several previous algorithms: (1) It is simpler and more general than Kearns and Singh's E3 algorithm, covering zerosum stochastic games. (2) It has a built-in mechanism for resolving the exploration vs. exploitation dilemma. (3) It formally justifies the "optimism under uncertainty" bias used in many RL algorithms. (4) It is much simpler and more general than Brafman and Tennenholtz's LSG algorithmfor learning in single controller stochastic games. (5) It generalizes the algorithm by Monderer and Tennenholtz for learning in repeated games. (6) It is the only algorithm for near-optimal learning in repeated games known to be polynomial, providing a much simpler and more efficient alternative to previous algorithms by Banos and by Megiddo.

923 citations


Journal ArticleDOI
TL;DR: The need for motor learning, what is learned and how it is represented, and the mechanisms of learning are explored, relating these computational issues to empirical studies on motor learning in humans.

665 citations


Journal ArticleDOI
TL;DR: In this article, a generic online learning control system based on the fundamental principle of reinforcement learning or more specifically neural dynamic programming is presented. But the authors focus on a systematic treatment for developing a generic RL control system.
Abstract: This paper focuses on a systematic treatment for developing a generic online learning control system based on the fundamental principle of reinforcement learning or more specifically neural dynamic programming. This online learning system improves its performance over time in two aspects: 1) it learns from its own mistakes through the reinforcement signal from the external environment and tries to reinforce its action to improve future performance; and 2) system states associated with the positive reinforcement is memorized through a network learning process where in the future, similar states will be more positively associated with a control action leading to a positive reinforcement. A successful candidate of online learning control design is introduced. Real-time learning algorithms is derived for individual components in the learning system. Some analytical insight is provided to give guidelines on the learning process took place in each module of the online learning control system.

634 citations


Journal ArticleDOI
TL;DR: In this article, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies is proposed.
Abstract: Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies. A similar algorithm was proposed by Kimura, Yamamura, and Kobayashi (1995). The algorithm's chief advantages are that it requires storage of only twice the number of policy parameters, uses one free parameter β ∈ [0, 1] (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of GPOMDP, and show how the correct choice of the parameter β is related to the mixing time of the controlled POMDP. We briefly describe extensions of GPOMDP to controlled Markov chains, continuous state, observation and control spaces, multiple-agents, higher-order derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter, Bartlett, & Weaver, 2001) we show how the gradient estimates generated by GPOMDP can be used in both a traditional stochastic gradient algorithm and a conjugate-gradient procedure to find local optima of the average reward.

587 citations


Proceedings ArticleDOI
14 Oct 2001
TL;DR: This work considers the free-rider problem that arises in peer-to-peer file sharing networks such as Napster: the problem that individual users are provided with no incentive for adding value to the network, by constructing a formal game theoretic model of the system and analyzing equilibria of user strategies under several novel payment mechanisms.
Abstract: We consider the free-rider problem that arises in peer-to-peer file sharing networks such as Napster: the problem that individual users are provided with no incentive for adding value to the network. We examine the design implications of the assumption that users will selfishly act to maximize their own rewards, by constructing a formal game theoretic model of the system and analyzing equilibria of user strategies under several novel payment mechanisms. We support and extend upon our theoretical predictions with experimental results from a multi-agent reinforcement learning model.

561 citations


Proceedings Article
28 Jun 2001
TL;DR: This paper presents a method by which a reinforcement learning agent can automatically discover certain types of subgoals online and is able to accelerate learning on the current task and to transfer its expertise to other, related tasks through the reuse of its ability to attainSubgoals.
Abstract: This paper presents a method by which a reinforcement learning agent can automatically discover certain types of subgoals online. By creating useful new subgoals while learning, the agent is able to accelerate learning on the current task and to transfer its expertise to other, related tasks through the reuse of its ability to attain subgoals. The agent discovers subgoals based on commonalities across multiple paths to a solution. We cast the task of finding these commonalities as a multiple-instance learning problem and use the concept of diverse density to find solutions. We illustrate this approach using several gridworld tasks.

496 citations


Journal ArticleDOI
Michael L. Littman1
TL;DR: A set of reinforcement-learning algorithms based on estimating value functions and convergence theorems for these algorithms are described and presented in a way that makes it easy to reason about the behavior of simultaneous learners in a shared environment.

404 citations


Journal ArticleDOI
TL;DR: It is demonstrated how direct reinforcement can be used to optimize risk-adjusted investment returns (including the differential Sharpe ratio), while accounting for the effects of transaction costs.
Abstract: We present methods for optimizing portfolios, asset allocations, and trading systems based on direct reinforcement (DR). In this approach, investment decision-making is viewed as a stochastic control problem, and strategies are discovered directly. We present an adaptive algorithm called recurrent reinforcement learning (RRL) for discovering investment policies. The need to build forecasting models is eliminated, and better trading performance is obtained. The direct reinforcement approach differs from dynamic programming and reinforcement algorithms such as TD-learning and Q-learning, which attempt to estimate a value function for the control problem. We find that the RRL direct reinforcement framework enables a simpler problem representation, avoids Bellman's curse of dimensionality and offers compelling advantages in efficiency. We demonstrate how direct reinforcement can be used to optimize risk-adjusted investment returns (including the differential Sharpe ratio), while accounting for the effects of transaction costs. In extensive simulation work using real financial data, we find that our approach based on RRL produces better trading strategies than systems utilizing Q-learning (a value function method). Real-world applications include an intra-daily currency trader and a monthly asset allocation system for the S&P 500 Stock Index and T-Bills.

396 citations


Journal ArticleDOI
TL;DR: Relational reinforcement learning (RL) as mentioned in this paper is a learning technique that combines reinforcement learning with relational learning or inductive logic programming, which can be potentially applied to a new range of learning tasks.
Abstract: Relational reinforcement learning is presented, a learning technique that combines reinforcement learning with relational learning or inductive logic programming. Due to the use of a more expressive representation language to represent states, actions and Q-functions, relational reinforcement learning can be potentially applied to a new range of learning tasks. One such task that we investigate is planning in the blocks world, where it is assumed that the effects of the actions are unknown to the agent and the agent has to learn a policy. Within this simple domain we show that relational reinforcement learning solves some existing problems with reinforcement learning. In particular, relational reinforcement learning allows us to employ structural representations, to abstract from specific goals pursued and to exploit the results of previous learning phases when addressing new (more complex) situations.

395 citations


Proceedings Article
28 Jun 2001
TL;DR: The first algorithm for off-policy temporal-difference learning that is stable with linear function approximation is introduced and it is proved that, given training under any -soft policy, the algorithm converges w.p.1 to a close approximation to the action-value function for an arbitrary target policy.
Abstract: We introduce the first algorithm for off-policy temporal-difference learning that is stable with linear function approximation. Off-policy learning is of interest because it forms the basis for popular reinforcement learning methods such as Q-learning, which has been known to diverge with linear function approximation, and because it is critical to the practical utility of multi-scale, multi-goal, learning frameworks such as options, HAMs, and MAXQ. Our new algorithm combines TD(λ) over state–action pairs with importance sampling ideas from our previous work. We prove that, given training under any -soft policy, the algorithm converges w.p.1 to a close approximation (as in Tsitsiklis and Van Roy, 1997; Tadic, 2001) to the action-value function for an arbitrary target policy. Variations of the algorithm designed to reduce variance introduce additional bias but are also guaranteed convergent. We also illustrate our method empirically on a small policy evaluation problem. Our current results are limited to episodic tasks with episodes of bounded length. 1Although Q-learning remains the most popular of all reinforcement learning algorithms, it has been known since about 1996 that it is unsound with linear function approximation (see Gordon, 1995; Bertsekas and Tsitsiklis, 1996). The most telling counterexample, due to Baird (1995) is a seven-state Markov decision process with linearly independent feature vectors, for which an exact solution exists, yet This is a re-typeset version of an article published in the Proceedings of the 18th International Conference on Machine Learning (2001). It differs from the original in line and page breaks, is crisper for electronic viewing, and has this funny footnote, but otherwise it is identical to the published article. for which the approximate values found by Q-learning diverge to infinity. This problem prompted the development of residual gradient methods (Baird, 1995), which are stable but much slower than Q-learning, and fitted value iteration (Gordon, 1995, 1999), which is also stable but limited to restricted, weaker-than-linear function approximators. Of course, Q-learning has been used with linear function approximation since its invention (Watkins, 1989), often with good results, but the soundness of this approach is no longer an open question. There exist non-pathological Markov decision processes for which it diverges; it is absolutely unsound in this sense. A sensible response is to turn to some of the other reinforcement learning methods, such as Sarsa, that are also efficient and for which soundness remains a possibility. An important distinction here is between methods that must follow the policy they are learning about, called on-policy methods, and those that can learn from behavior generated by a different policy, called off-policy methods. Q-learning is an off-policy method in that it learns the optimal policy even when actions are selected according to a more exploratory or even random policy. Q-learning requires only that all actions be tried in all states, whereas on-policy methods like Sarsa require that they be selected with specific probabilities. Although the off-policy capability of Q-learning is appealing, it is also the source of at least part of its instability problems. For example, in one version of Baird’s counterexample, the TD(λ) algorithm, which underlies both Qlearning and Sarsa, is applied with linear function approximation to learn the action-value function Q for a given policy π. Operating in an on-policy mode, updating state– action pairs according to the same distribution they would be experienced under π, this method is stable and convergent near the best possible solution (Tsitsiklis and Van Roy, 1997; Tadic, 2001). However, if state-action pairs are updated according to a different distribution, say that generated by following the greedy policy, then the estimated values again diverge to infinity. This and related counterexamples suggest that at least some of the reason for the instability of Q-learning is that it is an off-policy method; they also make it clear that this part of the problem can be studied in a purely policy-evaluation context. Despite these problems, there remains substantial reason for interest in off-policy learning methods. Several researchers have argued for an ambitious extension of reinforcement learning ideas into modular, multi-scale, and hierarchical architectures (Sutton, Precup & Singh, 1999; Parr, 1998; Parr & Russell, 1998; Dietterich, 2000). These architectures rely on off-policy learning to learn about multiple subgoals and multiple ways of behaving from the singular stream of experience. For these approaches to be feasible, some efficient way of combining off-policy learning and function approximation must be found. Because the problems with current off-policy methods become apparent in a policy evaluation setting, it is there that we focus in this paper. In previous work we considered multi-step off-policy policy evaluation in the tabular case. In this paper we introduce the first off-policy policy evaluation method consistent with linear function approximation. Our mathematical development focuses on the episodic case, and in fact on a single episode. Given a starting state and action, we show that the expected offpolicy update under our algorithm is the same as the expected on-policy update under conventional TD(λ). This, together with some variance conditions, allows us to prove convergence and bounds on the error in the asymptotic approximation identical to those obtained by Tsitsiklis and Van Roy (1997; Bertsekas and Tsitsiklis, 1996). 1. Notation and Main Result We consider the standard episodic reinforcement learning framework (see, e.g., Sutton & Barto, 1998) in which a learning agent interacts with a Markov decision process (MDP). Our notation focuses on a single episode of T time steps, s0, a0, r1, s1, a1, r2, . . . , rT , sT , with states st ∈ S, actions at ∈ A, and rewards rt ∈ <. We take the initial state and action, s0 and a0, to be given arbitrarily. Given a state and action, st and at, the next reward, rt+1, is a random variable with mean rt st and the next state, st+1, is chosen with probabilities pt stst+1 . The final state is a special terminal state that may not occur on any preceding time step. Given a state, st, 0 < t < T , the action at is selected according to probability π(st, at) or b(st, at) depending on whether policy π or policy b is in force. We always use π to denote the target policy, the policy that we are learning about. In the on-policy case, π is also used to generate the actions of the episode. In the off-policy case, the actions are instead generated by b, which we call the behavior policy. In either case, we seek an approximation to the action-value function Q : S ×A 7→ < for the target policy π: Q(s, a) = Eπ { rt+1 + · · ·+ γrT | st = s, at = a } , where 0 ≤ γ ≤ 1 is a discount-rate parameter. We consider approximations that are linear in a set of feature vectors {φsa}, s ∈ S, a ∈ A: Q(s, a) ≈ θφsa = n ∑

Proceedings Article
04 Aug 2001
TL;DR: This paper introduces two properties as desirable for a learning agent when in the presence of other learning agents, namely rationality and convergence, and contributes a new learning algorithm, WoLF policy hillclimbing, that is proven to be rational.
Abstract: This paper investigates the problem of policy learning in multiagent environments using the stochastic game framework, which we briefly overview. We introduce two properties as desirable for a learning agent when in the presence of other learning agents, namely rationality and convergence. We examine existing reinforcement learning algorithms according to these two properties and notice that they fail to simultaneously meet both criteria. We then contribute a new learning algorithm, WoLF policy hillclimbing, that is based on a simple principle: “learn quickly while losing, slowly while winning.” The algorithm is proven to be rational and we present empirical results for a number of stochastic games showing the algorithm converges.

Proceedings ArticleDOI
21 May 2001
TL;DR: This work considers algorithms that evaluate and synthesize controllers under distributions of Markovian models and demonstrates the presented learning control algorithm by flying an autonomous helicopter and shows that the controller learned is robust and delivers good performance in this real-world domain.
Abstract: Many control problems in the robotics field can be cast as partially observed Markovian decision problems (POMDPs), an optimal control formalism. Finding optimal solutions to such problems in general, however is known to be intractable. It has often been observed that in practice, simple structured controllers suffice for good sub-optimal control, and recent research in the artificial intelligence community has focused on policy search methods as techniques for finding sub-optimal controllers when such structured controllers do exist. Traditional model-based reinforcement learning algorithms make a certainty equivalence assumption on their learned models and calculate optimal policies for a maximum-likelihood Markovian model. We consider algorithms that evaluate and synthesize controllers under distributions of Markovian models. Previous work has demonstrated that algorithms that maximize mean reward with respect to model uncertainty leads to safer and more robust controllers. We consider briefly other performance criterion that emphasize robustness and exploration in the search for controllers, and note the relation with experiment design and active learning. To validate the power of the approach on a robotic application we demonstrate the presented learning control algorithm by flying an autonomous helicopter. We show that the controller learned is robust and delivers good performance in this real-world domain.

Journal ArticleDOI
TL;DR: This paper proposes a hierarchical reinforcement learning architecture that realizes practical learning speed in real hardware control tasks and applies it to a three-link, two-joint robot for the task of learning to stand up by trial and error.

Book ChapterDOI
TL;DR: This work constructs a formal game theoretic model of the peer-to-peer file sharing system, analyzing equilibria of user strategies under several novel payment mechanisms, and supports and extends this work with results from experiments with a multi-agent reinforcement learning model.
Abstract: We consider the free-rider problem in peer-to-peer file sharing networks such as Napster: that individual users are provided with no incentive for adding value to the network. We examine the design implications of the assumption that users will selfishly act to maximize their own rewards, by constructing a formal game theoretic model of the system and analyzing equilibria of user strategies under several novel payment mechanisms. We support and extend this workwith results from experiments with a multi-agent reinforcement learning model.

Proceedings Article
Bram Bakker1
03 Jan 2001
TL;DR: Model-free RL-LSTM using Advantage (λ) learning and directed exploration can solve non-Markovian tasks with long-term dependencies between relevant events.
Abstract: This paper presents reinforcement learning with a Long Short-Term Memory recurrent neural network: RL-LSTM. Model-free RL-LSTM using Advantage (λ) learning and directed exploration can solve non-Markovian tasks with long-term dependencies between relevant events. This is demonstrated in a T-maze task, as well as in a difficult variation of the pole balancing task.

Proceedings Article
02 Aug 2001
TL;DR: This work incorporates a reward baseline into the learning system, and shows that it affects variance without introducing further bias, and finds the optimal constant reward baseline is equal to the long-term average expected reward.
Abstract: There exist a number of reinforcement learning algorithms which learn by climbing the gradient of expected reward. Their long-run convergence has been proved, even in partially observable environments with non-deterministic actions, and without the need for a system model. However, the variance of the gradient estimator has been found to be a significant practical problem. Recent approaches have discounted future rewards, introducing a bias-variance trade-off into the gradient estimate. We incorporate a reward baseline into the learning system, and show that it affects variance without introducing further bias. In particular, as we approach the zerobias, high-variance parametedzation, the optimal (or variance minimizing) constant reward baseline is equal to the long-term average expected reward. Modified policy-gradient algorithms are presented, and a number of experiments demonstrate their improvement over previous work.

Proceedings Article
Peter Stone1, Richard S. Sutton1
28 Jun 2001
TL;DR: This work describes the application of episodic SMDP Sarsa(λ) with linear tile-coding function approximation and variable λ to learning higher-level decisions in a keepaway subtask of RoboCup soccer and demonstrates the generality of the approach by applying it to a number of task variations.
Abstract: RoboCup simulated soccer presents many challenges to reinforcement learning methods, including a large state space, hidden and uncertain state, multiple agents, and long and variable delays in the effects of actions. We describe our application of episodic SMDP Sarsa(λ) with linear tile-coding function approximation and variable λ to learning higher-level decisions in a keepaway subtask of RoboCup soccer. In keepaway, one team, “the keepers,” tries to keep control of the ball for as long as possible despite the efforts of “the takers.” The keepers learn individually when to hold the ball and when to pass to a teammate, while the takers learn when to charge the ball-holder and when to cover possible passing lanes. Our agents learned policies that significantly out-performed a range of benchmark policies. We demonstrate the generality of our approach by applying it to a number of task variations including different field sizes and different numbers of players on each team.

Book
20 Jun 2001
TL;DR: This book discusses the foundations of Multi-agent Systems, agent-Based Modelling of Ecosystems for Sustainable Resource Management, and a multi-agent Study of Interethnic Cooperation.
Abstract: Foundations of Multi-agent Systems.- Perspectives on Organizations in Multi-agent Systems.- Multi-agent Infrastructure, Agent Discovery, Middle Agents for Web Services and Interoperation.- Logical Foundations of Agent-Based Computing.- Standardizing Agent Communication.- Standardizing Agent Interoperability: The FIPA Approach.- Distributed Problem Solving and Planning.- Automated Negotiation and Decision Making in Multiagent Environments.- Agents? Advanced Features for Negotiation and Coordination.- Social Behaviour, Meta-reasoning, and Learning.- Towards Heterogeneous Agent Teams.- Social Knowledge in Multi-agent Systems.- Machine Learning and Inductive Logic Programming for Multi-agent Systems.- Relational Reinforcement Learning.- From Statistics to Emergence: Exercises in Systems Modularity.- Emotions and Agents.- Applications.- Multi-agent Coordination and Control Using Stigmergy Applied to Manufacturing Control.- Virtual Enterprise Modeling and Support Infrastructures: Applying Multi-agent System Approaches.- Specialised Agent Applications.- Agent-Based Modelling of Ecosystems for Sustainable Resource Management.- Cooperating Physical Robots: A Lesson in Playing Robotic Soccer.- A Multi-agent Study of Interethnic Cooperation.

Journal ArticleDOI
TL;DR: In simulation experiments, the learning algorithm of the neural network adjusts the fuzzy controller by fine-tuning the form and location of the membership functions, and is found successful at constant traffic volumes.

Journal ArticleDOI
TL;DR: It is found that although all methods are able to generate significant in-sample and out-of-sample profits when transaction costs are zero, the genetic algorithm approach is superior for non-zero transaction costs, although none of the methods produce significant profits at realistic transaction costs.
Abstract: We consider strategies which use a collection of popular technical indicators as input and seek a profitable trading rule defined in terms of them. We consider two popular computational learning approaches, reinforcement learning and genetic programming, and compare them to a pair of simpler methods: the exact solution of an appropriate Markov decision problem, and a simple heuristic. We find that although all methods are able to generate significant in-sample and out-of-sample profits when transaction costs are zero, the genetic algorithm approach is superior for non-zero transaction costs, although none of the methods produce significant profits at realistic transaction costs. We also find that there is a substantial danger of overfitting if in-sample learning is not constrained.

Proceedings ArticleDOI
28 May 2001
TL;DR: The application of reinforcement learning is described to allow Cobot to proactively take actions in this complex social environment, and adapt his behavior from multiple sources of human reward.
Abstract: We report on our reinforcement learning work on Cobot, a software agent that resides in the well-known online chat community LambdaMOO. Our initial work on Cobot~\cite{cobotaaai} provided him with the ability to collect {\em social statistics\/} and report them to users in a reactive manner. Here we describe our application of reinforcement learning to allow Cobot to proactively take actions in this complex social environment, and adapt his behavior from multiple sources of human reward. After 5 months of training, Cobot received 3171 reward and punishment events from 254 different Lambda\-MOO users, and learned nontrivial preferences for a number of users. Cobot modifies his behavior based on his current state in an attempt to maximize reward. Here we describe LambdaMOO and the state and action spaces of Cobot, and report the statistical results of the learning experiment.

Journal ArticleDOI
TL;DR: A model of unknown functions impinging on intentional actions through a high level form of (MA) reinforcement learning is sketched, arguing that, in order to reproduce some behaviour, its effects should not necessarily be 'good', i.e. useful for the goal of the agent or of some higher macro-system.

Proceedings ArticleDOI
21 May 2001
TL;DR: The rise of task primitives in robot learning from observation is described, a framework is developed that uses observed data to initially learn a task and the agent then goes on to increase its performance through repeated task performance (learning from practice).
Abstract: This paper describes the rise of task primitives in robot learning from observation. A framework is developed that uses observed data to initially learn a task and the agent then goes on to increase its performance through repeated task performance (learning from practice). Data that is collected while the human performs a task is parsed into small parts of the task called primitives. Modules are created for each primitive that encode the movements required during the performance of the primitive, and when and where the primitives are performed. The feasibility of this method is currently being tested with agents that learn to play a virtual and an actual air hockey game.

Proceedings ArticleDOI
28 May 2001
TL;DR: This paper applies this hierarchical multi-agent reinforcement learning algorithm to a complex AGV scheduling task and compares its performance and speed with other learning approaches, including flat multi- agent, single agent using MAXQ, selfish multiple agents usingMAXQ, as well as several well-known AGV heuristics like "first come first serve", "highest queue first" and "nearest station first".
Abstract: In this paper we investigate the use of hierarchical reinforcement learning to speed up the acquisition of cooperative multi-agent tasks. We extend the MAXQ framework to the multi-agent case. Each agent uses the same MAXQ hierarchy to decompose a task into sub-tasks. Learning is decentralized, with each agent learning three interrelated skills: how to perform subtasks, which order to do them in, and how to coordinate with other agents. Coordination skills among agents are learned by using joint actions at the highest level(s) of the hierarchy. The Q nodes at the highest level(s) of the hierarchy are configured to represent the joint task-action space among multiple agents. In this approach, each agent only knows what other agents are doing at the level of sub-tasks, and is unaware of lower level (primitive) actions. This hierarchical approach allows agents to learn coordination faster by sharing information at the level of sub-tasks, rather than attempting to learn coordination taking into account primitive joint state-action values. We apply this hierarchical multi-agent reinforcement learning algorithm to a complex AGV scheduling task and compare its performance and speed with other learning approaches, including flat multi-agent, single agent using MAXQ, selfish multiple agents using MAXQ (where each agent acts independently without communicating with the other agents), as well as several well-known AGV heuristics like "first come first serve", "highest queue first" and "nearest station first". We also compare the tradeoffs in learning speed vs. performance of modeling joint action values at multiple levels in the MAXQ hierarchy.

Journal ArticleDOI
TL;DR: This article describes agent-centered search (also called real-time search or local search) and illustrates this planning paradigm with examples and discusses the design and properties of several agent- centered search methods, focusing on robot exploration and localization.
Abstract: In this article, I describe agent-centered search (also called real-time search or local search) and illustrate this planning paradigm with examples. Agent-centered search methods interleave planning and plan execution and restrict planning to the part of the domain around the current state of the agent, for example, the current location of a mobile robot or the current board position of a game. These methods can execute actions in the presence of time constraints and often have a small sum of planning and execution cost, both because they trade off planning and execution cost and because they allow agents to gather information early in nondeterministic domains, which reduces the amount of planning they have to perform for unencountered situations. These advantages become important as more intelligent systems are interfaced with the world and have to operate autonomously in complex environments. Agent-centered search methods have been applied to a variety of domains, including traditional search, strips-type planning, moving-target search, planning with totally and partially observable Markov decision process models, reinforcement learning, constraint satisfaction, and robot navigation. I discuss the design and properties of several agent-centered search methods, focusing on robot exploration and localization.

Journal ArticleDOI
TL;DR: In the robot soccer environment, the effectiveness and applicability of modular Q-learning and the uni-vector field method are verified by real experiments using five micro-robots.

Book ChapterDOI
01 Jan 2001
TL;DR: The paper describes the overall approach pursued by the Karlsruhe Brainstormers simulator league team and first empirical results are presented for 2 against 2 attack situations.
Abstract: Our long-term goal is to build a robot soccer team where the decision making part is based completely on Reinforcement Learning (RL) methods. The paper describes the overall approach pursued by the Karlsruhe Brainstormers simulator league team. Main parts of basic decision making are meanwhile solved using RL techniques. On the tactical level, first empirical results are presented for 2 against 2 attack situations.

Journal ArticleDOI
TL;DR: Recent advances in motor control and learning are introduced, namely, the role of the basal ganglia in acquisition of goal-directed behaviors, learning of internal models by the cerebellum, and decomposition of complex tasks by the competition of predictive models.
Abstract: A new theory was postulated that the cerebellum, the basal ganglia, and the cerebral cortex have evolved to implement different kinds of learning algorithms: the cerebellum for supervised learning, the basal ganglia for reinforcement learning, and the cerebral cortex for unsupervised learning. Here, we introduce recent advances in motor control and learning, namely, the role of the basal ganglia in acquisition of goal-directed behaviors, learning of internal models by the cerebellum, and decomposition of complex tasks by the competition of predictive models.

Journal ArticleDOI
01 Feb 2001
TL;DR: How learning neural networks can support an approach in which agents are assigned goals when the simulation starts and then pursue these goals autonomously and adaptively is described and it is shown that goal based learning may be used effectively used in this context.
Abstract: We propose that learning agents (LAs) be incorporated into simulation environments in order to model the adaptive behavior of humans. These LAs adapt to specific circumstances and events during the simulation run. They would select tasks to be accomplished among a given set of tasks as the simulation progresses, or synthesize tasks for themselves based on their observations of the environment and on information they may receive from other agents. We investigate an approach in which agents are assigned goals when the simulation starts and then pursue these goals autonomously and adaptively. During the simulation, agents progressively improve their ability to accomplish their goals effectively and safely. Agents learn from their own observations and from the experience of other agents with whom they exchange information. Each LA starts with a given representation of the simulation environment from which it progressively constructs its own internal representation and uses it to make decisions. The paper describes how learning neural networks can support this approach and shows that goal based learning may be used effectively used in this context. An example simulation is presented in which agents represent manned vehicles; they are assigned the goal of traversing a dangerous metropolitan grid safely and rapidly using goal based reinforcement learning with neural networks and compared to three other algorithms.