scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Reinforcement learning for RoboCup soccer keepaway

01 Sep 2005-Adaptive Behavior (SAGE Publications)-Vol. 13, Iss: 3, pp 165-188
TL;DR: The application of episodic SMDP Sarsa(λ) with linear tile-coding function approximation and variable λ to learning higher-level decisions in a keepaway subtask of RoboCup soccer results in agents that significantly outperform a range of benchmark policies.
Abstract: RoboCup simulated soccer presents many challenges to reinforcement learning methods, including a large state space, hidden and uncertain state, multiple independent agents learning simultaneously, and long and variable delays in the effects of actions. We describe our application of episodic SMDP Sarsa(λ) with linear tile-coding function approximation and variable λ to learning higher-level decisions in a keepaway subtask of RoboCup soccer. In keepaway, one team, “the keepers,” tries to keep control of the ball for as long as possible despite the efforts of “the takers.” The keepers learn individually when to hold the ball and when to pass to a teammate. Our agents learned policies that significantly outperform a range of benchmark policies. We demonstrate the generality of our approach by applying it to a number of task variations including different field sizes and different numbers of players on each team.

Summary (5 min read)

1 Introduction

  • Reinforcement learning (Sutton & Barto, 1998) is a theoretically-grounded machine learning method designed to allow an autonomous agent to maximize its longterm reward via repeated experimentation in, and interaction with, its environment.
  • There is hidden state, meaning that each agent has only a partial world view at any given moment.
  • RoboCup soccer is a large and difficult instance of many of the issues which have been addressed in small, isolated cases in previous reinforcement learning research.

2 Keepaway Soccer

  • The authors consider a subtask of RoboCup soccer, keepaway, in which one team, the keepers, tries to maintain possession of the ball within a limited region, while the opposing team, the takers, attempts to gain possession.
  • From a learning perspective, these restrictions necessitate that teammates simultaneously learn independent control policies:.
  • Thus multiagent learning methods by which the team shares policy information are not applicable.
  • At the beginning of each episode, the coach resets the location of the ball and of the players semi-randomly within the region of play as follows.
  • One advantage of keepaway is that it is more suitable for directly comparing different machine learning by guest on November 18, 2010adb.sagepub.comDownloaded from methods than is the full robot soccer task.

3 Mapping Keepaway onto Reinforcement Learning

  • The authors keepaway problem maps fairly directly onto the discrete-time, episodic, reinforcement-learning framework.
  • PassBall(k): Kick the ball directly towards keeper k. GetOpen: Move to a position that is free from opponents and open for a pass from the ball’s current position (using SPAR (Veloso, Stone, & Bowling, 1999)).
  • SMDP macro-actions that consist of a subpolicy and termination condition over an underlying decision process, as here, have been termed options (Sutton, Precup, & Singh, 1999).
  • Because the players learn simultaneously without any shared knowledge, in what follows the authors present the task from an individual’s perspective.
  • From the point of view of an individual, then, an episode consists of a sequence of states, actions, and rewards selected and occurring at the macro-action boundaries: where ai is chosen based on some, presumably incomplete, perception of si, and sj is the terminal state in which the takers have possession or the ball has gone out of bounds.

3.1 Keepers

  • Here the authors lay out the keepers’ policy space in terms of the macro-actions from which they can select.
  • Keepers not in possession of the ball are required to select the Receive action: Receive:.
  • Then behave and terminate as in the Receive action.
  • Examples of policies within this space are provided by their benchmark policies: Random: Choose randomly among the n macro-actions, each with probability .
  • On these steps the keeper determines a set of state variables, computed based on the positions of:.

3.2 Takers

  • Pre-specified takers, the authors specify the taker behaviors within the same framework for the sake of completeness.
  • The takers are relatively simple, choosing only macro-actions of minimum duration (one step, or as few as possible given server misses) that exactly mirror low-level skills.
  • Hand-coded-T: If no other taker can get to the ball faster than this taker, or this taker is the closest or second closest taker to the ball: Choose the GoToBall action;.
  • The takers’ state variables are similar to those of the keepers.
  • T1 is the taker that is computing the state variables, and T2 – Tm are the other takers ordered by increasing distance from K1.

4 Reinforcement Learning Algorithm

  • The authors use the SMDP version of the Sarsa(λ) algorithm with linear tile-coding function approximation (also known as CMACs) and replacing eligibility traces (see Albus, 1981; Rummery & Niranjan, 1994; Sutton & Barto, 1998).
  • Each player learns simultaneously and independently from its own actions and its own perception of the state.
  • Note that as a result, the value of a player’s decision depends in general on the current quality of its teammates’ control policies, which themselves change over time.
  • In the remainder of this section the authors introduce Sarsa(λ) (Section 4.1) and CMACs (Section 4.2) before presenting the full details of their learning algorithm in Section 4.3.

4.1 Sarsa(λ)

  • Sarsa(λ) is an on-policy learning method, meaning that the learning procedure estimates Q(s, a), the value of executing action a from state s, subject to the current policy being executed by the agent.
  • Meanwhile, the agent continually updates the policy according to the changing estimates of Q(s, a).
  • In its basic form, Sarsa(λ) is defined as follows (Sutton & Barto, 1998, Section 7.5): The values in e(s, a), known as eligibility traces, store the credit that past action choices should receive for current rewards; the parameter λ governs how much credit is delivered back to them.
  • This alternate orientation requires a different perspective on the standard algorithm.
  • These three routines are presented in detail in Section 4.3.

4.2 Function Approximation

  • The basic Sarsa(λ) algorithm assumes that each action can be tried in each state infinitely often so as to fully and accurately populate the table of Q-values.
  • Rather, the agent needs to learn, based on limited experiences, how to act in new situations.
  • Many different function approximators exist and have been used successfully (Sutton & Barto, 1998, Section 8).
  • Tile coding allows us to take arbitrary groups of continuous state variables and lay infinite, axis-parallel tilings over them .
  • Thus the primary memory vector, , and the eligibility trace vector have only this many nonzero elements.

4.3 Algorithmic Detail

  • The authors present the full details of their approach as well as the parameter values they chose, and how they arrived at them.
  • Finally, in lines 7–8, the eligibility traces for each active tile of the selected action are set to 1, allowing the weights of these tiles to receive learning updates in the following step.
  • 3.2 RLstep RLstep is run on each SMDP step (and so only when some keeper has the ball).
  • Second, in line 10, the authors begin to calculate the error in their action value estimates by computing the difference between r, the reward they received, and QLastAction, the expected return of the previous SMDP step.
  • In previous work (Stone et al., 2001), the authors experimented systematically with a range of values for the step-size parameter.

5 Empirical Results

  • In this section the authors report their empirical results in the keepaway domain.
  • In previous work (Stone et al., 2001), the authors first learned a value function for the case in which the agents all used fixed, hand-coded policies.
  • Here the authors extend those results by reporting performance with the full set of sensory challenges presented by the RoboCup simulator.
  • Section 5.1 summarizes their previously reported initial results.
  • The authors then outline a series of follow-up questions in Section 5.2 and address them empirically in Section 5.3.

5.1 Initial Results

  • In the RoboCup soccer simulator, agents typically have limited and noisy sensors: Each player can see objects within a 90° view cone, and the precision of an object’s sensed location degrades with distance.
  • To benchmark the performance of the learned keepers, the authors first ran the three benchmark keeper policies, Random, Always Hold, and Hand-coded,8 as laid out in Section 3.1.
  • Figure 9 shows histograms of the lengths of the episodes generated by these policies.
  • All learning runs quickly found a much better policy than any of the benchmark policies, including the hand-coded policy.
  • The learning curves still appear to be rising after 40 hours:.

5.2 Follow-up Questions

  • The initial results in Section 5.1 represent the main result of this work.
  • They demonstrate the power and robustness of distributed SMDP Sarsa(λ) with linear tilecoding function approximation and variable λ.

1. Does the learning approach described above continue to work if the agents are limited to noisy, narrowed vision?

  • The authors initial complete, noiseless vision simplification was convenient for two reasons.
  • First, the learning agents had no uncertainty about the state of the world, effectively changing the world from partially observable to fully observable.
  • Second, as a result, the agents did not need to incorporate any information-gathering actions into their policies, thus simplifying them considerably.
  • For these techniques to scale up to more difficult hand-coded takers.
  • By guest on November 18, 2010adb.sagepub.comDownloaded from problems such as RoboCup soccer, agents must be able to learn using sensory information that is often incomplete and noisy.

2. How does a learned policy perform in comparison to a hand-coded policy that has been manually tuned?

  • The authors initial results compared learned policies to a hand-coded policy that was not tuned at all.
  • This policy was able to perform only slightly better than Random.
  • In particular, the authors seek to discover here whether learning as opposed to handcoding policies (i) leads to a superior solution; (ii) saves effort but produces a similarly effective solution; or (iii) trades off manual effort against performance.

3. How robust are these methods to differing field sizes?

  • The cost of manually tuning a hand-coded policy can generally be tolerated for single problems.
  • Having to retune the policy every time the problem specification is slightly changed can become quite cumbersome.
  • Reinforcement learning becomes an especially valuable tool for RoboCup soccer if it can handle domain alterations more robustly than hand-coded solutions.

4. How dependent are the results on the state representation?

  • The choice of input representation can have a dramatic effect on the performance and computation time of a machine learning solution.
  • For this reason, the representation is typically chosen with great care.
  • It is often difficult to detect and avoid redundant and irrelevant information.
  • Ideally, the learning algorithm would be able to detect the relevance of its state variables on its own.

6. How well do the results scale to larger problems?

  • The overall goal of this line of research is to develop reinforcement learning techniques that will scale to 11v11 soccer on a full-sized field.
  • Previous results in the keepaway domain have typically included just three keepers and at most two takers.
  • The largest learned keepaway solution that the authors know of is in the 4v3 scenario presented in Section 5.1.
  • This research examines whether current methods can scale up beyond that.

7. Is the source of the difficulty the learning task itself, or the fact that multiple agents are learning simultaneously?

  • Keepaway is a multiagent task in which all of the agents learn simultaneously and independently.
  • On the surface, it is unclear whether the learning challenge stems mainly from the fact that the agents are learning to interact with one another, or mainly from the difficulty of the task itself.
  • Perhaps it is just as hard for an individual agent to by guest on November 18, 2010adb.sagepub.comDownloaded from learn to collaborate with previously trained experts as it is for the agent to learn simultaneously with other learners.
  • Or, on the other hand, perhaps the fewer agents that are learning, and the more that are pre-trained, the quicker the learning happens.

5.3 Detailed Studies

  • This section addresses each of the questions listed in Section 5.2 with focused experiments in the keepaway domain.
  • The learned policies also outperform their Handcoded policy which the authors describe in detail in the next section.
  • This policy was able to hold the ball for an average of 8.2 s. From Figure 12 the authors can see that the keepers are able to learn policies that outperform their initial Hand-coded policy and exhibit performance roughly as good as (perhaps slightly better than) the tuned version.
  • As a starting point, notice that their Hand-coded policy uses only a small subset of the 13 state variables mentioned previously.
  • The learning curves for all nine runs are shown in Figure 17.

7 Conclusion

  • This article presents an application of episodic SMDP Sarsa(λ) with linear tile-coding function approximation and variable λ to a complex, multiagent task in a stochastic, dynamic environment.
  • With remarkably few training episodes, simultaneously learning agents achieve significantly better performance than a range of benchmark policies, including a reasonable handcoded policy, and comparable performance to a tuned hand-coded policy.
  • The authors on-going research aims to build upon the research reported in this article in many ways.
  • Preliminary work in this direction demonstrates the utility of transferring learned behaviors from 3v2 to 4v3 and 5v4 scenarios (Taylor & Stone, 2005).

Did you find this useful? Give us your feedback

Figures (16)

Content maybe subject to copyright    Report

http://adb.sagepub.com/
Adaptive Behavior
http://adb.sagepub.com/content/13/3/165
The online version of this article can be found at:
DOI: 10.1177/105971230501300301
2005 13: 165Adaptive Behavior
Peter Stone, Richard S. Sutton and Gregory Kuhlmann
Reinforcement Learning for RoboCup Soccer Keepaway
Published by:
http://www.sagepublications.com
On behalf of:
International Society of Adaptive Behavior
can be found at:Adaptive BehaviorAdditional services and information for
http://adb.sagepub.com/cgi/alertsEmail Alerts:
http://adb.sagepub.com/subscriptionsSubscriptions:
http://www.sagepub.com/journalsReprints.navReprints:
http://www.sagepub.com/journalsPermissions.navPermissions:
http://adb.sagepub.com/content/13/3/165.refs.htmlCitations:
by guest on November 18, 2010adb.sagepub.comDownloaded from

165
Reinforcement Learning for RoboCup Soccer
Keepaway
Peter Stone
1
, Richard S. Sutton
2
, Gregory Kuhlmann
1
1
Department of Computer Sciences, The University of Texas at Austin
2
Department of Computing Science, University of Alberta
RoboCup simulated soccer presents many challenges to reinforcement learning methods, including a
large state space, hidden and uncertain state, multiple independent agents learning simultaneously,
and long and variable delays in the effects of actions. We describe our application of episodic SMDP
Sarsa(λ) with linear tile-coding function approximation and variable λ to learning higher-level deci-
sions in a keepaway subtask of RoboCup soccer. In keepaway, one team, “the keepers,” tries to keep
control of the ball for as long as possible despite the efforts of “the takers.” The keepers learn individ-
ually when to hold the ball and when to pass to a teammate. Our agents learned policies that signifi-
cantly outperform a range of benchmark policies. We demonstrate the generality of our approach by
applying it to a number of task variations including different field sizes and different numbers of play-
ers on each team.
Keywords multiagent systems · machine learning · multiagent learning · reinforcement learning ·
robot soccer
1 Introduction
Reinforcement learning (Sutton & Barto, 1998) is a the-
oretically-grounded machine learning method designed
to allow an autonomous agent to maximize its long-
term reward via repeated experimentation in, and
interaction with, its environment. Under certain condi-
tions, reinforcement learning is guaranteed to enable
the agent to converge to an optimal control policy, and
has been empirically demonstrated to do so in a series
of relatively simple testbed domains. Despite its
appeal, reinforcement learning can be difficult to scale
up to larger domains due to the exponential growth of
states in the number of state variables (the “curse of
dimensionality”). A limited number of successes have
been reported in large-scale domains, including back-
gammon (Tesauro, 1994), elevator control (Crites &
Barto, 1996), and helicopter control (Bagnell & Schnei-
der, 2001). This article contributes to the list of rein-
forcement learning successes, demonstrating that it can
apply successfully to a complex multiagent task, namely
keepaway, a subtask of RoboCup simulated soccer.
RoboCup simulated soccer has been used as the
basis for successful international competitions and re-
search challenges (Kitano et al., 1997). As presented in
detail by Stone (2000), it is a fully distributed, multi-
agent domain with both teammates and adversaries.
There is hidden state, meaning that each agent has only
a partial world view at any given moment. The agents
also have noisy sensors and actuators, meaning that
Copyright © 2005 International Society for Adaptive Behavior
(2005), Vol 13(3): 165–188.
[1059–7123(200506) 13:3; 165–188; 056676]
Correspondence to: Peter Stone, Department of Computer Sciences, The
University of Texas at Austin, 1 University Station C0500, Austin, TX
78712-0233, USA. E-mail: pstone@cs.utexas.edu
Web: http://www.cs.utexas.edu/~pstone
Tel.: +1 512 471-9796, Fax: +1 512 471-8885.
by guest on November 18, 2010adb.sagepub.comDownloaded from

166 Adaptive Behavior 13(3)
they do not perceive the world exactly as it is, nor can
they affect the world exactly as intended. In addition,
the perception and action cycles are asynchronous,
prohibiting the traditional AI paradigm of using per-
ceptual input to trigger actions. Communication oppor-
tunities are limited, and the agents must make their
decisions in real-time. These italicized domain char-
acteristics combine to make simulated robot soccer a
realistic and challenging domain.
In principle, modern reinforcement learning meth-
ods are reasonably well suited to meeting the chal-
lenges of RoboCup simulated soccer. Reinforcement
learning is all about sequential decision making, achiev-
ing delayed goals, and handling noise and stochastic-
ity. It is also oriented toward making decisions relatively
rapidly rather than relying on extensive deliberation or
meta-reasoning. There is a substantial body of rein-
forcement learning research on multiagent decision
making, and soccer is an example of the relatively
benign case in which all agents on the same team share
the same goal. In this case it is often feasible for each
agent to learn independently, sharing only a common
reward signal. The large state space remains a prob-
lem, but can, in principle, be handled using function
approximation, which we discuss further below. Robo-
Cup soccer is a large and difficult instance of many of
the issues which have been addressed in small, isolated
cases in previous reinforcement learning research.
Despite substantial previous work (e.g., Andou, 1998;
Stone & Veloso, 1999; Uchibe, 1999; Riedmiller et al.,
2001), the extent to which modern reinforcement learn-
ing methods can meet these challenges remains an open
question.
Perhaps the most pressing challenge in RoboCup
simulated soccer is the large state space, which requires
some kind of general function approximation. Stone
and Veloso (1999) and others have applied state aggre-
gation approaches, but these are not well suited to
learning complex functions. In addition, the theory of
reinforcement learning with function approximation is
not yet well understood (e.g., see Sutton & Barto, 1998;
Baird & Moore, 1999; Sutton, McAllester, Singh, &
Mansour, 2000). Perhaps the best understood of cur-
rent methods is linear Sarsa(λ) (Sutton & Barto, 1998),
which we use here. This method is not guaranteed to
converge to the optimal policy in all cases, but several
lines of evidence suggest that it is near a good solution
(Gordon, 2001; Tsitsiklis & Van Roy, 1997; Sutton,
1996) and recent results show that it does indeed con-
verge, as long as the action-selection policy is continu-
ous (Perkins & Precup, 2003). Certainly it has advantages
over off-policy methods such as Q-learning (Watkins,
1989), which can be unstable with linear and other
kinds of function approximation. An important open
question is whether Sarsa’s failure to converge is of
practical importance or is merely a theoretical curios-
ity. Only tests on large-state-space applications such
as RoboCup soccer will answer this question.
In this article we begin to scale reinforcement
learning up to RoboCup simulated soccer. We con-
sider a subtask of soccer involving 5–9 players rather
than the full 22. This is the task of keepaway, in which
one team merely seeks to keep control of the ball for as
long as possible. The main contribution of this article
is that it considers a problem that is at the limits of
what reinforcement learning methods can tractably
handle and presents successful results using, mainly, a
single approach, namely episodic SMDP Sarsa(λ) with
linear tile-coding function approximation and variable
λ. Extensive experiments are presented demonstrating
the effectiveness of this approach relative to several
benchmarks.
The remainder of the article is organized as follows.
In the next section we describe keepaway and how we
build on prior work in RoboCup soccer to formulate
this problem at an intermediate level above that of the
lowest level actions and perceptions. In Section 3 we
map this task onto an episodic reinforcement learning
framework. In Sections 4 and 5 we describe our learn-
ing algorithm in detail and our results respectively.
Related work is discussed further in Section 6 and Sec-
tion 7 concludes.
2 Keepaway Soccer
We consider a subtask of RoboCup soccer, keepaway,
in which one team, the keepers, tries to maintain pos-
session of the ball within a limited region, while the
opposing team, the takers, attempts to gain posses-
sion. Whenever the takers take possession or the ball
leaves the region, the episode ends and the players are
reset for another episode (with the keepers being
given possession of the ball again).
Parameters of the task include the size of the
region, the number of keepers, and the number of tak-
ers. Figure 1 shows screen shots of episodes with
three keepers and two takers (called 3 vs. 2, or 3v2 for
by guest on November 18, 2010adb.sagepub.comDownloaded from

Stone, Sutton, & Kuhlmann Reinforcement Learning for RoboCup 167
short) playing in a 20 m by 20 m region and 4v3 in a
30 m by 30 m region.
1
All of the work reported in this article uses the
standard RoboCup soccer simulator
2
(Noda, Matsubara,
Hiraki, & Frank, 1998). Agents in the RoboCup simu-
lator receive visual perceptions every 150 ms indi-
cating the relative distance and angle to visible objects
in the world, such as the ball and other agents. They
may execute a parameterized primitive action such as
turn(angle), dash(power), or kick(power, angle) every
100 ms. Thus the agents must sense and act asynchro-
nously. Random noise is injected into all sensations
and actions. Individual agents must be controlled by
separate processes, with no inter-agent communica-
tion permitted other than via the simulator itself, which
enforces communication bandwidth and range con-
straints. From a learning perspective, these restrictions
necessitate that teammates simultaneously learn inde-
pendent control policies: They are not able to commu-
nicate their experiences or control policies during the
course of learning, nor is any agent able to make deci-
sions for the whole team. Thus multiagent learning
methods by which the team shares policy information
are not applicable. Full details of the RoboCup simula-
tor are presented by Chen et al. (2003).
For the keepaway task, an omniscient coach agent
manages the play, ending episodes when a taker gains
possession of the ball for a set period of time or when
the ball goes outside of the region. At the beginning of
each episode, the coach resets the location of the ball
and of the players semi-randomly within the region of
play as follows. The takers all start in one corner (bot-
tom left). Three randomly chosen keepers are placed
one in each of the three remaining corners, and any
keepers beyond three are placed in the center of the
region. The ball is initially placed next to the keeper in
the top left corner. A sample starting configuration with
four keepers and three takers is shown in Figure 1.
3
Keepaway is a subproblem of the complete robot
soccer domain. The principal simplifications are that
there are fewer players involved; they are playing in a
smaller area; and the players are always focused on the
same high-level goal—they don’t need to balance
offensive and defensive considerations. In addition, in
this article we focus on learning parts of the keepers
policies when playing against fixed, pre-specified tak-
ers. Nevertheless, the skills needed to play keepaway
well are also very useful in the full problem of robot
soccer. Indeed, ATT-CMUnited-2000—the 3rd-place
finishing team in the RoboCup-2000 simulator league—
incorporated a successful hand-coded solution to an
11v11 keepaway task (Stone & McAllester, 2001).
One advantage of keepaway is that it is more suit-
able for directly comparing different machine learning
Figure 1 Left: A screen shot from the middle of a 3 vs. 2 keepaway episode in a 20 m x 20 m region. Right: A starting
configuration for a 4 vs. 3 keepaway episode in a 30 m x 30 m region.
by guest on November 18, 2010adb.sagepub.comDownloaded from

168 Adaptive Behavior 13(3)
methods than is the full robot soccer task. In addition
to the reinforcement learning approaches mentioned
above, machine learning techniques including genetic
programming, neural networks, and decision trees
have been incorporated in RoboCup teams (e.g., see
Luke, Hohn, Farris, Jackson, & Hendler, 1998; Andre
& Teller, 1999; Stone, 2000). A frustration with these
and other machine learning approaches to RoboCup is
that they are all embedded within disparate systems,
and often address different subtasks of the full soccer
problem. Therefore, they are difficult to compare in any
meaningful way. Keepaway is simple enough that it
can be successfully learned in its entirety, yet complex
enough that straightforward solutions are inadequate.
Therefore it is an excellent candidate for a machine
learning benchmark problem. We provide all the nec-
essary source code as well as step-by-step tutorials for
implementing learning experiments in keepaway at
http://www.cs.utexas.edu/~AustinVilla/sim/keepaway/.
3 Mapping Keepaway onto
Reinforcement Learning
Our keepaway problem maps fairly directly onto the
discrete-time, episodic, reinforcement-learning frame-
work. The RoboCup soccer simulator operates in dis-
crete time steps, t = 0, 1, 2, …, each representing 100
ms of simulated time. When one episode ends (e.g., the
ball is lost to the takers), another begins, giving rise to
a series of episodes. Each player learns independently
and may perceive the world differently. For each player,
an episode begins when the player is first asked to
make a decision and ends when possession of the ball
is lost by the keepers.
As a way of incorporating domain knowledge,
our learners choose not from the simulator’s primitive
actions, but from higher level macro-actions based
closely on skills used in the CMUnited-99 team.
4
These
skills include
HoldBall(): Remain stationary while keeping posses-
sion of the ball in a position that is as far away
from the opponents as possible.
PassBall(k): Kick the ball directly towards keeper k.
GetOpen(): Move to a position that is free from oppo-
nents and open for a pass from the ball’s current
position (using SPAR (Veloso, Stone, & Bowling,
1999)).
GoToBall(): Intercept a moving ball or move directly
towards a stationary ball.
BlockPass(k): Move to a position between the keeper
with the ball and keeper k.
All of these skills except PassBall(k) are simple func-
tions from state to a corresponding primitive action;
an invocation of one of these normally controls behav-
ior for a single time step. PassBall(k), however, requires
an extended sequence of primitive actions, using a
series of kicks to position the ball, and then accelerate
it in the desired direction (Stone, 2000); a single invo-
cation of PassBall(k) influences behavior for several
time steps. Moreover, even the simpler skills may last
more than one time step because the player occasion-
ally misses the step following them; the simulator
occasionally misses commands; or the player may
find itself in a situation requiring it to take a specific
action, for instance to self-localize. In these cases
there is no new opportunity for decision-making until
two or more steps after invoking the skill. To handle
such possibilities, it is convenient to treat the prob-
lem as a semi-Markov decision process, or SMDP
(Puterman, 1994; Bradtke & Duff, 1995). An SMDP
evolves in a sequence of jumps from the initiation of
each SMDP macro-action to its termination one or
more time steps later, at which time the next SMDP
macro-action is initiated. SMDP macro-actions that
consist of a subpolicy and termination condition over
an underlying decision process, as here, have been
termed options (Sutton, Precup, & Singh, 1999). For-
mally,
Options consist of three components: A policy π : 1 ×
2
p
[0,1], a termination condition β : 1
+
[0,1],
and an initiation set 3 1. An option (3, π, β) is
available in state s
t
if and only if s
t
3. If the option
is taken, then actions are selected according to π
until the option terminates stochastically according
to β.
(Sutton et al., 1999)
In this context, 1 is the set of primitive states and 2
p
is the set of primitive actions in the domain. π(s, a) is
the probability of selecting primitive action a when in
by guest on November 18, 2010adb.sagepub.comDownloaded from

Citations
More filters
Journal ArticleDOI
TL;DR: This article presents a framework that classifies transfer learning methods in terms of their capabilities and goals, and then uses it to survey the existing literature, as well as to suggest future directions for transfer learning work.
Abstract: The reinforcement learning paradigm is a popular way to address problems that have only limited environmental feedback, rather than correctly labeled examples, as is common in other machine learning contexts. While significant progress has been made to improve learning in a single task, the idea of transfer learning has only recently been applied to reinforcement learning tasks. The core idea of transfer is that experience gained in learning to perform one task can help improve learning performance in a related, but different, task. In this article we present a framework that classifies transfer learning methods in terms of their capabilities and goals, and then use it to survey the existing literature, as well as to suggest future directions for transfer learning work.

1,634 citations


Cites background or methods from "Reinforcement learning for RoboCup ..."

  • ...If the agent only receivesobservationsand does not know the true state, the agent may treat approximate its true sta e as the observation (cf., Stone et al., 2005), or it may learn using the Partially Observable Markov Decision Process (POMDP) (cf., Kaelbling et al., 1998) problem formulation,…...

    [...]

  • ...…TDGammon Tesauro 1994, job shop scheduling Zhang and Dietterich 1995, elevator control Crites and B rto 1996, helicopter control Ng et al. 2004, marble maze control Bentivegna et al. 2004, Robot Soccer Keepaway Stone et al. 2005, and quadruped locomotion Saggar et al. 2007 and Kolter et al. 2008)....

    [...]

  • ...First, in recent years RL techniques have achieved notable successes in difficult tasks which other machine learning techniques are either unable or ill-equipped to address (e.g., TDGammon Tesauro 1994, job shop scheduling Zhang and Dietterich 1995, elevator control Crites and B rto 1996, helicopter control Ng et al. 2004, marble maze control Bentivegna et al. 2004, Robot Soccer Keepaway Stone et al. 2005, and quadruped locomotion Saggar et al. 2007 and Kolter et al. 2008)....

    [...]

Journal ArticleDOI
TL;DR: A set of scalable techniques for learning the behavior of a group of agents in a collaborative multiagent setting using the framework of coordination graphs of Guestrin, Koller, and Parr (2002a) and introduces different model-free reinforcement-learning techniques, unitedly called Sparse Cooperative Q-learning, which approximate the global action-value function based on the topology of a coordination graph.
Abstract: In this article we describe a set of scalable techniques for learning the behavior of a group of agents in a collaborative multiagent setting. As a basis we use the framework of coordination graphs of Guestrin, Koller, and Parr (2002a) which exploits the dependencies between agents to decompose the global payoff function into a sum of local terms. First, we deal with the single-state case and describe a payoff propagation algorithm that computes the individual actions that approximately maximize the global payoff function. The method can be viewed as the decision-making analogue of belief propagation in Bayesian networks. Second, we focus on learning the behavior of the agents in sequential decision-making tasks. We introduce different model-free reinforcement-learning techniques, unitedly called Sparse Cooperative Q-learning, which approximate the global action-value function based on the topology of a coordination graph, and perform updates using the contribution of the individual agents to the maximal global action value. The combined use of an edge-based decomposition of the action-value function and the payoff propagation algorithm for efficient action selection, result in an approach that scales only linearly in the problem size. We provide experimental evidence that our method outperforms related multiagent reinforcement-learning methods based on temporal differences.

332 citations

Journal ArticleDOI
14 Apr 2009
TL;DR: This paper considers the evolution of cognitive radio architecture (CRA) in the context of motivating use cases such as public safety and sentient spaces to characterize CRA with an interdisciplinary perspective where machine perception in visual, acoustic, speech, and natural language text domains provide cues to the automatic detection of stereotypical situations.
Abstract: The radio research community has aggressively embraced cognitive radio for dynamic radio spectrum management to enhance spectrum usage, e.g., in ISM bands and as secondary users in unused TV bands, but the needs of the mobile wireless user have not been addressed as thoroughly on the question of high quality of information (QoI) as a function of place, time, and social setting (e.g. commuting, shopping, or in need of medical assistance). This paper considers the evolution of cognitive radio architecture (CRA) in the context of motivating use cases such as public safety and sentient spaces to characterize CRA with an interdisciplinary perspective where machine perception in visual, acoustic, speech, and natural language text domains provide cues to the automatic detection of stereotypical situations, enabling radio nodes to select from among radio bands and modes more intelligently and enabling cognitive wireless networks to deliver higher QoI within social and technical constraints, made more cost effective via embedded and distributed computational intelligence.

294 citations


Cites methods from "Reinforcement learning for RoboCup ..."

  • ...Early planning systems used rule bases to solve simple planning problems like the monkey and the bananas [38], stimulating the development of an entire subculture of planning technologies now integrated into a broad range of applications from factory automation to autonomous vehicles [39] and RoboCup Soccer [40], integrating learning with planning [ 41 ]....

    [...]

Journal ArticleDOI
TL;DR: Several variants of the general batch learning framework are discussed, particularly tailored to the use of multilayer perceptrons to approximate value functions over continuous state spaces, which are successfully used to learn crucial skills in soccer-playing robots participating in the RoboCup competitions.
Abstract: Batch reinforcement learning methods provide a powerful framework for learning efficiently and effectively in autonomous robots. The paper reviews some recent work of the authors aiming at the successful application of reinforcement learning in a challenging and complex domain. It discusses several variants of the general batch learning framework, particularly tailored to the use of multilayer perceptrons to approximate value functions over continuous state spaces. The batch learning framework is successfully used to learn crucial skills in our soccer-playing robots participating in the RoboCup competitions. This is demonstrated on three different case studies.

281 citations


Cites background or methods from "Reinforcement learning for RoboCup ..."

  • ...1 Scene from a game in the RoboCup middle size league The article is organized as follows: first, we will give a description of the batch RL framework and in particular describe three variants that have been proven useful for the application to challenging RL problems....

    [...]

  • ...Furthermore, Stone’s keep-away-game is a popular standardized reinforcement learning problem derived from the simulation league (Stone et al. 2005)....

    [...]

  • ...A lot of learning tasks inspired by or directly derived from RoboCup have been used in proof-of-concepts of reinforcement learning methods (Asada et al. 1999; Stone et al. 2005)....

    [...]

Book ChapterDOI
01 Jan 2012
TL;DR: This chapter provides a formalization of the general transfer problem, the main settings which have been investigated so far, and the most important approaches to transfer in reinforcement learning.
Abstract: Transfer in reinforcement learning is a novel research area that focuses on the development of methods to transfer knowledge from a set of source tasks to a target task. Whenever the tasks are similar, the transferred knowledge can be used by a learning algorithm to solve the target task and significantly improve its performance (e.g., by reducing the number of samples needed to achieve a nearly optimal performance). In this chapter we provide a formalization of the general transfer problem, we identify the main settings which have been investigated so far, and we review the most important approaches to transfer in reinforcement learning.

272 citations

References
More filters
Book
01 Jan 1988
TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Abstract: Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Their discussion ranges from the history of the field's intellectual foundations to the most recent developments and applications. The only necessary mathematical background is familiarity with elementary concepts of probability. The book is divided into three parts. Part I defines the reinforcement learning problem in terms of Markov decision processes. Part II provides basic solution methods: dynamic programming, Monte Carlo methods, and temporal-difference learning. Part III presents a unified view of the solution methods and incorporates artificial neural networks, eligibility traces, and planning; the two final chapters present case studies and consider the future of reinforcement learning.

37,989 citations


"Reinforcement learning for RoboCup ..." refers background or methods in this paper

  • ...We use the SMDP version of the Sarsa(λ) algorithm with linear tile-coding function approximation (also known as CMACs) and replacing eligibility traces (see Albus, 1981; Rummery & Niranjan, 1994; Sutton & Barto, 1998)....

    [...]

  • ...Perhaps the best understood of current methods is linear Sarsa(λ) (Sutton & Barto, 1998), which we use here....

    [...]

  • ...In its basic form, Sarsa(λ) is defined as follows (Sutton & Barto, 1998, Section 7.5): Here, α is a learning rate parameter and γ is a discount factor governing the weight placed on future, as opposed to immediate, rewards.7 The values in e(s, a), known as eligibility traces, store the credit that…...

    [...]

  • ...Reinforcement learning (Sutton & Barto, 1998) is a theoretically-grounded machine learning method designed to allow an autonomous agent to maximize its longterm reward via repeated experimentation in, and interaction with, its environment....

    [...]

  • ...Many different function approximators exist and have been used successfully (Sutton & Barto, 1998, Section 8)....

    [...]

Book
15 Oct 1992
TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
Abstract: From the Publisher: Classifier systems play a major role in machine learning and knowledge-based systems, and Ross Quinlan's work on ID3 and C4.5 is widely acknowledged to have made some of the most significant contributions to their development. This book is a complete guide to the C4.5 system as implemented in C for the UNIX environment. It contains a comprehensive guide to the system's use , the source code (about 8,800 lines), and implementation notes. The source code and sample datasets are also available on a 3.5-inch floppy diskette for a Sun workstation. C4.5 starts with large sets of cases belonging to known classes. The cases, described by any mixture of nominal and numeric properties, are scrutinized for patterns that allow the classes to be reliably discriminated. These patterns are then expressed as models, in the form of decision trees or sets of if-then rules, that can be used to classify new cases, with emphasis on making the models understandable as well as accurate. The system has been applied successfully to tasks involving tens of thousands of cases described by hundreds of properties. The book starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting. Advantages and disadvantages of the C4.5 approach are discussed and illustrated with several case studies. This book and software should be of interest to developers of classification-based intelligent systems and to students in machine learning and expert systems courses.

21,674 citations


"Reinforcement learning for RoboCup ..." refers methods in this paper

  • ...…combine a ball-interception behavior trained with a back-propagation neural network; a pass-evaluation behavior trained with the C4.5 decision tree training algorithm (Quinlan, 1993); and a pass-decision behavior trained with TPOT-RL (mentioned above) into a single, successful team (Stone, 2000)....

    [...]

Book
15 Apr 1994
TL;DR: Puterman as discussed by the authors provides a uniquely up-to-date, unified, and rigorous treatment of the theoretical, computational, and applied research on Markov decision process models, focusing primarily on infinite horizon discrete time models and models with discrete time spaces while also examining models with arbitrary state spaces, finite horizon models, and continuous time discrete state models.
Abstract: From the Publisher: The past decade has seen considerable theoretical and applied research on Markov decision processes, as well as the growing use of these models in ecology, economics, communications engineering, and other fields where outcomes are uncertain and sequential decision-making processes are needed. A timely response to this increased activity, Martin L. Puterman's new work provides a uniquely up-to-date, unified, and rigorous treatment of the theoretical, computational, and applied research on Markov decision process models. It discusses all major research directions in the field, highlights many significant applications of Markov decision processes models, and explores numerous important topics that have previously been neglected or given cursory coverage in the literature. Markov Decision Processes focuses primarily on infinite horizon discrete time models and models with discrete time spaces while also examining models with arbitrary state spaces, finite horizon models, and continuous-time discrete state models. The book is organized around optimality criteria, using a common framework centered on the optimality (Bellman) equation for presenting results. The results are presented in a "theorem-proof" format and elaborated on through both discussion and examples, including results that are not available in any other book. A two-state Markov decision process model, presented in Chapter 3, is analyzed repeatedly throughout the book and demonstrates many results and algorithms. Markov Decision Processes covers recent research advances in such areas as countable state space models with average reward criterion, constrained models, and models with risk sensitive optimality criteria. It also explores several topics that have received little or no attention in other books, including modified policy iteration, multichain models with average reward criterion, and sensitive optimality. In addition, a Bibliographic Remarks section in each chapter comments on relevant historic

11,625 citations

Proceedings Article
29 Nov 1999
TL;DR: This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.
Abstract: Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own function approximator, independent of the value function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams's REINFORCE method and actor-critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.

5,492 citations


"Reinforcement learning for RoboCup ..." refers background in this paper

  • ...An important direction for future research is to explore whether reinforcement learning techniques can be extended to keepaway with large, discrete, or continuous, parameterized action spaces, perhaps using policy gradient methods (Sutton et al., 2000)....

    [...]

MonographDOI
TL;DR: Markov Decision Processes covers recent research advances in such areas as countable state space models with average reward criterion, constrained models, and models with risk sensitive optimality criteria, and explores several topics that have received little or no attention in other books.
Abstract: From the Publisher: The past decade has seen considerable theoretical and applied research on Markov decision processes, as well as the growing use of these models in ecology, economics, communications engineering, and other fields where outcomes are uncertain and sequential decision-making processes are needed. A timely response to this increased activity, Martin L. Puterman's new work provides a uniquely up-to-date, unified, and rigorous treatment of the theoretical, computational, and applied research on Markov decision process models. It discusses all major research directions in the field, highlights many significant applications of Markov decision processes models, and explores numerous important topics that have previously been neglected or given cursory coverage in the literature. Markov Decision Processes focuses primarily on infinite horizon discrete time models and models with discrete time spaces while also examining models with arbitrary state spaces, finite horizon models, and continuous-time discrete state models. The book is organized around optimality criteria, using a common framework centered on the optimality (Bellman) equation for presenting results. The results are presented in a \"theorem-proof\" format and elaborated on through both discussion and examples, including results that are not available in any other book. A two-state Markov decision process model, presented in Chapter 3, is analyzed repeatedly throughout the book and demonstrates many results and algorithms. Markov Decision Processes covers recent research advances in such areas as countable state space models with average reward criterion, constrained models, and models with risk sensitive optimality criteria. It also explores several topics that have received little or no attention in other books, including modified policy iteration, multichain models with average reward criterion, and sensitive optimality. In addition, a Bibliographic Remarks section in each chapter comments on relevant historic

5,188 citations

Frequently Asked Questions (13)
Q1. What contributions have the authors mentioned in the paper "Reinforcement learning for robocup soccer keepaway" ?

Recently, Stone et al. this paper showed that reinforcement learning can be used to solve the problem of keepaway in RoboCup simulated soccer. 

Taken as a whole, the experiments reported in this article demonstrate the possibility of multiple independent agents learning simultaneously in a complex environment using reinforcement learning after a small number of trials. ComDownloaded from possibility result and success story for reinforcement learning. An important direction for future research is to explore whether reinforcement learning techniques can be extended to keepaway with large, discrete, or continuous, parameterized action spaces, perhaps using policy gradient methods ( Sutton et al., 2000 ). This latter possibility would enable passes in front of a teammate so that it can move to meet the ball. 

A key challenge for applying RL in environments with large state spaces is to be able to generalize the state representation so as to make learning work in practice despite a relatively sparse sample of the state space. 

An advantage of tile coding is that it allows us ultimately to learn weights associated with discrete, binary features, thus eliminating issues of scaling among features of different types. 

By overlaying multiple tilings it is possible to achieve quick generalization while maintaining the ability to learn fine distinctions. 

Using real robots, Uchibe (1999) used reinforcement learning methods to learn to shoot a ball into a goal while avoiding an opponent. 

The main leverage for both factored and hierarchical approaches is that they allow the agent to ignore the parts of its state that are irrelevant to its current decision (Andre & Russell, 2002). 

Although the need for manual tuning of parameters is precisely what the authors try to avoid by using machine learning, to assess properly the value of learning, it is important to compare the performance of a learned policy to that of a benchmark that has been carefully thought out. 

Although the authors found that the keepers were able to achieve better than random performance with as little as 1 state variable, the 5 variables used in the handcoded policy seem to be minimal for peak performance. 

The degree to which the player is open is calculated as a linear combination of the teammate’s distance to its nearest opponent, and the angle between the teammate, K1, and the opponent closest to the passing line. 

Each time step in which the player receives no new information about a variable, the variable’s confidence is multiplied by a decay rate (0.99 in their experiments). 

Due to the complex policy representation (thousands of weights), it is difficult to characterize objectively the extent to which the independent learners specialize or learn different, perhaps complementary, policies. 

Because the Hand-coded policy did quite well without using the remaining variables, the authors wondered if perhaps the unused state variables were not essential for the keepaway task.