Reinforcement learning for RoboCup soccer keepaway
Summary (5 min read)
1 Introduction
- Reinforcement learning (Sutton & Barto, 1998) is a theoretically-grounded machine learning method designed to allow an autonomous agent to maximize its longterm reward via repeated experimentation in, and interaction with, its environment.
- There is hidden state, meaning that each agent has only a partial world view at any given moment.
- RoboCup soccer is a large and difficult instance of many of the issues which have been addressed in small, isolated cases in previous reinforcement learning research.
2 Keepaway Soccer
- The authors consider a subtask of RoboCup soccer, keepaway, in which one team, the keepers, tries to maintain possession of the ball within a limited region, while the opposing team, the takers, attempts to gain possession.
- From a learning perspective, these restrictions necessitate that teammates simultaneously learn independent control policies:.
- Thus multiagent learning methods by which the team shares policy information are not applicable.
- At the beginning of each episode, the coach resets the location of the ball and of the players semi-randomly within the region of play as follows.
- One advantage of keepaway is that it is more suitable for directly comparing different machine learning by guest on November 18, 2010adb.sagepub.comDownloaded from methods than is the full robot soccer task.
3 Mapping Keepaway onto Reinforcement Learning
- The authors keepaway problem maps fairly directly onto the discrete-time, episodic, reinforcement-learning framework.
- PassBall(k): Kick the ball directly towards keeper k. GetOpen: Move to a position that is free from opponents and open for a pass from the ball’s current position (using SPAR (Veloso, Stone, & Bowling, 1999)).
- SMDP macro-actions that consist of a subpolicy and termination condition over an underlying decision process, as here, have been termed options (Sutton, Precup, & Singh, 1999).
- Because the players learn simultaneously without any shared knowledge, in what follows the authors present the task from an individual’s perspective.
- From the point of view of an individual, then, an episode consists of a sequence of states, actions, and rewards selected and occurring at the macro-action boundaries: where ai is chosen based on some, presumably incomplete, perception of si, and sj is the terminal state in which the takers have possession or the ball has gone out of bounds.
3.1 Keepers
- Here the authors lay out the keepers’ policy space in terms of the macro-actions from which they can select.
- Keepers not in possession of the ball are required to select the Receive action: Receive:.
- Then behave and terminate as in the Receive action.
- Examples of policies within this space are provided by their benchmark policies: Random: Choose randomly among the n macro-actions, each with probability .
- On these steps the keeper determines a set of state variables, computed based on the positions of:.
3.2 Takers
- Pre-specified takers, the authors specify the taker behaviors within the same framework for the sake of completeness.
- The takers are relatively simple, choosing only macro-actions of minimum duration (one step, or as few as possible given server misses) that exactly mirror low-level skills.
- Hand-coded-T: If no other taker can get to the ball faster than this taker, or this taker is the closest or second closest taker to the ball: Choose the GoToBall action;.
- The takers’ state variables are similar to those of the keepers.
- T1 is the taker that is computing the state variables, and T2 – Tm are the other takers ordered by increasing distance from K1.
4 Reinforcement Learning Algorithm
- The authors use the SMDP version of the Sarsa(λ) algorithm with linear tile-coding function approximation (also known as CMACs) and replacing eligibility traces (see Albus, 1981; Rummery & Niranjan, 1994; Sutton & Barto, 1998).
- Each player learns simultaneously and independently from its own actions and its own perception of the state.
- Note that as a result, the value of a player’s decision depends in general on the current quality of its teammates’ control policies, which themselves change over time.
- In the remainder of this section the authors introduce Sarsa(λ) (Section 4.1) and CMACs (Section 4.2) before presenting the full details of their learning algorithm in Section 4.3.
4.1 Sarsa(λ)
- Sarsa(λ) is an on-policy learning method, meaning that the learning procedure estimates Q(s, a), the value of executing action a from state s, subject to the current policy being executed by the agent.
- Meanwhile, the agent continually updates the policy according to the changing estimates of Q(s, a).
- In its basic form, Sarsa(λ) is defined as follows (Sutton & Barto, 1998, Section 7.5): The values in e(s, a), known as eligibility traces, store the credit that past action choices should receive for current rewards; the parameter λ governs how much credit is delivered back to them.
- This alternate orientation requires a different perspective on the standard algorithm.
- These three routines are presented in detail in Section 4.3.
4.2 Function Approximation
- The basic Sarsa(λ) algorithm assumes that each action can be tried in each state infinitely often so as to fully and accurately populate the table of Q-values.
- Rather, the agent needs to learn, based on limited experiences, how to act in new situations.
- Many different function approximators exist and have been used successfully (Sutton & Barto, 1998, Section 8).
- Tile coding allows us to take arbitrary groups of continuous state variables and lay infinite, axis-parallel tilings over them .
- Thus the primary memory vector, , and the eligibility trace vector have only this many nonzero elements.
4.3 Algorithmic Detail
- The authors present the full details of their approach as well as the parameter values they chose, and how they arrived at them.
- Finally, in lines 7–8, the eligibility traces for each active tile of the selected action are set to 1, allowing the weights of these tiles to receive learning updates in the following step.
- 3.2 RLstep RLstep is run on each SMDP step (and so only when some keeper has the ball).
- Second, in line 10, the authors begin to calculate the error in their action value estimates by computing the difference between r, the reward they received, and QLastAction, the expected return of the previous SMDP step.
- In previous work (Stone et al., 2001), the authors experimented systematically with a range of values for the step-size parameter.
5 Empirical Results
- In this section the authors report their empirical results in the keepaway domain.
- In previous work (Stone et al., 2001), the authors first learned a value function for the case in which the agents all used fixed, hand-coded policies.
- Here the authors extend those results by reporting performance with the full set of sensory challenges presented by the RoboCup simulator.
- Section 5.1 summarizes their previously reported initial results.
- The authors then outline a series of follow-up questions in Section 5.2 and address them empirically in Section 5.3.
5.1 Initial Results
- In the RoboCup soccer simulator, agents typically have limited and noisy sensors: Each player can see objects within a 90° view cone, and the precision of an object’s sensed location degrades with distance.
- To benchmark the performance of the learned keepers, the authors first ran the three benchmark keeper policies, Random, Always Hold, and Hand-coded,8 as laid out in Section 3.1.
- Figure 9 shows histograms of the lengths of the episodes generated by these policies.
- All learning runs quickly found a much better policy than any of the benchmark policies, including the hand-coded policy.
- The learning curves still appear to be rising after 40 hours:.
5.2 Follow-up Questions
- The initial results in Section 5.1 represent the main result of this work.
- They demonstrate the power and robustness of distributed SMDP Sarsa(λ) with linear tilecoding function approximation and variable λ.
1. Does the learning approach described above continue to work if the agents are limited to noisy, narrowed vision?
- The authors initial complete, noiseless vision simplification was convenient for two reasons.
- First, the learning agents had no uncertainty about the state of the world, effectively changing the world from partially observable to fully observable.
- Second, as a result, the agents did not need to incorporate any information-gathering actions into their policies, thus simplifying them considerably.
- For these techniques to scale up to more difficult hand-coded takers.
- By guest on November 18, 2010adb.sagepub.comDownloaded from problems such as RoboCup soccer, agents must be able to learn using sensory information that is often incomplete and noisy.
2. How does a learned policy perform in comparison to a hand-coded policy that has been manually tuned?
- The authors initial results compared learned policies to a hand-coded policy that was not tuned at all.
- This policy was able to perform only slightly better than Random.
- In particular, the authors seek to discover here whether learning as opposed to handcoding policies (i) leads to a superior solution; (ii) saves effort but produces a similarly effective solution; or (iii) trades off manual effort against performance.
3. How robust are these methods to differing field sizes?
- The cost of manually tuning a hand-coded policy can generally be tolerated for single problems.
- Having to retune the policy every time the problem specification is slightly changed can become quite cumbersome.
- Reinforcement learning becomes an especially valuable tool for RoboCup soccer if it can handle domain alterations more robustly than hand-coded solutions.
4. How dependent are the results on the state representation?
- The choice of input representation can have a dramatic effect on the performance and computation time of a machine learning solution.
- For this reason, the representation is typically chosen with great care.
- It is often difficult to detect and avoid redundant and irrelevant information.
- Ideally, the learning algorithm would be able to detect the relevance of its state variables on its own.
6. How well do the results scale to larger problems?
- The overall goal of this line of research is to develop reinforcement learning techniques that will scale to 11v11 soccer on a full-sized field.
- Previous results in the keepaway domain have typically included just three keepers and at most two takers.
- The largest learned keepaway solution that the authors know of is in the 4v3 scenario presented in Section 5.1.
- This research examines whether current methods can scale up beyond that.
7. Is the source of the difficulty the learning task itself, or the fact that multiple agents are learning simultaneously?
- Keepaway is a multiagent task in which all of the agents learn simultaneously and independently.
- On the surface, it is unclear whether the learning challenge stems mainly from the fact that the agents are learning to interact with one another, or mainly from the difficulty of the task itself.
- Perhaps it is just as hard for an individual agent to by guest on November 18, 2010adb.sagepub.comDownloaded from learn to collaborate with previously trained experts as it is for the agent to learn simultaneously with other learners.
- Or, on the other hand, perhaps the fewer agents that are learning, and the more that are pre-trained, the quicker the learning happens.
5.3 Detailed Studies
- This section addresses each of the questions listed in Section 5.2 with focused experiments in the keepaway domain.
- The learned policies also outperform their Handcoded policy which the authors describe in detail in the next section.
- This policy was able to hold the ball for an average of 8.2 s. From Figure 12 the authors can see that the keepers are able to learn policies that outperform their initial Hand-coded policy and exhibit performance roughly as good as (perhaps slightly better than) the tuned version.
- As a starting point, notice that their Hand-coded policy uses only a small subset of the 13 state variables mentioned previously.
- The learning curves for all nine runs are shown in Figure 17.
7 Conclusion
- This article presents an application of episodic SMDP Sarsa(λ) with linear tile-coding function approximation and variable λ to a complex, multiagent task in a stochastic, dynamic environment.
- With remarkably few training episodes, simultaneously learning agents achieve significantly better performance than a range of benchmark policies, including a reasonable handcoded policy, and comparable performance to a tuned hand-coded policy.
- The authors on-going research aims to build upon the research reported in this article in many ways.
- Preliminary work in this direction demonstrates the utility of transferring learned behaviors from 3v2 to 4v3 and 5v4 scenarios (Taylor & Stone, 2005).
Did you find this useful? Give us your feedback
Citations
1,634 citations
Cites background or methods from "Reinforcement learning for RoboCup ..."
...If the agent only receivesobservationsand does not know the true state, the agent may treat approximate its true sta e as the observation (cf., Stone et al., 2005), or it may learn using the Partially Observable Markov Decision Process (POMDP) (cf., Kaelbling et al., 1998) problem formulation,…...
[...]
...…TDGammon Tesauro 1994, job shop scheduling Zhang and Dietterich 1995, elevator control Crites and B rto 1996, helicopter control Ng et al. 2004, marble maze control Bentivegna et al. 2004, Robot Soccer Keepaway Stone et al. 2005, and quadruped locomotion Saggar et al. 2007 and Kolter et al. 2008)....
[...]
...First, in recent years RL techniques have achieved notable successes in difficult tasks which other machine learning techniques are either unable or ill-equipped to address (e.g., TDGammon Tesauro 1994, job shop scheduling Zhang and Dietterich 1995, elevator control Crites and B rto 1996, helicopter control Ng et al. 2004, marble maze control Bentivegna et al. 2004, Robot Soccer Keepaway Stone et al. 2005, and quadruped locomotion Saggar et al. 2007 and Kolter et al. 2008)....
[...]
332 citations
294 citations
Cites methods from "Reinforcement learning for RoboCup ..."
...Early planning systems used rule bases to solve simple planning problems like the monkey and the bananas [38], stimulating the development of an entire subculture of planning technologies now integrated into a broad range of applications from factory automation to autonomous vehicles [39] and RoboCup Soccer [40], integrating learning with planning [ 41 ]....
[...]
281 citations
Cites background or methods from "Reinforcement learning for RoboCup ..."
...1 Scene from a game in the RoboCup middle size league The article is organized as follows: first, we will give a description of the batch RL framework and in particular describe three variants that have been proven useful for the application to challenging RL problems....
[...]
...Furthermore, Stone’s keep-away-game is a popular standardized reinforcement learning problem derived from the simulation league (Stone et al. 2005)....
[...]
...A lot of learning tasks inspired by or directly derived from RoboCup have been used in proof-of-concepts of reinforcement learning methods (Asada et al. 1999; Stone et al. 2005)....
[...]
272 citations
References
37,989 citations
"Reinforcement learning for RoboCup ..." refers background or methods in this paper
...We use the SMDP version of the Sarsa(λ) algorithm with linear tile-coding function approximation (also known as CMACs) and replacing eligibility traces (see Albus, 1981; Rummery & Niranjan, 1994; Sutton & Barto, 1998)....
[...]
...Perhaps the best understood of current methods is linear Sarsa(λ) (Sutton & Barto, 1998), which we use here....
[...]
...In its basic form, Sarsa(λ) is defined as follows (Sutton & Barto, 1998, Section 7.5): Here, α is a learning rate parameter and γ is a discount factor governing the weight placed on future, as opposed to immediate, rewards.7 The values in e(s, a), known as eligibility traces, store the credit that…...
[...]
...Reinforcement learning (Sutton & Barto, 1998) is a theoretically-grounded machine learning method designed to allow an autonomous agent to maximize its longterm reward via repeated experimentation in, and interaction with, its environment....
[...]
...Many different function approximators exist and have been used successfully (Sutton & Barto, 1998, Section 8)....
[...]
21,674 citations
"Reinforcement learning for RoboCup ..." refers methods in this paper
...…combine a ball-interception behavior trained with a back-propagation neural network; a pass-evaluation behavior trained with the C4.5 decision tree training algorithm (Quinlan, 1993); and a pass-decision behavior trained with TPOT-RL (mentioned above) into a single, successful team (Stone, 2000)....
[...]
11,625 citations
5,492 citations
"Reinforcement learning for RoboCup ..." refers background in this paper
...An important direction for future research is to explore whether reinforcement learning techniques can be extended to keepaway with large, discrete, or continuous, parameterized action spaces, perhaps using policy gradient methods (Sutton et al., 2000)....
[...]
5,188 citations
Related Papers (5)
Frequently Asked Questions (13)
Q2. What are the future works mentioned in the paper "Reinforcement learning for robocup soccer keepaway" ?
Taken as a whole, the experiments reported in this article demonstrate the possibility of multiple independent agents learning simultaneously in a complex environment using reinforcement learning after a small number of trials. ComDownloaded from possibility result and success story for reinforcement learning. An important direction for future research is to explore whether reinforcement learning techniques can be extended to keepaway with large, discrete, or continuous, parameterized action spaces, perhaps using policy gradient methods ( Sutton et al., 2000 ). This latter possibility would enable passes in front of a teammate so that it can move to meet the ball.
Q3. What is the key challenge for applying RL in environments with large state spaces?
A key challenge for applying RL in environments with large state spaces is to be able to generalize the state representation so as to make learning work in practice despite a relatively sparse sample of the state space.
Q4. What is the advantage of tile coding?
An advantage of tile coding is that it allows us ultimately to learn weights associated with discrete, binary features, thus eliminating issues of scaling among features of different types.
Q5. How can the authors achieve quick generalization while maintaining the ability to learn fine distinctions?
By overlaying multiple tilings it is possible to achieve quick generalization while maintaining the ability to learn fine distinctions.
Q6. How did Uchibe learn to shoot a ball into a goal?
Using real robots, Uchibe (1999) used reinforcement learning methods to learn to shoot a ball into a goal while avoiding an opponent.
Q7. What is the main leverage for both hierarchical and factored approaches?
The main leverage for both factored and hierarchical approaches is that they allow the agent to ignore the parts of its state that are irrelevant to its current decision (Andre & Russell, 2002).
Q8. What is the importance of comparing the performance of a learned policy to that of a?
Although the need for manual tuning of parameters is precisely what the authors try to avoid by using machine learning, to assess properly the value of learning, it is important to compare the performance of a learned policy to that of a benchmark that has been carefully thought out.
Q9. What is the effect of the hand-coded policy on the keepers?
Although the authors found that the keepers were able to achieve better than random performance with as little as 1 state variable, the 5 variables used in the handcoded policy seem to be minimal for peak performance.
Q10. How is the degree to which a player is open calculated?
The degree to which the player is open is calculated as a linear combination of the teammate’s distance to its nearest opponent, and the angle between the teammate, K1, and the opponent closest to the passing line.
Q11. How many times does the player receive no new information about a variable?
Each time step in which the player receives no new information about a variable, the variable’s confidence is multiplied by a decay rate (0.99 in their experiments).
Q12. Why is it difficult to characterize objectively the extent to which the independent learners specialize?
Due to the complex policy representation (thousands of weights), it is difficult to characterize objectively the extent to which the independent learners specialize or learn different, perhaps complementary, policies.
Q13. What is the main reason why the hand-coded policy did so well?
Because the Hand-coded policy did quite well without using the remaining variables, the authors wondered if perhaps the unused state variables were not essential for the keepaway task.