What are the future works mentioned in the paper "Markov games as a framework for multi-agent reinforcement learning" ?

Identifying an opponent of this type for the Q-learning agent described in this paper would be an interesting topic for future research.

What is the main idea of the paper?

In particular, the paper describes a reinforcement learning approach to solving two-player zero-sum games in which the “max” operator in the update step of a standard Q-learning algorithm is replaced by a “minimax” operator that can be evaluated by solving a linear program.

What is the learning rule for a Markov game?

This learning rule converges to the correct values forQ and V , assuming that every action is tried in every state infinitely often and that new estimates are blended with previous ones using a slow enough exponentially weighted average [Watkins and Dayan, 1992].

How many steps will be needed before the system reaches convergence?

The use of linear programming in the innermost loop of a learning algorithm is somewhat problematic since the computational complexity of each step is large and typically many steps will be needed before the system reaches convergence.

(Open Access) Markov games as a framework for multi-agent reinforcement learning (1994) | Michael L. Littman

Q: What is the need for probabilistic action choice?

The need for probabilistic action choice stems from the agent’s uncertainty of its opponent’s current move and its requirement to avoid being “second guessed.

Q: What is the effect of the discount factor on the players?

For current purposes, the discount factor has the desirable effect of goading the players into trying to win sooner rather than later.

Q: What is the alternative approach to solving an MDP?

solving an MDP using value iteration involves applying Equations 1–2 simultaneously over all s 2 S. Watkins [Watkins, 1989] proposed an alternative approach that involves performing the updates asynchronously without the use of the transition function, T .

Q: What is the policy for maximizing the expected sum of discounted reward?

In an MDP, an optimal policy is one that maximizes the expected sum of discounted reward and is undominated, meaning that there is no state from which any other policy can achieve a better expected sum of discounted reward.

Q: What is the purpose of reinforcement learning?

Reinforcement learning is a promising technique for creating agents that co-exist [Tan, 1993, Yanco and Stein, 1993], but the mathematical framework that justifies it is inappropriate for multi-agent environments.

Markov games as a framework for multi-agent reinforcement learning



Michael L. Littman

Brown University / Bellcore

Department of Computer Science

Brown University

Providence, RI 02912-1910

mlittman@cs.brown.edu

Abstract

In the Markov decision process(MDP)formaliza-

tion of reinforcement learning, a single adaptive

agent interacts with an environment deﬁned by a

probabilistic transition function. In this solipsis-

tic view, secondary agents can only be part of the

environment and are therefore ﬁxed in their be-

havior. The framework of Markov games allows

us to widen this view to include multiple adap-

tive agents with interacting or competing goals.

This paper considers a step in this direction in

which exactly two agents with diametrically op-

posed goals share an environment. It describes

a Q-learning-like algorithm for ﬁnding optimal

policiesanddemonstrates itsapplicationtoa sim-

ple two-player game in which the optimal policy

is probabilistic.

1 INTRODUCTION

Noagentlivesinavacuum; itmustinteractwithotheragents

to achieve its goals. Reinforcement learning is a promis-

ing technique for creating agents that co-exist

[

Tan, 1993,

Yanco and Stein, 1993

]

, but the mathematical frame-

work that justiﬁes it is inappropriate for multi-agent en-

vironments. The theory of Markov Decision Processes

(MDP’s)

[

Barto et al., 1989, Howard, 1960

]

, which under-

lies much of the recent work on reinforcement learning,

assumes that the agent’s environment is stationary and as

such contains no other adaptive agents.

The theory of games

[

von Neumann and Morgenstern,

1947

]

is explicitlydesigned for reasoningabout multi-agent

systems. Markov games (see e.g.,

[

Van Der Wal, 1981

]

) is

an extension of game theory to MDP-like environments.

This paper considers the consequences of using the Markov

game framework in place of MDP’s in reinforcement learn-

ing. Only the speciﬁc case of two-player zero-sum games

is addressed, but even in this restricted version there are

insights that can be applied to open questions in the ﬁeld of

reinforcement learning.

2 DEFINITIONS

An MDP

[

Howard, 1960

]

is deﬁned by a set of states,

, and actions,

. A transition function,



(

)

, deﬁnes the effects of the various actions on the

state of the environment. (PD

(

)

represents the set of

discreteprobabilitydistributionsovertheset

.) Thereward

function,



! <

, speciﬁes the agent’s task.

In broad terms, the agent’s objective is to ﬁnd a policy

mapping its interaction history to a current choice of action

so as to maximize the expected sum of discounted reward,



,where

is the rewardreceived

steps

into the future. A discount factor, 0



 <

1 controls how

much effect future rewards have on the optimal decisions,

with small values of



emphasizing near-term gain and

larger values giving signiﬁcant weight to later rewards.

In its general form, a Markov game, sometimes called a

stochastic game

[

Owen, 1982

]

, is deﬁned by a set of states,

, and a collection of action sets,

; : : : ; A

, one for each

agent in the environment. State transitions are controlled

by the current state and one action from each agent:



    

(

)

. Each agent also has an

associated reward function,



    

! <

for agent

, and attempts to maximize its expected sum of

discounted rewards,



i;t

, where

i;t

is the

reward received

steps into the future by agent

In this paper, we consider a well-studied specialization in

which there are only two agents and they have diametrically

opposed goals. This allows us to use a single reward func-

tion that one agent tries to maximize and the other, called

the opponent, tries to minimize. In this paper, we use

denote the agent’s action set,

to denote the opponent’s

action set, and

(

s; a; o

)

to denote the immediate reward to

the agent for taking action

in state

when its

opponent takes action

Adopting this specialization, which we call a two-player

zero-sum Markov game, simpliﬁes the mathematics but

makes it impossible to consider important phenomena such

as cooperation. However, itis a ﬁrst step and can be consid-

ered a strict generalization of both MDP’s (when

and matrix games (when

1).

As in MDP’s, the discount factor,



, can be thought of

as the probability that the game will be allowed to con-

tinue after the current move. It is possible to deﬁne a no-

tion of undiscounted rewards

[

Schwartz, 1993

]

, but not all

Markov games have optimal strategies in the undiscounted

case

[

Owen, 1982

]

. This is because, in many games, it is

best to postpone risky actions indeﬁnitely. For current pur-

poses, thediscount factorhas the desirableeffect of goading

the players into trying to win sooner rather than later.

3 OPTIMAL POLICIES

The previous section deﬁned the agent’s objective as max-

imizing the expected sum of discounted reward. There

are subtleties in applying this deﬁnition to Markov games,

however. First, we consider the parallel scenario in MDP’s.

In an MDP, an optimal policy is one that maximizes the

expected sum of discounted reward and is

undominated

meaning that there is no state from which any other policy

can achieve a better expected sum of discounted reward.

Every MDP has at least one optimal policy and of the op-

timal policies for a given MDP, at least one is stationary

and deterministic. This means that, for any MDP, there is

a policy



that is optimal. The policy



is called

stationary since it does not change as a function of time

and it is called deterministicsince the same action is always

chosen whenever the agent is in state

, for all

For many Markov games, there is no policy that is undomi-

nated because performance depends critically on the choice

of opponent. In the game theory literature, the resolu-

tion to this dilemma is to eliminate the choice and evaluate

each policy with respect to the opponent that makes it look

the worst. This performance measure prefers conservative

strategies that can force any opponent to a draw to more

daring ones that accrue a great deal of reward against some

opponentsand losea great deal to others. This istheessence

of minimax: Behave so as to maximize your reward in the

worst case.

Given this deﬁnition of optimality, Markov games have

several important properties. Like MDP’s, every Markov

game has a non-empty set of optimal policies, at least one

of which is stationary. Unlike MDP’s, there need not be a

deterministic optimalpolicy. Instead,theoptimalstationary

policy is sometimesprobabilistic,mapping states todiscrete

probability distributions over actions,



(

)

. A

classic example is “rock, paper, scissors” in which any

deterministic policy can be consistently defeated.

The idea thatoptimal policiesare sometimesstochastic may

seem strange to readers familiar with MDP’s or games with

alternating turns like backgammon or tic-tac-toe, since in

these frameworks there is always a deterministic policy that

does no worse than the best probabilisticone. The need for

probabilistic action choice stems from the agent’s uncer-

tainty of its opponent’s current move and its requirement to

avoid being “second guessed.”

Agent

rock paper scissors

rock 0 1 -1

Opponent paper -1 0 1

scissors 1 -1 0

Table 1: The matrix game for “rock, paper, scissors.”



paper



scissors



(vs. rock)



rock



scissors



(vs. paper)



rock



paper



(vs. scissors)



rock



paper



scissors

Table 2: Linear constraintson thesolutionto amatrix game.

4 FINDING OPTIMAL POLICIES

This section reviews methods for ﬁnding optimal policies

for matrix games, MDP’s, and Markovgames. It uses a uni-

form notation that is intended to emphasize the similarities

between the three frameworks. To avoid confusion, func-

tion names that appear more than once appear withdifferent

numbers of arguments each time.

4.1 MATRIX GAMES

At the coreof the theoryof gamesis the matrixgame deﬁned

by a matrix,

, of instantaneous rewards. Component

i;j

is the reward to the agent for choosing action

when its

opponent chooses action

. The agent strives to maximize

its expected reward while the opponent tries to minimize

it. Table 1 gives the matrix game corresponding to “rock,

paper, scissors.”

The agent’s policy is a probability distributionover actions,



(

)

. For “rock, paper, scissors,”



is made up of

3 components:



rock



paper

, and



scissors

. According to the

notion of optimality discussed earlier, the optimal agent’s

minimum expected reward should be as large as possible.

How can we ﬁnd a policy that achieves this? Imagine that

we would be satisﬁed with a policy that is guaranteed an

expected score of

no matter which action the opponent

chooses. The inequalities in Table 2, with





0, constrain

the components of



to represent exactly those policies—

any solution to the inequalities would sufﬁce.

For



to be optimal, we must identify the largest

for

which there is some value of



that makes the constraints

hold. Linear programming (see, e.g.,

[

Strang, 1980

]

) is a

general technique for solving problems of this kind. In this

example, linear programming ﬁnds a value of 0 for

and

(1/3, 1/3, 1/3) for



. We can abbreviate this linear program

as:

max



(

)

min

o;a



;

where

o;a



expresses the expected reward to the

agent for using policy



against the opponent’s action

4.2 MDP’s

There is a host of methods for solving MDP’s. This section

describes a general method known as value iteration

[

Bert-

sekas, 1987

]

The value of a state,

(

)

, is the total expected discounted

reward attained by the optimal policy starting from state

. States for which

(

)

is large are “good” in that a

smart agent can collect a great deal of reward starting from

those states. The quality of a state-action pair,

(

s; a

)

the total expected discounted reward attained by the non-

stationary policy that takes action

from state

and then follows the optimal policy from then on. These

functions satisfy the following recursive relationship for all

and

(

s; a

) =

(

s; a

) +



(

s; a; s

)

(

)

(1)

(

) =

max

(

s; a

)

(2)

This says that the quality of a state-action pair is the im-

mediate reward plus the discounted value of all succeeding

states weighted by their likelihood. The value of a state is

the quality of the best action for that state. It follows that

knowing

is enough to specify an optimal policy since

in each state, we can choose the action with the highest

-value.

The method of value iteration starts with estimates for

and

and generates new estimates by treating the equal

signs in Equations 1–2 as assignment operators. It can be

shown that the estimated values for

and

converge to

their true values

[

Bertsekas, 1987

]

4.3 MARKOV GAMES

Given

(

s; a

)

, an agent can maximize its reward using the

“greedy” strategy of always choosing the action with the

highest

-value. This strategy is greedy because it treats

(

s; a

)

as a surrogate for immediate reward and then acts

to maximize its immediate gain. It is optimal because the

-function is an accurate summary of future rewards.

A similar observation can be used for Markov games once

we redeﬁne

(

)

to be the expected reward for the optimal

policy starting from state

, and

(

s; a; o

)

as the expected

reward for taking action

when the opponent chooses

from state

and continuing optimally thereafter. We can

then treat the

(

s; a; o

)

values as immediate payoffs in an

unrelated sequence of matrix games (one for each state,

each of which can be solved optimally using the techniques

of Section 4.1.

Thus, the value of a state

in a Markov game is

(

) =

max



(

)

min

(

s; a; o

)



;

and the quality of action

against action

in state

(

s; a; o

) =

(

s; a; o

) +



(

s; a; o; s

)

(

)

The resulting recursive equations look much like Equations

1–2 and indeed the analogous value iteration algorithm can

be shown to converge to the correct values

[

Owen, 1982

]

It is worth noting that in games with alternating turns, the

valuefunctionneednotbecomputedbylinearprogramming

since there is an optimal deterministic policy. In this case

we can write

(

) =

max

min

(

s; a; o

)

5 LEARNING OPTIMAL POLICIES

Traditionally,solvinganMDPusingvalueiterationinvolves

applying Equations 1–2 simultaneously over all

Watkins

[

Watkins, 1989

]

proposed an alternative approach

that involves performing the updates asynchronously with-

out the use of the transition function,

In this Q-learning formulation, an update is performed by

an agent whenever it receives a reward of

when making

a transition from

after taking action

. The up-

date is

(

s; a

)

 V

(

)

which takes the place of

Equation 1. The probability with which this happens is

precisely

(

s; a; s

)

which is why it is possible for an agent

to carry out the appropriate update without explicitly using

. This learning rule converges to the correct values for

and

, assuming that every action is tried in every state

inﬁnitely often and that new estimates are blended with

previous ones using a slow enough exponentially weighted

average

[

Watkins and Dayan, 1992

]

It is straightforward, though seemingly novel, to apply the

same technique to solving Markov games. A completely

speciﬁed version of the algorithm is given in Figure 1. The

variables in the ﬁgure warrant explanation since some are

given to the algorithmas part of the environment, others are

internal to the algorithm and still others are parameters of

the algorithm itself.

Variables from the environment are: the state set, S; the

action set, A; the opponent’saction set, O; and the discount

factor, gamma. The variables internal to the learner are: a

learning rate, alpha, which is initializedto 1.0 and decays

over time; the agent’s estimate of the

-function, Q; the

agent’s estimate of the

-function, V; and the agent’s cur-

rent policy for state

, pi[s,.]. The remaining variables

are parameters of the algorithm: explor controls how

often the agent will deviate from its current policy to en-

sure that the state space is adequately explored, and decay

controls the rate at which the learning rate decays.

This algorithm is called minimax-Q since it is essentially

identical to the standard Q-learning algorithm with a mini-

max replacing the max.

6 EXPERIMENTS

This section demonstrates the minimax-Q learning algo-

rithm using a simple two-player zero-sum Markov game

modeled after the game of soccer.

Initialize:

For all s in S, a in A, and o in O,

Let Q[s,a,o] := 1

For all s in S,

Let V[s] := 1

For all s in S, a in A,

Let pi[s,a] := 1/|A|

Let alpha := 1.0

Choose an action:

With probability explor, return an action uniformly at random.

Otherwise, if current state is s,

Return action a with probabilitypi[s,a].

Learn:

After receiving reward rew for moving from state s to s’

via action a and opponent’s action o,

Let Q[s,a,o] := (1-alpha) * Q[s,a,o] + alpha * (rew + gamma * V[s’])

Use linear programming to ﬁnd pi[s,.] such that:

pi[s,.] := argmax

pi’[s,.], min

o’, sum

a’, pi[s,a’] * Q[s,a’,o’]

ggg

Let V[s] := min

o’, sum

a’, pi[s,a’] * Q[s,a’,o’]

Let alpha := alpha * decay

Figure 1: The minimax-Q algorithm.

Figure 2: An initial board (left) and a situation requiring a probabilistic choice for A (right).

6.1 SOCCER

The game is played on a 4x5 grid as depicted in Figure 2.

The two players, A and B, occupy distinct squares of the

grid and can choose one of 5 actions on each turn: N, S, E,

W, and stand. Onceboth playershave selected their actions,

the two moves are executed in random order.

The circle in the ﬁgures represents the “ball.” When the

player with the ball steps into the appropriate goal (left for

A, right for B), that player scores a point and the board

is reset to the conﬁguration shown in the left half of the

ﬁgure. Possession of the ball goes to one or the otherplayer

at random.

When a player executes an action that would take it to the

square occupied by the other player, possession of the ball

goes to the stationary player and the move does not take

place. A good defensive maneuver, then, is to stand where

the other player wants to go. Goals are worth one point

and the discount factor is set to 0.9, which makes scoring

sooner somewhat better than scoring later.

For an agent on the offensive to do better than breaking

even against an unknown defender, the agent must use a

probabilistic policy. For instance, in the example situation

shown in the right half of Figure 2, any deterministicchoice

forAcanbeblockedindeﬁnitelybyacleveropponent. Only

by choosing randomly between stand and S can the agent

guarantee an opening and therefore an opportunityto score.

6.2 TRAINING AND TESTING

Four different policies were learned, two using the

minimax-Q algorithm and two using Q-learning. For each

learning algorithm, one learner was trained against a ran-

dom opponent and the other against another learner of

identical design. The resulting policies were named MR,

MM, QR, and QQ for minimax-Q trained against random,

minimax-Q trained against minimax-Q, Q trained against

random, and Q trained against Q.

The minimax-Q algorithm (MR, MM) was as described in

Figure 1 with

explor

2 and

decay

log0

9999954 and learning took place for one million steps.

(The value of decay was chosen so that the learning rate

reached 0.01 at the end of the run.) The Q-learning algo-

rithm (QR, QQ) was identical except a “max” operator was

used in place of the minimax and the

-table did not keep

information about the opponent’s action. Parameters were

set identically to the minimax-Q case.

For MR and QR,theopponent for trainingwas a ﬁxed policy

that chose actions uniformly at random. For MM and QQ,

the opponent was another learner identical to the ﬁrst but

with separate

and

-tables.

The resulting policies were evaluated in three ways. First,

each policy was run head-to-head with a random policy

for one hundred thousand steps. To emulate the discount

factor, every step had a 0.1 probability of being declared a

draw. Wins and losses against the random opponent were

tabulated.

The second test was a head-to-head competition with a

hand-built policy. This policy was deterministic and had

simple rules for scoring and blocking. In 100,000 steps, it

completed 5600 games against the random opponent and

won 99.5% of them.

The third test used Q-learning to train a “challenger” op-

ponent for each of MR, MM, QR and QQ. The training

procedure for the challengers followed that of QR where

the “champion” policy was held ﬁxed while the challenger

was trained against it. The resulting policies were then

evaluated against their respective champions. This test was

repeated three times to ensure stability with only the ﬁrst

reported here. All evaluations were repeated three times

and averaged.

6.3 RESULTS

Table 3 summarizes the results. The columns marked

“games” list the number of completed games in 100,000

steps and the columns marked “% won” list the percent-

age won by the associated policy. Percentages close to 50

indicate that the contest was nearly a draw.

All the policies did quite well when tested against the ran-

dom opponent. The QR policy’s performance was quite

remarkable, however, since it completed more games than

the otherpolicies and won nearly all of them. This might be

expected since QR was trained speciﬁcally to beat this op-

ponent whereas MR, though trainedin competitionwith the

random policy, chooses actions with an idealized opponent

in mind.

Against the hand-built policy, MM and MR did well,

roughly breaking even. The MM policy did marginally

better. In the limit, this should not be the case since an

agent trained by the minimax-Q algorithm should be in-

sensitive to the opponent against which it was trained and

always behave so as to maximize its score in the worst case.

The fact that there was a difference suggests that the algo-

rithm had not converged on the optimal policy yet. Prior

to convergence, the opponent can make a big difference to

the behavior of a minimax-Q agent since playing against

a strong opponent means the training will take place in

important parts of the state space.

The performance of the QQ and QR policies against the

hand-built policy was strikingly different. This points out

an important consequence of not using a minimax criterion.

A close look at the two policies indicated that QQ, by luck,

implemented a defense that was perfect against the hand-

built policy. The QR policy, on the other hand, happened

to converge on a strategy that was not appropriate. Against

a slightly different opponent, the tables would have been

turned.

The fact that the QQ policy did so well against the ran-

dom and hand-built opponents, especially compared to the

Markov games as a framework for multi-agent reinforcement learning

Figures

Citations

Reinforcement Learning: An Introduction

Mastering the game of Go with deep neural networks and tree search

Machine learning

Mastering the game of Go without human knowledge

Reinforcement learning: a survey

References

Theory of Games and Economic Behavior

Machine learning

Technical Note : \cal Q -Learning

Theory of Games and Economic Behavior

Technical Note Q-Learning

Related Papers (5)

Reinforcement Learning: An Introduction

Human-level control through deep reinforcement learning

Reinforcement learning: a survey

Learning from delayed rewards

Introduction to Reinforcement Learning

Frequently Asked Questions (10)

Q1. What have the authors contributed in "Markov games as a framework for multi-agent reinforcement learning" ?

Q2. What are the future works mentioned in the paper "Markov games as a framework for multi-agent reinforcement learning" ?

Q3. What is the main idea of the paper?

Q4. What is the need for probabilistic action choice?

Q5. What is the effect of the discount factor on the players?

Q6. What is the alternative approach to solving an MDP?

Q7. What is the learning rule for a Markov game?

Q8. How many steps will be needed before the system reaches convergence?

Q9. What is the policy for maximizing the expected sum of discounted reward?

Q10. What is the purpose of reinforcement learning?