scispace - formally typeset
Open AccessBook ChapterDOI

Markov games as a framework for multi-agent reinforcement learning

TLDR
A Q-learning-like algorithm for finding optimal policies and its application to a simple two-player game in which the optimal policy is probabilistic is demonstrated.
Abstract
In the Markov decision process (MDP) formalization of reinforcement learning, a single adaptive agent interacts with an environment defined by a probabilistic transition function. In this solipsis-tic view, secondary agents can only be part of the environment and are therefore fixed in their behavior. The framework of Markov games allows us to widen this view to include multiple adaptive agents with interacting or competing goals. This paper considers a step in this direction in which exactly two agents with diametrically opposed goals share an environment. It describes a Q-learning-like algorithm for finding optimal policies and demonstrates its application to a simple two-player game in which the optimal policy is probabilistic.

read more

Content maybe subject to copyright    Report

Markov games as a framework for multi-agent reinforcement learning
Michael L. Littman
Brown University / Bellcore
Department of Computer Science
Brown University
Providence, RI 02912-1910
mlittman@cs.brown.edu
Abstract
In the Markov decision process(MDP)formaliza-
tion of reinforcement learning, a single adaptive
agent interacts with an environment defined by a
probabilistic transition function. In this solipsis-
tic view, secondary agents can only be part of the
environment and are therefore fixed in their be-
havior. The framework of Markov games allows
us to widen this view to include multiple adap-
tive agents with interacting or competing goals.
This paper considers a step in this direction in
which exactly two agents with diametrically op-
posed goals share an environment. It describes
a Q-learning-like algorithm for finding optimal
policiesanddemonstrates itsapplicationtoa sim-
ple two-player game in which the optimal policy
is probabilistic.
1 INTRODUCTION
Noagentlivesinavacuum; itmustinteractwithotheragents
to achieve its goals. Reinforcement learning is a promis-
ing technique for creating agents that co-exist
[
Tan, 1993,
Yanco and Stein, 1993
]
, but the mathematical frame-
work that justifies it is inappropriate for multi-agent en-
vironments. The theory of Markov Decision Processes
(MDP’s)
[
Barto et al., 1989, Howard, 1960
]
, which under-
lies much of the recent work on reinforcement learning,
assumes that the agent’s environment is stationary and as
such contains no other adaptive agents.
The theory of games
[
von Neumann and Morgenstern,
1947
]
is explicitlydesigned for reasoningabout multi-agent
systems. Markov games (see e.g.,
[
Van Der Wal, 1981
]
) is
an extension of game theory to MDP-like environments.
This paper considers the consequences of using the Markov
game framework in place of MDP’s in reinforcement learn-
ing. Only the specific case of two-player zero-sum games
is addressed, but even in this restricted version there are
insights that can be applied to open questions in the field of
reinforcement learning.
2 DEFINITIONS
An MDP
[
Howard, 1960
]
is defined by a set of states,
S
, and actions,
A
. A transition function,
T
:
S
A
!
PD
(
S
)
, defines the effects of the various actions on the
state of the environment. (PD
(
S
)
represents the set of
discreteprobabilitydistributionsovertheset
S
.) Thereward
function,
R
:
S
A
! <
, specifies the agent’s task.
In broad terms, the agent’s objective is to find a policy
mapping its interaction history to a current choice of action
so as to maximize the expected sum of discounted reward,
E
f
P
1
j
=
0
j
r
t
+
j
g
,where
r
t
+
j
is the rewardreceived
j
steps
into the future. A discount factor, 0
<
1 controls how
much effect future rewards have on the optimal decisions,
with small values of
emphasizing near-term gain and
larger values giving significant weight to later rewards.
In its general form, a Markov game, sometimes called a
stochastic game
[
Owen, 1982
]
, is defined by a set of states,
S
, and a collection of action sets,
A
1
; : : : ; A
k
, one for each
agent in the environment. State transitions are controlled
by the current state and one action from each agent:
T
:
S
A
1
A
k
!
PD
(
S
)
. Each agent also has an
associated reward function,
R
i
:
S
A
1
A
k
! <
,
for agent
i
, and attempts to maximize its expected sum of
discounted rewards,
E
f
P
1
j
=
0
j
r
i;t
+
j
g
, where
r
i;t
+
j
is the
reward received
j
steps into the future by agent
i
.
In this paper, we consider a well-studied specialization in
which there are only two agents and they have diametrically
opposed goals. This allows us to use a single reward func-
tion that one agent tries to maximize and the other, called
the opponent, tries to minimize. In this paper, we use
A
to
denote the agent’s action set,
O
to denote the opponent’s
action set, and
R
(
s; a; o
)
to denote the immediate reward to
the agent for taking action
a
2
A
in state
s
2
S
when its
opponent takes action
o
2
O
.
Adopting this specialization, which we call a two-player
zero-sum Markov game, simplifies the mathematics but
makes it impossible to consider important phenomena such
as cooperation. However, itis a first step and can be consid-
ered a strict generalization of both MDP’s (when
j
O
j
=
1)
and matrix games (when
j
S
j
=
1).

As in MDP’s, the discount factor,
, can be thought of
as the probability that the game will be allowed to con-
tinue after the current move. It is possible to define a no-
tion of undiscounted rewards
[
Schwartz, 1993
]
, but not all
Markov games have optimal strategies in the undiscounted
case
[
Owen, 1982
]
. This is because, in many games, it is
best to postpone risky actions indefinitely. For current pur-
poses, thediscount factorhas the desirableeffect of goading
the players into trying to win sooner rather than later.
3 OPTIMAL POLICIES
The previous section defined the agent’s objective as max-
imizing the expected sum of discounted reward. There
are subtleties in applying this definition to Markov games,
however. First, we consider the parallel scenario in MDP’s.
In an MDP, an optimal policy is one that maximizes the
expected sum of discounted reward and is
undominated
,
meaning that there is no state from which any other policy
can achieve a better expected sum of discounted reward.
Every MDP has at least one optimal policy and of the op-
timal policies for a given MDP, at least one is stationary
and deterministic. This means that, for any MDP, there is
a policy
:
S
!
A
that is optimal. The policy
is called
stationary since it does not change as a function of time
and it is called deterministicsince the same action is always
chosen whenever the agent is in state
s
, for all
s
2
S
.
For many Markov games, there is no policy that is undomi-
nated because performance depends critically on the choice
of opponent. In the game theory literature, the resolu-
tion to this dilemma is to eliminate the choice and evaluate
each policy with respect to the opponent that makes it look
the worst. This performance measure prefers conservative
strategies that can force any opponent to a draw to more
daring ones that accrue a great deal of reward against some
opponentsand losea great deal to others. This istheessence
of minimax: Behave so as to maximize your reward in the
worst case.
Given this definition of optimality, Markov games have
several important properties. Like MDP’s, every Markov
game has a non-empty set of optimal policies, at least one
of which is stationary. Unlike MDP’s, there need not be a
deterministic optimalpolicy. Instead,theoptimalstationary
policy is sometimesprobabilistic,mapping states todiscrete
probability distributions over actions,
:
S
!
PD
(
A
)
. A
classic example is “rock, paper, scissors” in which any
deterministic policy can be consistently defeated.
The idea thatoptimal policiesare sometimesstochastic may
seem strange to readers familiar with MDP’s or games with
alternating turns like backgammon or tic-tac-toe, since in
these frameworks there is always a deterministic policy that
does no worse than the best probabilisticone. The need for
probabilistic action choice stems from the agent’s uncer-
tainty of its opponent’s current move and its requirement to
avoid being “second guessed.
Agent
rock paper scissors
rock 0 1 -1
Opponent paper -1 0 1
scissors 1 -1 0
Table 1: The matrix game for “rock, paper, scissors.
paper
?
scissors
V
(vs. rock)
?
rock
+
scissors
V
(vs. paper)
rock
?
paper
V
(vs. scissors)
rock
+
paper
+
scissors
=
1
Table 2: Linear constraintson thesolutionto amatrix game.
4 FINDING OPTIMAL POLICIES
This section reviews methods for finding optimal policies
for matrix games, MDP’s, and Markovgames. It uses a uni-
form notation that is intended to emphasize the similarities
between the three frameworks. To avoid confusion, func-
tion names that appear more than once appear withdifferent
numbers of arguments each time.
4.1 MATRIX GAMES
At the coreof the theoryof gamesis the matrixgame defined
by a matrix,
R
, of instantaneous rewards. Component
R
i;j
is the reward to the agent for choosing action
j
when its
opponent chooses action
i
. The agent strives to maximize
its expected reward while the opponent tries to minimize
it. Table 1 gives the matrix game corresponding to “rock,
paper, scissors.
The agent’s policy is a probability distributionover actions,
2
PD
(
A
)
. For “rock, paper, scissors,
is made up of
3 components:
rock
,
paper
, and
scissors
. According to the
notion of optimality discussed earlier, the optimal agent’s
minimum expected reward should be as large as possible.
How can we find a policy that achieves this? Imagine that
we would be satisfied with a policy that is guaranteed an
expected score of
V
no matter which action the opponent
chooses. The inequalities in Table 2, with
0, constrain
the components of
to represent exactly those policies—
any solution to the inequalities would suffice.
For
to be optimal, we must identify the largest
V
for
which there is some value of
that makes the constraints
hold. Linear programming (see, e.g.,
[
Strang, 1980
]
) is a
general technique for solving problems of this kind. In this
example, linear programming finds a value of 0 for
V
and
(1/3, 1/3, 1/3) for
. We can abbreviate this linear program
as:
V
=
max
2
PD
(
A
)
min
o
2
O
X
a
2
A
R
o;a
a
;
where
P
a
R
o;a
a
expresses the expected reward to the
agent for using policy
against the opponent’s action
o
.

4.2 MDP’s
There is a host of methods for solving MDP’s. This section
describes a general method known as value iteration
[
Bert-
sekas, 1987
]
.
The value of a state,
V
(
s
)
, is the total expected discounted
reward attained by the optimal policy starting from state
s
2
S
. States for which
V
(
s
)
is large are “good” in that a
smart agent can collect a great deal of reward starting from
those states. The quality of a state-action pair,
Q
(
s; a
)
is
the total expected discounted reward attained by the non-
stationary policy that takes action
a
2
A
from state
s
2
S
and then follows the optimal policy from then on. These
functions satisfy the following recursive relationship for all
a
and
s
:
Q
(
s; a
) =
R
(
s; a
) +
X
s
0
2
S
T
(
s; a; s
0
)
V
(
s
0
)
(1)
V
(
s
) =
max
a
0
2
A
Q
(
s; a
0
)
(2)
This says that the quality of a state-action pair is the im-
mediate reward plus the discounted value of all succeeding
states weighted by their likelihood. The value of a state is
the quality of the best action for that state. It follows that
knowing
Q
is enough to specify an optimal policy since
in each state, we can choose the action with the highest
Q
-value.
The method of value iteration starts with estimates for
Q
and
V
and generates new estimates by treating the equal
signs in Equations 1–2 as assignment operators. It can be
shown that the estimated values for
Q
and
V
converge to
their true values
[
Bertsekas, 1987
]
.
4.3 MARKOV GAMES
Given
Q
(
s; a
)
, an agent can maximize its reward using the
“greedy” strategy of always choosing the action with the
highest
Q
-value. This strategy is greedy because it treats
Q
(
s; a
)
as a surrogate for immediate reward and then acts
to maximize its immediate gain. It is optimal because the
Q
-function is an accurate summary of future rewards.
A similar observation can be used for Markov games once
we redefine
V
(
s
)
to be the expected reward for the optimal
policy starting from state
s
, and
Q
(
s; a; o
)
as the expected
reward for taking action
a
when the opponent chooses
o
from state
s
and continuing optimally thereafter. We can
then treat the
Q
(
s; a; o
)
values as immediate payoffs in an
unrelated sequence of matrix games (one for each state,
s
),
each of which can be solved optimally using the techniques
of Section 4.1.
Thus, the value of a state
s
2
S
in a Markov game is
V
(
s
) =
max
2
PD
(
A
)
min
o
2
O
X
a
2
A
Q
(
s; a; o
)
a
;
and the quality of action
a
against action
o
in state
s
is
Q
(
s; a; o
) =
R
(
s; a; o
) +
X
s
0
T
(
s; a; o; s
0
)
V
(
s
0
)
:
The resulting recursive equations look much like Equations
1–2 and indeed the analogous value iteration algorithm can
be shown to converge to the correct values
[
Owen, 1982
]
.
It is worth noting that in games with alternating turns, the
valuefunctionneednotbecomputedbylinearprogramming
since there is an optimal deterministic policy. In this case
we can write
V
(
s
) =
max
a
min
o
Q
(
s; a; o
)
.
5 LEARNING OPTIMAL POLICIES
Traditionally,solvinganMDPusingvalueiterationinvolves
applying Equations 1–2 simultaneously over all
s
2
S
.
Watkins
[
Watkins, 1989
]
proposed an alternative approach
that involves performing the updates asynchronously with-
out the use of the transition function,
T
.
In this Q-learning formulation, an update is performed by
an agent whenever it receives a reward of
r
when making
a transition from
s
to
s
0
after taking action
a
. The up-
date is
Q
(
s; a
)
:
=
r
+
V
(
s
0
)
which takes the place of
Equation 1. The probability with which this happens is
precisely
T
(
s; a; s
0
)
which is why it is possible for an agent
to carry out the appropriate update without explicitly using
T
. This learning rule converges to the correct values for
Q
and
V
, assuming that every action is tried in every state
infinitely often and that new estimates are blended with
previous ones using a slow enough exponentially weighted
average
[
Watkins and Dayan, 1992
]
.
It is straightforward, though seemingly novel, to apply the
same technique to solving Markov games. A completely
specified version of the algorithm is given in Figure 1. The
variables in the figure warrant explanation since some are
given to the algorithmas part of the environment, others are
internal to the algorithm and still others are parameters of
the algorithm itself.
Variables from the environment are: the state set, S; the
action set, A; the opponent’saction set, O; and the discount
factor, gamma. The variables internal to the learner are: a
learning rate, alpha, which is initializedto 1.0 and decays
over time; the agent’s estimate of the
Q
-function, Q; the
agent’s estimate of the
V
-function, V; and the agent’s cur-
rent policy for state
s
, pi[s,.]. The remaining variables
are parameters of the algorithm: explor controls how
often the agent will deviate from its current policy to en-
sure that the state space is adequately explored, and decay
controls the rate at which the learning rate decays.
This algorithm is called minimax-Q since it is essentially
identical to the standard Q-learning algorithm with a mini-
max replacing the max.
6 EXPERIMENTS
This section demonstrates the minimax-Q learning algo-
rithm using a simple two-player zero-sum Markov game
modeled after the game of soccer.

Initialize:
For all s in S, a in A, and o in O,
Let Q[s,a,o] := 1
For all s in S,
Let V[s] := 1
For all s in S, a in A,
Let pi[s,a] := 1/|A|
Let alpha := 1.0
Choose an action:
With probability explor, return an action uniformly at random.
Otherwise, if current state is s,
Return action a with probabilitypi[s,a].
Learn:
After receiving reward rew for moving from state s to s’
via action a and opponent’s action o,
Let Q[s,a,o] := (1-alpha) * Q[s,a,o] + alpha * (rew + gamma * V[s’])
Use linear programming to find pi[s,.] such that:
pi[s,.] := argmax
f
pi’[s,.], min
f
o’, sum
f
a’, pi[s,a’] * Q[s,a’,o’]
ggg
Let V[s] := min
f
o’, sum
f
a’, pi[s,a’] * Q[s,a’,o’]
gg
Let alpha := alpha * decay
Figure 1: The minimax-Q algorithm.
AB
B
A
Figure 2: An initial board (left) and a situation requiring a probabilistic choice for A (right).

6.1 SOCCER
The game is played on a 4x5 grid as depicted in Figure 2.
The two players, A and B, occupy distinct squares of the
grid and can choose one of 5 actions on each turn: N, S, E,
W, and stand. Onceboth playershave selected their actions,
the two moves are executed in random order.
The circle in the figures represents the “ball. When the
player with the ball steps into the appropriate goal (left for
A, right for B), that player scores a point and the board
is reset to the configuration shown in the left half of the
figure. Possession of the ball goes to one or the otherplayer
at random.
When a player executes an action that would take it to the
square occupied by the other player, possession of the ball
goes to the stationary player and the move does not take
place. A good defensive maneuver, then, is to stand where
the other player wants to go. Goals are worth one point
and the discount factor is set to 0.9, which makes scoring
sooner somewhat better than scoring later.
For an agent on the offensive to do better than breaking
even against an unknown defender, the agent must use a
probabilistic policy. For instance, in the example situation
shown in the right half of Figure 2, any deterministicchoice
forAcanbeblockedindefinitelybyacleveropponent. Only
by choosing randomly between stand and S can the agent
guarantee an opening and therefore an opportunityto score.
6.2 TRAINING AND TESTING
Four different policies were learned, two using the
minimax-Q algorithm and two using Q-learning. For each
learning algorithm, one learner was trained against a ran-
dom opponent and the other against another learner of
identical design. The resulting policies were named MR,
MM, QR, and QQ for minimax-Q trained against random,
minimax-Q trained against minimax-Q, Q trained against
random, and Q trained against Q.
The minimax-Q algorithm (MR, MM) was as described in
Figure 1 with
explor
=
0
:
2 and
decay
=
10
log0
:
01
=
10
6
=
0
:
9999954 and learning took place for one million steps.
(The value of decay was chosen so that the learning rate
reached 0.01 at the end of the run.) The Q-learning algo-
rithm (QR, QQ) was identical except a “max” operator was
used in place of the minimax and the
Q
-table did not keep
information about the opponent’s action. Parameters were
set identically to the minimax-Q case.
For MR and QR,theopponent for trainingwas a fixed policy
that chose actions uniformly at random. For MM and QQ,
the opponent was another learner identical to the first but
with separate
Q
and
V
-tables.
The resulting policies were evaluated in three ways. First,
each policy was run head-to-head with a random policy
for one hundred thousand steps. To emulate the discount
factor, every step had a 0.1 probability of being declared a
draw. Wins and losses against the random opponent were
tabulated.
The second test was a head-to-head competition with a
hand-built policy. This policy was deterministic and had
simple rules for scoring and blocking. In 100,000 steps, it
completed 5600 games against the random opponent and
won 99.5% of them.
The third test used Q-learning to train a “challenger” op-
ponent for each of MR, MM, QR and QQ. The training
procedure for the challengers followed that of QR where
the “champion” policy was held fixed while the challenger
was trained against it. The resulting policies were then
evaluated against their respective champions. This test was
repeated three times to ensure stability with only the first
reported here. All evaluations were repeated three times
and averaged.
6.3 RESULTS
Table 3 summarizes the results. The columns marked
“games” list the number of completed games in 100,000
steps and the columns marked “% won list the percent-
age won by the associated policy. Percentages close to 50
indicate that the contest was nearly a draw.
All the policies did quite well when tested against the ran-
dom opponent. The QR policy’s performance was quite
remarkable, however, since it completed more games than
the otherpolicies and won nearly all of them. This might be
expected since QR was trained specifically to beat this op-
ponent whereas MR, though trainedin competitionwith the
random policy, chooses actions with an idealized opponent
in mind.
Against the hand-built policy, MM and MR did well,
roughly breaking even. The MM policy did marginally
better. In the limit, this should not be the case since an
agent trained by the minimax-Q algorithm should be in-
sensitive to the opponent against which it was trained and
always behave so as to maximize its score in the worst case.
The fact that there was a difference suggests that the algo-
rithm had not converged on the optimal policy yet. Prior
to convergence, the opponent can make a big difference to
the behavior of a minimax-Q agent since playing against
a strong opponent means the training will take place in
important parts of the state space.
The performance of the QQ and QR policies against the
hand-built policy was strikingly different. This points out
an important consequence of not using a minimax criterion.
A close look at the two policies indicated that QQ, by luck,
implemented a defense that was perfect against the hand-
built policy. The QR policy, on the other hand, happened
to converge on a strategy that was not appropriate. Against
a slightly different opponent, the tables would have been
turned.
The fact that the QQ policy did so well against the ran-
dom and hand-built opponents, especially compared to the

Citations
More filters
Book

Reinforcement Learning: An Introduction

TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Journal ArticleDOI

Mastering the game of Go with deep neural networks and tree search

TL;DR: Using this search algorithm, the program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0.5, the first time that a computer program has defeated a human professional player in the full-sized game of Go.
Journal ArticleDOI

Machine learning

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Journal ArticleDOI

Mastering the game of Go without human knowledge

TL;DR: An algorithm based solely on reinforcement learning is introduced, without human data, guidance or domain knowledge beyond game rules, that achieves superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.
Journal ArticleDOI

Reinforcement learning: a survey

TL;DR: Central issues of reinforcement learning are discussed, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state.
References
More filters
Book

Theory of Games and Economic Behavior

TL;DR: Theory of games and economic behavior as mentioned in this paper is the classic work upon which modern-day game theory is based, and it has been widely used to analyze a host of real-world phenomena from arms races to optimal policy choices of presidential candidates, from vaccination policy to major league baseball salary negotiations.
Journal ArticleDOI

Machine learning

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Journal ArticleDOI

Technical Note : \cal Q -Learning

TL;DR: This paper presents and proves in detail a convergence theorem forQ-learning based on that outlined in Watkins (1989), showing that Q-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action- values are represented discretely.
Journal ArticleDOI

Theory of Games and Economic Behavior

E. Rowland
- 01 Feb 1946 - 
TL;DR: In this article, the authors show that the maximization of individual wealth is not an ordinary problem in variational calculus, because the individual does not control, and may even be ignorant of, some of the variables.
Journal ArticleDOI

Technical Note Q-Learning

TL;DR: In this article, it is shown that Q-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action values are represented discretely.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What have the authors contributed in "Markov games as a framework for multi-agent reinforcement learning" ?

This paper considers a step in this direction in which exactly two agents with diametrically opposed goals share an environment. 

Identifying an opponent of this type for the Q-learning agent described in this paper would be an interesting topic for future research. 

In particular, the paper describes a reinforcement learning approach to solving two-player zero-sum games in which the “max” operator in the update step of a standard Q-learning algorithm is replaced by a “minimax” operator that can be evaluated by solving a linear program. 

The need for probabilistic action choice stems from the agent’s uncertainty of its opponent’s current move and its requirement to avoid being “second guessed. 

For current purposes, the discount factor has the desirable effect of goading the players into trying to win sooner rather than later. 

solving an MDP using value iteration involves applying Equations 1–2 simultaneously over all s 2 S. Watkins [Watkins, 1989] proposed an alternative approach that involves performing the updates asynchronously without the use of the transition function, T . 

This learning rule converges to the correct values forQ and V , assuming that every action is tried in every state infinitely often and that new estimates are blended with previous ones using a slow enough exponentially weighted average [Watkins and Dayan, 1992]. 

The use of linear programming in the innermost loop of a learning algorithm is somewhat problematic since the computational complexity of each step is large and typically many steps will be needed before the system reaches convergence. 

In an MDP, an optimal policy is one that maximizes the expected sum of discounted reward and is undominated, meaning that there is no state from which any other policy can achieve a better expected sum of discounted reward. 

Reinforcement learning is a promising technique for creating agents that co-exist [Tan, 1993, Yanco and Stein, 1993], but the mathematical framework that justifies it is inappropriate for multi-agent environments.