scispace - formally typeset
Open AccessProceedings ArticleDOI

Decentralized Reinforcement Learning Control of a Robotic Manipulator

Reads0
Chats0
TLDR
This paper investigates centralized and decentralized RL, emphasizing the challenges and potential advantages of the latter and illustrated on an example: learning to control a two-link rigid manipulator.
Abstract
Multi-agent systems are rapidly finding applications in a variety of domains, including robotics, distributed control, telecommunications, etc. Learning approaches to multi-agent control, many of them based on reinforcement learning (RL), are investigated in complex domains such as teams of mobile robots. However, the application of decentralized RL to low-level control tasks is not as intensively studied. In this paper, we investigate centralized and decentralized RL, emphasizing the challenges and potential advantages of the latter. These are then illustrated on an example: learning to control a two-link rigid manipulator. Some open issues and future research directions in decentralized RL are outlined

read more

Content maybe subject to copyright    Report

Decentralized Reinforcement Learning Control
of a Robotic Manipulator
Lucian Bus¸oniu Bart De Schutter Robert Babu
ˇ
ska
Delft Center for Systems and Control
Delft University of Technology
2628 CD Delft, The Netherlands
Email: {i.l.busoniu,b.deschutter,r.babuska}@tudelft.nl
Abstract Multi-agent systems are rapidly finding applications
in a variety of domains, including robotics, distributed control,
telecommunications, etc. Learning approaches to multi-agent
control, many of them based on reinforcement learning (RL),
are investigated in complex domains such as teams of mobile
robots. However, the application of decentralized RL to low-level
control tasks is not as intensively studied. In this paper, we
investigate centralized and decentralized RL, emphasizing the
challenges and potential advantages of the latter. These are then
illustrated on an example: learning to control a two-link rigid
manipulator. Some open issues and future research directions in
decentralized RL are outlined.
Keywords—
multi-agent learning, decentralized control, rein-
forcement learning
I. INTRODUCTION
A multi-agent system (MAS) is a collection of interacting
agents that share a common environment (operate on a com-
mon process), which they perceive through sensors, and upon
which they act through actuators [1]. In contrast to the classical
control paradigm, that uses a single controller acting on the
process, in MAS control is distributed among the autonomous
agents.
MAS can arise naturally as a viable representation of the
considered system. This is the case with e.g., teams of mobile
robots, where the agents are the robots and the process is
their environment [2], [3]. MAS can also provide alternative
solutions for systems that are typically regarded as centralized,
e.g., resource management: each resource may be managed by
a dedicated agent [4] or several agents may negotiate access
to passive resources [5]. Another application field of MAS
is decentralized, distributed control, e.g., for traffic or power
networks.
Decentralized, multi-agent solutions offer several potential
advantages over centralized ones [2]:
Speed-up, resulting from parallel computation.
Robustness to single-point failures, if redundancy is built
into the system.
Scalability, resulting from modularity.
MAS also pose certain challenges, many of which do not
appear in centralized control. The agents have to
coordinate
their individual behaviors, such that a coherent joint behavior
results that is beneficial for the system. Conflicting goals, inter-
agent communication, and incomplete agent views over the
process, are issues that may also play a role.
The multi-agent control task is often too complex to be
solved effectively by agents with pre-programmed behaviors.
Agents can do better by
learning
new behaviors, such that
their performance gradually improves [6], [7]. Learning can
be performed either online, while the agents actually try to
solve the task, or offline, typically by using a task model to
generate simulated experience.
Reinforcement learning (RL) [8] is a simple and general
framework that can be applied to the multi-agent learning
problem. In this framework, the performance of each agent is
rewarded by a scalar signal, that the agent aims to maximize.
A significant body of research on multi-agent RL has evolved
over the last decade (see e.g., [7], [9], [10]).
In this paper, we investigate the single-agent, centralized RL
task, and its multi-agent, decentralized counterpart. We focus
on cooperative low-level control tasks. To our knowledge,
decentralized RL control has not been applied to such tasks.
We describe the challenge of coordinating multiple RL agents,
and briefly mention the approaches proposed in the literature.
We present some potential advantages of multi-agent RL. Most
of these advantages extend beyond RL to the general multi-
agent learning setting.
We illustrate the differences between centralized and multi-
agent RL on an example involving learning to control a two-
link rigid manipulator. Finally, we present some open research
issues and directions for future work.
The rest of the paper is organized as follows. Section II
introduces the basic concepts of RL. Cooperative decentralized
RL is then discussed in Section III. Section IV introduces
the two-link rigid manipulator and presents the results of RL
control on this process. Section V concludes the paper.
II. REINFORCEMENT LEARNING
In this section we introduce the main concepts of centralized
and multi-agent RL for deterministic processes. This presen-
tation is based on [8], [11].
A. Centralized RL
The theoretical model of the centralized (single-agent) RL
task is the Markov decision process.
Definition 1: A
Markov decision process
is a tuple
hX, U, f, ρi where: X is the discrete set of process states, U
is the discrete set of agent actions, f : X × U X is the
1–4244–0342–1/06/$20.00
c
2006 IEEE ICARCV 2006

state transition function, and ρ : X × U R is the reward
function.
The process changes state from x
k
to x
k+1
as a result of
action u
k
, according to the state transition function f. The
agent receives (possibly delayed) feedback on its performance
via the scalar reward signal r
k
R, according to the reward
function ρ. The agent chooses actions according to its
policy
h : X U .
The learning goal is the maximization, at each time step k,
of the discounted return:
R
k
=
X
j=0
γ
j
r
k+j+1
, (1)
where γ (0, 1) is the discount factor. The
action-value
function
(Q-function), Q
h
: X × U R, is the expected
return of a state-action pair under a given policy: Q
h
(x, u) =
E {R
k
| x
k
= x, u
k
= u, h}. The agent can maximize its re-
turn by first computing the optimal Q-function, defined as
Q
(x, u) = max
h
Q
h
(x, u), and then choosing actions by the
greedy policy h
(x) = arg max
u
Q
(x, u), which is optimal
(ties are broken randomly).
The central result upon which RL algorithms rely is that
Q
satisfies the Bellman optimality recursion:
Q
(x, u) = ρ(x, u) + γ max
u
0
U
Q
(f(x, u), u
0
) x, u. (2)
Value iteration
is an offline, model-based algorithm that
turns this recursion into an update rule:
Q
`+1
(x, u) = ρ(x, u) + γ max
u
0
U
Q
`
(f(x, u), u
0
) x, u. (3)
where ` is the iteration index. Q
0
can be initialized arbitrarily.
The sequence Q
`
provably converges to Q
.
Q-learning
is an online algorithm that iteratively estimates
Q
by interaction with the process, using observed rewards r
k
and pairs of subsequent states x
k
, x
k+1
[12]:
Q
k+1
(x
k
, u
k
) = Q
k
(x
k
, u
k
)+
α
r
k+1
+ γ max
u
0
U
Q(x
k+1
, u
0
) Q
k
(x
k
, u
k
)
, (4)
where α (0, 1] is the learning rate. The sequence Q
k
provably converges to Q
under certain conditions, including
that the agent keeps trying all actions in all states with nonzero
probability [12]. This means that the agent must sometimes
explore
, i.e., perform other actions than those dictated by the
current greedy policy.
B. Multi-Agent RL
The generalization of the Markov decision process to the
multi-agent case is the Markov game.
Definition 2: A
Markov game
is a tuple
hA, X, {U
i
}
iA
, f, {ρ
i
}
iA
i where: A = {1, . . . , n} is
the set of n agents, X is the discrete set of process states,
{U
i
}
iA
are the discrete sets of actions available to the agents,
yielding the joint action set U = ×
iA
U
i
, f : X × U X
is the state transition function, and ρ
i
: X × U R, i A
are the reward functions of the agents.
Note that the state transitions, agent rewards r
i,k
, and thus
also the agent returns R
i,k
, depend on the
joint action
u
k
=
[u
T
1,k
, . . . , u
T
n,k
]
T
, U
k
U , u
i,k
U
i
. The policies h
i
: X ×
U
i
[0, 1] form together the joint policy h. The Q-function
of each agent depends on the joint action and is conditioned
on the joint policy, Q
h
i
: X × U R.
A fully cooperative Markov game is a game where the
agents have identical reward functions, ρ
1
= . . . = ρ
n
. In
this case, the learning goal is the maximization the common
discounted return. In the general case, the reward functions
of the agents may differ. Even agents which form a team
may encounter situations where their immediate interests are
in conflict, e.g., when they need to share some resource. As the
returns of the agents are correlated, they cannot be maximized
independently. Formulating a good learning goal in such a
situation is a difficult open problem (see e.g., [13]–[15]).
III. COOPERATIVE DECENTRALIZED RL CONTROL
This section briefly reviews approaches to solving the
coordination issue in decentralized RL, and then mentions
some of the potential advantages of decentralized RL.
A. The Coordination Problem
Coordination requires that all agents coherently choose their
part of a desirable joint policy. This is not trivial, even if the
task is fully cooperative. To see this, assume all agents learn in
parallel the common optimal Q-function with, e.g., Q-learning:
Q
k+1
(x
k
, u
k
) = Q
k
(x
k
, u
k
)+
α
r
k+1
+ γ max
u
0
U
Q(x
k+1
, u
0
) Q
k
(x
k
, u
k
)
. (5)
Then, in principle, they could use the greedy policy to
maximize the common return. However, greedy action selec-
tion breaks ties randomly, which means that in the absence
of additional mechanisms, different agents may break a tie
in different ways, and the resulting joint action may be
suboptimal.
The multi-agent RL algorithms in the literature solve this
problem in various ways.
Coordination-free
methods bypass the issue. For instance,
in fully cooperative tasks, the Team Q-learning algorithm [16]
assumes that the optimal joint actions are unique (which will
rarely be the case). Then, (5) can directly be used.
The agents can be
indirectly
steered toward coordination.
To this purpose, some algorithms learn empirical models of
the other agents and adapt to these models [17]. Others use
heuristics to bias the agents toward actions that promise to
yield good reward [18]. Yet others directly search through the
space of policies using gradient-based methods [11].
The action choices of the agents can also be
explicitly
coordinated
or negotiated:
Social conventions [19] and roles [20] restrict the action
choices of the agents.
Coordination graphs explicitly represent where coordi-
nation between agents is required, thus preventing the
agents from engaging in unnecessary coordination activ-
ities [21].

Communication is used to negotiate action choices, either
alone or in combination with the above techniques.
B. Potential Advantages of Decentralized RL
If the coordination problem is efficiently solved, learning
speed might be higher for decentralized learners. This is
because each agent i searches an action space U
i
. A centralized
learner solving the same problem searches the joint action
space U = U
1
× · · · × U
n
, which is exponentially larger.
This difference will be even more significant in tasks where
not all the state information is relevant to all the learning
agents. For instance, in a team of mobile robots, at a given
time, the position and velocity of robots that are far away from
the considered robot might not be interesting for it. In such
tasks, the learning agents can consider only the relevant state
components and thus further decrease the size of the problem
they need to solve [22].
Memory and processing time requirements will also be
smaller for smaller problem sizes.
If several learners solve similar tasks, then they could gain
further benefit from sharing their experience or knowledge.
IV. EXAMPLE: TWO-LINK RIGID MANIPULATOR
A. Manipulator Model
The two-link manipulator, depicted in Fig. 1, is described
by the nonlinear fourth-order model:
M(θ)
¨
θ + C(θ,
˙
θ)
˙
θ + G(θ) = τ (6)
where θ = [θ
1
, θ
2
]
T
, τ = [τ
1
, τ
2
]
T
. The system has two control
inputs, the torques in the two joints, τ
1
and τ
2
, and four
measured outputs the link angles, θ
1
, θ
2
, and their angular
speeds
˙
θ
1
,
˙
θ
2
.
The mass matrix M (θ), Coriolis and centrifugal forces
m
1
m
2
l
2
l
1
motor
1
motor
2
Fig. 1. Schematic drawing of the two-link rigid manipulator.
matrix C(θ,
˙
θ), and gravity vector G(θ), are:
M(θ) =
P
1
+ P
2
+ 2P
3
cos θ
2
P
2
+ P
3
cos θ
2
P
2
+ P
3
cos θ
2
P
2
(7)
C(θ,
˙
θ) =
b
1
P
3
˙
θ
2
sin θ
2
P
3
(
˙
θ
1
+
˙
θ
2
) sin θ
2
P
3
˙
θ
2
sin θ
2
b
2
(8)
G(θ) =
g
1
sin θ
1
g
2
sin(θ
1
+ θ
2
)
g
2
sin(θ
1
+ θ
2
)
(9)
The meaning and values of the physical parameters of the
system are given in Table I.
Using these, the rest of the parameters in (6) can be
computed by:
P
1
= m
1
c
2
1
+ m
2
l
2
1
+ I
1
P
2
= m
2
c
2
2
+ I
2
P
3
= m
2
l
1
c
2
g
1
= (m
1
c
1
+ m
2
l
1
)g g
2
= m
2
c
2
g
(10)
In the sequel, it is assumed that the manipulator operates
in a horizontal plane, leading to G(θ) = 0. Furthermore, the
following simplifications are adopted in (6):
1) Coriolis and centrifugal forces are neglected, leading to
C(θ,
˙
θ) = diag[b
1
, b
2
];
2)
¨
θ
1
is neglected in the equation for
¨
θ
2
;
3) the friction in the second joint is neglected in the
equation for
¨
θ
1
.
After these simplifications, the dynamics of the manipulator
can be approximated by:
¨
θ
1
=
1
P
2
(P
1
+ P
2
+ 2P
3
cos θ
2
)
·
P
2
(τ
1
b
1
˙
θ
1
) (P
2
+ P
3
cos θ
2
)τ
2
¨
θ
2
=
τ
2
P
2
b
2
˙
θ
2
(11)
The complete process state is given by x = [θ
T
,
˙
θ
T
]
T
.
If centralized control is used, the command is u = τ ; for
decentralized control with one agent controlling each joint
motor, the agent commands are u
1
= τ
1
, u
2
= τ
2
.
TABLE I
PHYSICAL PARAMETERS OF THE MANIPULATOR
Symbol Parameter Value
g gravitational acceleration 9.81 m/s
2
l
1
length of first link 0.1 m
l
2
length of second link 0.1 m
m
1
mass of first link 1.25 kg
m
2
mass of second link 1 kg
I
1
inertia of first link 0.004 kgm
2
I
2
inertia of second link 0.003 kgm
2
c
1
center of mass of first link 0.05 m
c
2
center of mass of second link 0.05 m
b
1
damping in first joint 0.1 kgs
1
b
2
damping in second joint 0.02 kgs
1
τ
1,max
maximum torque of first joint motor 0.2 Nm
τ
2,max
maximum torque of first joint motor 0.1 Nm
˙
θ
1,max
maximum angular speed of first link 2π rad/sec
˙
θ
2,max
maximum angular speed of second link 2π rad/sec

B. RL Control
The control goal is the stabilization of the system around
θ =
˙
θ = 0 in minimum time, with a tolerance of ±5 · π/180
rad for the angles, and ±0.1 rad/sec for the angular speeds.
To apply RL in the form presented in Section II, the time
axis, as well as the continuous state and action components of
the manipulator, must first be discretized. Time is discretized
with a sampling time of T
s
= 0.05 sec; this gives the discrete
system dynamics f. Each state component is quantized in
fuzzy bins, and three torque values are considered for each
joint: τ
i,max
(maximal torque clockwise), 0, and τ
i,max
(maximal torque counter-clockwise).
One Q-value is stored for each combination of bin centers
and torque values. The Q-values of continuous states are then
interpolated between these center Q-values, using the degrees
of membership to each fuzzy bin as interpolation weights. If
e.g., the Q-function has the form Q(θ
2
,
˙
θ
2
, τ
2
), the Q-values
of a continuous state [θ
2,k
,
˙
θ
2,k
]
T
are computed by:
˜
Q(θ
2,k
,
˙
θ
2,k
, τ
2
) =
X
m=1,...,N
θ
2
n=1,...,N
˙
θ
2
µ
θ
2
,m
(θ
2,k
)µ
˙
θ
2
,n
(
˙
θ
2,k
) · Q(m, n, τ
2
), τ
2
(12)
where e.g., µ
˙
θ
2
,n
(
˙
θ
2,k
) is the membership degree of
˙
θ
2,k
in
the n
th
bin. For triangular membership functions, this can be
computed as:
µ
˙
θ
2
,n
(
˙
θ
2,k
) =
max(0,
c
n+1
˙
θ
2,k
c
n+1
c
n
), if n = 1
max
0, min(
˙
θ
2,k
c
n1
c
n
c
n1
,
c
n+1
˙
θ
2,k
c
n+1
c
n
)
,
if 1 < n < N
˙
θ
2
max(0,
˙
θ
2,k
c
n1
c
n
c
n1
), if n = N
˙
θ
2
(13)
where c
n
is the center of the n
th
bin see Fig. 2 for an
example.
µ
0.5
2
θ
[rad/sec]
1
c =-2
1
π c =2
7
πc
2
c
3
c
4
c
5
c
6
µ
Θ
2,6
Fig. 2. Example of quantization in fuzzy bins with triangular membership
functions for
˙
θ
2
.
Such a set of bins is completely determined by a vector
of bin center coordinates. For
˙
θ
1
and
˙
θ
2
, 7 bins are used,
with their centers at [360, 180, 30, 0, 30, 180, 360]·π/180
rad/sec. For θ
1
and θ
2
, 12 bins are used, with their centers at
[180, 130, 80, 30, 15, 5, 0, 5, 15, 30, 80, 130] · π/180
rad; there is no ‘last’ or ‘first’ bin, because the angles evolve
on a circle manifold [π, π). The π point is identical to π,
so the ‘last’ bin is a neighbor of the ‘first’.
Algorithm 1 Fuzzy value iteration for a SISO RL controller
1: Q
0
(m, u
j
) = 0, for m = 1, . . . , N
X
, j = 1, . . . , N
U
2: ` = 0
3: repeat
4: for m = 1, . . . , N
X
, j = 1, . . . , N
U
do
5:
Q
`+1
(m, u
j
) = ρ(c
m
, u
j
)
+ γ
N
X
X
˜m=1
µ
x, ˜m
(f(c
m
, u
j
)) max
˜u
j
Q
`
( ˜m, ˜u
j
)
6: end for
7: ` = ` + 1
8: until k Q
`
Q
`1
k δ
The optimal Q-functions for both the centralized and decen-
tralized case are computed with a version of value iteration (3)
which is altered to accommodate the fuzzy representation of
the state. The complete algorithm is given in Alg. 1. For easier
readability, the RL controller is assumed single-input single-
output, but the extension to multiple states and / or outputs is
straightforward. The discount factor is set to γ = 0.98, and
the threshold value to δ = 0.01.
The control action in state x
k
is computed as follows
(assuming as above a SISO controller):
u
k
= h(x
k
) =
N
X
X
m=1
µ
x,m
(x
k
) arg max
˜u
j
Q( ˜m, ˜u
j
) (14)
Centralized RL. The reward function ρ for the centralized
learner computes rewards by:
r
k
=
0 if |θ
i,k
| 5 · π/180 rad
and
˙
θ
i,k
0.1 rad/sec, i {1, 2}
0.5 otherwise
(15)
The centralized policy for solving the two-link manipulator
task must be of the form:
[τ
1
, τ
2
]
T
= h(θ
1
, θ
2
,
˙
θ
1
,
˙
θ
2
) (16)
Therefore, the centralized learner uses a Q-table of the form
Q(θ
1
, θ
2
,
˙
θ
1
,
˙
θ
2
, τ
1
, τ
2
).
The policy computed by value iteration is applied to the
system starting from the initial state x
0
= [1, 3, 0, 0]
T
.
The resulting command, state, and reward signals are given in
Fig. 3(a).
Decentralized RL. In the decentralized case, the rewards
are computed separately for the two agents:
r
i,k
=
0 if |θ
i,k
| 5 · π/180 rad
and
˙
θ
i,k
0.1 rad/sec
0.5 otherwise
(17)
For decentralized control, the system (11) creates an asym-
metric setting. Agent 2 can choose its action τ
2,k
by only
considering the second link’s state, whereas agent 1 needs to
take into account θ
2,k
and τ
2,k
besides the first link’s state. If
agent 2 is always the first to choose its action, and agent 1

0 0.5 1 1.5 2 2.5 3
-3
-2
-1
0
Link angles[rad]
0 0.5 1 1.5 2 2.5 3
0
1
2
3
4
5
Link velocities[rad/sec]
0 0.5 1 1.5 2 2.5 3
-0.2
-0.1
0
0.1
0.2
Cmd torque joint 1[Nm]
0 0.5 1 1.5 2 2.5 3
-0.2
-0.1
0
0.1
0.2
Cmd torque joint 2[Nm]
0 0.5 1 1.5 2 2.5 3
-0.4
-0.2
0
Reward [-]
t [sec]
(a) Centralized RL (thin line–link 1, thick line–link 2)
0 0.5 1 1.5 2 2.5 3
-3
-2
-1
0
Link angles[rad]
0 0.5 1 1.5 2 2.5 3
0
1
2
3
4
5
Link velocities[rad/sec]
0 0.5 1 1.5 2 2.5 3
-0.2
-0.1
0
0.1
0.2
Cmd torque joint 1[Nm]
0 0.5 1 1.5 2 2.5 3
-0.2
-0.1
0
0.1
0.2
Cmd torque joint 2[Nm]
0 0.5 1 1.5 2 2.5 3
-0.4
-0.2
0
Reward [-]
t [sec]
(b) Decentralized RL (thin line–link / agent 1, thick line–link / agent 2)
Fig. 3. State, command, and reward signals for RL control.
can learn about this action before it is actually taken (e.g., by
communication) then the two agents can learn control policies
of the following form:
τ
2
= h
2
(θ
2
,
˙
θ
2
)
τ
1
= h
1
(θ
1
, θ
2
,
˙
θ
1
, τ
2
)
(18)
Therefore, the two agents use Q-tables of the form
Q
2
(θ
2
,
˙
θ
2
, τ
2
), and respectively Q
1
(θ
1
, θ
2
,
˙
θ
1
, τ
2
, τ
1
). Value
iteration is applied first for agent 2, and the resulting policy is
used in value iteration for agent 1.
The policies computed in this way are applied to the
system starting from the initial state x
0
= [1, 3, 0, 0]
T
.
The resulting command, state, and reward signals are given in
Fig. 3(b).
C. Discussion
Value iteration converges in 125 iterations for the cen-
tralized case, 192 iterations for agent 1, and 49 iterations
for agent 2. The learning speeds are therefore comparable
for centralized and decentralized learning in this application.
Agent 2 of course converges relatively faster, as it state-action
space is much smaller.
Both the centralized and the decentralized policies stabilize
the system in 1.2 seconds. The steady-state angle offsets are
all within the imposed 5 degrees tolerance bound. Notice
that in Fig. 3(b), the first link is stabilized slightly faster
than in Fig. 3(a), where both links are stabilized at around
the same time. This is because decentralized learners are
rewarded separately (17), and have an incentive to stabilize
their respective links faster.
The form of coordination used by the two agents is

Citations
More filters
Journal ArticleDOI

A Comprehensive Survey of Multiagent Reinforcement Learning

TL;DR: The benefits and challenges of MARL are described along with some of the problem domains where the MARL techniques have been applied, and an outlook for the field is provided.
Journal Article

Multiagent systems: a modern approach to distributed artificial intelligence

TL;DR: Multiagent Systems is the title of a collection of papers dedicated to surveying specific themes of Multiagent Systems (MAS) and Distributed Artificial Intelligence (DAI).
Book ChapterDOI

Multi-agent Reinforcement Learning: An Overview

TL;DR: This chapter reviews a representative selection of multi-agent reinforcement learning algorithms for fully cooperative, fully competitive, and more general (neither cooperative nor competitive) tasks.
Journal ArticleDOI

Review: independent reinforcement learners in cooperative markov games: A survey regarding coordination problems

TL;DR: This paper identifies several challenges responsible for the non-coordination of independent agents: Pareto-selection, non-stationarity, stochasticity, alter-exploration and shadowed equilibria, and can serve as a basis for choosing the appropriate algorithm for a new domain.
Journal ArticleDOI

Residential Demand Response of Thermostatically Controlled Loads Using Batch Reinforcement Learning

TL;DR: The experiments show that batch RL techniques provide a valuable alternative to model-based controllers and that they can be used to construct both closed-loop and open-loop policies.
References
More filters
Book

Reinforcement Learning: An Introduction

TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Journal ArticleDOI

Technical Note Q-Learning

TL;DR: In this article, it is shown that Q-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action values are represented discretely.
Journal ArticleDOI

Cooperative Multi-Agent Learning: The State of the Art

TL;DR: This survey attempts to draw from multi-agent learning work in a spectrum of areas, including RL, evolutionary computation, game theory, complex systems, agent modeling, and robotics, and finds that this broad view leads to a division of the work into two categories.
Proceedings Article

The dynamics of reinforcement learning in cooperative multiagent systems

TL;DR: This work distinguishes reinforcement learners that are unaware of (or ignore) the presence of other agents from those that explicitly attempt to learn the value of joint actions and the strategies of their counterparts, and proposes alternative optimistic exploration strategies that increase the likelihood of convergence to an optimal equilibrium.
Frequently Asked Questions (15)
Q1. What are the contributions mentioned in the paper "Decentralized reinforcement learning control of a robotic manipulator" ?

However, the application of decentralized RL to low-level control tasks is not as intensively studied. In this paper, the authors investigate centralized and decentralized RL, emphasizing the challenges and potential advantages of the latter. 

V. CONCLUSION AND FUTURE RESEARCH Studying the robustness of solutions with respect to imperfect models or imperfect observations is topic for future research. 

To apply RL in the form presented in Section II, the time axis, as well as the continuous state and action components of the manipulator, must first be discretized. 

If centralized control is used, the command is u = τ ; for decentralized control with one agent controlling each joint motor, the agent commands are u1 = τ1, u2 = τ2. 

The learning goal is the maximization, at each time step k, of the discounted return:Rk = ∑∞j=0 γjrk+j+1, (1)where γ ∈ (0, 1) is the discount factor. 

The Q-function of each agent depends on the joint action and is conditioned on the joint policy, Qhi : X × U → R.A fully cooperative Markov game is a game where the agents have identical reward functions, ρ1 = . . . = ρn. 

The Q-values of continuous states are then interpolated between these center Q-values, using the degrees of membership to each fuzzy bin as interpolation weights. 

If e.g., the Q-function has the form Q(θ2, θ̇2, τ2), the Q-values of a continuous state [θ2,k, θ̇2,k]T are computed by:Q̃(θ2,k, θ̇2,k, τ2) = ∑m=1,...,Nθ2 n=1,...,Nθ̇2µθ2,m(θ2,k)µθ̇2,n(θ̇2,k) · Q(m,n, τ2), ∀τ2 (12)where e.g., µθ̇2,n(θ̇2,k) is the membership degree of θ̇2,k in the nth bin. 

The action choices of the agents can also be explicitly coordinated or negotiated:– Social conventions [19] and roles [20] restrict the action choices of the agents. 

Definition 1: A Markov decision process is a tuple 〈X,U, f, ρ〉 where: X is the discrete set of process states, U is the discrete set of agent actions, f : X × U → X is the1–4244–0342–1/06/$20.00 c© 2006 IEEE ICARCV 2006state transition function, and ρ : X × U → R is the reward function. 

The system has two control inputs, the torques in the two joints, τ1 and τ2, and four measured outputs – the link angles, θ1, θ2, and their angular speeds θ̇1, θ̇2. 

Another issue is that RL updates assume perfect knowledge of the task model (for model-based learning, e.g., value iteration (3)), or perfect measurements of the state (for online, model-free learning, e.g., Q-learning (4)). 

Each state component is quantized in fuzzy bins, and three torque values are considered for each joint: −τi,max (maximal torque clockwise), 0, and τi,max (maximal torque counter-clockwise). 

δThe optimal Q-functions for both the centralized and decentralized case are computed with a version of value iteration (3) which is altered to accommodate the fuzzy representation of the state. 

Algorithm 1 Fuzzy value iteration for a SISO RL controller 1: Q0(m,uj) = 0, for m = 1, . . . , NX , j = 1, . . . , NU 2: ` = 0 3: repeat 4: for m = 1, . . . , NX , j = 1, . . . , NU do5:Q`+1(m,uj) = ρ(cm, uj)+ γNX ∑m̃=1µx,m̃(f(cm, uj))max ũj Q`(m̃, ũj)6: end for 7: ` = ` + 1 8: until ‖Q` − Q`−1‖ ≤