What are the contributions mentioned in the paper "Decentralized reinforcement learning control of a robotic manipulator" ?

However, the application of decentralized RL to low-level control tasks is not as intensively studied. In this paper, the authors investigate centralized and decentralized RL, emphasizing the challenges and potential advantages of the latter.

What are the future works in "Decentralized reinforcement learning control of a robotic manipulator" ?

V. CONCLUSION AND FUTURE RESEARCH Studying the robustness of solutions with respect to imperfect models or imperfect observations is topic for future research.

What is the learning goal of the centralized RL task?

The learning goal is the maximization, at each time step k, of the discounted return:Rk = ∑∞j=0 γjrk+j+1, (1)where γ ∈ (0, 1) is the discount factor.

What is the Q-function of each agent?

The Q-function of each agent depends on the joint action and is conditioned on the joint policy, Qhi : X × U → R.A fully cooperative Markov game is a game where the agents have identical reward functions, ρ1 = . . . = ρn.

What is the simplest way to compute the dynamic of a continuous state?

If e.g., the Q-function has the form Q(θ2, θ̇2, τ2), the Q-values of a continuous state [θ2,k, θ̇2,k]T are computed by:Q̃(θ2,k, θ̇2,k, τ2) = ∑m=1,...,Nθ2 n=1,...,Nθ̇2µθ2,m(θ2,k)µθ̇2,n(θ̇2,k) · Q(m,n, τ2), ∀τ2 (12)where e.g., µθ̇2,n(θ̇2,k) is the membership degree of θ̇2,k in the nth bin.

What is the angular speed of the two links?

The system has two control inputs, the torques in the two joints, τ1 and τ2, and four measured outputs – the link angles, θ1, θ2, and their angular speeds θ̇1, θ̇2.

What is the main issue with RL updates?

Another issue is that RL updates assume perfect knowledge of the task model (for model-based learning, e.g., value iteration (3)), or perfect measurements of the state (for online, model-free learning, e.g., Q-learning (4)).

What is the simplest way to calculate the RL system?

Algorithm 1 Fuzzy value iteration for a SISO RL controller 1: Q0(m,uj) = 0, for m = 1, . . . , NX , j = 1, . . . , NU 2: ` = 0 3: repeat 4: for m = 1, . . . , NX , j = 1, . . . , NU do5:Q`+1(m,uj) = ρ(cm, uj)+ γNX ∑m̃=1µx,m̃(f(cm, uj))max ũj Q`(m̃, ũj)6: end for 7: ` = ` + 1 8: until ‖Q` − Q`−1‖ ≤

(Open Access) Decentralized Reinforcement Learning Control of a Robotic Manipulator (2006) | Lucian Busoniu

Q: What is the simplest example of a Markov decision process?

Definition 1: A Markov decision process is a tuple 〈X,U, f, ρ〉 where: X is the discrete set of process states, U is the discrete set of agent actions, f : X × U → X is the1–4244–0342–1/06/$20.00 c© 2006 IEEE ICARCV 2006state transition function, and ρ : X × U → R is the reward function.

Decentralized Reinforcement Learning Control

of a Robotic Manipulator

Lucian Bus¸oniu Bart De Schutter Robert Babu

ska

Delft Center for Systems and Control

Delft University of Technology

2628 CD Delft, The Netherlands

Email: {i.l.busoniu,b.deschutter,r.babuska}@tudelft.nl

Abstract— Multi-agent systems are rapidly ﬁnding applications

in a variety of domains, including robotics, distributed control,

telecommunications, etc. Learning approaches to multi-agent

control, many of them based on reinforcement learning (RL),

are investigated in complex domains such as teams of mobile

robots. However, the application of decentralized RL to low-level

control tasks is not as intensively studied. In this paper, we

investigate centralized and decentralized RL, emphasizing the

challenges and potential advantages of the latter. These are then

illustrated on an example: learning to control a two-link rigid

manipulator. Some open issues and future research directions in

decentralized RL are outlined.

Keywords—

multi-agent learning, decentralized control, rein-

forcement learning

I. INTRODUCTION

A multi-agent system (MAS) is a collection of interacting

agents that share a common environment (operate on a com-

mon process), which they perceive through sensors, and upon

which they act through actuators [1]. In contrast to the classical

control paradigm, that uses a single controller acting on the

process, in MAS control is distributed among the autonomous

agents.

MAS can arise naturally as a viable representation of the

considered system. This is the case with e.g., teams of mobile

robots, where the agents are the robots and the process is

their environment [2], [3]. MAS can also provide alternative

solutions for systems that are typically regarded as centralized,

e.g., resource management: each resource may be managed by

a dedicated agent [4] or several agents may negotiate access

to passive resources [5]. Another application ﬁeld of MAS

is decentralized, distributed control, e.g., for trafﬁc or power

networks.

Decentralized, multi-agent solutions offer several potential

advantages over centralized ones [2]:

– Speed-up, resulting from parallel computation.

– Robustness to single-point failures, if redundancy is built

into the system.

– Scalability, resulting from modularity.

MAS also pose certain challenges, many of which do not

appear in centralized control. The agents have to

coordinate

their individual behaviors, such that a coherent joint behavior

results that is beneﬁcial for the system. Conﬂicting goals, inter-

agent communication, and incomplete agent views over the

process, are issues that may also play a role.

The multi-agent control task is often too complex to be

solved effectively by agents with pre-programmed behaviors.

Agents can do better by

learning

new behaviors, such that

their performance gradually improves [6], [7]. Learning can

be performed either online, while the agents actually try to

solve the task, or ofﬂine, typically by using a task model to

generate simulated experience.

Reinforcement learning (RL) [8] is a simple and general

framework that can be applied to the multi-agent learning

problem. In this framework, the performance of each agent is

rewarded by a scalar signal, that the agent aims to maximize.

A signiﬁcant body of research on multi-agent RL has evolved

over the last decade (see e.g., [7], [9], [10]).

In this paper, we investigate the single-agent, centralized RL

task, and its multi-agent, decentralized counterpart. We focus

on cooperative low-level control tasks. To our knowledge,

decentralized RL control has not been applied to such tasks.

We describe the challenge of coordinating multiple RL agents,

and brieﬂy mention the approaches proposed in the literature.

We present some potential advantages of multi-agent RL. Most

of these advantages extend beyond RL to the general multi-

agent learning setting.

We illustrate the differences between centralized and multi-

agent RL on an example involving learning to control a two-

link rigid manipulator. Finally, we present some open research

issues and directions for future work.

The rest of the paper is organized as follows. Section II

introduces the basic concepts of RL. Cooperative decentralized

RL is then discussed in Section III. Section IV introduces

the two-link rigid manipulator and presents the results of RL

control on this process. Section V concludes the paper.

II. REINFORCEMENT LEARNING

In this section we introduce the main concepts of centralized

and multi-agent RL for deterministic processes. This presen-

tation is based on [8], [11].

A. Centralized RL

The theoretical model of the centralized (single-agent) RL

task is the Markov decision process.

Deﬁnition 1: A

Markov decision process

is a tuple

hX, U, f, ρi where: X is the discrete set of process states, U

is the discrete set of agent actions, f : X × U → X is the

1–4244–0342–1/06/$20.00

 2006 IEEE ICARCV 2006

state transition function, and ρ : X × U → R is the reward

function.

The process changes state from x

to x

k+1

as a result of

action u

, according to the state transition function f. The

agent receives (possibly delayed) feedback on its performance

via the scalar reward signal r

∈ R, according to the reward

function ρ. The agent chooses actions according to its

policy

h : X → U .

The learning goal is the maximization, at each time step k,

of the discounted return:

∞

j=0

k+j+1

, (1)

where γ ∈ (0, 1) is the discount factor. The

action-value

function

(Q-function), Q

: X × U → R, is the expected

return of a state-action pair under a given policy: Q

(x, u) =

E {R

| x

= x, u

= u, h}. The agent can maximize its re-

turn by ﬁrst computing the optimal Q-function, deﬁned as

∗

(x, u) = max

(x, u), and then choosing actions by the

greedy policy h

∗

(x) = arg max

∗

(x, u), which is optimal

(ties are broken randomly).

The central result upon which RL algorithms rely is that

∗

satisﬁes the Bellman optimality recursion:

∗

(x, u) = ρ(x, u) + γ max

∈U

∗

(f(x, u), u

) ∀x, u. (2)

Value iteration

is an ofﬂine, model-based algorithm that

turns this recursion into an update rule:

`+1

(x, u) = ρ(x, u) + γ max

∈U

(f(x, u), u

) ∀x, u. (3)

where ` is the iteration index. Q

can be initialized arbitrarily.

The sequence Q

provably converges to Q

∗

Q-learning

is an online algorithm that iteratively estimates

∗

by interaction with the process, using observed rewards r

and pairs of subsequent states x

, x

k+1

[12]:

k+1

, u

) = Q

, u



k+1

+ γ max

∈U

Q(x

k+1

, u

) − Q

, u

)



, (4)

where α ∈ (0, 1] is the learning rate. The sequence Q

provably converges to Q

∗

under certain conditions, including

that the agent keeps trying all actions in all states with nonzero

probability [12]. This means that the agent must sometimes

explore

, i.e., perform other actions than those dictated by the

current greedy policy.

B. Multi-Agent RL

The generalization of the Markov decision process to the

multi-agent case is the Markov game.

Deﬁnition 2: A

Markov game

is a tuple

hA, X, {U

}

i∈A

, f, {ρ

}

i∈A

i where: A = {1, . . . , n} is

the set of n agents, X is the discrete set of process states,

}

i∈A

are the discrete sets of actions available to the agents,

yielding the joint action set U = ×

i∈A

, f : X × U → X

is the state transition function, and ρ

: X × U → R, i ∈ A

are the reward functions of the agents.

Note that the state transitions, agent rewards r

i,k

, and thus

also the agent returns R

i,k

, depend on the

joint action

1,k

, . . . , u

n,k

]

, U

∈ U , u

i,k

∈ U

. The policies h

: X ×

→ [0, 1] form together the joint policy h. The Q-function

of each agent depends on the joint action and is conditioned

on the joint policy, Q

: X × U → R.

A fully cooperative Markov game is a game where the

agents have identical reward functions, ρ

= . . . = ρ

. In

this case, the learning goal is the maximization the common

discounted return. In the general case, the reward functions

of the agents may differ. Even agents which form a team

may encounter situations where their immediate interests are

in conﬂict, e.g., when they need to share some resource. As the

returns of the agents are correlated, they cannot be maximized

independently. Formulating a good learning goal in such a

situation is a difﬁcult open problem (see e.g., [13]–[15]).

III. COOPERATIVE DECENTRALIZED RL CONTROL

This section brieﬂy reviews approaches to solving the

coordination issue in decentralized RL, and then mentions

some of the potential advantages of decentralized RL.

A. The Coordination Problem

Coordination requires that all agents coherently choose their

part of a desirable joint policy. This is not trivial, even if the

task is fully cooperative. To see this, assume all agents learn in

parallel the common optimal Q-function with, e.g., Q-learning:

k+1

, u

) = Q

, u



k+1

+ γ max

∈U

Q(x

k+1

, u

) − Q

, u

)



. (5)

Then, in principle, they could use the greedy policy to

maximize the common return. However, greedy action selec-

tion breaks ties randomly, which means that in the absence

of additional mechanisms, different agents may break a tie

in different ways, and the resulting joint action may be

suboptimal.

The multi-agent RL algorithms in the literature solve this

problem in various ways.

Coordination-free

methods bypass the issue. For instance,

in fully cooperative tasks, the Team Q-learning algorithm [16]

assumes that the optimal joint actions are unique (which will

rarely be the case). Then, (5) can directly be used.

The agents can be

indirectly

steered toward coordination.

To this purpose, some algorithms learn empirical models of

the other agents and adapt to these models [17]. Others use

heuristics to bias the agents toward actions that promise to

yield good reward [18]. Yet others directly search through the

space of policies using gradient-based methods [11].

The action choices of the agents can also be

explicitly

coordinated

or negotiated:

– Social conventions [19] and roles [20] restrict the action

choices of the agents.

– Coordination graphs explicitly represent where coordi-

nation between agents is required, thus preventing the

agents from engaging in unnecessary coordination activ-

ities [21].

– Communication is used to negotiate action choices, either

alone or in combination with the above techniques.

B. Potential Advantages of Decentralized RL

If the coordination problem is efﬁciently solved, learning

speed might be higher for decentralized learners. This is

because each agent i searches an action space U

. A centralized

learner solving the same problem searches the joint action

space U = U

× · · · × U

, which is exponentially larger.

This difference will be even more signiﬁcant in tasks where

not all the state information is relevant to all the learning

agents. For instance, in a team of mobile robots, at a given

time, the position and velocity of robots that are far away from

the considered robot might not be interesting for it. In such

tasks, the learning agents can consider only the relevant state

components and thus further decrease the size of the problem

they need to solve [22].

Memory and processing time requirements will also be

smaller for smaller problem sizes.

If several learners solve similar tasks, then they could gain

further beneﬁt from sharing their experience or knowledge.

IV. EXAMPLE: TWO-LINK RIGID MANIPULATOR

A. Manipulator Model

The two-link manipulator, depicted in Fig. 1, is described

by the nonlinear fourth-order model:

M(θ)

θ + C(θ,

θ)

θ + G(θ) = τ (6)

where θ = [θ

, θ

]

, τ = [τ

, τ

]

. The system has two control

inputs, the torques in the two joints, τ

and τ

, and four

measured outputs – the link angles, θ

, θ

, and their angular

speeds

The mass matrix M (θ), Coriolis and centrifugal forces

motor

Fig. 1. Schematic drawing of the two-link rigid manipulator.

matrix C(θ,

θ), and gravity vector G(θ), are:

M(θ) =



+ P

+ 2P

cos θ

+ P

cos θ

+ P

cos θ



(7)

C(θ,

θ) =



− P

sin θ

−P

(

) sin θ

sin θ



(8)

G(θ) =



−g

sin θ

− g

sin(θ

+ θ

)

−g

sin(θ

+ θ

)



(9)

The meaning and values of the physical parameters of the

system are given in Table I.

Using these, the rest of the parameters in (6) can be

computed by:

= m

+ m

+ I

= m

+ I

= m

= (m

+ m

)g g

= m

(10)

In the sequel, it is assumed that the manipulator operates

in a horizontal plane, leading to G(θ) = 0. Furthermore, the

following simpliﬁcations are adopted in (6):

1) Coriolis and centrifugal forces are neglected, leading to

C(θ,

θ) = diag[b

, b

];

is neglected in the equation for

;

3) the friction in the second joint is neglected in the

equation for

After these simpliﬁcations, the dynamics of the manipulator

can be approximated by:

+ P

+ 2P

cos θ

)



(τ

− b

) − (P

+ P

cos θ

)τ



− b

(11)

The complete process state is given by x = [θ

]

If centralized control is used, the command is u = τ ; for

decentralized control with one agent controlling each joint

motor, the agent commands are u

= τ

, u

= τ

TABLE I

PHYSICAL PARAMETERS OF THE MANIPULATOR

Symbol Parameter Value

g gravitational acceleration 9.81 m/s

length of ﬁrst link 0.1 m

length of second link 0.1 m

mass of ﬁrst link 1.25 kg

mass of second link 1 kg

inertia of ﬁrst link 0.004 kgm

inertia of second link 0.003 kgm

center of mass of ﬁrst link 0.05 m

center of mass of second link 0.05 m

damping in ﬁrst joint 0.1 kgs

−1

damping in second joint 0.02 kgs

−1

1,max

maximum torque of ﬁrst joint motor 0.2 Nm

2,max

maximum torque of ﬁrst joint motor 0.1 Nm

1,max

maximum angular speed of ﬁrst link 2π rad/sec

2,max

maximum angular speed of second link 2π rad/sec

B. RL Control

The control goal is the stabilization of the system around

θ =

θ = 0 in minimum time, with a tolerance of ±5 · π/180

rad for the angles, and ±0.1 rad/sec for the angular speeds.

To apply RL in the form presented in Section II, the time

axis, as well as the continuous state and action components of

the manipulator, must ﬁrst be discretized. Time is discretized

with a sampling time of T

= 0.05 sec; this gives the discrete

system dynamics f. Each state component is quantized in

fuzzy bins, and three torque values are considered for each

joint: −τ

i,max

(maximal torque clockwise), 0, and τ

i,max

(maximal torque counter-clockwise).

One Q-value is stored for each combination of bin centers

and torque values. The Q-values of continuous states are then

interpolated between these center Q-values, using the degrees

of membership to each fuzzy bin as interpolation weights. If

e.g., the Q-function has the form Q(θ

, τ

), the Q-values

of a continuous state [θ

2,k

]

are computed by:

Q(θ

2,k

, τ

) =

m=1,...,N

n=1,...,N

(θ

2,k

)µ

(

2,k

) · Q(m, n, τ

), ∀τ

(12)

where e.g., µ

(

2,k

) is the membership degree of

2,k

the n

bin. For triangular membership functions, this can be

computed as:

(

2,k

) =











max(0,

n+1

−

2,k

n+1

−c

), if n = 1

max



0, min(

2,k

−c

n−1

−c

n−1

n+1

−

2,k

n+1

−c

)



if 1 < n < N

max(0,

2,k

−c

n−1

−c

n−1

), if n = N

(13)

where c

is the center of the n

bin – see Fig. 2 for an

example.

0.5

[rad/sec]

c =-2

π c =2

πc

2,6

Fig. 2. Example of quantization in fuzzy bins with triangular membership

functions for

Such a set of bins is completely determined by a vector

of bin center coordinates. For

and

, 7 bins are used,

with their centers at [−360, −180, −30, 0, 30, 180, 360]·π/180

rad/sec. For θ

and θ

, 12 bins are used, with their centers at

[−180, −130, −80, −30, −15, −5, 0, 5, 15, 30, 80, 130] · π/180

rad; there is no ‘last’ or ‘ﬁrst’ bin, because the angles evolve

on a circle manifold [−π, π). The π point is identical to −π,

so the ‘last’ bin is a neighbor of the ‘ﬁrst’.

Algorithm 1 Fuzzy value iteration for a SISO RL controller

1: Q

(m, u

) = 0, for m = 1, . . . , N

, j = 1, . . . , N

2: ` = 0

3: repeat

4: for m = 1, . . . , N

, j = 1, . . . , N

`+1

(m, u

) = ρ(c

, u

)

+ γ

˜m=1

x, ˜m

(f(c

, u

)) max

˜u

( ˜m, ˜u

)

6: end for

7: ` = ` + 1

8: until k Q

− Q

`−1

k ≤ δ

The optimal Q-functions for both the centralized and decen-

tralized case are computed with a version of value iteration (3)

which is altered to accommodate the fuzzy representation of

the state. The complete algorithm is given in Alg. 1. For easier

readability, the RL controller is assumed single-input single-

output, but the extension to multiple states and / or outputs is

straightforward. The discount factor is set to γ = 0.98, and

the threshold value to δ = 0.01.

The control action in state x

is computed as follows

(assuming as above a SISO controller):

= h(x

) =

m=1

x,m

) arg max

˜u

Q( ˜m, ˜u

) (14)

Centralized RL. The reward function ρ for the centralized

learner computes rewards by:











0 if |θ

i,k

| ≤ 5 · π/180 rad

and



i,k



≤ 0.1 rad/sec, i ∈ {1, 2}

−0.5 otherwise

(15)

The centralized policy for solving the two-link manipulator

task must be of the form:

[τ

, τ

]

= h(θ

, θ

) (16)

Therefore, the centralized learner uses a Q-table of the form

Q(θ

, θ

, τ

The policy computed by value iteration is applied to the

system starting from the initial state x

= [−1, −3, 0, 0]

The resulting command, state, and reward signals are given in

Fig. 3(a).

Decentralized RL. In the decentralized case, the rewards

are computed separately for the two agents:

i,k











0 if |θ

i,k

| ≤ 5 · π/180 rad

and



i,k



≤ 0.1 rad/sec

−0.5 otherwise

(17)

For decentralized control, the system (11) creates an asym-

metric setting. Agent 2 can choose its action τ

2,k

by only

considering the second link’s state, whereas agent 1 needs to

take into account θ

2,k

and τ

2,k

besides the ﬁrst link’s state. If

agent 2 is always the ﬁrst to choose its action, and agent 1

0 0.5 1 1.5 2 2.5 3

-3

-2

-1

Link angles[rad]

0 0.5 1 1.5 2 2.5 3

Link velocities[rad/sec]

0 0.5 1 1.5 2 2.5 3

-0.2

-0.1

0.1

0.2

Cmd torque joint 1[Nm]

0 0.5 1 1.5 2 2.5 3

-0.2

-0.1

0.1

0.2

Cmd torque joint 2[Nm]

0 0.5 1 1.5 2 2.5 3

-0.4

-0.2

Reward [-]

t [sec]

(a) Centralized RL (thin line–link 1, thick line–link 2)

0 0.5 1 1.5 2 2.5 3

-3

-2

-1

Link angles[rad]

0 0.5 1 1.5 2 2.5 3

Link velocities[rad/sec]

0 0.5 1 1.5 2 2.5 3

-0.2

-0.1

0.1

0.2

Cmd torque joint 1[Nm]

0 0.5 1 1.5 2 2.5 3

-0.2

-0.1

0.1

0.2

Cmd torque joint 2[Nm]

0 0.5 1 1.5 2 2.5 3

-0.4

-0.2

Reward [-]

t [sec]

(b) Decentralized RL (thin line–link / agent 1, thick line–link / agent 2)

Fig. 3. State, command, and reward signals for RL control.

can learn about this action before it is actually taken (e.g., by

communication) then the two agents can learn control policies

of the following form:

= h

(θ

)

= h

(θ

, θ

, τ

)

(18)

Therefore, the two agents use Q-tables of the form

(θ

, τ

), and respectively Q

(θ

, θ

, τ

). Value

iteration is applied ﬁrst for agent 2, and the resulting policy is

used in value iteration for agent 1.

The policies computed in this way are applied to the

system starting from the initial state x

= [−1, −3, 0, 0]

The resulting command, state, and reward signals are given in

Fig. 3(b).

C. Discussion

Value iteration converges in 125 iterations for the cen-

tralized case, 192 iterations for agent 1, and 49 iterations

for agent 2. The learning speeds are therefore comparable

for centralized and decentralized learning in this application.

Agent 2 of course converges relatively faster, as it state-action

space is much smaller.

Both the centralized and the decentralized policies stabilize

the system in 1.2 seconds. The steady-state angle offsets are

all within the imposed 5 degrees tolerance bound. Notice

that in Fig. 3(b), the ﬁrst link is stabilized slightly faster

than in Fig. 3(a), where both links are stabilized at around

the same time. This is because decentralized learners are

rewarded separately (17), and have an incentive to stabilize

their respective links faster.

The form of coordination used by the two agents is

Decentralized Reinforcement Learning Control of a Robotic Manipulator

Figures

Citations

A Comprehensive Survey of Multiagent Reinforcement Learning

Multiagent systems: a modern approach to distributed artificial intelligence

Multi-agent Reinforcement Learning: An Overview

Review: independent reinforcement learners in cooperative markov games: A survey regarding coordination problems

Residential Demand Response of Thermostatically Controlled Loads Using Batch Reinforcement Learning

References

Reinforcement Learning: An Introduction

Technical Note Q-Learning

Technical Note:Q-Learning

Cooperative Multi-Agent Learning: The State of the Art

The dynamics of reinforcement learning in cooperative multiagent systems

Related Papers (5)

Reinforcement Learning: An Introduction

A Comprehensive Survey of Multiagent Reinforcement Learning

An Algorithm for Distributed Reinforcement Learning in Cooperative Multi-Agent Systems

The dynamics of reinforcement learning in cooperative multiagent systems

Multi-agent reinforcement learning: independent vs. cooperative agents

Frequently Asked Questions (15)

Q1. What are the contributions mentioned in the paper "Decentralized reinforcement learning control of a robotic manipulator" ?

Q2. What are the future works in "Decentralized reinforcement learning control of a robotic manipulator" ?

Q3. How is the RL applied in the form presented in Section II?

Q4. What is the command for centralized control?

Q5. What is the learning goal of the centralized RL task?

Q6. What is the Q-function of each agent?

Q7. What is the simplest way to compute the Q-values of a continuous state?

Q8. What is the simplest way to compute the dynamic of a continuous state?

Q9. What are the actions of the agents that can be explicitly coordinated?

Q10. What is the simplest example of a Markov decision process?

Q11. What is the angular speed of the two links?

Q12. What is the main issue with RL updates?

Q13. What is the simplest way to calculate the dynamic of a rotary motor?

Q14. What is the optimal Q-function for the centralized and decentralized case?

Q15. What is the simplest way to calculate the RL system?