Improved deep reinforcement learning for robotics through distribution-based experience retention

doi:10.1109/IROS.2016.7759581

Delft University of Technology

Improved deep reinforcement learning for robotics through distribution-based experience

retention

de Bruin, Tim; Kober, Jens; Tuyls, Karl; Babuska, Robert

DOI

10.1109/IROS.2016.7759581

Publication date

2016

Document Version

Accepted author manuscript

Published in

Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Citation (APA)

de Bruin, T., Kober, J., Tuyls, K., & Babuska, R. (2016). Improved deep reinforcement learning for robotics

through distribution-based experience retention. In D-S. Kwon, C-G. Kang, & I. H. Suh (Eds.),

Proceedings

of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS): IROS 2016

(pp.

3947-3952). IEEE . https://doi.org/10.1109/IROS.2016.7759581

Important note

To cite this publication, please use the final published version (if applicable).

Please check the document version above.

Copyright

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent

of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Takedown policy

Please contact us and provide details if you believe this document breaches copyrights.

We will remove access to the work immediately and investigate your claim.

This work is downloaded from Delft University of Technology.

For technical reasons the number of authors shown on this cover page is limited to a maximum of 10.

Tim de Bruin

1

Jens Kober

1

Karl Tuyls

2,1

Robert Babu

ˇ

ska

1

Abstract— Recent years have seen a growing interest in the

use of deep neural networks as function approximators in

reinforcement learning. In this paper, an experience replay

method is proposed that ensures that the distribution of the

experiences used for training is between that of the policy and

a uniform distribution. Through experiments on a magnetic

manipulation task it is shown that the method reduces the

need for sustained exhaustive exploration during learning. This

makes it attractive in scenarios where sustained exploration

is in-feasible or undesirable, such as for physical systems like

robots and for life long learning. The method is also shown to

improve the generalization performance of the trained policy,

which can make it attractive for transfer learning. Finally,

for small experience databases the method performs favorably

when compared to the recently proposed alternative of using the

temporal difference error to determine the experience sample

distribution, which makes it an attractive option for robots with

limited memory capacity.

I. INTRODUCT ION

Modern day robots are increasingly required to adapt to

changing circumstances and to learn how to behave in new

and complex environments. Reinforcement Learning (RL)

provides a powerful framework that enables them to do this

with minimal prior knowledge about their environment or

their own dynamics [1]. When applying RL to problems

with medium to large state and action dimensions, function

approximators are needed to keep the process tractable. Deep

neural networks have recently had great successes as function

approximators in RL both for robotics [2] and beyond [3],

[4].

When RL is used to learn directly from trial and error,

the amount of interaction required for the robot to learn

good behavior policies can be prohibitively large. Experience

Replay (ER) [5] is a technique that can help to overcome this

problem by allowing interaction experiences to be re-used.

This can make RL more sample efﬁcient, and has proven

to be important to make RL with deep neural networks as

function approximators work in practice [6], [7].

As reported previously [8], RL with deep neural network

function approximators can fail when the experiences that

are used to train the neural networks are not diverse enough.

When learning online, or when using experience replay in

1

All authors are with the Delft Center for Systems and Con-

trol, Delft University of Technology. {t.d.debruin, j.kober,

r.babuska}@tudelft.nl

2

Karl Tuyls is with the Department of Computer Science, University of

Liverpool. K.Tuyls@liverpool.ac.uk

This work is part of the research programme Deep Learning for Robust

Robot Control (DL-Force) with project number 656.000.003, which is

(partly) ﬁnanced by the Netherlands Organisation for Scientiﬁc Research

(NWO).

Fig. 1: Magnetic manipulation setup. The horizontal position

of the ball needs to be controlled via the four coils. The

results in this paper are from a simulation model of this

setup.

the standard manner, in which experiences are added in a

First In First Out (FIFO) manner and sampled uniformly,

this effectively translates to a requirement to always keep

on exploring. This can be problematic when using RL on

physical systems such as robots, where continued thorough

exploration leads to increased wear or even damage of the

system. Additionally, this can lead to bad task-performance

of the robot while learning.

In this paper a method is proposed in which two ex-

perience replay databases are deployed. One is ﬁlled with

experiences in the standard FIFO manner, while in the other

one the experiences are overwritten with new experiences

in order to get an approximately uniform distribution over

the state-action space. By sampling experiences from both

databases when training the deep neural networks, the detri-

mental effects of reduced exploration can be limited. This

method is tested on a simulated magnetic manipulation task

(Figure 1).

This work is closely related to [9] where an experience

replay strategy is proposed in which all experiences are

saved, but the sampling procedure is based on the temporal

difference error. We show that when a small database is used,

the temporal difference error does not yield good results.

In [3] a model free RL method with deep neural network

function approximation was proposed in which no experience

replay was used. However, this method requires several

different exploration policies to be followed simultaneously,

which seems implausible outside of simulation.

The remainder of this paper

1

is organized as follows:

Section II explains the used deep reinforcement learning

1

Also see accompanying video.

2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Accepted Author Manuscript. Link to published paper (IEEE): https://dx.doi.org/10.1109/IROS.2016.7759581

Improved Deep Reinforcement Learning for Robotics Through

Distribution-based Experience Retention

republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any

copyrighted component of this work in other works.

method, some preliminaries about experience replay and our

proposed extension. In Section III the magnetic manipulator

problem on which our method is tested is discussed. Then in

Section IV we examine the properties and the performance

of our method on the magnetic manipulator problem. We

also compare the method to alternatives.

II. METHOD

The experience replay method proposed in this paper is

used in combination with the Deep Deterministic Policy

Gradient (DDPG) algorithm presented in [6]. The results are

however expected to apply similarly to other deep reinforce-

ment learning methods that make use of experience replay.

A. Deep Deterministic Policy Gradient (DDPG)

The DDPG method is an off-policy actor-critic reinforce-

ment learning algorithm. Actor-critic algorithms are inter-

esting for robot control, as they allow for continuous action

spaces. This means that smooth actuator control signals can

be learned.

The algorithm uses an actor and a critic. The actor π

attempts to determine the real-valued control action a

π

∈ R

n

that will maximize the expected sum of future rewards r

based on the current state of the system s ∈ R

m

; a

π

= π(s).

The critic Q predicts the expected discounted sum of future

rewards when taking action a(k) in state s(k) at time k and

the policy π is followed for all future time steps:

Q

π

(s, a) = E





r



s(k), a, s(k + 1)



+

∞

X

j=k+1

γ

j−k

r



s(j), π(s(j) ) , s(j + 1)





s(k) = s



(1)

with 0 ≤ γ < 1 the discount factor, which is used to ensure

this sum is ﬁnite.

The actor and critic functions are approximated by neural

networks with parameter vectors ζ and ξ respectively. The

critic network weights ξ are updated to minimize the squared

temporal difference error:

L(ξ) =



r + γQ



s

′

, π(s

′

|ζ

−

)



ξ

−



− Q(s , a|ξ)



2

(2)

Where s = s(k), s

′

= s(k+1) and r = r(s, a, s

′

) for brevity.

The parameter vectors ζ

−

and ξ

−

are copies of ζ and ξ that

are updated with a low-pass ﬁlter to slowly track ζ and ξ:

ξ

−

← τξ + (1 − τ )ξ

−

(3)

ζ

−

← τζ + (1 − τ)ζ

−

. (4)

This improves the stability of the learning algorithm [6]. The

parameter τ determines how quickly the ζ

−

and ξ

−

track ζ

and ξ. Values of τ close to one result in fast yet unstable

learning, whereas small values of τ result in slow yet stable

learning. Here τ = 10

−2

is used.

The actor network is updated in the direction that will

maximize the expected reward according to the critic:

∆ζ ∼ ▽

a

Q(s, a|ξ)|

s=s(k),a=π(s(k)|ζ)

▽

ζ

π(s|ζ)|

s=s(k)

(5)

B. Experience Replay

The use of an off-policy algorithm is very relevant

for robotics as it allows for experience replay [5] to be

used. When using experience replay, the experience tuples

hs, a, s

′

, ri from the interaction with the system are stored

in a database. During the training of the neural networks,

the experiences are sampled from this database, allowing

them to be used multiple times. The addition of experience

replay aids the learning in several ways. The ﬁrst beneﬁt

is the increased sample efﬁciency by allowing samples to be

reused. Additionally, in the context of neural networks, expe-

rience replay allows for mini-batch updates which improves

the computational efﬁciency, especially when the training is

performed on a GPU.

On top of the efﬁciency gains that experience replay

brings, it also improves the stability of RL algorithms that

make use of neural network function approximators such as

DQN [7] and DDPG [6]. One way in which the database

helps stabilize the learning process is that it is used to break

the temporal correlations of the neural network learning

updates. Without an experience database, the updates of

(2), (5) would be based on subsequent experience samples

from the system. These samples are highly correlated since

the state of the system does not change much between

consecutive time-steps. For real-time control, this effect is

even more pronounced with high sampling frequencies. The

problem this poses to the learning process is that most mini-

batch optimization algorithms are based on the assumption

of independent and identically distributed (i.i.d.) data [6].

Learning from subsequent samples would violate this i.i.d.

assumption and cause the updates to the network parameters

to have a high variance, leading to slower and potentially

less stable learning [10]. By saving the experiences over a

period of time and updating the neural networks with mini-

batches of experiences that are sampled uniformly random

from the database this problem is alleviated.

1) Effects of the Experience Sample Distributions: In this

paper we use a deterministic policy in combination with

an actor-critic algorithm that uses Q-learning updates. This

means that in theory, no importance sampling is needed to

compensate for the fact that we are sampling our experiences

off-policy [11].

However, we are using deep neural networks as global

function approximators for the actor and the critic. After

every episode E the critic network is updated to minimize

an estimate of the loss function (2), the empirical loss:

E(ξ) =

1

|D|

X

i∈D



r

i

+ γQ



s

′

i

, π(s

′

i

|ζ

−

)



ξ

−



− Q(s

i

, a

i

|ξ)



2

(6)

Here i are the samples in the experience replay database D

after episode E. The distribution of the samples over the

state-action space clearly determines the contribution of the

approximation accuracy in these regions to the empirical loss.

An approximation of Q that is very precise in a part of the

state-action space that has many samples but imprecise in a

region with few experience samples might result in a low

empirical loss. Meanwhile, an approximation that is more or

less correct everywhere might result in a higher empirical

loss. From (5) it can be seen that when the critic is not

accurate for a certain region of the state action space, the

updates to the actor will also likely be wrong.

Additionally, even if a neural network has previously

learned to do a task well, it can forget this knowledge

completely when learning a new task, even when the new

task is related to the old one [12]. In the case of the

DDPG algorithm, even if the critic can accurately predict

the expected future sum of rewards for parts of the state-

action space, this ability can disappear when it is no longer

trained on data from this part of the state-action space, as the

same parameters apply to the other parts of the state-action

space as well and might be changed to reduce the temporal

difference error there.

We earlier observed [8] that when the experiences are

sampled by exclusively following a deterministic policy

without exploration, even a good one, the DDPG method

fails. Since sufﬁcient exploration prevents this problem, this

seems to imply that having a value function and policy

that generalize to the whole state-action space to at least

some extent is important. Therefore, we would like to have

at least some sample density over the whole state-action

space. We are however mostly interested in those areas of

the state-action space that would actually be encountered

when performing the task. We therefore want most of our

experiences to be in this region. For optimal performance

we therefore need to ﬁnd a trade-off between both criteria.

Furthermore, these properties of the database distribution

should ideally hold after all episodes E.

In Section IV experiments are shown that investigate the

inﬂuence of the experience sample distribution over the state-

action space. These experiments indeed show that the ideal

distribution is likely to be somewhere between the distribu-

tion that results from simply following the most recent policy

with some exploration and a uniform distribution over the

state-action space.

2) Distribution Based Experience Retention: The most

common experience replay method, which is used in the

DQN [7] and DDPG [6] papers, is to use a database D that

is overwritten in a First In First Out (FIFO) fashion. The

experiences are then sampled from this database uniformly

random. This will however in general not yield a desirable

distribution of the experiences in the sampled batches over

the state-action space. In fact, as will be shown in Section IV,

when at some point during the training the amount of explo-

ration is reduced too far, the performance of the controller

policy will decrease.

Maintaining high levels of exploration might place in-

feasible demands on physical systems such as robots. On

these systems, continued extensive exploration might cause

increased wear or damage or be simply impossible because

the robot is required to perform a task adequately while

learning is under way.

In this paper, a method is proposed to maintain a desirable

distribution over the state-action space of the experiences in

u

1

y

u

4

u

3

u

2

Fig. 2: Experimental setup schematic.

the batches used to update the neural networks.

The proposed method is based on having two experience

databases of limited size. The ﬁrst database D

π

is overwrit-

ten in the standard FIFO manner. The distribution of the

experience samples in this database will therefore correspond

approximately to the current policy.

For the second database D

U

, the experiences in the

database are overwritten by the new experiences in such

a way that we approximate a uniform distribution over

the state-action space. To do this, after this experience

database has been ﬁlled to capacity, each new experience

will overwrite the experience i already in this database that is

most likely under the distribution induced by the experiences

j already contained in D

U

. The following distance metric,

employing kernel density estimation [13] on the experiences

in the database D

U

, is used to determine which experience

will be overwritten:

i

overwrite

= argmax

i∈D

1

|D|

X

j∈D

e

−

P

D

N

d=1

(i

d

−j

d

)

2

/

C

d

(7)

where d are the dimensions in the state-action space, D

N

is

the total dimensionality of the state-action space and C

d

is

a dimension dependent scaling constant. Here, C

d

is chosen

as |d|/C with d the size of the considered part of that state-

action dimension. C is a constant that is dependent on the

size of the database and the properties of the distribution. It

is chosen manually based on the approximation quality of

the sample distribution.

When training the neural networks, experiences are drawn

uniformly random from D

U

with probability β and uniformly

random from D

π

with probability (1 − β). The constant β

represents a trade-off between generalization performance

and task performance. Additionally, the value of β could

be increased when the amount of exploration is reduced to

prevent loss of performance.

III. EXPERIMENTS

To test the proposed method, control policies are learned

for a simulated magnetic manipulation task.

A. MAGMAN

Magnetic manipulation is contactless, which opens up

new possibilities for actuation on a micro scale and in

environments where it is not possible to use traditional

actuators. An example of this type of application are medical

micro and nano robots [14].

Our magnetic manipulation setup (Figures 1 and 2) has

four electromagnets in a line. The current through the

electromagnet coils is controlled to dynamically shape the

magnetic ﬁeld above the electromagnets and so to position a

steel ball accurately and quickly to a desired set point. The

ball position is measured by a laser sensor.

The horizontal acceleration of the ball is given by:

¨y = −

b

m

˙y +

1

m

4

X

i=1

g(y, i) u

i

(8)

with

g(y, i) =

−c

1

(y − 0.025i)



(y − 0.025i)

2

+ c

2



3

. (9)

Here, y denotes the position of the ball, ˙y its velocity

and ¨y the acceleration. With u

i

the current through coil

i = 1, 2, 3, 4, g(y, i) is the nonlinear magnetic force equation,

m [kg] the ball mass, and b [

Ns

m

] the viscous friction of the

ball on the rail. The model parameters are listed in Table I.

TABLE I: Magnetic manipulation system parameters

Model parameter Symbol Value Unit

Ball mass m 3.200 · 10

−2

kg

Viscous damping b 1.613 · 10

−2

Nms

Empirical parameter c

1

5.520 · 10

−10

Nm

5

A

−1

Empirical parameter c

2

1.750 · 10

−4

m

2

Sampling period T

s

0.02 s

The reinforcement learning state s is given by the the

position and velocity of the ball. The action a is deﬁned

as the vector of currents u

1

. . . u

n

∈ [0, 0.6] to the coils. The

reward function is deﬁned as:

r(s) = − (100 |y − y

r

| + 5 | ˙y|) (10)

where the reference position y

r

is set to y

r

= 0.035m.

For the theoretical experiments, simulations have been

performed with 3 coils. In these experiments the ball always

starts with the position and velocity equal to zero. We

measure the performance of the controller in two ways:

• The task performance: the average reward for an episode

when using the same initial conditions as were used

during training.

• The generalization performance: the average reward for

an episode when starting from several different initial

positions and velocities.

On physical systems such as robots, continued thorough

exploration is not always desirable or even feasible. To reﬂect

this fact, the amount of exploration in the experiments is

decayed exponentially per episode for all experiments.

IV. RESULTS

To investigate the merits of the method proposed in Sec-

tion II, several experiments are conducted on the magnetic

manipulation problem described in Section III.

(a) Performance on the training task

(b) Performance on the generalization task

Fig. 3: Inﬂuence of the database distribution on the learning

performance. Means and 90% conﬁdence bounds shown for

30 trials.

A. Distribution Effects

In Section II-B.1 we theorized that the ideal distribution

of the experiences in the mini-batches over the state-action

space would be somewhere between a uniform distribution

and the distribution resulting from the policy. We now test

this hypothesis experimentally.

Trials are conducted with an experience replay database

that is overwritten in the standard FIFO manner. However,

with a probability α the experience that results from inter-

acting with the system is replaced with a hypothetical expe-

rience before being written to the database. The hypothetical

experience is synthesized by choosing the state and action

uniformly random from the state and action spaces. The next

state and the reward are known since a simulation is used.

In general this is not the case, but here it serves to test the

desirability of the theoretical database distribution.

The average results of 30 repetitions of this experiment

are shown in Figure 3, for different values of α. For α = 0

we get the standard FIFO method. Here, training is based

on the experiences from the 10 most recent episodes. The

Improved deep reinforcement learning for robotics through distribution-based experience retention

Citations

Off-Policy Deep Reinforcement Learning without Exploration

State-of-the-Art Deep Learning: Evolving Machine Intelligence Toward Tomorrow’s Intelligent Network Traffic Control Systems

Selective Experience Replay for Lifelong Learning

Selective Experience Replay for Lifelong Learning

Experience selection in deep reinforcement learning for control

References

Human-level control through deep reinforcement learning

Mastering the game of Go with deep neural networks and tree search

On Estimation of a Probability Density Function and Mode

Asynchronous methods for deep reinforcement learning

Reinforcement learning in robotics: A survey

Related Papers (5)

Human-level control through deep reinforcement learning

Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching

Prioritized Experience Replay

Continuous control with deep reinforcement learning

Reinforcement Learning: An Introduction

Trending Questions (1)