scispace - formally typeset
Open AccessProceedings ArticleDOI

Improved deep reinforcement learning for robotics through distribution-based experience retention

Reads0
Chats0
TLDR
An experience replay method is proposed that ensures that the distribution of the experiences used for training is between that of the policy and a uniform distribution, which reduces the need for sustained exhaustive exploration during learning and is attractive in scenarios where sustained exploration is in-feasible or undesirable.
Abstract
Recent years have seen a growing interest in the use of deep neural networks as function approximators in reinforcement learning. In this paper, an experience replay method is proposed that ensures that the distribution of the experiences used for training is between that of the policy and a uniform distribution. Through experiments on a magnetic manipulation task it is shown that the method reduces the need for sustained exhaustive exploration during learning. This makes it attractive in scenarios where sustained exploration is in-feasible or undesirable, such as for physical systems like robots and for life long learning. The method is also shown to improve the generalization performance of the trained policy, which can make it attractive for transfer learning. Finally, for small experience databases the method performs favorably when compared to the recently proposed alternative of using the temporal difference error to determine the experience sample distribution, which makes it an attractive option for robots with limited memory capacity.

read more

Content maybe subject to copyright    Report

Delft University of Technology
Improved deep reinforcement learning for robotics through distribution-based experience
retention
de Bruin, Tim; Kober, Jens; Tuyls, Karl; Babuska, Robert
DOI
10.1109/IROS.2016.7759581
Publication date
2016
Document Version
Accepted author manuscript
Published in
Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Citation (APA)
de Bruin, T., Kober, J., Tuyls, K., & Babuska, R. (2016). Improved deep reinforcement learning for robotics
through distribution-based experience retention. In D-S. Kwon, C-G. Kang, & I. H. Suh (Eds.),
Proceedings
of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS): IROS 2016
(pp.
3947-3952). IEEE . https://doi.org/10.1109/IROS.2016.7759581
Important note
To cite this publication, please use the final published version (if applicable).
Please check the document version above.
Copyright
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent
of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Takedown policy
Please contact us and provide details if you believe this document breaches copyrights.
We will remove access to the work immediately and investigate your claim.
This work is downloaded from Delft University of Technology.
For technical reasons the number of authors shown on this cover page is limited to a maximum of 10.

Tim de Bruin
1
Jens Kober
1
Karl Tuyls
2,1
Robert Babu
ˇ
ska
1
Abstract Recent years have seen a growing interest in the
use of deep neural networks as function approximators in
reinforcement learning. In this paper, an experience replay
method is proposed that ensures that the distribution of the
experiences used for training is between that of the policy and
a uniform distribution. Through experiments on a magnetic
manipulation task it is shown that the method reduces the
need for sustained exhaustive exploration during learning. This
makes it attractive in scenarios where sustained exploration
is in-feasible or undesirable, such as for physical systems like
robots and for life long learning. The method is also shown to
improve the generalization performance of the trained policy,
which can make it attractive for transfer learning. Finally,
for small experience databases the method performs favorably
when compared to the recently proposed alternative of using the
temporal difference error to determine the experience sample
distribution, which makes it an attractive option for robots with
limited memory capacity.
I. INTRODUCT ION
Modern day robots are increasingly required to adapt to
changing circumstances and to learn how to behave in new
and complex environments. Reinforcement Learning (RL)
provides a powerful framework that enables them to do this
with minimal prior knowledge about their environment or
their own dynamics [1]. When applying RL to problems
with medium to large state and action dimensions, function
approximators are needed to keep the process tractable. Deep
neural networks have recently had great successes as function
approximators in RL both for robotics [2] and beyond [3],
[4].
When RL is used to learn directly from trial and error,
the amount of interaction required for the robot to learn
good behavior policies can be prohibitively large. Experience
Replay (ER) [5] is a technique that can help to overcome this
problem by allowing interaction experiences to be re-used.
This can make RL more sample efficient, and has proven
to be important to make RL with deep neural networks as
function approximators work in practice [6], [7].
As reported previously [8], RL with deep neural network
function approximators can fail when the experiences that
are used to train the neural networks are not diverse enough.
When learning online, or when using experience replay in
1
All authors are with the Delft Center for Systems and Con-
trol, Delft University of Technology. {t.d.debruin, j.kober,
r.babuska}@tudelft.nl
2
Karl Tuyls is with the Department of Computer Science, University of
Liverpool. K.Tuyls@liverpool.ac.uk
This work is part of the research programme Deep Learning for Robust
Robot Control (DL-Force) with project number 656.000.003, which is
(partly) financed by the Netherlands Organisation for Scientific Research
(NWO).
Fig. 1: Magnetic manipulation setup. The horizontal position
of the ball needs to be controlled via the four coils. The
results in this paper are from a simulation model of this
setup.
the standard manner, in which experiences are added in a
First In First Out (FIFO) manner and sampled uniformly,
this effectively translates to a requirement to always keep
on exploring. This can be problematic when using RL on
physical systems such as robots, where continued thorough
exploration leads to increased wear or even damage of the
system. Additionally, this can lead to bad task-performance
of the robot while learning.
In this paper a method is proposed in which two ex-
perience replay databases are deployed. One is filled with
experiences in the standard FIFO manner, while in the other
one the experiences are overwritten with new experiences
in order to get an approximately uniform distribution over
the state-action space. By sampling experiences from both
databases when training the deep neural networks, the detri-
mental effects of reduced exploration can be limited. This
method is tested on a simulated magnetic manipulation task
(Figure 1).
This work is closely related to [9] where an experience
replay strategy is proposed in which all experiences are
saved, but the sampling procedure is based on the temporal
difference error. We show that when a small database is used,
the temporal difference error does not yield good results.
In [3] a model free RL method with deep neural network
function approximation was proposed in which no experience
replay was used. However, this method requires several
different exploration policies to be followed simultaneously,
which seems implausible outside of simulation.
The remainder of this paper
1
is organized as follows:
Section II explains the used deep reinforcement learning
1
Also see accompanying video.
2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Accepted Author Manuscript. Link to published paper (IEEE): https://dx.doi.org/10.1109/IROS.2016.7759581
Improved Deep Reinforcement Learning for Robotics Through
Distribution-based Experience Retention
© 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/
republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any
copyrighted component of this work in other works.

method, some preliminaries about experience replay and our
proposed extension. In Section III the magnetic manipulator
problem on which our method is tested is discussed. Then in
Section IV we examine the properties and the performance
of our method on the magnetic manipulator problem. We
also compare the method to alternatives.
II. METHOD
The experience replay method proposed in this paper is
used in combination with the Deep Deterministic Policy
Gradient (DDPG) algorithm presented in [6]. The results are
however expected to apply similarly to other deep reinforce-
ment learning methods that make use of experience replay.
A. Deep Deterministic Policy Gradient (DDPG)
The DDPG method is an off-policy actor-critic reinforce-
ment learning algorithm. Actor-critic algorithms are inter-
esting for robot control, as they allow for continuous action
spaces. This means that smooth actuator control signals can
be learned.
The algorithm uses an actor and a critic. The actor π
attempts to determine the real-valued control action a
π
R
n
that will maximize the expected sum of future rewards r
based on the current state of the system s R
m
; a
π
= π(s).
The critic Q predicts the expected discounted sum of future
rewards when taking action a(k) in state s(k) at time k and
the policy π is followed for all future time steps:
Q
π
(s, a) = E
r
s(k), a, s(k + 1)
+
X
j=k+1
γ
jk
r
s(j), π(s(j) ) , s(j + 1)
s(k) = s
(1)
with 0 γ < 1 the discount factor, which is used to ensure
this sum is finite.
The actor and critic functions are approximated by neural
networks with parameter vectors ζ and ξ respectively. The
critic network weights ξ are updated to minimize the squared
temporal difference error:
L(ξ) =

r + γQ
s
, π(s
|ζ
)
ξ

Q(s , a|ξ)
2
(2)
Where s = s(k), s
= s(k+1) and r = r(s, a, s
) for brevity.
The parameter vectors ζ
and ξ
are copies of ζ and ξ that
are updated with a low-pass filter to slowly track ζ and ξ:
ξ
τξ + (1 τ )ξ
(3)
ζ
τζ + (1 τ)ζ
. (4)
This improves the stability of the learning algorithm [6]. The
parameter τ determines how quickly the ζ
and ξ
track ζ
and ξ. Values of τ close to one result in fast yet unstable
learning, whereas small values of τ result in slow yet stable
learning. Here τ = 10
2
is used.
The actor network is updated in the direction that will
maximize the expected reward according to the critic:
ζ
a
Q(s, a|ξ)|
s=s(k),a=π(s(k)|ζ)
ζ
π(s|ζ)|
s=s(k)
(5)
B. Experience Replay
The use of an off-policy algorithm is very relevant
for robotics as it allows for experience replay [5] to be
used. When using experience replay, the experience tuples
hs, a, s
, ri from the interaction with the system are stored
in a database. During the training of the neural networks,
the experiences are sampled from this database, allowing
them to be used multiple times. The addition of experience
replay aids the learning in several ways. The first benefit
is the increased sample efficiency by allowing samples to be
reused. Additionally, in the context of neural networks, expe-
rience replay allows for mini-batch updates which improves
the computational efficiency, especially when the training is
performed on a GPU.
On top of the efficiency gains that experience replay
brings, it also improves the stability of RL algorithms that
make use of neural network function approximators such as
DQN [7] and DDPG [6]. One way in which the database
helps stabilize the learning process is that it is used to break
the temporal correlations of the neural network learning
updates. Without an experience database, the updates of
(2), (5) would be based on subsequent experience samples
from the system. These samples are highly correlated since
the state of the system does not change much between
consecutive time-steps. For real-time control, this effect is
even more pronounced with high sampling frequencies. The
problem this poses to the learning process is that most mini-
batch optimization algorithms are based on the assumption
of independent and identically distributed (i.i.d.) data [6].
Learning from subsequent samples would violate this i.i.d.
assumption and cause the updates to the network parameters
to have a high variance, leading to slower and potentially
less stable learning [10]. By saving the experiences over a
period of time and updating the neural networks with mini-
batches of experiences that are sampled uniformly random
from the database this problem is alleviated.
1) Effects of the Experience Sample Distributions: In this
paper we use a deterministic policy in combination with
an actor-critic algorithm that uses Q-learning updates. This
means that in theory, no importance sampling is needed to
compensate for the fact that we are sampling our experiences
off-policy [11].
However, we are using deep neural networks as global
function approximators for the actor and the critic. After
every episode E the critic network is updated to minimize
an estimate of the loss function (2), the empirical loss:
E(ξ) =
1
|D|
X
i∈D
r
i
+ γQ
s
i
, π(s
i
|ζ
)
ξ
Q(s
i
, a
i
|ξ)
2
(6)
Here i are the samples in the experience replay database D
after episode E. The distribution of the samples over the
state-action space clearly determines the contribution of the
approximation accuracy in these regions to the empirical loss.
An approximation of Q that is very precise in a part of the
state-action space that has many samples but imprecise in a
region with few experience samples might result in a low

empirical loss. Meanwhile, an approximation that is more or
less correct everywhere might result in a higher empirical
loss. From (5) it can be seen that when the critic is not
accurate for a certain region of the state action space, the
updates to the actor will also likely be wrong.
Additionally, even if a neural network has previously
learned to do a task well, it can forget this knowledge
completely when learning a new task, even when the new
task is related to the old one [12]. In the case of the
DDPG algorithm, even if the critic can accurately predict
the expected future sum of rewards for parts of the state-
action space, this ability can disappear when it is no longer
trained on data from this part of the state-action space, as the
same parameters apply to the other parts of the state-action
space as well and might be changed to reduce the temporal
difference error there.
We earlier observed [8] that when the experiences are
sampled by exclusively following a deterministic policy
without exploration, even a good one, the DDPG method
fails. Since sufficient exploration prevents this problem, this
seems to imply that having a value function and policy
that generalize to the whole state-action space to at least
some extent is important. Therefore, we would like to have
at least some sample density over the whole state-action
space. We are however mostly interested in those areas of
the state-action space that would actually be encountered
when performing the task. We therefore want most of our
experiences to be in this region. For optimal performance
we therefore need to find a trade-off between both criteria.
Furthermore, these properties of the database distribution
should ideally hold after all episodes E.
In Section IV experiments are shown that investigate the
influence of the experience sample distribution over the state-
action space. These experiments indeed show that the ideal
distribution is likely to be somewhere between the distribu-
tion that results from simply following the most recent policy
with some exploration and a uniform distribution over the
state-action space.
2) Distribution Based Experience Retention: The most
common experience replay method, which is used in the
DQN [7] and DDPG [6] papers, is to use a database D that
is overwritten in a First In First Out (FIFO) fashion. The
experiences are then sampled from this database uniformly
random. This will however in general not yield a desirable
distribution of the experiences in the sampled batches over
the state-action space. In fact, as will be shown in Section IV,
when at some point during the training the amount of explo-
ration is reduced too far, the performance of the controller
policy will decrease.
Maintaining high levels of exploration might place in-
feasible demands on physical systems such as robots. On
these systems, continued extensive exploration might cause
increased wear or damage or be simply impossible because
the robot is required to perform a task adequately while
learning is under way.
In this paper, a method is proposed to maintain a desirable
distribution over the state-action space of the experiences in
u
1
y
u
4
u
3
u
2
Fig. 2: Experimental setup schematic.
the batches used to update the neural networks.
The proposed method is based on having two experience
databases of limited size. The first database D
π
is overwrit-
ten in the standard FIFO manner. The distribution of the
experience samples in this database will therefore correspond
approximately to the current policy.
For the second database D
U
, the experiences in the
database are overwritten by the new experiences in such
a way that we approximate a uniform distribution over
the state-action space. To do this, after this experience
database has been filled to capacity, each new experience
will overwrite the experience i already in this database that is
most likely under the distribution induced by the experiences
j already contained in D
U
. The following distance metric,
employing kernel density estimation [13] on the experiences
in the database D
U
, is used to determine which experience
will be overwritten:
i
overwrite
= argmax
i∈D
1
|D|
X
j∈D
e
P
D
N
d=1
(i
d
j
d
)
2
/
C
d
(7)
where d are the dimensions in the state-action space, D
N
is
the total dimensionality of the state-action space and C
d
is
a dimension dependent scaling constant. Here, C
d
is chosen
as |d|/C with d the size of the considered part of that state-
action dimension. C is a constant that is dependent on the
size of the database and the properties of the distribution. It
is chosen manually based on the approximation quality of
the sample distribution.
When training the neural networks, experiences are drawn
uniformly random from D
U
with probability β and uniformly
random from D
π
with probability (1 β). The constant β
represents a trade-off between generalization performance
and task performance. Additionally, the value of β could
be increased when the amount of exploration is reduced to
prevent loss of performance.
III. EXPERIMENTS
To test the proposed method, control policies are learned
for a simulated magnetic manipulation task.
A. MAGMAN
Magnetic manipulation is contactless, which opens up
new possibilities for actuation on a micro scale and in

environments where it is not possible to use traditional
actuators. An example of this type of application are medical
micro and nano robots [14].
Our magnetic manipulation setup (Figures 1 and 2) has
four electromagnets in a line. The current through the
electromagnet coils is controlled to dynamically shape the
magnetic field above the electromagnets and so to position a
steel ball accurately and quickly to a desired set point. The
ball position is measured by a laser sensor.
The horizontal acceleration of the ball is given by:
¨y =
b
m
˙y +
1
m
4
X
i=1
g(y, i) u
i
(8)
with
g(y, i) =
c
1
(y 0.025i)
(y 0.025i)
2
+ c
2
3
. (9)
Here, y denotes the position of the ball, ˙y its velocity
and ¨y the acceleration. With u
i
the current through coil
i = 1, 2, 3, 4, g(y, i) is the nonlinear magnetic force equation,
m [kg] the ball mass, and b [
Ns
m
] the viscous friction of the
ball on the rail. The model parameters are listed in Table I.
TABLE I: Magnetic manipulation system parameters
Model parameter Symbol Value Unit
Ball mass m 3.200 · 10
2
kg
Viscous damping b 1.613 · 10
2
Nms
Empirical parameter c
1
5.520 · 10
10
Nm
5
A
1
Empirical parameter c
2
1.750 · 10
4
m
2
Sampling period T
s
0.02 s
The reinforcement learning state s is given by the the
position and velocity of the ball. The action a is defined
as the vector of currents u
1
. . . u
n
[0, 0.6] to the coils. The
reward function is defined as:
r(s) = (100 |y y
r
| + 5 | ˙y|) (10)
where the reference position y
r
is set to y
r
= 0.035m.
For the theoretical experiments, simulations have been
performed with 3 coils. In these experiments the ball always
starts with the position and velocity equal to zero. We
measure the performance of the controller in two ways:
The task performance: the average reward for an episode
when using the same initial conditions as were used
during training.
The generalization performance: the average reward for
an episode when starting from several different initial
positions and velocities.
On physical systems such as robots, continued thorough
exploration is not always desirable or even feasible. To reflect
this fact, the amount of exploration in the experiments is
decayed exponentially per episode for all experiments.
IV. RESULTS
To investigate the merits of the method proposed in Sec-
tion II, several experiments are conducted on the magnetic
manipulation problem described in Section III.
(a) Performance on the training task
(b) Performance on the generalization task
Fig. 3: Influence of the database distribution on the learning
performance. Means and 90% confidence bounds shown for
30 trials.
A. Distribution Effects
In Section II-B.1 we theorized that the ideal distribution
of the experiences in the mini-batches over the state-action
space would be somewhere between a uniform distribution
and the distribution resulting from the policy. We now test
this hypothesis experimentally.
Trials are conducted with an experience replay database
that is overwritten in the standard FIFO manner. However,
with a probability α the experience that results from inter-
acting with the system is replaced with a hypothetical expe-
rience before being written to the database. The hypothetical
experience is synthesized by choosing the state and action
uniformly random from the state and action spaces. The next
state and the reward are known since a simulation is used.
In general this is not the case, but here it serves to test the
desirability of the theoretical database distribution.
The average results of 30 repetitions of this experiment
are shown in Figure 3, for different values of α. For α = 0
we get the standard FIFO method. Here, training is based
on the experiences from the 10 most recent episodes. The

Citations
More filters
Proceedings Article

Off-Policy Deep Reinforcement Learning without Exploration

TL;DR: This paper introduces a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data.
Journal ArticleDOI

State-of-the-Art Deep Learning: Evolving Machine Intelligence Toward Tomorrow’s Intelligent Network Traffic Control Systems

TL;DR: An overview of the state-of-the-art deep learning architectures and algorithms relevant to the network traffic control systems, and a new use case, i.e., deep learning based intelligent routing, which is demonstrated to be effective in contrast with the conventional routing strategy.
Proceedings Article

Selective Experience Replay for Lifelong Learning

TL;DR: This paper propose an experience replay process that augments the standard FIFO buffer and selectively stores experiences in a long-term memory to mitigate forgetting, and explore four strategies for selecting which experiences will be stored: favoring surprise, favoring reward, matching the global training distribution, and maximizing coverage of the state space.
Posted Content

Selective Experience Replay for Lifelong Learning

TL;DR: Overall, the results show that selective experience replay, when suitable selection algorithms are employed, can prevent catastrophic forgetting and is consistently the best approach on all domains tested.
Journal ArticleDOI

Experience selection in deep reinforcement learning for control

TL;DR: This work proposes guidelines for using prior knowledge about the characteristics of the control problem at hand to choose the appropriate experience replay strategy, and investigates different proxies for their immediate and long-term utility.
References
More filters
Journal ArticleDOI

Human-level control through deep reinforcement learning

TL;DR: This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.
Journal ArticleDOI

Mastering the game of Go with deep neural networks and tree search

TL;DR: Using this search algorithm, the program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0.5, the first time that a computer program has defeated a human professional player in the full-sized game of Go.
Journal ArticleDOI

On Estimation of a Probability Density Function and Mode

TL;DR: In this paper, the problem of the estimation of a probability density function and of determining the mode of the probability function is discussed. Only estimates which are consistent and asymptotically normal are constructed.
Proceedings Article

Asynchronous methods for deep reinforcement learning

TL;DR: A conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers and shows that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.
Journal ArticleDOI

Reinforcement learning in robotics: A survey

TL;DR: This article attempts to strengthen the links between the two research communities by providing a survey of work in reinforcement learning for behavior generation in robots by highlighting both key challenges in robot reinforcement learning as well as notable successes.
Related Papers (5)
Trending Questions (1)
Does the experience distribution of reinforcement learning agents have the Heavy Tailed characteristic?

The paper does not mention whether the experience distribution of reinforcement learning agents has the Heavy Tailed characteristic.