What is the main problem of the RL method?

The main optimization problem of the RL method is the need to find the optimal policy π∗, which is defined such as maximizing the expected returnπ∗ = arg max π E τ∼π [Rt(τ)].

What is the spline curve of degree p?

A Bézier spline curve of degree p is defined by n+ 1 control points P0, P1, .., Pn [2]:B(t) = n∑ i=0 Ni,p(t)Pi, (7)where Ni,p(t) is a normalized B-Spline curve defined over the nodes.

What is the main problem of the experiment?

— Soft Computing Journal, Volume 27, No.1, June 2021, Brno, Czech RepublicXAfter the learning process, the robotic arm UR3 is tested with the RVIZ simulation tool and the Gazebo 3D modeling tool, which communicate through ROS.

What is the corresponding parameter of the learning environment?

Hyper-parameters of individual RL/DRL techniques are given in the table (Tab. 1, Tab. 2).The reward function of the goal achievement experiment in each learning technique is defined as:Rt = dt(Pp, Pt)− dt(Pa, Pt)dt(Pi, Pt) +Rs, (9)where dt(Pi, Pt) is the initial Euclidean distance between the start Pi and target point Pt, and4MENDEL — Soft Computing Journal, Volume 27, No.1, June 2021, Brno, Czech RepublicXFigure 4: Agent-environment interaction in Markov’s decision-making process (MDPs) in their problem.

(Open Access) Comparison of Multiple Reinforcement Learning and Deep Reinforcement Learning Methods for the Task Aimed at Achieving the Goal (2021) | Roman Parak

Q: What contributions have the authors mentioned in the paper "Comparison of multiple reinforcement learning and deep reinforcement learning methods for the task aimed at achieving the goal" ?

In this paper, the authors compare several reinforcement learning ( Q-Learning, SARSA ) and deep reinforcement learning ( Deep Q-Network, Deep Sarsa ) methods for a task aimed at achieving a goal using robotics arm UR3.

Q: What have the authors stated for future works in "Comparison of multiple reinforcement learning and deep reinforcement learning methods for the task aimed at achieving the goal" ?

RL ( QL ) and DRL ( DQN, DSARSA ) techniques completed the conditions for the required accuracy represented by Rs, but in the perspective of the future research are techniques based on deep neural network are more stable and efficient. This work can provide a foundation for future research on motion planning in the field of robotics using advanced deep reinforcement learning methods such as DDPG ( Deep Deterministic Policy Gradient ), TD3 ( Twin Delayed Deep Deterministic Policy Gradient ) and more.

Q: What is the main approach of the learning agent’s DQN?

In the structure of learning, the approximation of the value function is with the convolutional neural network (CNN), that uses Q-network to obtain Q value like DQN.

Q: What is the value of the parameter Rs?

The parameter Rs assumes values higher than 0 when the condition is successfully done, which is determined by the accuracy of the results (Eq. 10).

Q: What is the main theme of this article?

Robotics as a field of science has been evolving for the past several years and modern robots operating in the real world should learn new tasks autonomously, flexibly and adapt smoothly to different changes.

MENDEL — Soft Computing Journal, Volume 27, No.1, June 2021, Brno, Czech RepublicX

Comparison of Multiple Reinforcement Learning and Deep Reinforcement Learning

Methods for the Task Aimed at Achieving the Goal

Roman Parak



, Radomil Matousek



Institute of Automation and Computer Science, Brno University of Technology, Czech Republic

Roman.Parak@vutbr.cz



, RMatousek@vutbr.cz



Abstract

Reinforcement Learning (RL) and Deep Reinforcement Learning (DRL) methods

are a promising approach to solving complex tasks in the real world with physi-

cal robots. In this paper, we compare several reinforcement learning (Q-Learning,

SARSA) and deep reinforcement learning (Deep Q-Network, Deep Sarsa) methods

for a task aimed at achieving a goal using robotics arm UR3. The main optimiza-

tion problem of this experiment is to ﬁnd the best solution for each RL/DRL

scenario, respectively, minimize the Euclidean distance accuracy error and smooth

the resulting path by the B´ezier spline method. The simulation and real word

application are controlled by the Robot Operating System (ROS). The learning

environment is implemented using the OpenAI Gym library, which uses the RVIZ

simulation tool and the Gazebo 3D modeling tool for dynamics and kinematics.

Keywords: Reinforcement Learning, Deep neural network, Motion planning,

B´ezier spline, Robotics, UR3.

Received: 1 March 2021

Accepted: 9 May 2021

Published: 21 June 2021

1 Introduction

Robotics as a ﬁeld of science has been evolving for the

past several years and modern robots operating in the

real world should learn new tasks autonomously, ﬂex-

ibly and adapt smoothly to diﬀerent changes. These

requirements create new challenges in the ﬁeld of

robot control systems. For this purpose, reinforcement

learning (RL) methods such as Q-Learning, SARSA

(State–action–reward–state–action), etc. are com-

monly used [27]. A limitation of these learning meth-

ods is the need for a large amount of memory.

In recent years, there has been an increase in deep

neural network (DNN) methods in several areas of

science, technology, medicine, and more, along with

signiﬁcant advances in Deep Reinforcement Learning

(DRL) techniques [11]. DRL overcomes the limitations

of simple RL methods by combining parallel computa-

tion and embedded deep neural networks (DNN).

Reinforcement Learning (RL) and Deep Reinforce-

ment Learning (DRL) methods are a promising ap-

proach to solving complex tasks in the real world

with physical robots. RL/DRL methods are also used

in real-world applications, such as improvements in

the gaming industry for the Go game [24], as well

as in robotic applications for manipulation [21], goal

achievement [5, 22], Human-Robot Collaboration [9],

and more [17].

Planning the trajectory of the robotic arm as one

of the most basic and challenging research topics in

robotics has found considerable interest from research

institutes in recent decades. Traditional task and

motion planning methods, such as RRT [18], RRT*

[31] can solve complex tasks but require full state ob-

servability, a lot of time for problem solving and are

not adapted to dynamic scene changes. Advanced

RL/DRL techniques can solve motion planning tasks

for multiple-axis industrial robot [23].

Figure 1: Experimental task aimed at achieving the

goal using the UR3 robot. The purple box (A

target

)

here is approximately the area of restriction from which

the targets are sampled, and the yellow box (A

)

represents the area of safe movement. The distance be-

tween the start (P

) and target (P

) point is described

by Euclidean Distance.

In this paper, we propose several RL/DRL methods

for the task aimed at achieving the goal using the co-

operating robotic arm UR3 for Universal Robots, more

precisely a 6-axis robotic arm [28]. The basic scene of

our experiment and the real robot in the initial position

position are shown in Fig. 1.

We capitalize the related work in multiple areas of re-

inforcement learning, deep reinforcement learning, mo-

https://doi.org/10.13164/mendel.2021.1.001

ISSN: 1803-3814 (Printed), 2571-3701 (Online)

MENDEL — Soft Computing Journal, Volume 27, No.1, June 2021, Brno, Czech RepublicX

tion planning, etc. (Section 2 Related Work), and we

also summarize the necessary methods needed to create

our work (Section 3 Methods).

In the main part of the work, we focus on solving the

problem, achieving the goal using advanced methods

of motion planning (Section 4 Experiments and Re-

sults). Our approach compares diﬀerent learning tech-

niques (Reinforcement Learning / Deep Reinforcement

Learning) to ﬁnd the trajectory from the initial posi-

tion to the target position and the resulting trajectory

smoothing using a B´ezier spline (B-spline) curve.

In the ﬁnal part of the paper, we focus on the chal-

lenges we have encountered, the current limitations,

and future extensions of our work (Section 5 Conclu-

sion and Future Work).

2 Related Work

Our approach to ﬁnding a point in Cartesian space

using multiple RL/DRL techniques is based on previ-

ous work in the areas of reinforcement learning, deep

reinforcement learning, and motion planning. In the

following section, we will brieﬂy discuss previous work

on each of the relevant topics.

In research in the ﬁeld of robotic motion planning,

the concept of machine learning emerges. In particular,

Reinforcement Learning (RL) and Deep Reinforcement

Learning (DRL) techniques are an area of growing in-

terest for the robotic research community.

Researchers at Erle Robotics have created a frame-

work for testing various RL/DRL algorithms called

OpenAI Gym [6, 32]. Various robot simulation tools

are used as an extension of the OpenAI toolkit, with

Gazebo [1, 15] and PyBullet [8] being the most com-

monly used today. The connection of the robotic tool

with the robust physical core and the Gym toolkit is

created using the ROS (Robot Operating System) [25].

RL/DRL methods are used in robotic applications in

the real world for several experiments. One of the ap-

proaches is focused on unscrewing operations in robotic

disassembly of electronic waste using the Q-Learning

method [16], other approaches have used a robotic arm

to achieve a goal using the Deep Reinforcement Learn-

ing method DQN (Deep Q-Network) [5], TRPO(Trust

Region Policy Optimization)[22] and tested the result

of the experiment in a real application. Some ap-

proaches use 2D/3D cameras and some other sensors

to observe the robotic environment [5, 10, 12], others

use only dynamic simulation with the speciﬁed environ-

ment [13, 16] or use real-time robot learning techniques

[19].

Motion planning is one of the most fundamental re-

search topics in robotics. Some of the approaches have

used traditional planning methods, such as RRT [18],

RRT* [31], where structured tree methods are used

to ﬁnd the curve from point A to point B. Other ap-

proaches use modern techniques, such as RL/DRL, but

both methods use B´ezier curves to characterize com-

plex trajectories and smooth motion planning [23].

3 Methods

This section provides a brief introduction to the the-

ory of Reinforcement Learning, Deep Reinforcement

Learning, as well as path smoothing techniques us-

ing the B´ezier spline curve. In each subsection, we

present two methods of RL (Q-Learning, SARSA) /

DRL (Deep Q-Network, Deep Sarsa) control and the

last subsection is focused on trajectory smoothing.

3.1 Markov Decision Process

Markov Decision Process (MDP) is a classical formula-

tion of sequential decision making, where actions inﬂu-

ence not just immediate rewards, but also subsequent

situations or states, and through those future rewards

[27]. MDPs include late reward and the need to com-

promise with immediate and late reward.

The MDP contains a structure of four basic elements:

; a

; P (s

t+1

; a

); R(s

t+1

; a

)), where s

and s

t+1

elements represents the current and next state, a

part represents the action, P (s

t+1

; a

) means the

probability of transition to the state s

t+1

when tak-

ing action a

in state s

, and the last part of ele-

ments R(s

t+1

; a

) represents the immediate reward

received from the environment after the transition from

to s

t+1

. The agent and environment interact at each

in a sequence of discrete time steps, t = 0, 1, ... Proba-

bility of transition in the MDP structure is depending

on the current state s

and chosen action a

[11, 27, 33].

Figure 2: Basic structure of agent-environment inter-

action in Markov’s decision-making process (MDPs)

[11, 27]

3.2 Reinforcement Learning

Reinforcement learning (RL) is an area of machine

learning that deals with gradual decision-making. The

main task of this method is to learn how agents ought

to take sequences of actions in an environment to max-

imize cumulative rewards. Markov decision processes

(MDP) are an ideal mathematical formulation for RL

problems, for which a direct learning methodology to

achieve the goal is proposed. The agent decides to

receive not only the current remuneration, but also

cumulative remuneration in the next learning state

[11, 27, 33].

The agent and MDP together form a sequence that

contains a number of state-action pairs represented as

τ = ((s

, a

), (s

, a

), . . . ). The return is deﬁned as the

MENDEL — Soft Computing Journal, Volume 27, No.1, June 2021, Brno, Czech RepublicX

discounted return for the sequence τ at the time steps

(τ) =

∞

−t

, (1)

where γ is a discount factor ((0 ≤ γ ≤ 1)), r

is the

reward at time steps t

The main optimization problem of the RL method is

the need to ﬁnd the optimal policy π

∗

, which is deﬁned

such as maximizing the expected return

∗

= arg max

τ ∼π

(τ)]. (2)

Q-Learning (QL):

is one of the most popular methods of Reinforce-

ment Learning. The QL method was developed as an

oﬀ-policy TD (Temporal diﬀerence) control algorithm,

deﬁned by

Q(s

, a

) ← Q(s

, a

) + α[r

t+1

+ γ max

Q(s

t+1

, a

)

− Q(s

, a

)],

(3)

where α is a learning rate (0 < α ≤ 1),

max

Q(s

t+1

, a) is estimate of the optimal future

value, and other parameters are described in the

previous section [11, 27].

State–action–reward–state–action (SARSA):

is an on-policy TD control method, very similar to

the previous Q-Learning method. SARSA method is

characterized by using the next action [27].

Q(s

, a

) ← Q(s

, a

) + α[r

t+1

+ γ max

Q(s

t+1

, a

t+1

)

− Q(s

, a

)],

(4)

3.3 Deep Reinforcement Learning

Deep reinforcement learning (DRL) combines artiﬁcial

neural networks (ANN) with a learning reinforcement

architecture that allows agents to learn the best pos-

sible actions in an individual environment to achieve

their goals. The main approach of this method is to

approximate the function and optimize the goals, map-

ping the state-action pairs to the expected rewards

[11, 33]. This area of research can solve a wide range

of complex decision-making tasks that were previously

out of reach for a machine.

The learning agent’s DRL methods with auxiliary

tasks within a jointly learned representation can signif-

icantly increase the eﬀectiveness of the learning sam-

ple. The learning agent’s DRL methods with auxiliary

tasks within a jointly learned representation can signiﬁ-

cantly increase the eﬀectiveness of the learning sample.

This is based on simultaneously maximizing a number

of pseudoreward functions, such as immediately pre-

dicting the reward (γ = 0), predicting changes on the

next observation, or predicting the activation of some

hidden unit of the agent’s neural network [11].

Deep Q-Network (DQN):

is a combination of Q-learning with deep convolu-

tional ANN and reinforcement learning method, multi-

layer and deep ANN specialized in the processing of

spatial data ﬁelds. For a given state of the neural net-

work s, the output is a vector of action values Q(s, a; θ),

where θ are the network parameters [11, 29]. The two

most important components of the DQN algorithm are

the use of the target network and the use of the expe-

rience replay. The formula used by DQN is then:

Q(s

, a

; θ

) ← Q(s

, a

; θ

)

+ α[(r

t+1

+ γ max

Q(s

t+1

, a

; θ

)

− Q(s

, a

; θ

))

(5)

where a target network with parameters θ

, is the

same as the online network except that its parameters

are copied every τ steps from the online network, so

that then θ

= θ

, which implies that weight of the

neural network [29].

Deep SARSA (DSARSA):

is similar to the previous part. In the structure

of learning, the approximation of the value function

is with the convolutional neural network (CNN), that

uses Q-network to obtain Q value like DQN. The Func-

tion is represented by a CNN with weights θ and an

output that represents the Q-values for each action[20].

Q(s

, a

; θ

) ← Q(s

, a

; θ

)

+ α[(r

t+1

+ γ max

Q(s

t+1

, a

t+1

; θ

t−1

)

− Q(s

, a

; θ

))

(6)

3.4 B´ezier-Spline Curves

B´ezier curves use Bernstein’s polynomials, which were

described in 1912 by the Russian mathematician Sergei

Bernstein [30].

B-spline curves, like B´ezier curves, use polynomials

to generate a curve segment. The main diﬀerence be-

tween simple B´ezier curves and a B-spline is that B-

Spline is used as a series of control points to determine

the local geometry of the curve. This feature ensures

that only a small portion of the curve is changed when

a control point is moved [30]. A B´ezier spline curve of

degree p is deﬁned by n + 1 control points P

, P

, .., P

[2]:

B(t) =

i=0

i,p

(t)P

, (7)

where N

i,p

(t) is a normalized B-Spline curve deﬁned

over the nodes.

R. Parak and R. Matousek

MENDEL — Soft Computing Journal, Volume 27, No.1, June 2021, Brno, Czech RepublicX

4 Experiments and Results

In this section, we present an application that includes

a task using several RL/DRL techniques for the point

of Cartesian space achievement problem using an col-

laborative manipulator with 6-DOF (Degrees of Free-

dom). The experiment will be demonstrated in simu-

lation and in the real world.

4.1 Setting up the learning environment

For our experiment, we will use a cooperating robotic

arm UR3 from Universal Robots (Fig. 1), more pre-

cisely a 6-axis robotic arm with a working radius of 500

mm/19.7 inches, a payload of 3 kg/6.6 pounds and a

repeatability of ±0.1 mm [28].

The simulation is controlled through communica-

tion between the RVIZ [7, 26] simulation tool and the

Gazebo 3D modeling tool [1, 15]. In our experiment,

we use the UR3 URDF model (Universal Robotic De-

scription Format) without a gripper and the oﬃcial

ROS driver for Universal Robots [3], which is used to

control real / simulation robots.

The main structure of our experiment is shown in

Fig. 1. Fig. 3 shows a simpliﬁed structure of the

UR3 model in the RVIZ simulation tool. The blue

sphere represents the initial position of the robot and

the target to be reached by the robot is represented

as a red sphere. The purple box (A

target

) is approx-

imately the area of restriction from which targets are

removed, and the yellow box (A

) represents the

area of safe movement. The safe area is created with

given to the individual robot model to avoid collisions

with the target area (imagine that the target area is a

bin for an object selection problem).

Figure 3: Learning environment of the UR3 model in

the RVIZ simulation tool.

4.2 Deﬁnition of experiment

In our problem, we choose a task aimed at achieving

a goal in Cartesian space using a robotics arm. The

task aims to autonomously ﬁnd the trajectory from

the initial position to the target position using various

learning techniques (Reinforcement Learning / Deep

Reinforcement Learning) and using the B´ezier spline

method to smooth the resulting trajectory.

First, the robot position is initialized and the target

position is randomly selected. When selecting a target

position, we start the learning process for each learn-

ing technique. The distance between the start P

and

target P

point is described by Euclidean Distance d

(Eq. 8).

, P

) =

i=1

− P

)

, (8)

4.3 Learning process

We implement the learning environment within the

OpenAI Gym library [6, 32], which provides a inter-

face to train and test the learning process.

To evaluate our algorithm, we performed identical

experiments with several RL/DRL techniques. In the

ﬁrst experiment we use classical RL techniques (QL,

SARSA) and in the next experiment we use modern

DRL techniques (DQN, DSARSA). Hyper-parameters

of individual RL/DRL techniques are given in the table

(Tab. 1, Tab. 2).

Table 1: Hyper-parameters used for Reinforcement

Learning method (Q-Learning, DARSA)

Hyper-parameter Symbol Value

Episodes of Training M

min

, M

max

500, 1000

Steps per Episode T 100

Discount Factor γ 0.75

Learning Rate α 0.3

Table 2: Hyper-parameters used for Deep Reinforce-

ment Learning method (DQN, DSARSA)

Hyper-parameter Symbol Value

Episodes of Training M

min

, M

max

500, 1000

Steps per Episode T 100

Discount Factor γ 0.95

Learning Rate α 0.0003

Batch Size N 64

Replay Buﬀer Size B 2000

Optimizer - Adam [14]

The reward function of the goal achievement exper-

iment in each learning technique is deﬁned as:

, P

) − d

, P

)

, P

)

+ R

, (9)

where d

, P

) is the initial Euclidean distance

between the start P

and target point P

, and

MENDEL — Soft Computing Journal, Volume 27, No.1, June 2021, Brno, Czech RepublicX

Figure 4: Agent-environment interaction in Markov’s

decision-making process (MDPs) in our problem. The

environment in our problem represents the communi-

cation between the RVIZ [7, 26] simulation tool and

the Gazebo [1, 15] 3D modeling tool via ROS [25].

The agent represents the RL/DRL methods used in

our problem.

, P

) − d

, P

)) is the Euclidean distance

diﬀerence in real time (P

– previous position, P

–

actual position). The parameter R

assumes values

higher than 0 when the condition is successfully done,

which is determined by the accuracy of the results

(Eq. 10).

(

0.25, if δ

≤ 5%.

0, otherwise.

(10)

where δ

is the error of accuracy of the Euclidean

distance, deﬁned as:

, P

)

, P

)

.100. (11)

The learning process of the RL/DRL agent begins

by examining the environment by performing actions

from the initial state to the target state and collecting

appropriate rewards (Eq. 9). In our experiment, the

agent can select one of six possible actions (a

= (0..5))

in each state (s

= (X

, X

−

, Y

−

, Z

−

)), and the

available actions correspond to ﬁxed discrete steps 5

mm of the Tool Center Point (TCP).

Once the agent moves from A

, the episode M

ends in the current step T , otherwise the process con-

tinues until the maximum number of episodes.

4.4 Experimental results

The training results of the proposed target achieve-

ment experiment using the UR3 robotic arm are shown

in Fig. 5 (a. Q-Learning, b. SARSA, c. Deep Q-

Network, d. Deep SARSA), including the mean cumu-

lative reward with the number of iterations in each of

the episodes.

The main goal of the optimization problem in the

case of the learning process was to maximize the ex-

pected cumulative reward (Fig. 5) and to minimize the

Euclidean distance accuracy error (Fig. 6).

(a) Q-Learning

(b) SARSA

(d) D eep SARSA

Figure 5: Training results of RL/DRL techniques using

the environment to achieve the goal of the UR3 robot.

R. Parak and R. Matousek

Comparison of Multiple Reinforcement Learning and Deep Reinforcement Learning Methods for the Task Aimed at Achieving the Goal

Figures

Citations

Deep Q-Learning in Robotics: Improvement of Accuracy and Repeatability

A New Optimal Design of Stable Feedback Control of Two-Wheel System Based on Reinforcement Learning

A Note on the Frequency Characteristics of Discrete Systems in the Complex Plane

A Collection of Robotics Problems for Benchmarking Evolutionary Computation Methods

References

Mastering the game of Go with deep neural networks and tree search

Design and use paradigms for Gazebo, an open-source multi-robot simulator

An Introduction to Deep Reinforcement Learning

Reducing the Barrier to Entry of Complex Robotic Software: a MoveIt! Case Study

Review of Deep Reinforcement Learning for Robot Manipulation

Related Papers (5)

Simulation-Based Evaluations of Reinforcement Learning Algorithms for Autonomous Mobile Robot Path Planning

Mixed-Reality Deep Reinforcement Learning for a Reach-to-grasp Task

Deep-sarsa: a reinforcement learning algorithm for autonomous navigation

Reinforcement learning in the multi-robot domain

Koolio: Path planning using reinforcement learning on a real robot platform

Frequently Asked Questions (21)

Q1. What contributions have the authors mentioned in the paper "Comparison of multiple reinforcement learning and deep reinforcement learning methods for the task aimed at achieving the goal" ?

Q2. What have the authors stated for future works in "Comparison of multiple reinforcement learning and deep reinforcement learning methods for the task aimed at achieving the goal" ?

Q3. What is the main approach of the learning agent’s DQN?

Q4. What is the main problem of the RL method?

Q5. What are the limitations of traditional task and motion planning methods?

Q6. What is the value of the parameter Rs?

Q7. What are the common robot simulation tools?

Q8. What are the common methods used for learning tasks in the real world?

Q9. What is the spline curve of degree p?

Q10. What is the main theme of this article?

Q11. What is the goal of the optimization problem in the case of the learning process?

Q12. What is the main problem of the experiment?

Q13. What is the definition of a TD control method?

Q14. What is the safe area of the robot?

Q15. What is the learning process of the RL/DRL agent?

Q16. What is the area of restriction from which the robot is removed?

Q17. What are the main advantages of RL and DRL?

Q18. What is the main idea of the paper?

Q19. What is the corresponding parameter of the learning environment?

Q20. What is the purpose of the paper?

Q21. What is the main topic of this paper?