scispace - formally typeset
Open AccessJournal ArticleDOI

Comparison of Multiple Reinforcement Learning and Deep Reinforcement Learning Methods for the Task Aimed at Achieving the Goal

Roman Parak, +1 more
- Vol. 27, Iss: 1, pp 1-8
Reads0
Chats0
TLDR
In this paper, the authors compare several reinforcement learning (Q-learning, SARSA) and deep RL (Deep Q-Network, Deep Sarsa) methods for a task aimed at achieving a specific goal using robotics arm UR3.
Abstract
Reinforcement Learning (RL) and Deep Reinforcement Learning (DRL) methods are a promising approach to solving complex tasks in the real world with physical robots. In this paper, we compare several reinforcement  learning (Q-Learning, SARSA) and deep reinforcement learning (Deep Q-Network, Deep Sarsa) methods for a task aimed at achieving a specific goal using robotics arm UR3. The main optimization problem of this experiment is to find the best solution for each RL/DRL scenario and minimize the Euclidean distance accuracy error and smooth the resulting path by the Bezier spline method. The simulation and real word applications are controlled by the Robot Operating System (ROS). The learning environment is implemented using the OpenAI Gym library which uses the RVIZ simulation tool and the Gazebo 3D modeling tool for dynamics and kinematics.

read more

Content maybe subject to copyright    Report

MENDEL Soft Computing Journal, Volume 27, No.1, June 2021, Brno, Czech RepublicX
Comparison of Multiple Reinforcement Learning and Deep Reinforcement Learning
Methods for the Task Aimed at Achieving the Goal
Roman Parak
, Radomil Matousek
Institute of Automation and Computer Science, Brno University of Technology, Czech Republic
Roman.Parak@vutbr.cz
, RMatousek@vutbr.cz
Abstract
Reinforcement Learning (RL) and Deep Reinforcement Learning (DRL) methods
are a promising approach to solving complex tasks in the real world with physi-
cal robots. In this paper, we compare several reinforcement learning (Q-Learning,
SARSA) and deep reinforcement learning (Deep Q-Network, Deep Sarsa) methods
for a task aimed at achieving a goal using robotics arm UR3. The main optimiza-
tion problem of this experiment is to find the best solution for each RL/DRL
scenario, respectively, minimize the Euclidean distance accuracy error and smooth
the resulting path by the B´ezier spline method. The simulation and real word
application are controlled by the Robot Operating System (ROS). The learning
environment is implemented using the OpenAI Gym library, which uses the RVIZ
simulation tool and the Gazebo 3D modeling tool for dynamics and kinematics.
Keywords: Reinforcement Learning, Deep neural network, Motion planning,
B´ezier spline, Robotics, UR3.
Received: 1 March 2021
Accepted: 9 May 2021
Published: 21 June 2021
1 Introduction
Robotics as a field of science has been evolving for the
past several years and modern robots operating in the
real world should learn new tasks autonomously, flex-
ibly and adapt smoothly to different changes. These
requirements create new challenges in the field of
robot control systems. For this purpose, reinforcement
learning (RL) methods such as Q-Learning, SARSA
(State–action–reward–state–action), etc. are com-
monly used [27]. A limitation of these learning meth-
ods is the need for a large amount of memory.
In recent years, there has been an increase in deep
neural network (DNN) methods in several areas of
science, technology, medicine, and more, along with
significant advances in Deep Reinforcement Learning
(DRL) techniques [11]. DRL overcomes the limitations
of simple RL methods by combining parallel computa-
tion and embedded deep neural networks (DNN).
Reinforcement Learning (RL) and Deep Reinforce-
ment Learning (DRL) methods are a promising ap-
proach to solving complex tasks in the real world
with physical robots. RL/DRL methods are also used
in real-world applications, such as improvements in
the gaming industry for the Go game [24], as well
as in robotic applications for manipulation [21], goal
achievement [5, 22], Human-Robot Collaboration [9],
and more [17].
Planning the trajectory of the robotic arm as one
of the most basic and challenging research topics in
robotics has found considerable interest from research
institutes in recent decades. Traditional task and
motion planning methods, such as RRT [18], RRT*
[31] can solve complex tasks but require full state ob-
servability, a lot of time for problem solving and are
not adapted to dynamic scene changes. Advanced
RL/DRL techniques can solve motion planning tasks
for multiple-axis industrial robot [23].
Figure 1: Experimental task aimed at achieving the
goal using the UR3 robot. The purple box (A
target
)
here is approximately the area of restriction from which
the targets are sampled, and the yellow box (A
search
)
represents the area of safe movement. The distance be-
tween the start (P
i
) and target (P
t
) point is described
by Euclidean Distance.
In this paper, we propose several RL/DRL methods
for the task aimed at achieving the goal using the co-
operating robotic arm UR3 for Universal Robots, more
precisely a 6-axis robotic arm [28]. The basic scene of
our experiment and the real robot in the initial position
position are shown in Fig. 1.
We capitalize the related work in multiple areas of re-
inforcement learning, deep reinforcement learning, mo-
1
https://doi.org/10.13164/mendel.2021.1.001
ISSN: 1803-3814 (Printed), 2571-3701 (Online)

MENDEL Soft Computing Journal, Volume 27, No.1, June 2021, Brno, Czech RepublicX
tion planning, etc. (Section 2 Related Work), and we
also summarize the necessary methods needed to create
our work (Section 3 Methods).
In the main part of the work, we focus on solving the
problem, achieving the goal using advanced methods
of motion planning (Section 4 Experiments and Re-
sults). Our approach compares different learning tech-
niques (Reinforcement Learning / Deep Reinforcement
Learning) to find the trajectory from the initial posi-
tion to the target position and the resulting trajectory
smoothing using a ezier spline (B-spline) curve.
In the final part of the paper, we focus on the chal-
lenges we have encountered, the current limitations,
and future extensions of our work (Section 5 Conclu-
sion and Future Work).
2 Related Work
Our approach to finding a point in Cartesian space
using multiple RL/DRL techniques is based on previ-
ous work in the areas of reinforcement learning, deep
reinforcement learning, and motion planning. In the
following section, we will briefly discuss previous work
on each of the relevant topics.
In research in the field of robotic motion planning,
the concept of machine learning emerges. In particular,
Reinforcement Learning (RL) and Deep Reinforcement
Learning (DRL) techniques are an area of growing in-
terest for the robotic research community.
Researchers at Erle Robotics have created a frame-
work for testing various RL/DRL algorithms called
OpenAI Gym [6, 32]. Various robot simulation tools
are used as an extension of the OpenAI toolkit, with
Gazebo [1, 15] and PyBullet [8] being the most com-
monly used today. The connection of the robotic tool
with the robust physical core and the Gym toolkit is
created using the ROS (Robot Operating System) [25].
RL/DRL methods are used in robotic applications in
the real world for several experiments. One of the ap-
proaches is focused on unscrewing operations in robotic
disassembly of electronic waste using the Q-Learning
method [16], other approaches have used a robotic arm
to achieve a goal using the Deep Reinforcement Learn-
ing method DQN (Deep Q-Network) [5], TRPO(Trust
Region Policy Optimization)[22] and tested the result
of the experiment in a real application. Some ap-
proaches use 2D/3D cameras and some other sensors
to observe the robotic environment [5, 10, 12], others
use only dynamic simulation with the specified environ-
ment [13, 16] or use real-time robot learning techniques
[19].
Motion planning is one of the most fundamental re-
search topics in robotics. Some of the approaches have
used traditional planning methods, such as RRT [18],
RRT* [31], where structured tree methods are used
to find the curve from point A to point B. Other ap-
proaches use modern techniques, such as RL/DRL, but
both methods use ezier curves to characterize com-
plex trajectories and smooth motion planning [23].
3 Methods
This section provides a brief introduction to the the-
ory of Reinforcement Learning, Deep Reinforcement
Learning, as well as path smoothing techniques us-
ing the ezier spline curve. In each subsection, we
present two methods of RL (Q-Learning, SARSA) /
DRL (Deep Q-Network, Deep Sarsa) control and the
last subsection is focused on trajectory smoothing.
3.1 Markov Decision Process
Markov Decision Process (MDP) is a classical formula-
tion of sequential decision making, where actions influ-
ence not just immediate rewards, but also subsequent
situations or states, and through those future rewards
[27]. MDPs include late reward and the need to com-
promise with immediate and late reward.
The MDP contains a structure of four basic elements:
(s
t
; a
t
; P (s
t+1
|s
t
; a
t
); R(s
t+1
|s
t
; a
t
)), where s
t
and s
t+1
elements represents the current and next state, a
t
part represents the action, P (s
t+1
|s
t
; a
t
) means the
probability of transition to the state s
t+1
when tak-
ing action a
t
in state s
t
, and the last part of ele-
ments R(s
t+1
|s
t
; a
t
) represents the immediate reward
received from the environment after the transition from
s
t
to s
t+1
. The agent and environment interact at each
in a sequence of discrete time steps, t = 0, 1, ... Proba-
bility of transition in the MDP structure is depending
on the current state s
t
and chosen action a
t
[11, 27, 33].
Figure 2: Basic structure of agent-environment inter-
action in Markov’s decision-making process (MDPs)
[11, 27]
3.2 Reinforcement Learning
Reinforcement learning (RL) is an area of machine
learning that deals with gradual decision-making. The
main task of this method is to learn how agents ought
to take sequences of actions in an environment to max-
imize cumulative rewards. Markov decision processes
(MDP) are an ideal mathematical formulation for RL
problems, for which a direct learning methodology to
achieve the goal is proposed. The agent decides to
receive not only the current remuneration, but also
cumulative remuneration in the next learning state
[11, 27, 33].
The agent and MDP together form a sequence that
contains a number of state-action pairs represented as
τ = ((s
0
, a
0
), (s
1
, a
1
), . . . ). The return is defined as the
2

MENDEL Soft Computing Journal, Volume 27, No.1, June 2021, Brno, Czech RepublicX
discounted return for the sequence τ at the time steps
t:
R
t
(τ) =
X
t
0
=t
γ
t
0
t
r
t
0
, (1)
where γ is a discount factor ((0 γ 1)), r
t
0
is the
reward at time steps t
0
.
The main optimization problem of the RL method is
the need to find the optimal policy π
, which is defined
such as maximizing the expected return
π
= arg max
π
E
τ π
[R
t
(τ)]. (2)
Q-Learning (QL):
is one of the most popular methods of Reinforce-
ment Learning. The QL method was developed as an
off-policy TD (Temporal difference) control algorithm,
defined by
Q(s
t
, a
t
) Q(s
t
, a
t
) + α[r
t+1
+ γ max
a
Q(s
t+1
, a
t
)
Q(s
t
, a
t
)],
(3)
where α is a learning rate (0 < α 1),
max
a
Q(s
t+1
, a) is estimate of the optimal future
value, and other parameters are described in the
previous section [11, 27].
State–action–reward–state–action (SARSA):
is an on-policy TD control method, very similar to
the previous Q-Learning method. SARSA method is
characterized by using the next action [27].
Q(s
t
, a
t
) Q(s
t
, a
t
) + α[r
t+1
+ γ max
a
Q(s
t+1
, a
t+1
)
Q(s
t
, a
t
)],
(4)
3.3 Deep Reinforcement Learning
Deep reinforcement learning (DRL) combines artificial
neural networks (ANN) with a learning reinforcement
architecture that allows agents to learn the best pos-
sible actions in an individual environment to achieve
their goals. The main approach of this method is to
approximate the function and optimize the goals, map-
ping the state-action pairs to the expected rewards
[11, 33]. This area of research can solve a wide range
of complex decision-making tasks that were previously
out of reach for a machine.
The learning agent’s DRL methods with auxiliary
tasks within a jointly learned representation can signif-
icantly increase the effectiveness of the learning sam-
ple. The learning agent’s DRL methods with auxiliary
tasks within a jointly learned representation can signifi-
cantly increase the effectiveness of the learning sample.
This is based on simultaneously maximizing a number
of pseudoreward functions, such as immediately pre-
dicting the reward (γ = 0), predicting changes on the
next observation, or predicting the activation of some
hidden unit of the agent’s neural network [11].
Deep Q-Network (DQN):
is a combination of Q-learning with deep convolu-
tional ANN and reinforcement learning method, multi-
layer and deep ANN specialized in the processing of
spatial data fields. For a given state of the neural net-
work s, the output is a vector of action values Q(s, a; θ),
where θ are the network parameters [11, 29]. The two
most important components of the DQN algorithm are
the use of the target network and the use of the expe-
rience replay. The formula used by DQN is then:
Q(s
t
, a
t
; θ
t
) Q(s
t
, a
t
; θ
t
)
+ α[(r
t+1
+ γ max
a
Q(s
t+1
, a
t
; θ
t
)
Q(s
t
, a
t
; θ
t
))
2
],
(5)
where a target network with parameters θ
t
, is the
same as the online network except that its parameters
are copied every τ steps from the online network, so
that then θ
t
= θ
t
, which implies that weight of the
neural network [29].
Deep SARSA (DSARSA):
is similar to the previous part. In the structure
of learning, the approximation of the value function
is with the convolutional neural network (CNN), that
uses Q-network to obtain Q value like DQN. The Func-
tion is represented by a CNN with weights θ and an
output that represents the Q-values for each action[20].
Q(s
t
, a
t
; θ
t
) Q(s
t
, a
t
; θ
t
)
+ α[(r
t+1
+ γ max
a
Q(s
t+1
, a
t+1
; θ
t1
)
Q(s
t
, a
t
; θ
t
))
2
],
(6)
3.4 ezier-Spline Curves
B´ezier curves use Bernstein’s polynomials, which were
described in 1912 by the Russian mathematician Sergei
Bernstein [30].
B-spline curves, like ezier curves, use polynomials
to generate a curve segment. The main difference be-
tween simple ezier curves and a B-spline is that B-
Spline is used as a series of control points to determine
the local geometry of the curve. This feature ensures
that only a small portion of the curve is changed when
a control point is moved [30]. A ezier spline curve of
degree p is defined by n + 1 control points P
0
, P
1
, .., P
n
[2]:
B(t) =
n
X
i=0
N
i,p
(t)P
i
, (7)
where N
i,p
(t) is a normalized B-Spline curve defined
over the nodes.
3
R. Parak and R. Matousek

MENDEL Soft Computing Journal, Volume 27, No.1, June 2021, Brno, Czech RepublicX
4 Experiments and Results
In this section, we present an application that includes
a task using several RL/DRL techniques for the point
of Cartesian space achievement problem using an col-
laborative manipulator with 6-DOF (Degrees of Free-
dom). The experiment will be demonstrated in simu-
lation and in the real world.
4.1 Setting up the learning environment
For our experiment, we will use a cooperating robotic
arm UR3 from Universal Robots (Fig. 1), more pre-
cisely a 6-axis robotic arm with a working radius of 500
mm/19.7 inches, a payload of 3 kg/6.6 pounds and a
repeatability of ±0.1 mm [28].
The simulation is controlled through communica-
tion between the RVIZ [7, 26] simulation tool and the
Gazebo 3D modeling tool [1, 15]. In our experiment,
we use the UR3 URDF model (Universal Robotic De-
scription Format) without a gripper and the official
ROS driver for Universal Robots [3], which is used to
control real / simulation robots.
The main structure of our experiment is shown in
Fig. 1. Fig. 3 shows a simplified structure of the
UR3 model in the RVIZ simulation tool. The blue
sphere represents the initial position of the robot and
the target to be reached by the robot is represented
as a red sphere. The purple box (A
target
) is approx-
imately the area of restriction from which targets are
removed, and the yellow box (A
search
) represents the
area of safe movement. The safe area is created with
given to the individual robot model to avoid collisions
with the target area (imagine that the target area is a
bin for an object selection problem).
Figure 3: Learning environment of the UR3 model in
the RVIZ simulation tool.
4.2 Definition of experiment
In our problem, we choose a task aimed at achieving
a goal in Cartesian space using a robotics arm. The
task aims to autonomously find the trajectory from
the initial position to the target position using various
learning techniques (Reinforcement Learning / Deep
Reinforcement Learning) and using the B´ezier spline
method to smooth the resulting trajectory.
First, the robot position is initialized and the target
position is randomly selected. When selecting a target
position, we start the learning process for each learn-
ing technique. The distance between the start P
i
and
target P
t
point is described by Euclidean Distance d
t
(Eq. 8).
d
t
(P
i
, P
t
) =
v
u
u
t
n
X
i=1
(P
i
P
t
)
2
, (8)
4.3 Learning process
We implement the learning environment within the
OpenAI Gym library [6, 32], which provides a inter-
face to train and test the learning process.
To evaluate our algorithm, we performed identical
experiments with several RL/DRL techniques. In the
first experiment we use classical RL techniques (QL,
SARSA) and in the next experiment we use modern
DRL techniques (DQN, DSARSA). Hyper-parameters
of individual RL/DRL techniques are given in the table
(Tab. 1, Tab. 2).
Table 1: Hyper-parameters used for Reinforcement
Learning method (Q-Learning, DARSA)
Hyper-parameter Symbol Value
Episodes of Training M
min
, M
max
500, 1000
Steps per Episode T 100
Discount Factor γ 0.75
Learning Rate α 0.3
Table 2: Hyper-parameters used for Deep Reinforce-
ment Learning method (DQN, DSARSA)
Hyper-parameter Symbol Value
Episodes of Training M
min
, M
max
500, 1000
Steps per Episode T 100
Discount Factor γ 0.95
Learning Rate α 0.0003
Batch Size N 64
Replay Buffer Size B 2000
Optimizer - Adam [14]
The reward function of the goal achievement exper-
iment in each learning technique is defined as:
R
t
=
d
t
(P
p
, P
t
) d
t
(P
a
, P
t
)
d
t
(P
i
, P
t
)
+ R
s
, (9)
where d
t
(P
i
, P
t
) is the initial Euclidean distance
between the start P
i
and target point P
t
, and
4

MENDEL Soft Computing Journal, Volume 27, No.1, June 2021, Brno, Czech RepublicX
Figure 4: Agent-environment interaction in Markov’s
decision-making process (MDPs) in our problem. The
environment in our problem represents the communi-
cation between the RVIZ [7, 26] simulation tool and
the Gazebo [1, 15] 3D modeling tool via ROS [25].
The agent represents the RL/DRL methods used in
our problem.
(d
t
(P
p
, P
t
) d
t
(P
a
, P
t
)) is the Euclidean distance
difference in real time (P
p
previous position, P
a
actual position). The parameter R
s
assumes values
higher than 0 when the condition is successfully done,
which is determined by the accuracy of the results
(Eq. 10).
R
s
=
(
0.25, if δ
d
5%.
0, otherwise.
(10)
where δ
d
is the error of accuracy of the Euclidean
distance, defined as:
δ
d
=
d
t
(P
a
, P
t
)
d
t
(P
i
, P
t
)
.100. (11)
The learning process of the RL/DRL agent begins
by examining the environment by performing actions
from the initial state to the target state and collecting
appropriate rewards (Eq. 9). In our experiment, the
agent can select one of six possible actions (a
t
= (0..5))
in each state (s
t
= (X
+
, X
, Y
+
, Y
, Z
+
, Z
)), and the
available actions correspond to fixed discrete steps 5
mm of the Tool Center Point (TCP).
Once the agent moves from A
search
, the episode M
ends in the current step T , otherwise the process con-
tinues until the maximum number of episodes.
4.4 Experimental results
The training results of the proposed target achieve-
ment experiment using the UR3 robotic arm are shown
in Fig. 5 (a. Q-Learning, b. SARSA, c. Deep Q-
Network, d. Deep SARSA), including the mean cumu-
lative reward with the number of iterations in each of
the episodes.
The main goal of the optimization problem in the
case of the learning process was to maximize the ex-
pected cumulative reward (Fig. 5) and to minimize the
Euclidean distance accuracy error (Fig. 6).
(a) Q-Learning
(b) SARSA
(c) Deep Q-Network
(d) D eep SARSA
Figure 5: Training results of RL/DRL techniques using
the environment to achieve the goal of the UR3 robot.
5
R. Parak and R. Matousek

Citations
More filters
Journal ArticleDOI

Deep Q-Learning in Robotics: Improvement of Accuracy and Repeatability

TL;DR: In this paper , a deep q-learning algorithm was applied to improve the positioning accuracy of an articulated KUKA youBot robot during operation and a significant improvement was achieved approximately after 260 iterations in the online mode and initial simulation of the ML procedure.
Journal ArticleDOI

A New Optimal Design of Stable Feedback Control of Two-Wheel System Based on Reinforcement Learning

TL;DR: In this article , a double inverted pendulum on mobile device (DIPM) model is proposed to improve control performance and reduce calculations, which is based on the kinetic and potential energy of the DIPM system.
Proceedings ArticleDOI

A Note on the Frequency Characteristics of Discrete Systems in the Complex Plane

TL;DR: In this paper , a comparison of the shape and waveform of the frequency characteristics of continuous and discrete linear time-invariant (LTI) systems is made and their differences are highlighted.
References
More filters
Journal ArticleDOI

Mastering the game of Go with deep neural networks and tree search

TL;DR: Using this search algorithm, the program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0.5, the first time that a computer program has defeated a human professional player in the full-sized game of Go.
Proceedings ArticleDOI

Design and use paradigms for Gazebo, an open-source multi-robot simulator

TL;DR: Gazebo is designed to fill this niche by creating a 3D dynamic multi-robot environment capable of recreating the complex worlds that would be encountered by the next generation of mobile robots.
Journal ArticleDOI

An Introduction to Deep Reinforcement Learning

TL;DR: This manuscript provides an introduction to deep reinforcement learning models, algorithms and techniques and particular focus is on the aspects related to generalization and how deep RL can be used for practical applications.

Reducing the Barrier to Entry of Complex Robotic Software: a MoveIt! Case Study

TL;DR: The MoveIt! framework as discussed by the authors is an open-source tool for mobile manipulation in ROS that allows users to quickly get basic motion planning functionality with minimal initial setup, automate its configuration and optimization, and easily customize its components.
Related Papers (5)
Frequently Asked Questions (21)
Q1. What contributions have the authors mentioned in the paper "Comparison of multiple reinforcement learning and deep reinforcement learning methods for the task aimed at achieving the goal" ?

In this paper, the authors compare several reinforcement learning ( Q-Learning, SARSA ) and deep reinforcement learning ( Deep Q-Network, Deep Sarsa ) methods for a task aimed at achieving a goal using robotics arm UR3. 

RL ( QL ) and DRL ( DQN, DSARSA ) techniques completed the conditions for the required accuracy represented by Rs, but in the perspective of the future research are techniques based on deep neural network are more stable and efficient. This work can provide a foundation for future research on motion planning in the field of robotics using advanced deep reinforcement learning methods such as DDPG ( Deep Deterministic Policy Gradient ), TD3 ( Twin Delayed Deep Deterministic Policy Gradient ) and more. 

In the structure of learning, the approximation of the value function is with the convolutional neural network (CNN), that uses Q-network to obtain Q value like DQN. 

The main optimization problem of the RL method is the need to find the optimal policy π∗, which is defined such as maximizing the expected returnπ∗ = arg max π E τ∼π [Rt(τ)]. 

Traditional task and motion planning methods, such as RRT [18], RRT* [31] can solve complex tasks but require full state ob-servability, a lot of time for problem solving and are not adapted to dynamic scene changes. 

The parameter Rs assumes values higher than 0 when the condition is successfully done, which is determined by the accuracy of the results (Eq. 10). 

Various robot simulation tools are used as an extension of the OpenAI toolkit, with Gazebo [1, 15] and PyBullet [8] being the most commonly used today. 

For this purpose, reinforcement learning (RL) methods such as Q-Learning, SARSA (State–action–reward–state–action), etc. are commonly used [27]. 

A Bézier spline curve of degree p is defined by n+ 1 control points P0, P1, .., Pn [2]:B(t) = n∑ i=0 Ni,p(t)Pi, (7)where Ni,p(t) is a normalized B-Spline curve defined over the nodes. 

Robotics as a field of science has been evolving for the past several years and modern robots operating in the real world should learn new tasks autonomously, flexibly and adapt smoothly to different changes. 

The main goal of the optimization problem in the case of the learning process was to maximize the ex-pected cumulative reward (Fig. 5) and to minimize the Euclidean distance accuracy error (Fig. 6).5R. 

— Soft Computing Journal, Volume 27, No.1, June 2021, Brno, Czech RepublicXAfter the learning process, the robotic arm UR3 is tested with the RVIZ simulation tool and the Gazebo 3D modeling tool, which communicate through ROS. 

The QL method was developed as an off-policy TD (Temporal difference) control algorithm, defined byQ(st, at)← Q(st, at) + α[rt+1 + γmax a Q(st+1, at)−Q(st, at)], (3)where α is a learning rate (0 < α ≤ 1), max a Q(st+1, a) is estimate of the optimal future value, and other parameters are described in the previous section [11, 27].State–action–reward–state–action (SARSA):is an on-policy TD control method, very similar to the previous Q-Learning method. 

The safe area is created with given to the individual robot model to avoid collisions with the target area (imagine that the target area is a bin for an object selection problem). 

(11)The learning process of the RL/DRL agent begins by examining the environment by performing actions from the initial state to the target state and collecting appropriate rewards (Eq. 9). 

The purple box (Atarget) is approximately the area of restriction from which targets are removed, and the yellow box (Asearch) represents the area of safe movement. 

Reinforcement Learning (RL) and Deep Reinforcement Learning (DRL) methods are a promising approach to solving complex tasks in the real world with physical robots. 

Their approach to finding a point in Cartesian space using multiple RL/DRL techniques is based on previous work in the areas of reinforcement learning, deep reinforcement learning, and motion planning. 

Hyper-parameters of individual RL/DRL techniques are given in the table (Tab. 1, Tab. 2).The reward function of the goal achievement experiment in each learning technique is defined as:Rt = dt(Pp, Pt)− dt(Pa, Pt)dt(Pi, Pt) +Rs, (9)where dt(Pi, Pt) is the initial Euclidean distance between the start Pi and target point Pt, and4MENDEL — Soft Computing Journal, Volume 27, No.1, June 2021, Brno, Czech RepublicXFigure 4: Agent-environment interaction in Markov’s decision-making process (MDPs) in their problem. 

In this paper, the authors propose several RL/DRL methods for the task aimed at achieving the goal using the cooperating robotic arm UR3 for Universal Robots, more precisely a 6-axis robotic arm [28]. 

Planning the trajectory of the robotic arm as one of the most basic and challenging research topics in robotics has found considerable interest from research institutes in recent decades.