scispace - formally typeset
Open AccessProceedings ArticleDOI

Fuzzy Approximation for Convergent Model-Based Reinforcement Learning

Reads0
Chats0
TLDR
This paper proposes a fuzzy approximation structure for the Q-value iteration algorithm, and shows that the resulting algorithm is convergent, and proposes a modified, serial version of the algorithm that is guaranteed to converge at least as fast as the original algorithm.
Abstract
Reinforcement learning (RL) is a learning control paradigm that provides well-understood algorithms with good convergence and consistency properties. Unfortunately, these algorithms require that process states and control actions take only discrete values. Approximate solutions using fuzzy representations have been proposed in the literature for the case when the states and possibly the actions are continuous. However, the link between these mainly heuristic solutions and the larger body of work on approximate RL, including convergence results, has not been made explicit. In this paper, we propose a fuzzy approximation structure for the Q-value iteration algorithm, and show that the resulting algorithm is convergent. The proof is based on an extension of previous results in approximate RL. We then propose a modified, serial version of the algorithm that is guaranteed to converge at least as fast as the original algorithm. An illustrative simulation example is also provided.

read more

Content maybe subject to copyright    Report

Fuzzy Approximation for Convergent
Model-Based Reinforcement Learning
Lucian Bus¸oniu, Damien Ernst, Bart De Schutter, and Robert Babu
ˇ
ska
Abstract Reinforcement learning (RL) is a learning control
paradigm that provides well-understood algorithms with good
convergence and consistency properties. Unfortunately, these
algorithms require that process states and control actions
take only discrete values. Approximate solutions using fuzzy
representations have been proposed in the literature for the
case when the states and possibly the actions are continuous.
However, the link between these mainly heuristic solutions
and the larger body of work on approximate RL, including
convergence results, has not been made explicit. In this paper,
we propose a fuzzy approximation structure for the Q-value
iteration algorithm, and show that the resulting algorithm is
convergent. The proof is based on an extension of previous
results in approximate RL. We then propose a modified, serial
version of the algorithm that is guaranteed to converge at least
as fast as the original algorithm. An illustrative simulation
example is also provided.
I. INTRODUCTION
Learning controllers can tackle problems where pre-
programmed solutions are difficult or impossible to design.
Reinforcement learning (RL) is a popular learning paradigm,
mainly because it requires only mild assumptions on the pro-
cess to be controlled, and is able to work without an explicit
model [1]–[3]. A RL controller measures directly the process
state, and receives feedback on the control performance in
the form of a scalar reward signal. The learning objective is
to maximize the cumulative reward signal. Well-understood
algorithms with good convergence and consistency properties
are available for solving the RL task, both when a model
of the controlled process is available and when it is not.
However, these algorithms require that the controller inputs
(process states) and outputs (control actions) take values in
a relatively small discrete set. When the state and / or action
spaces are continuous or contain a large number of elements,
approximate solutions must be used.
Approximation schemes have been proposed for model-
based RL [4]–[6], as well as for model-free or model-learning
RL [7]–[13].
1
Unfortunately, in general, approximate RL is
not guaranteed to converge [4], [14]. One type of approxi-
mators for which many RL algorithms converge are linear
Lucian Bus¸oniu, Bart De Schutter, and Robert Babu
ˇ
ska are with the
Center for Systems and Control of the Delft University of Techno-
logy, The Netherlands (email: i.l.busoniu@tudelft.nl, b@deschutter.info,
r.babuska@tudelft.nl). Bart De Schutter is also with the Marine and Trans-
port Technology Department of TU Delft. Damien Ernst is with Sup
´
elec,
Rennes, France (email: damien.ernst@supelec.fr).
1
Some authors use the term ‘model-based RL’ when referring to algo-
rithms that build a model from interaction with the process. We use the term
‘model-learning’ for such techniques, and reserve the name ‘model-based’
for algorithms that rely on an
a priori
model of the process.
basis functions, also known as kernel functions, averagers,
and interpolative representations [4], [5], [7], [8].
Fuzzy approximation for RL is also popular in the lit-
erature, mainly for model-free RL. Fuzzy approximators
are combined e.g., with Q-learning [15], yielding fuzzy
Q-learning [16]–[18], or with actor-critic algorithms [1],
yielding fuzzy actor-critic architectures [18]–[23]. For fuzzy
Q-learning, Takagi-Sugeno fuzzy rule-bases are typically
used. Actor-critic algorithms use fuzzy rule-bases for the
actor element, and either fuzzy or other approximators (e.g.,
neural networks) for the critic element. Typically, fuzzy RL
approaches are heuristic, and their convergence has not been
studied, with the exception of the actor-critic algorithms in
[20], [21]. These algorithms use special rulebase structures
and parameter update rules in order to guarantee conver-
gence. The results on convergence in the larger body of work
in approximate RL have not been employed for fuzzy RL.
In this work, we propose a fuzzy approximator similar
to others previously used for Q-learning [16], [18], but
we combine it with the model-based Q-iteration algorithm
(see e.g., [8]). This approximator works for continuous
states and discrete actions; however, continuous actions can
be handled by discretization. We show that the resulting
fuzzy Q-iteration algorithm converges. We then propose an
asynchronous, serial version of fuzzy Q-iteration, which
converges at least as fast as the original algorithm. The
modified algorithm has not, to the authors’ best knowledge,
been studied yet in approximate RL, although
exact
serial
value iteration is widely used [3].
The remainder of this paper is structured as follows. Sec-
tion II introduces the necessary RL elements, and Section III
describes approximate model-based RL. Section IV describes
the proposed fuzzy approximation structure. The properties
of approximate Q-iteration using this structure are analyzed
in Section V. Section VI illustrates the proposed algorithms
on a simulated example. Finally, Section VII outlines ideas
for future work and concludes the paper.
II. BACKGROUND: REINFORCEMENT LEARNING
In this section, we briefly introduce the RL task and
characterize its optimal solution, following [1]–[3].
Consider a deterministic Markov decision process with
the state space X, the action (control) space U , the state
transition function f : X × U X, and the reward function
ρ : X × U R.
2
As a result of the control action u
k
2
A stochastic formulation is possible, where the state transitions are prob-
abilistic. In that case, expected returns under these probabilistic transitions
must be considered, and the results discussed still hold.

applied in state x
k
, the state changes to x
k+1
= f(x
k
, u
k
).
The controller receives feedback on its performance in the
form of the scalar reward signal r
k+1
= ρ(x
k
, u
k
). This
reward evaluates the immediate effect of action u
k
, but
says nothing directly about the long-term effects of this
action. The controller chooses actions given the current state,
according to its policy h : X U: u
k
= h(x
k
).
The learning goal is the maximization, starting from the
current moment in time (k = 0), of the discounted return:
X
k=0
γ
k
r
k+1
=
X
k=0
γ
k
ρ(x
k
, u
k
) (1)
where γ [0, 1) is the discount factor. The discounted
return compactly represents the reward accumulated by the
controller in the long-run. The learning task is therefore
to maximize long-term performance, while only receiving
feedback about immediate, one-step performance. This can
be achieved by computing the optimal action-value function.
An action-value function (Q-function), Q
h
: X × U R,
gives the return of each state-action pair under a policy h:
Q
h
(x, u) = ρ(x, u) +
X
k=1
γ
k
ρ(x
k
, h(x
k
)) (2)
where x
1
= f(x, u) and x
k+1
= f(x
k
, h(x
k
)), k. The
optimal action-value function is defined as Q
(x, u) =
max
h
Q
h
(x, u). Any policy that picks for every state the
action with the highest optimal Q-value:
h
(x) = arg max
u
Q
(x, u) (3)
is then optimal (i.e., it maximizes the return (1)).
A central result in RL is the
Bellman optimality equation
:
Q
(x, u) = ρ(x, u) + γ max
u
0
U
Q
(f(x, u), u
0
) x, u (4)
This equation states that the optimal value of action u taken
in state x is the expected immediate reward plus the expected
(discounted) optimal value attainable from the next state.
Let the set of all Q-functions be denoted by Q. The Q-
iteration mapping T : Q Q is the right-hand side of the
Bellman equation for any Q-function:
[T (Q)](x, u) = ρ(x, u) + γ max
u
0
U
Q(f(x, u), u
0
) (5)
Using this notation, the Bellman optimality equation (4)
states that Q
is a fixed point of T , i.e., Q
= T (Q
). The
following result is also well-known (see e.g., [24]).
Theorem 1: T is a contraction with factor γ in the
infinity norm, i.e., for any pair of functions Q, Q
0
,
kT (Q) T (Q
0
)k
γ kQ Q
0
k
.
The Q-value iteration (Q-iteration) algorithm starts from
an arbitrary Q-function Q
0
and at each iteration ` updates
the Q-function using the formula Q
`+1
= T (Q
`
). From
Theorem 1, it follows that T has a unique fixed point, and
from (4), this point is Q
. Therefore, Q-iteration converges
to Q
as ` .
The standard Q-iteration uses an
a priori
model of the
task (in the form of the transition and reward functions f, ρ).
There are also algorithms that learn a model from experience,
and others that do not use an explicit model at all [1], [2].
III. FUNCTION APPROXIMATION FOR Q-ITERATION
In general, the practical implementation of RL algorithms
requires that Q-values are stored and updated explicitly for
each state-action pair. This can only be realized when the
number of state and action values is small. When the state
and / or action spaces contain a large or infinite number of
elements (e.g., they are continuous), approximate solutions
must be used instead.
Parametric approximators use a parameter vector θ as
a finite representation of the Q-function
b
Q. The following
mappings are defined in order to formalize parametric ap-
proximate Q-iteration (the notation follows [10]).
1) The
Q-iteration
mapping T , defined by equation (5).
2) The
approximation
mapping F : R
n
Q, which for
a given value of the parameter vector θ R
n
produces
an approximate Q-function
b
Q = F (θ).
3) The
projection
mapping P : Q R
n
, which given a
target Q-function Q computes the parameter vector θ
such that F (θ) is as close as possible to Q (e.g., in a
least-squares sense).
The notation [F (θ)](x, u) refers to the value of the Q-
function F (θ) for the state-action pair (x, u). The notation
[P (Q)]
l
refers to the l-th parameter in the parameter vector
P (Q).
Approximate Q-iteration starts with an arbitrary param-
eter vector θ
0
and at each iteration ` updates it using the
composition of the mappings P , T , and F :
θ
`+1
= P T F(θ
`
) (6)
Unfortunately, the approximate Q-iteration is not guar-
anteed to converge for an arbitrary approximator. Counter-
examples can be found e.g., in [4], [14] for the related value-
iteration algorithm, and those results apply directly to Q-
iteration as well. One particular case in which approximate
Q-iteration converges is when the composite mapping P T F
can be shown to be a contraction [4], [5]. This property will
be used below to show that fuzzy Q-iteration converges.
IV. FUZZY Q-ITERATION
In this section, we propose a fuzzy approximation similar
to others previously used for Q-learning [16], [18], but we
combine it with the model-based Q-iteration algorithm. In the
sequel, it is assumed that the action space is discrete, denoted
by U
0
= {u
j
|j = 1, . . . , M }. This discrete set can be
obtained from the discretization of an originally continuous
action space. The state space can be either continuous or
discrete. In the latter case, fuzzy approximation is useful
when the number of discrete states is large.
The proposed approximation architecture relies on a fuzzy
partition of the state space into N sets X
i
, each described by
a membership function µ
i
: X [0, 1]. A state x belongs to
each set i with a degree of membership µ
i
(x). In the sequel
the following assumptions are made:

1) The fuzzy partition is normalized, i.e.,
P
N
i=1
µ
i
(x) =
1, x X.
2) The fuzzy sets in the partition are normal, i.e., for
every i there exists an x
i
for which µ
i
(x
i
) = 1 (and
consequently, µ
i
(x
i
) = 0 for all i 6= i by Assumption
1). The state value x
i
is called the core of set X
i
.
This second assumption is made here for brevity in
the description and analysis of the algorithms; it can
be relaxed using results of [4].
For an example of a partition that satisfies the above
conditions, see Figure 2 of Section VI.
The Q-function is approximated using a Takagi-Sugeno
rule-base with singleton consequents. The rule-base has one
input, the state x, and M outputs q
1
, . . . , q
M
, the Q-values
corresponding to each of the discrete actions u
1
, . . . , u
M
.
The i-th rule in this rule-base has the form:
R
i
: if x is X
i
then q
1
= θ
i,1
; q
2
= θ
i,2
; . . . ; q
M
= θ
i,M
The parameters of this approximator are the singleton con-
sequent values appearing in the rule-base. They are arranged
in an N × M matrix θ, one row for each rule i and one
column for each output j.
3
The logical expression x is X
i
holds true with degree µ
i
(x), the membership degree of x
in X
i
. The fuzzy rule-base outputs the weighted sum of the
consequent values θ
i,j
in each rule, where the weight factor
of a particular rule corresponds to the degree of fulfillment of
its logical expression. Thus, the approximator takes as input
the state-action pair (x, u
j
) and outputs the Q-value:
b
Q(x, u
j
) = [F (θ)](x, u
j
) =
N
X
i=1
µ
i
(x)θ
i,j
(7)
This is a basis-functions form, with the basis functions
only depending on the state. The approximator (7) can be
regarded as M distinct approximators, one for each of the
M discrete actions.
The projection mapping infers from a Q-function the val-
ues of the approximator parameters according to the relation:
θ
i,j
= [P (Q)]
i,j
= Q(x
i
, u
j
) (8)
This is a particular case of the least-squares solution:
P (Q) = arg min
θ
X
(x,u)X
0
×U
0
[Q(x, u) F (θ)(x, u)]
2
when the set of samples X
0
× U
0
is the Cartesian product
of the set of cores X
0
= {x
1
, . . . , x
M
} and the discrete
action space U
0
. Note that when the set of samples is differ-
ent from this, least-squares projection no longer guarantees
convergence [4].
The approximator specified in this way is a special case
of several types of approximators previously considered
for RL: interpolative representations [4], averagers [5], and
representative-state techniques as described in [13]. It also
3
The matrix arrangement is adopted for convenience of notation only. For
the theoretical study of the algorithms, the collection of parameters is still
regarded as a vector, leading e.g., to kθk
= max
i,j
|θ
ij
|.
Algorithm 1 Parallel fuzzy Q-iteration
1: ` 0; θ
0
0 (or arbitrary values)
2: repeat
3: for i = 1, . . . , N, j = 1, . . . , M do
4: θ
`+1,i,j
ρ(x
i
, u
j
)+
γ max
j
P
N
i
=1
µ
i
(f(x
i
, u
j
))θ
`,i
,j
5: end for
6: ` ` + 1
7: until kθ
`
θ
`1
k
δ
Algorithm 2 Serial fuzzy Q-iteration
1: ` 0; θ
0
0 (or arbitrary values)
2: repeat
3: θ θ
`
4: for i = 1, . . . , N, j = 1, . . . , M do
5: θ
i,j
ρ(x
i
, u
j
) + γ max
j
P
N
i
=1
µ
i
(f(x
i
, u
j
))θ
i
,j
6: end for
7: θ
`+1
θ; ` ` + 1
8: until kθ
`
θ
`1
k
δ
shares similarities with barycentric interpolation [6]. The
analysis in Section V will rely on theoretical properties of
these approximators.
An explicit form of the approximate Q-value iteration
algorithm using the approximator (7) and projection (8) is
given in Algorithm 1. To establish the equivalence between
Algorithm 1 and the approximate Q-iteration in the form
(6), observe that the right-hand side in line 4 of Algorithm 1
corresponds to [T (
b
Q
`
)](x
i
, u
j
), where
b
Q
`
= F (θ
`
). Hence,
line 4 can be written θ
`+1,i,j
[P T F(θ
`
)]
i,j
and the entire
for loop described by lines lines 3–5 is equivalent to (6).
In Algorithm 1, only the parameters θ
`
at the end of the
previous iteration are used in the computation of the updated
values θ
`+1
. Algorithm 2 is an alternative version, which uses
the updated parameters as soon as they are available. Since
the parameters are updated in serial fashion, this version is
called
serial
Q-iteration. Although the exact counterpart of
this algorithm is widely used [1], [3], approximate serial Q-
iteration has not, to the authors’ best knowledge, been studied
yet. To differentiate between the two versions, we hereafter
call Algorithm 1
parallel
fuzzy Q-iteration.
V. ANALYSIS
In this section, the convergence of parallel and serial fuzzy
Q-iteration is established. It is shown that there exists a
parameter vector θ
such that for both algorithms, θ
`
θ
as ` . The consistency of the algorithms, i.e., the
convergence to the optimal Q-function Q
as the maximum
distance between the cores of adjacent fuzzy sets goes to 0,
is not studied here and is a topic for future research. It can
be shown, however, that under certain conditions, F (θ
) is
within a given bound of the Q
[4], [5].
Proposition 1: Fuzzy Q-iteration (Algorithm 1) con-
verges.

Proof: The proof follows from the convergence proof
of value iteration with averagers [5], or with interpolative
representations [4]. This is because fuzzy approximation is
an averager by the definition in [5], and an interpolative
representation by the definition in [4]. For these types of
approximator, P and F are nonexpansions, making P T F a
contraction with factor γ, i.e., kP TF (θ) P T F(θ
0
)k
γ kθ θ
0
k
, for any θ, θ
0
.
Similarly to the convergence proof for exact serial value
iteration in [3], it is shown below that the
approximate
serial
Q-iteration Algorithm 2 converges.
Proposition 2: Serial fuzzy Q-iteration (Algorithm 2) con-
verges.
Proof: Denote n = N · M , and rearrange the matrix
θ into a vector in R
n
, placing first the elements of the first
row, then the second etc. The element at row i and column
j of the matrix is now the l-th element of the vector, with
l = (i 1) · M + j.
Define for all l = 0, . . . , n recursively the mappings S
l
:
R
n
R
n
as:
S
0
(θ) = θ
[S
l
(θ)]
l
=
(
[P T F (S
l1
(θ))]
l
if l = l
[S
l1
(θ)]
l
l {1, . . . , n} \ l
In words, S
l
corresponds to updating the first l parameters
using approximate serial Q-iteration, and S
n
is a com-
plete iteration of the approximate serial algorithm. Now we
show that S
n
is a contraction, i.e., kS
n
(θ) S
n
(θ
0
)k
γ kθ θ
0
k
, for any θ, θ
0
. This can be done element-by-
element. By the definition of S
l
, the first element is only
updated by S
1
:
|[S
n
(θ)]
1
[S
n
(θ
0
)]
1
| = |[S
1
(θ)]
1
[S
1
(θ
0
)]
1
|
= |[P T F(θ)]
1
[P T F(θ
0
)]
1
|
γ kθ θ
0
k
The last step follows from the contraction mapping property
of P T F .
Similarly, the second element is only updated by S
2
:
|[S
n
(θ)]
2
[S
n
(θ
0
)]
2
| = |[S
2
(θ)]
2
[S
2
(θ
0
)]
2
|
= |[P T F(S
1
(θ))]
2
[P T F(S
1
(θ
0
))]
2
|
γ kS
1
(θ) S
1
(θ
0
)k
= γ max{|[P T F(θ)]
1
[P T F(θ
0
)]
1
| ,
|θ
2
θ
0
2
| , . . . , |θ
n
θ
0
n
|}
γ kθ θ
0
k
where kS
1
(θ) S
1
(θ
0
)k
is expressed by direct maximiza-
tion over its elements, and the contraction mapping property
of P T F is used twice.
Continuing in this fashion, we obtain
|[S
n
(θ)]
l
[S
n
(θ
0
)]
l
| γ kθ θ
0
k
for all l, and
thus S
n
is a contraction. Therefore, serial fuzzy Q-iteration
converges.
This proof is actually more general, showing that approx-
imate serial Q-iteration converges for any approximation F
and projection P for which P T F is a contraction.
In the same way as exact serial value iteration [3], serial
fuzzy Q-iteration can be shown to converge at least as quickly
as Algorithm 1.
The following bound on the suboptimality of the computed
Q-function follows from [5], but applies only when the
action space of the
original
problem is discrete (i.e., no
discretization is necessary prior to fuzzy Q-iteration).
Proposition 3: If the original action space is discrete and
min
Q
Q
Q
= ε where Q is any fixed point of the
composite mapping F P : Q Q, then fuzzy Q-iteration
converges to θ
such that:
kQ
F (θ
)k
2ε
1 γ
(9)
For example, any Q-function which satisfies Q(x, u
j
) =
P
N
i=1
µ
i
(x)Q(x
i
, u
j
) x, j is a fixed point of F P . In
particular, if the optimal Q-function has this form, i.e., is
exactly representable by the chosen fuzzy approximator, the
algorithm will converge to it (since in this case ε = 0).
In this section, we have established the parallel and
serial fuzzy Q-iteration as theoretically sound algorithms for
approximate RL in continuous-state tasks. When the original
action space is discrete, bounds on the derived Q-function
and policy were also shown to hold.
VI. SIMULATION EXAMPLE
As an illustrative example, fuzzy Q-iteration is applied in
simulation to the minimum-time stabilization of a two-link
manipulator.
A. Two-link Manipulator Model
The two-link manipulator, depicted in Figure 1, is de-
scribed by the fourth-order nonlinear model:
M(α)¨α + C(α, ˙α) ˙α + G(α) = τ (10)
where α = [α
1
, α
2
]
T
, τ = [τ
1
, τ
2
]
T
. The system has two
control inputs, the torques in the two joints, τ
1
and τ
2
, and
four measured outputs the link angles, α
1
, α
2
, and their
angular speeds ˙α
1
, ˙α
2
.
In the sequel, it is assumed that the manipulator operates
in a horizontal plane, leading to G(α) = 0. The mass matrix
M(α) and the Coriolis and centrifugal forces matrix C(α, ˙α)
have the following form:
M(α) =
P
1
+ P
2
+ 2P
3
cos α
2
P
2
+ P
3
cos α
2
P
2
+ P
3
cos α
2
P
2
(11)
C(α, ˙α) =
b
1
P
3
˙α
2
sin α
2
P
3
( ˙α
1
+ ˙α
2
) sin α
2
P
3
˙α
1
sin α
2
b
2
(12)
m
1
m
2
l
2
l
1
motor
1
motor
2
α
1
α
2
Fig. 1. Schematic drawing of the two-link rigid manipulator.

TABLE I
PHY SICA L PARAM ETER S OF TH E M ANIP ULATOR
Symbols and values Meaning
l
1
= l
2
= 0.4 m link lengths
m
1
= 1.25 kg, m
2
= 0.8 kg link masses
I
1
= 0.066 kgm
2
, I
2
= 0.043 kgm
2
link inertias
c
1
= c
2
= 0.2 m centers of mass for the links
b
1
= 0.08 kg/s, b
2
= 0.02 kg/s dampings in the joints
τ
1,max
= 1.5 Nm, τ
2,max
= 1 Nm maximum motor torques
˙α
1,max
= ˙α
2,max
= 2π rad/s maximum angular velocities
The meaning and values of the physical parameters of the
system are given in Table I. Using these, the rest of the
parameters in (10) can be computed by:
P
1
= m
1
c
2
1
+ m
2
l
2
1
+ I
1
P
2
= m
2
c
2
2
+ I
2
P
3
= m
2
l
1
c
2
(13)
B. Setup of the RL Algorithm
The input of the RL controller (the process state) is x =
[α
T
, ˙α
T
]
T
, and its output (the command signal) is u = τ.
The discrete time step is set to T
S
= 0.05 and the discrete-
time dynamics f are obtained by numerical integration of
(10) between consecutive time steps.
The control goal is the stabilization of the system around
α = ˙α = 0 in minimum time, with a tolerance of ±5 ·π/180
rad for the angles, and ±0.1 rad/s for the angular speeds.
The reward function chosen to express this goal is:
ρ(x, u) =
0 if |α
p
| 5 · π/180 rad
and | ˙α
p
| 0.1 rad/s, p = 1, 2
1 otherwise
(14)
where [α
1
, α
2
, ˙α
1
, ˙α
2
]
T
= f(x, u) (the next state).
Each torque signal τ
p
, p = 1, 2 takes continuous values in
the corresponding interval [τ
p,max
, τ
p,max
]. To apply fuzzy
Q-iteration, three discrete values are chosen for each torque:
τ
p,max
(maximal torque clockwise), 0, and τ
p,max
(maximal
torque counter-clockwise).
Separately for each state component, a normal, complete
triangular fuzzy partition is defined. Such a partition is
completely determined by the core coordinates of the fuzzy
sets. For ˙α
1
and ˙α
2
, the interval is partitioned into 7 fuzzy
sets, with their cores at {−360, 180, 30, 0, 30, 180, 360} ·
π/180 rad/s. This partition is depicted as an example in
Figure 2. For α
1
and α
2
, 12 sets are used, with their cores
at {−180, 130, 80, 30, 15, 5, 0, 5, 15, 30, 80, 130} ·
π/180 rad. There is no fuzzy set with core π, because this
is identical with the first set, having the core π (the angles
evolve on a circle manifold [π, π)).
The fuzzy partition of the state space is then defined as
follows. One fuzzy set is computed for each combination
(i
1
, . . . , i
4
) of individual sets for the four state components
α
1
, α
2
, ˙α
1
, ˙α
2
. Such a fuzzy set has the following member-
ship function:
µ(x) = µ
α
1
,i
1
(α
1
)·µ
α
2
,i
2
(α
2
)·µ
˙α
1
,i
3
( ˙α
1
)·µ
˙α
2
,i
4
( ˙α
2
) (15)
This way of building the state space partition can be thought
of as a conjunction of one-dimensional concepts correspond-
ing to the fuzzy partitions of the individual state variables.
0
0.25
0.5
0.75
1
˙α
1
2π
π
π
6
0
π
6
π
2π
µ
i
( ˙α
1
)
Fig. 2. The triangular fuzzy partition for the state variable ˙α
1
[2π, 2π] (identical to the partition for ˙α
2
). Core values are in
˘
2π, π,
π
6
, 0,
π
6
, π, 2π
¯
.
0 1 2 3 4 5
−6
−4
−2
0
Link angles[rad]
0 1 2 3 4 5
−4
−2
0
2
4
6
Link velocities[rad/sec]
0 1 2 3 4 5
−2
0
2
Cmd torques[Nm]
0 1 2 3 4 5
−1
−0.5
0
Reward [−]
t [sec]
Fig. 3. State, command, and reward signals for RL control (thin black line
link 1, thick gray line link 2). The initial state is x
0
= [π, π, 0, 0]
T
.
The fuzzy partition computed in this way still satisfies
Assumptions 1 and 2. It contains (12 · 7)
2
= 7056 sets.
An approximate optimal action-value function is computed
with serial and parallel fuzzy Q-iteration. The discount factor
is set to γ = 0.98.
C. Results
Figure 3 presents a controlled trajectory starting from
the initial state x
0
= [π, π, 0, 0]
T
, together with the
corresponding command and reward signals. In order to
obtain a continuous policy from the computed Q-function,
the following heuristic is used. For any state value, an
action is computed by interpolating between the best local
actions, using the membership degrees as weights: h(x) =
P
N
i=1
µ
i
(x)u
j
i
, where j
i
is the index of the best local
action for the core state x
i
, j
i
= arg max
j
b
Q
(x
i
, u
j
) =
arg max
j
θ
i,j
.
The controller successfully stabilizes the system in about
2.7 s. Because the control actions were originally continuous
and had to be discretized prior to running the fuzzy Q-
iteration, the bound (9) does not apply.

Citations
More filters
Journal ArticleDOI

Approximate dynamic programming with a fuzzy parameterization

TL;DR: This work shows that fuzzy Q-iteration is consistent, i.e., that it asymptotically obtains the optimal solution as the approximation accuracy increases, and proves that the asynchronous algorithm is proven to converge at least as fast as the synchronous one.
Proceedings ArticleDOI

Control delay in Reinforcement Learning for real-time dynamic systems: A memoryless approach

TL;DR: This work presents two novel temporal difference learning algorithms for problems with control delay that improve learning performance by taking the control delay into account and outperform classical TD learning algorithms while maintaining low computational complexity.
Journal ArticleDOI

AI Models for Green Communications Towards 6G

TL;DR: In this paper , the authors present the main considerations for green communications and survey the related research on AI-based green communications, focusing on how AI techniques are adopted to manage the network and improve energy harvesting toward the green era.
Journal ArticleDOI

Battery-Aware Optimization of Green Small Cells: Sizing and Energy Management

TL;DR: Simulations show that the proposed solution achieves considerable cost reduction compared to a classical Kalman filter-based method proposed in the literature and performs very closely to the ideal strategy able to perfectly predict the state of the stochastic variables.
Proceedings ArticleDOI

Adaptive learning based on guided exploration for decision making at roundabouts

TL;DR: This paper proposes a learning-based behavior generation approach for automated vehicles which is adapted sequentially using a learning algorithm that successively derives safe actions as an outcome.
References
More filters
Book

Reinforcement Learning: An Introduction

TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Book

Dynamic Programming and Optimal Control

TL;DR: The leading and most up-to-date textbook on the far-ranging algorithmic methododogy of Dynamic Programming, which can be used for optimal control, Markovian decision problems, planning and sequential decision making under uncertainty, and discrete/combinatorial optimization.
Journal ArticleDOI

Technical Note : \cal Q -Learning

TL;DR: This paper presents and proves in detail a convergence theorem forQ-learning based on that outlined in Watkins (1989), showing that Q-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action- values are represented discretely.
Journal ArticleDOI

Reinforcement learning: a survey

TL;DR: Central issues of reinforcement learning are discussed, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state.
Posted Content

Reinforcement Learning: A Survey

TL;DR: A survey of reinforcement learning from a computer science perspective can be found in this article, where the authors discuss the central issues of RL, including trading off exploration and exploitation, establishing the foundations of RL via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state.
Related Papers (5)
Frequently Asked Questions (8)
Q1. What contributions have the authors mentioned in the paper "Fuzzy approximation for convergent model-based reinforcement learning" ?

In this paper, the authors propose a fuzzy approximation structure for the Q-value iteration algorithm, and show that the resulting algorithm is convergent. The authors then propose a modified, serial version of the algorithm that is guaranteed to converge at least as fast as the original algorithm. 

The controller successfully stabilizes the system in about 2.7 s. Because the control actions were originally continuous and had to be discretized prior to running the fuzzy Qiteration, the bound (9) does not apply. 

The reward function chosen to express this goal is:ρ(x, u) = 0 if |αp| ≤ 5 · π/180 rad and |α̇p| ≤ 0.1 rad/s, p = 1, 2−1 otherwise (14)where [α1, α2, α̇1, α̇2]T = f(x, u) (the next state). 

0. The mass matrix M(α) and the Coriolis and centrifugal forces matrix C(α, α̇) have the following form:M(α) =[ P1 + P2 + 2P3 cos α2 P2 + P3 cos α2P2 + P3 cos α2 P2] (11)C(α, α̇) =[ b1 − P3α̇2 sin α2 −P3(α̇1 + α̇2) sin α2P3α̇1 sin α2 b2](12)The meaning and values of the physical parameters of the system are given in Table I. 

The consistency of the algorithms, i.e., the convergence to the optimal Q-function Q∗ as the maximum distance between the cores of adjacent fuzzy sets goes to 0, is not studied here and is a topic for future research. 

Proof: Denote n = N ·M , and rearrange the matrix θ into a vector in Rn, placing first the elements of the first row, then the second etc. 

The fuzzy rule-base outputs the weighted sum of the consequent values θi,j in each rule, where the weight factor of a particular rule corresponds to the degree of fulfillment of its logical expression. 

the approximator takes as input the state-action pair (x, uj) and outputs the Q-value:Q̂(x, uj) = [F (θ)](x, uj) =N∑i=1µi(x)θi,j (7)This is a basis-functions form, with the basis functions only depending on the state.