What contributions have the authors mentioned in the paper "Fuzzy approximation for convergent model-based reinforcement learning" ?

In this paper, the authors propose a fuzzy approximation structure for the Q-value iteration algorithm, and show that the resulting algorithm is convergent. The authors then propose a modified, serial version of the algorithm that is guaranteed to converge at least as fast as the original algorithm.

what is the reward function chosen to express this goal?

The reward function chosen to express this goal is:ρ(x, u) = 0 if |αp| ≤ 5 · π/180 rad and |α̇p| ≤ 0.1 rad/s, p = 1, 2−1 otherwise (14)where [α1, α2, α̇1, α̇2]T = f(x, u) (the next state).

What is the meaning of the physical parameters of the system?

0. The mass matrix M(α) and the Coriolis and centrifugal forces matrix C(α, α̇) have the following form:M(α) =[ P1 + P2 + 2P3 cos α2 P2 + P3 cos α2P2 + P3 cos α2 P2] (11)C(α, α̇) =[ b1 − P3α̇2 sin α2 −P3(α̇1 + α̇2) sin α2P3α̇1 sin α2 b2](12)The meaning and values of the physical parameters of the system are given in Table I.

What is the proof for the convergence of the fuzzy Q-iteration algorithm?

Proof: Denote n = N ·M , and rearrange the matrix θ into a vector in Rn, placing first the elements of the first row, then the second etc.

What is the function of the approximator?

the approximator takes as input the state-action pair (x, uj) and outputs the Q-value:Q̂(x, uj) = [F (θ)](x, uj) =N∑i=1µi(x)θi,j (7)This is a basis-functions form, with the basis functions only depending on the state.

(Open Access) Fuzzy Approximation for Convergent Model-Based Reinforcement Learning (2007) | Lucian Busoniu

Fuzzy Approximation for Convergent

Model-Based Reinforcement Learning

Lucian Bus¸oniu, Damien Ernst, Bart De Schutter, and Robert Babu

ska

Abstract— Reinforcement learning (RL) is a learning control

paradigm that provides well-understood algorithms with good

convergence and consistency properties. Unfortunately, these

algorithms require that process states and control actions

take only discrete values. Approximate solutions using fuzzy

representations have been proposed in the literature for the

case when the states and possibly the actions are continuous.

However, the link between these mainly heuristic solutions

and the larger body of work on approximate RL, including

convergence results, has not been made explicit. In this paper,

we propose a fuzzy approximation structure for the Q-value

iteration algorithm, and show that the resulting algorithm is

convergent. The proof is based on an extension of previous

results in approximate RL. We then propose a modiﬁed, serial

version of the algorithm that is guaranteed to converge at least

as fast as the original algorithm. An illustrative simulation

example is also provided.

I. INTRODUCTION

Learning controllers can tackle problems where pre-

programmed solutions are difﬁcult or impossible to design.

Reinforcement learning (RL) is a popular learning paradigm,

mainly because it requires only mild assumptions on the pro-

cess to be controlled, and is able to work without an explicit

model [1]–[3]. A RL controller measures directly the process

state, and receives feedback on the control performance in

the form of a scalar reward signal. The learning objective is

to maximize the cumulative reward signal. Well-understood

algorithms with good convergence and consistency properties

are available for solving the RL task, both when a model

of the controlled process is available and when it is not.

However, these algorithms require that the controller inputs

(process states) and outputs (control actions) take values in

a relatively small discrete set. When the state and / or action

spaces are continuous or contain a large number of elements,

approximate solutions must be used.

Approximation schemes have been proposed for model-

based RL [4]–[6], as well as for model-free or model-learning

RL [7]–[13].

Unfortunately, in general, approximate RL is

not guaranteed to converge [4], [14]. One type of approxi-

mators for which many RL algorithms converge are linear

Lucian Bus¸oniu, Bart De Schutter, and Robert Babu

ska are with the

Center for Systems and Control of the Delft University of Techno-

logy, The Netherlands (email: i.l.busoniu@tudelft.nl, b@deschutter.info,

r.babuska@tudelft.nl). Bart De Schutter is also with the Marine and Trans-

port Technology Department of TU Delft. Damien Ernst is with Sup

elec,

Rennes, France (email: damien.ernst@supelec.fr).

Some authors use the term ‘model-based RL’ when referring to algo-

rithms that build a model from interaction with the process. We use the term

‘model-learning’ for such techniques, and reserve the name ‘model-based’

for algorithms that rely on an

a priori

model of the process.

basis functions, also known as kernel functions, averagers,

and interpolative representations [4], [5], [7], [8].

Fuzzy approximation for RL is also popular in the lit-

erature, mainly for model-free RL. Fuzzy approximators

are combined e.g., with Q-learning [15], yielding fuzzy

Q-learning [16]–[18], or with actor-critic algorithms [1],

yielding fuzzy actor-critic architectures [18]–[23]. For fuzzy

Q-learning, Takagi-Sugeno fuzzy rule-bases are typically

used. Actor-critic algorithms use fuzzy rule-bases for the

actor element, and either fuzzy or other approximators (e.g.,

neural networks) for the critic element. Typically, fuzzy RL

approaches are heuristic, and their convergence has not been

studied, with the exception of the actor-critic algorithms in

[20], [21]. These algorithms use special rulebase structures

and parameter update rules in order to guarantee conver-

gence. The results on convergence in the larger body of work

in approximate RL have not been employed for fuzzy RL.

In this work, we propose a fuzzy approximator similar

to others previously used for Q-learning [16], [18], but

we combine it with the model-based Q-iteration algorithm

(see e.g., [8]). This approximator works for continuous

states and discrete actions; however, continuous actions can

be handled by discretization. We show that the resulting

fuzzy Q-iteration algorithm converges. We then propose an

asynchronous, serial version of fuzzy Q-iteration, which

converges at least as fast as the original algorithm. The

modiﬁed algorithm has not, to the authors’ best knowledge,

been studied yet in approximate RL, although

exact

serial

value iteration is widely used [3].

The remainder of this paper is structured as follows. Sec-

tion II introduces the necessary RL elements, and Section III

describes approximate model-based RL. Section IV describes

the proposed fuzzy approximation structure. The properties

of approximate Q-iteration using this structure are analyzed

in Section V. Section VI illustrates the proposed algorithms

on a simulated example. Finally, Section VII outlines ideas

for future work and concludes the paper.

II. BACKGROUND: REINFORCEMENT LEARNING

In this section, we brieﬂy introduce the RL task and

characterize its optimal solution, following [1]–[3].

Consider a deterministic Markov decision process with

the state space X, the action (control) space U , the state

transition function f : X × U → X, and the reward function

ρ : X × U → R.

As a result of the control action u

A stochastic formulation is possible, where the state transitions are prob-

abilistic. In that case, expected returns under these probabilistic transitions

must be considered, and the results discussed still hold.

applied in state x

, the state changes to x

k+1

= f(x

, u

The controller receives feedback on its performance in the

form of the scalar reward signal r

k+1

= ρ(x

, u

). This

reward evaluates the immediate effect of action u

, but

says nothing directly about the long-term effects of this

action. The controller chooses actions given the current state,

according to its policy h : X → U: u

= h(x

The learning goal is the maximization, starting from the

current moment in time (k = 0), of the discounted return:

∞

k=0

k+1

∞

k=0

ρ(x

, u

) (1)

where γ ∈ [0, 1) is the discount factor. The discounted

return compactly represents the reward accumulated by the

controller in the long-run. The learning task is therefore

to maximize long-term performance, while only receiving

feedback about immediate, one-step performance. This can

be achieved by computing the optimal action-value function.

An action-value function (Q-function), Q

: X × U → R,

gives the return of each state-action pair under a policy h:

(x, u) = ρ(x, u) +

∞

k=1

ρ(x

, h(x

)) (2)

where x

= f(x, u) and x

k+1

= f(x

, h(x

)), ∀k. The

optimal action-value function is deﬁned as Q

∗

(x, u) =

max

(x, u). Any policy that picks for every state the

action with the highest optimal Q-value:

∗

(x) = arg max

∗

(x, u) (3)

is then optimal (i.e., it maximizes the return (1)).

A central result in RL is the

Bellman optimality equation

∗

(x, u) = ρ(x, u) + γ max

∈U

∗

(f(x, u), u

) ∀x, u (4)

This equation states that the optimal value of action u taken

in state x is the expected immediate reward plus the expected

(discounted) optimal value attainable from the next state.

Let the set of all Q-functions be denoted by Q. The Q-

iteration mapping T : Q → Q is the right-hand side of the

Bellman equation for any Q-function:

[T (Q)](x, u) = ρ(x, u) + γ max

∈U

Q(f(x, u), u

) (5)

Using this notation, the Bellman optimality equation (4)

states that Q

∗

is a ﬁxed point of T , i.e., Q

∗

= T (Q

∗

). The

following result is also well-known (see e.g., [24]).

Theorem 1: T is a contraction with factor γ in the

inﬁnity norm, i.e., for any pair of functions Q, Q

kT (Q) − T (Q

∞

≤ γ kQ − Q

∞

The Q-value iteration (Q-iteration) algorithm starts from

an arbitrary Q-function Q

and at each iteration ` updates

the Q-function using the formula Q

`+1

= T (Q

). From

Theorem 1, it follows that T has a unique ﬁxed point, and

from (4), this point is Q

∗

. Therefore, Q-iteration converges

to Q

∗

as ` → ∞.

The standard Q-iteration uses an

a priori

model of the

task (in the form of the transition and reward functions f, ρ).

There are also algorithms that learn a model from experience,

and others that do not use an explicit model at all [1], [2].

III. FUNCTION APPROXIMATION FOR Q-ITERATION

In general, the practical implementation of RL algorithms

requires that Q-values are stored and updated explicitly for

each state-action pair. This can only be realized when the

number of state and action values is small. When the state

and / or action spaces contain a large or inﬁnite number of

elements (e.g., they are continuous), approximate solutions

must be used instead.

Parametric approximators use a parameter vector θ as

a ﬁnite representation of the Q-function

Q. The following

mappings are deﬁned in order to formalize parametric ap-

proximate Q-iteration (the notation follows [10]).

1) The

Q-iteration

mapping T , deﬁned by equation (5).

2) The

approximation

mapping F : R

→ Q, which for

a given value of the parameter vector θ ∈ R

produces

an approximate Q-function

Q = F (θ).

3) The

projection

mapping P : Q → R

, which given a

target Q-function Q computes the parameter vector θ

such that F (θ) is as close as possible to Q (e.g., in a

least-squares sense).

The notation [F (θ)](x, u) refers to the value of the Q-

function F (θ) for the state-action pair (x, u). The notation

[P (Q)]

refers to the l-th parameter in the parameter vector

P (Q).

Approximate Q-iteration starts with an arbitrary param-

eter vector θ

and at each iteration ` updates it using the

composition of the mappings P , T , and F :

`+1

= P T F(θ

) (6)

Unfortunately, the approximate Q-iteration is not guar-

anteed to converge for an arbitrary approximator. Counter-

examples can be found e.g., in [4], [14] for the related value-

iteration algorithm, and those results apply directly to Q-

iteration as well. One particular case in which approximate

Q-iteration converges is when the composite mapping P T F

can be shown to be a contraction [4], [5]. This property will

be used below to show that fuzzy Q-iteration converges.

IV. FUZZY Q-ITERATION

In this section, we propose a fuzzy approximation similar

to others previously used for Q-learning [16], [18], but we

combine it with the model-based Q-iteration algorithm. In the

sequel, it is assumed that the action space is discrete, denoted

by U

= {u

|j = 1, . . . , M }. This discrete set can be

obtained from the discretization of an originally continuous

action space. The state space can be either continuous or

discrete. In the latter case, fuzzy approximation is useful

when the number of discrete states is large.

The proposed approximation architecture relies on a fuzzy

partition of the state space into N sets X

, each described by

a membership function µ

: X → [0, 1]. A state x belongs to

each set i with a degree of membership µ

(x). In the sequel

the following assumptions are made:

1) The fuzzy partition is normalized, i.e.,

i=1

(x) =

1, ∀x ∈ X.

2) The fuzzy sets in the partition are normal, i.e., for

every i there exists an x

for which µ

) = 1 (and

consequently, µ

) = 0 for all i 6= i by Assumption

1). The state value x

is called the core of set X

This second assumption is made here for brevity in

the description and analysis of the algorithms; it can

be relaxed using results of [4].

For an example of a partition that satisﬁes the above

conditions, see Figure 2 of Section VI.

The Q-function is approximated using a Takagi-Sugeno

rule-base with singleton consequents. The rule-base has one

input, the state x, and M outputs q

, . . . , q

, the Q-values

corresponding to each of the discrete actions u

, . . . , u

The i-th rule in this rule-base has the form:

: if x is X

then q

= θ

i,1

; q

= θ

i,2

; . . . ; q

= θ

i,M

The parameters of this approximator are the singleton con-

sequent values appearing in the rule-base. They are arranged

in an N × M matrix θ, one row for each rule i and one

column for each output j.

The logical expression ‘x is X

’

holds true with degree µ

(x), the membership degree of x

in X

. The fuzzy rule-base outputs the weighted sum of the

consequent values θ

i,j

in each rule, where the weight factor

of a particular rule corresponds to the degree of fulﬁllment of

its logical expression. Thus, the approximator takes as input

the state-action pair (x, u

) and outputs the Q-value:

Q(x, u

) = [F (θ)](x, u

) =

i=1

(x)θ

i,j

(7)

This is a basis-functions form, with the basis functions

only depending on the state. The approximator (7) can be

regarded as M distinct approximators, one for each of the

M discrete actions.

The projection mapping infers from a Q-function the val-

ues of the approximator parameters according to the relation:

i,j

= [P (Q)]

i,j

= Q(x

, u

) (8)

This is a particular case of the least-squares solution:

P (Q) = arg min

(x,u)∈X

×U

[Q(x, u) − F (θ)(x, u)]

when the set of samples X

× U

is the Cartesian product

of the set of cores X

= {x

, . . . , x

} and the discrete

action space U

. Note that when the set of samples is differ-

ent from this, least-squares projection no longer guarantees

convergence [4].

The approximator speciﬁed in this way is a special case

of several types of approximators previously considered

for RL: interpolative representations [4], averagers [5], and

representative-state techniques as described in [13]. It also

The matrix arrangement is adopted for convenience of notation only. For

the theoretical study of the algorithms, the collection of parameters is still

regarded as a vector, leading e.g., to kθk

∞

= max

i,j

|θ

Algorithm 1 Parallel fuzzy Q-iteration

1: ` ← 0; θ

← 0 (or arbitrary values)

2: repeat

3: for i = 1, . . . , N, j = 1, . . . , M do

4: θ

`+1,i,j

← ρ(x

, u

γ max

(f(x

, u

))θ

`,i

5: end for

6: ` ← ` + 1

7: until kθ

− θ

`−1

∞

≤ δ

Algorithm 2 Serial fuzzy Q-iteration

1: ` ← 0; θ

← 0 (or arbitrary values)

2: repeat

3: θ ← θ

4: for i = 1, . . . , N, j = 1, . . . , M do

5: θ

i,j

← ρ(x

, u

) + γ max

(f(x

, u

))θ

6: end for

7: θ

`+1

← θ; ` ← ` + 1

8: until kθ

− θ

`−1

∞

≤ δ

shares similarities with barycentric interpolation [6]. The

analysis in Section V will rely on theoretical properties of

these approximators.

An explicit form of the approximate Q-value iteration

algorithm using the approximator (7) and projection (8) is

given in Algorithm 1. To establish the equivalence between

Algorithm 1 and the approximate Q-iteration in the form

(6), observe that the right-hand side in line 4 of Algorithm 1

corresponds to [T (

)](x

, u

), where

= F (θ

). Hence,

line 4 can be written θ

`+1,i,j

← [P T F(θ

)]

i,j

and the entire

for loop described by lines lines 3–5 is equivalent to (6).

In Algorithm 1, only the parameters θ

at the end of the

previous iteration are used in the computation of the updated

values θ

`+1

. Algorithm 2 is an alternative version, which uses

the updated parameters as soon as they are available. Since

the parameters are updated in serial fashion, this version is

called

serial

Q-iteration. Although the exact counterpart of

this algorithm is widely used [1], [3], approximate serial Q-

iteration has not, to the authors’ best knowledge, been studied

yet. To differentiate between the two versions, we hereafter

call Algorithm 1

parallel

fuzzy Q-iteration.

V. ANALYSIS

In this section, the convergence of parallel and serial fuzzy

Q-iteration is established. It is shown that there exists a

parameter vector θ

∗

such that for both algorithms, θ

→ θ

∗

as ` → ∞. The consistency of the algorithms, i.e., the

convergence to the optimal Q-function Q

∗

as the maximum

distance between the cores of adjacent fuzzy sets goes to 0,

is not studied here and is a topic for future research. It can

be shown, however, that under certain conditions, F (θ

∗

) is

within a given bound of the Q

∗

[4], [5].

Proposition 1: Fuzzy Q-iteration (Algorithm 1) con-

verges.

Proof: The proof follows from the convergence proof

of value iteration with averagers [5], or with interpolative

representations [4]. This is because fuzzy approximation is

an averager by the deﬁnition in [5], and an interpolative

representation by the deﬁnition in [4]. For these types of

approximator, P and F are nonexpansions, making P T F a

contraction with factor γ, i.e., kP TF (θ) − P T F(θ

∞

≤

γ kθ − θ

∞

, for any θ, θ

Similarly to the convergence proof for exact serial value

iteration in [3], it is shown below that the

approximate

serial

Q-iteration Algorithm 2 converges.

Proposition 2: Serial fuzzy Q-iteration (Algorithm 2) con-

verges.

Proof: Denote n = N · M , and rearrange the matrix

θ into a vector in R

, placing ﬁrst the elements of the ﬁrst

row, then the second etc. The element at row i and column

j of the matrix is now the l-th element of the vector, with

l = (i − 1) · M + j.

Deﬁne for all l = 0, . . . , n recursively the mappings S

→ R

as:

(θ) = θ

(θ)]

(

[P T F (S

l−1

(θ))]

if l = l

l−1

(θ)]

l ∈ {1, . . . , n} \ l

In words, S

corresponds to updating the ﬁrst l parameters

using approximate serial Q-iteration, and S

is a com-

plete iteration of the approximate serial algorithm. Now we

show that S

is a contraction, i.e., kS

(θ) − S

(θ

∞

≤

γ kθ − θ

∞

, for any θ, θ

. This can be done element-by-

element. By the deﬁnition of S

, the ﬁrst element is only

updated by S

|[S

(θ)]

− [S

(θ

)]

| = |[S

(θ)]

− [S

(θ

)]

= |[P T F(θ)]

− [P T F(θ

)]

≤ γ kθ − θ

∞

The last step follows from the contraction mapping property

of P T F .

Similarly, the second element is only updated by S

|[S

(θ)]

− [S

(θ

)]

| = |[S

(θ)]

− [S

(θ

)]

= |[P T F(S

(θ))]

− [P T F(S

(θ

))]

≤ γ kS

(θ) − S

(θ

∞

= γ max{|[P T F(θ)]

− [P T F(θ

)]

| ,

|θ

− θ

| , . . . , |θ

− θ

≤ γ kθ − θ

∞

where kS

(θ) − S

(θ

∞

is expressed by direct maximiza-

tion over its elements, and the contraction mapping property

of P T F is used twice.

Continuing in this fashion, we obtain

|[S

(θ)]

− [S

(θ

)]

| ≤ γ kθ − θ

∞

for all l, and

thus S

is a contraction. Therefore, serial fuzzy Q-iteration

converges.

This proof is actually more general, showing that approx-

imate serial Q-iteration converges for any approximation F

and projection P for which P T F is a contraction.

In the same way as exact serial value iteration [3], serial

fuzzy Q-iteration can be shown to converge at least as quickly

as Algorithm 1.

The following bound on the suboptimality of the computed

Q-function follows from [5], but applies only when the

action space of the

original

problem is discrete (i.e., no

discretization is necessary prior to fuzzy Q-iteration).

Proposition 3: If the original action space is discrete and

min



∗

− Q



∞

= ε where Q is any ﬁxed point of the

composite mapping F P : Q → Q, then fuzzy Q-iteration

converges to θ

∗

such that:

∗

− F (θ

∗

∞

≤

2ε

1 − γ

(9)

For example, any Q-function which satisﬁes Q(x, u

) =

i=1

(x)Q(x

, u

) ∀x, j is a ﬁxed point of F P . In

particular, if the optimal Q-function has this form, i.e., is

exactly representable by the chosen fuzzy approximator, the

algorithm will converge to it (since in this case ε = 0).

In this section, we have established the parallel and

serial fuzzy Q-iteration as theoretically sound algorithms for

approximate RL in continuous-state tasks. When the original

action space is discrete, bounds on the derived Q-function

and policy were also shown to hold.

VI. SIMULATION EXAMPLE

As an illustrative example, fuzzy Q-iteration is applied in

simulation to the minimum-time stabilization of a two-link

manipulator.

A. Two-link Manipulator Model

The two-link manipulator, depicted in Figure 1, is de-

scribed by the fourth-order nonlinear model:

M(α)¨α + C(α, ˙α) ˙α + G(α) = τ (10)

where α = [α

, α

]

, τ = [τ

, τ

]

. The system has two

control inputs, the torques in the two joints, τ

and τ

, and

four measured outputs – the link angles, α

, α

, and their

angular speeds ˙α

, ˙α

In the sequel, it is assumed that the manipulator operates

in a horizontal plane, leading to G(α) = 0. The mass matrix

M(α) and the Coriolis and centrifugal forces matrix C(α, ˙α)

have the following form:

M(α) =



+ P

+ 2P

cos α

+ P

cos α

+ P

cos α



(11)

C(α, ˙α) =



− P

˙α

sin α

−P

( ˙α

+ ˙α

) sin α

˙α

sin α



(12)

motor

Fig. 1. Schematic drawing of the two-link rigid manipulator.

TABLE I

PHY SICA L PARAM ETER S OF TH E M ANIP ULATOR

Symbols and values Meaning

= l

= 0.4 m link lengths

= 1.25 kg, m

= 0.8 kg link masses

= 0.066 kgm

, I

= 0.043 kgm

link inertias

= c

= 0.2 m centers of mass for the links

= 0.08 kg/s, b

= 0.02 kg/s dampings in the joints

1,max

= 1.5 Nm, τ

2,max

= 1 Nm maximum motor torques

˙α

1,max

= ˙α

2,max

= 2π rad/s maximum angular velocities

The meaning and values of the physical parameters of the

system are given in Table I. Using these, the rest of the

parameters in (10) can be computed by:

= m

+ m

+ I

= m

+ I

= m

(13)

B. Setup of the RL Algorithm

The input of the RL controller (the process state) is x =

[α

, ˙α

]

, and its output (the command signal) is u = τ.

The discrete time step is set to T

= 0.05 and the discrete-

time dynamics f are obtained by numerical integration of

(10) between consecutive time steps.

The control goal is the stabilization of the system around

α = ˙α = 0 in minimum time, with a tolerance of ±5 ·π/180

rad for the angles, and ±0.1 rad/s for the angular speeds.

The reward function chosen to express this goal is:

ρ(x, u) =











0 if |α

| ≤ 5 · π/180 rad

and | ˙α

| ≤ 0.1 rad/s, p = 1, 2

−1 otherwise

(14)

where [α

, α

, ˙α

]

= f(x, u) (the next state).

Each torque signal τ

, p = 1, 2 takes continuous values in

the corresponding interval [−τ

p,max

, τ

p,max

]. To apply fuzzy

Q-iteration, three discrete values are chosen for each torque:

−τ

p,max

(maximal torque clockwise), 0, and τ

p,max

(maximal

torque counter-clockwise).

Separately for each state component, a normal, complete

triangular fuzzy partition is deﬁned. Such a partition is

completely determined by the core coordinates of the fuzzy

sets. For ˙α

and ˙α

, the interval is partitioned into 7 fuzzy

sets, with their cores at {−360, −180, −30, 0, 30, 180, 360} ·

π/180 rad/s. This partition is depicted as an example in

Figure 2. For α

and α

, 12 sets are used, with their cores

at {−180, −130, −80, −30, −15, −5, 0, 5, 15, 30, 80, 130} ·

π/180 rad. There is no fuzzy set with core π, because this

is identical with the ﬁrst set, having the core −π (the angles

evolve on a circle manifold [−π, π)).

The fuzzy partition of the state space is then deﬁned as

follows. One fuzzy set is computed for each combination

, . . . , i

) of individual sets for the four state components

, α

, ˙α

. Such a fuzzy set has the following member-

ship function:

µ(x) = µ

(α

)·µ

(α

)·µ

˙α

( ˙α

)·µ

˙α

( ˙α

) (15)

This way of building the state space partition can be thought

of as a conjunction of one-dimensional concepts correspond-

ing to the fuzzy partitions of the individual state variables.

0.25

0.5

0.75

˙α

−2π

−π −

2π

( ˙α

)

Fig. 2. The triangular fuzzy partition for the state variable ˙α

∈

[−2π, 2π] (identical to the partition for ˙α

). Core values are in

−2π, −π,

−π

, 0,

, π, 2π

0 1 2 3 4 5

−6

−4

−2

Link angles[rad]

0 1 2 3 4 5

−4

−2

Link velocities[rad/sec]

0 1 2 3 4 5

−2

Cmd torques[Nm]

0 1 2 3 4 5

−1

−0.5

Reward [−]

t [sec]

Fig. 3. State, command, and reward signals for RL control (thin black line

– link 1, thick gray line – link 2). The initial state is x

= [−π, −π, 0, 0]

The fuzzy partition computed in this way still satisﬁes

Assumptions 1 and 2. It contains (12 · 7)

= 7056 sets.

An approximate optimal action-value function is computed

with serial and parallel fuzzy Q-iteration. The discount factor

is set to γ = 0.98.

C. Results

Figure 3 presents a controlled trajectory starting from

the initial state x

= [−π, −π, 0, 0]

, together with the

corresponding command and reward signals. In order to

obtain a continuous policy from the computed Q-function,

the following heuristic is used. For any state value, an

action is computed by interpolating between the best local

actions, using the membership degrees as weights: h(x) =

i=1

(x)u

∗

, where j

∗

is the index of the best local

action for the core state x

, j

∗

= arg max

∗

, u

) =

arg max

∗

i,j

The controller successfully stabilizes the system in about

2.7 s. Because the control actions were originally continuous

and had to be discretized prior to running the fuzzy Q-

iteration, the bound (9) does not apply.

Fuzzy Approximation for Convergent Model-Based Reinforcement Learning

Figures

Citations

Approximate dynamic programming with a fuzzy parameterization

Control delay in Reinforcement Learning for real-time dynamic systems: A memoryless approach

AI Models for Green Communications Towards 6G

Battery-Aware Optimization of Green Small Cells: Sizing and Energy Management

Adaptive learning based on guided exploration for decision making at roundabouts

References

Reinforcement Learning: An Introduction

Dynamic Programming and Optimal Control

Technical Note : \cal Q -Learning

Reinforcement learning: a survey

Reinforcement Learning: A Survey

Related Papers (5)

Reinforcement Learning: An Introduction

A reinforcement learning framework for the adaptive routing problem in stochastic time-dependent network

Path Integral Stochastic Optimal Control for Reinforcement Learning

On ε-optimality of the pursuit learning algorithm

Stochastic Training of Neural Networks via Successive Convex Approximations

Frequently Asked Questions (8)

Q1. What contributions have the authors mentioned in the paper "Fuzzy approximation for convergent model-based reinforcement learning" ?

Q2. How long does the controller stabilize the system?

Q3. what is the reward function chosen to express this goal?

Q4. What is the meaning of the physical parameters of the system?

Q5. What is the consistency of the algorithms?

Q6. What is the proof for the convergence of the fuzzy Q-iteration algorithm?

Q7. What is the weight factor of a particular rule?

Q8. What is the function of the approximator?