what is the class of stationary policies?

The class of stationary policies is compact; i.e., for any sequence u(i) ∈ S, there exists a subsequence u(ij) such that the policy u∗ = limj→∞ u(ij) (i.e., the policy for which u∗(a|x) = limj→∞ u(ij)(a|x) for all a and x) is stationary.

What is the solution of phpLEMMA 4.1?

phpLEMMA 4.1. Let yit(h), i = 1, 2, be functions of time t and state-action histories h. Let zit(h) be the solution of (9) obtained with y i t(h) (h is fixed), i = 1, 2.

What is the simplest way to solve the lemma?

Substituting the last inequality in (42), one obtainsmax t∈[0,1]Eu (y)x ||Zt − z y t ||1 ≤ L [ ∆( ) + (L1 + L2)∆( ) + L3µ(K( ))] ,which, by (43), completes the proof of the lemma.

What is the simplest way to explain the concept of stationary policy?

Since for any initial distribution ξ and for any stationary policy s(i), the authors haveψ0 = η(s(i)), P s(i) ξ a.s.,(38)it follows by choosing the sequence of times t(i) so that the intervals t(i + 1) − t(i) are sufficiently large, that (36) implies thatlim i→∞

What is the proof of the Pontryagin maximus?

As in this theorem, one can also establish thatlim →0B (z, x, s) = B0(z, s),with the convergence being uniform with respect to s ∈ [0, 1], x ∈ X, and z ∈ Z, where Z is a compact subset of Rn.Notice that the described approach has a decomposition structure.

what is the proof of a stationary policy?

It follows by arguments as in the first part of the proof that there exist sequences of times t(i) and of stationary policies s(i), and a constant α4 > 0 such that for all i,E s(i) ξ d 2 t(i) ≥ α4(36)for any initial distribution ξ.

What is the optimal value of the above problem?

The optimal value of the above problem does not depend on the initial distribution ξ, and it is equal to the optimal value of the following linear programming problem:Jξ(z, λ) = J(z, λ) def= minη {∑ v,a r(z, λ; v, a)η(v, a)|η = {η(v, a)} ∈W } (19)= λT f1(z) + min η{ λT f2(z)∑ v,ay(v, a)η(v, a)|η = {η(v, a)} ∈W } .b)

(Open Access) Asymptotic Optimization of a Nonlinear Hybrid System Governed by a Markov Decision Process (1997) | Eitan Altman

Q: What have the authors contributed in "Asymptotic optimization of a nonlinear hybrid system governed by a markov decision process" ?

The authors consider in this paper a continuous time stochastic hybrid control system with finite time horizon. Under the assumption that is a small parameter, the authors justify an averaging procedure allowing us to establish that their problem can be approximated by the solution of some deterministic optimal control problem.

ASYMPTOTIC OPTIMIZATION OF A NONLINEAR HYBRID

SYSTEM GOVERNED BY A MARKOV DECISION PROCESS

∗

EITAN ALTMAN

†

AND VLADIMIR GAITSGORY

‡

SIAM J. C

ONTROL OPTIM.



1997 Society for Industrial and Applied Mathematics

Vol. 35, No. 6, pp. 2070–2085, November 1997 010

Abstract. We consider in this paper a continuous time stochastic hybrid control system with

ﬁnite time horizon. The objective is to minimize a nonlinear function of the state trajectory. The

state evolves according to a nonlinear dynamics. The parameters of the dynamics of the system

may change at discrete times l, l =0,1, ..., according to a controlled Markov chain which has ﬁnite

state and action spaces. Under the assumption that  is a small parameter, we justify an averaging

procedure allowing us to establish that our problem can be approximated by the solution of some

deterministic optimal control problem.

Key words. hybrid stochastic systems, asymptotic optimality, nonlinear dynamics, Markov

decision processes, averaging

AMS subject classiﬁcations. 49B10, 49B50

PII. S0363012995279985

1. Introduction and statement of the problem. Consider the following hy-

brid stochastic control system. The state Z

∈ R

evolves according to the following

dynamics:

= f(Z

),t∈[0, 1],Z

=z,(1)

where Y

∈ R

is the “control” to be speciﬁed later and z is the initial state. f is

assumed to be linear in the second argument (for each value of the ﬁrst argument),

i.e.,

f(z,y)=f

(z)+f

(z)y,(2)

where f

is an n-dimensional vector and f

is an n × k matrix; f

(z)y is the multi-

plication between the matrix f

(z) and the vector y. The functions f

(z) and f

(z)

are supposed to be bounded and to satisfy the Lipschitz condition



(z) − f

)



≤ C

||z − z

∀z,z

,(3)



(z)



≤ C

,(4)

where z,z

are from a suﬃciently large domain which contains all possible trajecto-

ries of (1), C

and C

are constants, and ||·||

stands for the L

norm in the ﬁnite-

dimensional space. That is, ||q ||

= max

i=1,...,k

| for the vector q = {q

},i=1, ..., k,

and ||A||

= max

||q ||

||Aq||

for the matrix A(n × k).

It is assumed in what follows that there exists a bounded domain containing all

the trajectories of (1), and, thus, (4), in fact, is implied by (3).

∗

Received by the editors January 13, 1995; accepted for publication (in revised form) September

11, 1996. The research undertaken in this paper was supported by the Australian Research Council

(ARC).

http://www.siam.org/journals/sicon/35-6/27998.html

†

INRIA, BP93, 2004 Route des Lucioles, 06902 Sophia Antipolis Cedex, France (altman@

martingale.inria.fr).

‡

School of Mathematics, University of South Australia, The Levels, Pooraka, South Australia

5095, Australia (mavg@lux.levels.unisa.edu.au).

2070

Downloaded 06/23/14 to 195.83.212.140. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

ASYMPTOTIC OPTIMIZATION OF A HYBRID SYSTEM GOVERNED BY MDP 2071

is not chosen directly by the controller but is obtained as a result of controlling

the following underlying stochastic discrete event system. Let  be the basic time

unit. Time is discretized; i.e., transitions occur at times t = n, n =0,1,2, ..., b

−1

where bxc stands for the greatest integer which is smaller than or equal to x. There

is a ﬁnite state space X = {1, ..., N } and a ﬁnite action space A. If a state is v and

an action a is chosen, then the next state is w with the probability P

vaw

. A policy

u = {u

, ...} in the set of policies U is a sequence of probability measures on A;at

each time t = n the controller chooses u

based on the history of all previous states

and actions, as well as the present state. Thus, u

is a function that maps histories

of the form h

=(x

, ..., x

n−1

) to probability measures on A.

We shall be especially interested in the following classes of policies:

• the Markov policies, denoted by M, i.e., policies for which u

depends only

on the current state and does not depend on previous states and actions.

• the stationary policies, denoted by S, i.e., policies for which u

depends only

on the current state and does not depend on previous states and actions nor

on the time.

The stochastic process {X

} is known as a controlled Markov chain, or

Markov decision process (MDP); see Derman [11, pp. 2–4]. We assume through-

out the paper that under any stationary policy, the state space forms an aperiodic

Markov chain such that all states communicate (regular Markov chain). The results of

the paper hold, in fact, under weaker ergodicity assumptions; however, the restricted

assumption makes the presentation clearer.

Denote by H the set of all possible states and actions histories which can be

observed until time b

−1

H =

[

{h},h=



),n=0,1, ..., b

−1



Let F be the σ-algebra of all subsets of H. Each policy u and initial state x determines

a probability measure on F, on which the stochastic state and action process H =



,n=0,1, ..., b

−1



is deﬁned. Denote by P

and E

the probability measure

and mathematical expectation that correspond to an initial state X

= x and a policy

u. Sometimes we shall assume an initial distribution ξ on X

, instead of a ﬁxed

initial state. In that case P

, E

denote the corresponding probability measure and

mathematical expectation.

Let y : X × A → R

, j =1, ..., k, be some given vector-valued function. Then Y

in (1) is given by

= y(X

bt/c

).(5)

The system (1) with thus-deﬁned Y

is called hybrid, ﬁrst, because Y

changes

its values via some random jumps whereas Z

is a smooth (diﬀerentiable) function of

time and, second, because, as follows from the consideration below, Y

being controlled

“statistically” through controlling the transition probabilities plays by itself the role

of a “direct” control with respect to Z

Let g : R

→ R be some operating cost related to the process Z

. We assume

that it is Lipschitz continuous; i.e.,

||g (z ) − g (z

)||

≤ C

||z − z

We consider the following control problem with  and x ﬁxed.

Downloaded 06/23/14 to 195.83.212.140. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

2072 EITAN ALTMAN AND VLADIMIR GAITSGORY



: ﬁnd a policy u that achieves F



(z,x) = inf

u∈U

g(Z

), where Z

is obtained

through (1).

Our model is characterized by the fact that  is supposed to be a small parameter

and our objective is to construct a policy (depending, in general, on ) which is

asymptotically optimal for Q



. That is, the diﬀerence between the cost under this

policy and F



(z,x) converges to zero as  → 0.

The type of model which we introduce is natural in the control of inventories

or of production, where we deal with material whose quantity may “slowly” change

in a continuous (linear) way. Breakdowns, repairs, and other control decisions yield

the underlying MDP. Our model may also be used in the control of highly loaded

queueing networks for which the ﬂuid approximation holds (see Kleinrock [20, p. 56]).

The slow variables Z

may then represent the number of customers in the diﬀerent

queues, whereas the underlying MDP may correspond to routing, or ﬂow control of,

say, some long on/oﬀ traﬃc.

The fact that  is chosen to be small means that the variables Y

along with the

MDP X

can be considered to be fast with respect to the time scale t in which Z

evolves. Indeed, Y

and X

may have large jumps between t = m and t =(m+1),

whereas the corresponding change in Z

in that period is of order . The problem

is, thus, close in nature to stochastic singular perturbed control problems intensively

studied in the literature (see, for example, [1], [5], [6], [7], [9], [10], [21], [23], [24], [25]

and references therein). A common approach to this kind of problem is an application

of singular perturbations or averaging techniques to the Hamilton–Jacobi–Bellman

(HJB) equation for problems in continuous time (as in [5], [6], [21]) or to the dynamic

programming equation for singularly perturbed MDPs [1], [7], [9], [10], [24], [25]. In

contrast to this approach, we, as in [23], apply an averaging method directly to the

“slow” stochastic equation. Our model diﬀers, however, from the ones in [23] in many

respects—mainly in the type of fast motions involved, which implies the diﬀerences

in both the technique used and the results obtained.

In our previous paper [2], we considered the problem similar to Q



for the case

of linear dynamics f and cost g and showed that an asymptotically optimal policy

can be constructed via maximization of the Hamiltonian of some linear deterministic

system. The technique we used was, however, strongly related to the linearity of the

model, and it is not applicable to the case when the dynamics and/or the cost are

nonlinear. As opposed to the linear case, the consideration for the nonlinear case is

much more involved and based on an ergodicity-type result for MDPs obtained in this

paper (see Theorem 4.1 below). Using this result we establish that the trajectories of

stochastic hybrid system (1) are approximated by the trajectories of some nonlinear

deterministic control system, and the problem Q



is approximated by the correspond-

ing deterministic optimal control problem allowing us, in particular, to construct an

asymptotically optimal policy for Q



. Notice that this result can be viewed as an

extension of the averaging technique for deterministic singularly perturbed control

systems (see, e.g., [15]) to the stochastic case under consideration. On the other

hand, it can be viewed as an extention of results on uncontrolled motions establishing

that the solution of the original stochastic system is approximated by the solution of

some deterministic system obtained via averaging over the fast random dynamics [16],

[19], [22] to the case when this random dynamics is deﬁned by the controlled Markov

chain.

The paper consists of four sections. Section 1 is this introduction; section 2

describes the main results about the approximation of the problem of optimal control

Downloaded 06/23/14 to 195.83.212.140. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

ASYMPTOTIC OPTIMIZATION OF A HYBRID SYSTEM GOVERNED BY MDP 2073

of the hybrid system by a deterministic optimal control problem. In section 3 we

discuss ways that the solution of the deterministic optimal control problem can be

characterized and how it can be used to obtain an asymptotically optimal policy.

Section 4 contains the above-mentioned Theorem 4.1, as well as the proofs of some

basic lemmas used in section 2.

2. Description of main results. Let

Y(m, x)

def

[

u∈U

(

(m +1)

−1

t=0

)

where the union is taken over all policies. As follows from Theorem 3 in [2], the set

Y(m, x) converges in the Haussdorﬀ metric to a set Y deﬁned below:

lim

m→∞

Y(m, x)=Y

def

[

u∈S

(

v,a

η(u;v, a)y(v,a)

)

,(6)

where the union is taken over all stationary policies, and η(u)={η(u;v, a)}is the vec-

tor of steady state probabilities of state-action pairs obtained when using a stationary

policy u. That is,

η(u; v, a) = lim

n→∞

= v, A

= a).(7)

Notice that due to the ergodicity assumption on our model, η(u; v, a) does not depend

on the initial distribution. Notice also that, since the set

def

[

u∈S

{η(u)}(8)

is a polyhedron (see, for example, [11, pp. 93–95]), the set Y is a polyhedron as well.

Deﬁne now the averaged deterministic control system as

= f(z

),z

=z,(9)

where y

is a measurable function of t taking values in Y. The set of such functions

y :[0,1] → Y

will be called the set of admissible controls.

Our claim is that the set of all random trajectories of (1) is approximated by

the set of solutions of (9) obtained with all admissible controls. More speciﬁcally, we

establish that there exists a function γ() satisfying

lim

→0

γ()=0

such that the following holds.

EMMA 2.1. Corresponding to any admissible control y = {y

,t ∈ [0, 1]}, there

exists a Markov policy u



(y) such that the random trajectory Z

of (1), obtained with

this policy u



(y), and the deterministic solution z

of (9), obtained with y, satisfy the

inequality

max

t∈[0,1]



(y)

||Z

− z

≤ γ ().(10)

Downloaded 06/23/14 to 195.83.212.140. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

2074 EITAN ALTMAN AND VLADIMIR GAITSGORY

LEMMA 2.2. There exists a function ˜y



(h),

˜y



:[0,1] × H → Y,

such that (a) for each h ∈ H, ˜y



(h) is a piecewise constant function of t and (b) for

any policy u,

max

t∈[0,1]

||Z

− ˜z



(H )||

≤ γ (),(11)

where Z

is the solution of (1), ˜z



(H) is the solution of (9) obtained with y

=˜y



(H),

and H is the random realization of the state-action trajectories.

Notice that the quantity under the expectation sign in (11) is a random variable

for any policy u since H is a ﬁnite set and F is the σ-algebra of all subsets of H.

Notice also that a construction of a policy u



(y) which allows an estimate (10)

in Lemma 2.1 is described below in section 3. This is just a stationary policy when

the deterministic control y is a constant function of time, and it consists of a ﬁnite

number of stationary policies (and thus is not stationary itself) when y is piecewise

constant.

Deﬁne the “deterministic” optimal control problem Q

as follows.

: Find an admissible control y which minimizes the cost function

(z)

def

= inf

g(z

)

over the trajectories z of system (9). The following theorem about approximation of



by Q

is then easily established on the basis of Lemmas 2.1 and 2.2.

HEOREM 2.1. The values F



(z,x) of the original problem Q



converge to the

value F

(z) of the problem Q

,as→0. More precisely,





(z,x) −F

(z)



≤ C

γ().

If y

∗

is an optimal control for Q

, then the Markov policy u



∗

) allowing estimate

(10) with y = y

∗

satisﬁes the inequality





∗

)

g(Z

) − F



(z,x)



≤ C

γ().

That is, u



∗

) is asymptotically optimal for Q



Remark 2.1. In the linear case studied in [2], γ can be chosen such that

lim

→0



−(1/2)

γ()=0.

Hence, for the linear case, simple bounds on the rate of convergence are available for

Lemmas 2.1 and 2.2 as well as for Theorem 2.1.

Proof of Theorem 2.1. Let u be an arbitrary policy and ˜y



(h) ∈ Y be the function

deﬁned in Lemma 2.2. Then

g(Z

) − E

g(˜z



(H))|≤C

||Z

− ˜z



(H )||

≤ C

γ (),(12)

where C

is deﬁned in (3). Being piecewise constant, the function ˜y



is measurable in

t. Hence,

g(˜z



(h)) ≥ F

(z) ∀h ∈ H,

Downloaded 06/23/14 to 195.83.212.140. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

Asymptotic Optimization of a Nonlinear Hybrid System Governed by a Markov Decision Process

Citations

Towars a Theory of Stochastic Hybrid Systems

A compositional modelling and analysis framework for stochastic hybrid systems

Measurability and safety verification for stochastic hybrid systems

Safety verification for probabilistic hybrid systems

Optimal and Hierarchical Controls in Dynamic Stochastic Manufacturing Systems: A Survey

References

Optimization and nonsmooth analysis

Controlled Markov processes and viscosity solutions

Finite Markov chains

Deterministic and stochastic optimal control

Finite State Markovian Decision Processes

Related Papers (5)

Probabilistic reachability and safety for controlled discrete time stochastic hybrid systems

Towars a Theory of Stochastic Hybrid Systems

Extended Stochastic Hybrid Systems and Their Reachability Problem

Hierarchical Decision Making in Stochastic Manufacturing Systems

Probabilistic simulations for probabilistic processes

Frequently Asked Questions (9)

Q1. What have the authors contributed in "Asymptotic optimization of a nonlinear hybrid system governed by a markov decision process" ?

Q2. what is the class of stationary policies?

Q3. What is the proof of theorem 2.1?

Q4. What is the solution of phpLEMMA 4.1?

Q5. What is the simplest way to solve the lemma?

Q6. What is the simplest way to explain the concept of stationary policy?

Q7. What is the proof of the Pontryagin maximus?

Q8. what is the proof of a stationary policy?

Q9. What is the optimal value of the above problem?