scispace - formally typeset
Open AccessJournal ArticleDOI

Asymptotic Optimization of a Nonlinear Hybrid System Governed by a Markov Decision Process

TLDR
In this paper, a continuous time stochastic hybrid control system with finite time horizon is considered, and the objective is to minimize a nonlinear function of the state trajectory, which can be approximated by the solution of some deterministic optimal control problem.
Abstract
We consider in this paper a continuous time stochastic hybrid control system with finite time horizon. The objective is to minimize a nonlinear function of the state trajectory. The state evolves according to a nonlinear dynamics. The parameters of the dynamics of the system may change at discrete times $l\epsilon$, $l=0,1,...$, according to a controlled Markov chain which has finite state and action spaces. Under the assumption that $\epsilon$ is a small parameter, we justify an averaging procedure allowing us to establish that our problem can be approximated by the solution of some deterministic optimal control problem.

read more

Content maybe subject to copyright    Report

ASYMPTOTIC OPTIMIZATION OF A NONLINEAR HYBRID
SYSTEM GOVERNED BY A MARKOV DECISION PROCESS
EITAN ALTMAN
AND VLADIMIR GAITSGORY
SIAM J. C
ONTROL OPTIM.
c
1997 Society for Industrial and Applied Mathematics
Vol. 35, No. 6, pp. 2070–2085, November 1997 010
Abstract. We consider in this paper a continuous time stochastic hybrid control system with
finite time horizon. The objective is to minimize a nonlinear function of the state trajectory. The
state evolves according to a nonlinear dynamics. The parameters of the dynamics of the system
may change at discrete times l, l =0,1, ..., according to a controlled Markov chain which has finite
state and action spaces. Under the assumption that is a small parameter, we justify an averaging
procedure allowing us to establish that our problem can be approximated by the solution of some
deterministic optimal control problem.
Key words. hybrid stochastic systems, asymptotic optimality, nonlinear dynamics, Markov
decision processes, averaging
AMS subject classifications. 49B10, 49B50
PII. S0363012995279985
1. Introduction and statement of the problem. Consider the following hy-
brid stochastic control system. The state Z
t
R
n
evolves according to the following
dynamics:
d
dt
Z
t
= f(Z
t
,Y
t
),t[0, 1],Z
0
=z,(1)
where Y
t
R
k
is the “control” to be specified later and z is the initial state. f is
assumed to be linear in the second argument (for each value of the first argument),
i.e.,
f(z,y)=f
1
(z)+f
2
(z)y,(2)
where f
1
is an n-dimensional vector and f
2
is an n × k matrix; f
2
(z)y is the multi-
plication between the matrix f
2
(z) and the vector y. The functions f
1
(z) and f
2
(z)
are supposed to be bounded and to satisfy the Lipschitz condition
f
i
(z) f
i
(z
0
)
1
C
1
||z z
0
||
1
z,z
0
,(3)
f
i
(z)
1
C
2
,(4)
where z,z
0
are from a sufficiently large domain which contains all possible trajecto-
ries of (1), C
1
and C
2
are constants, and ||·||
1
stands for the L
1
norm in the finite-
dimensional space. That is, ||q ||
1
= max
i=1,...,k
|q
i
| for the vector q = {q
i
},i=1, ..., k,
and ||A||
1
= max
||q ||
1
=1
||Aq||
1
for the matrix A(n × k).
It is assumed in what follows that there exists a bounded domain containing all
the trajectories of (1), and, thus, (4), in fact, is implied by (3).
Received by the editors January 13, 1995; accepted for publication (in revised form) September
11, 1996. The research undertaken in this paper was supported by the Australian Research Council
(ARC).
http://www.siam.org/journals/sicon/35-6/27998.html
INRIA, BP93, 2004 Route des Lucioles, 06902 Sophia Antipolis Cedex, France (altman@
martingale.inria.fr).
School of Mathematics, University of South Australia, The Levels, Pooraka, South Australia
5095, Australia (mavg@lux.levels.unisa.edu.au).
2070
Downloaded 06/23/14 to 195.83.212.140. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

ASYMPTOTIC OPTIMIZATION OF A HYBRID SYSTEM GOVERNED BY MDP 2071
Y
t
is not chosen directly by the controller but is obtained as a result of controlling
the following underlying stochastic discrete event system. Let be the basic time
unit. Time is discretized; i.e., transitions occur at times t = n, n =0,1,2, ..., b
1
c,
where bxc stands for the greatest integer which is smaller than or equal to x. There
is a finite state space X = {1, ..., N } and a finite action space A. If a state is v and
an action a is chosen, then the next state is w with the probability P
vaw
. A policy
u = {u
0
,u
1
, ...} in the set of policies U is a sequence of probability measures on A;at
each time t = n the controller chooses u
n
based on the history of all previous states
and actions, as well as the present state. Thus, u
n
is a function that maps histories
of the form h
n
=(x
0
,a
0
,x
1
,a
1
, ..., x
n1
,a
n1
,x
n
) to probability measures on A.
We shall be especially interested in the following classes of policies:
the Markov policies, denoted by M, i.e., policies for which u
t
depends only
on the current state and does not depend on previous states and actions.
the stationary policies, denoted by S, i.e., policies for which u
t
depends only
on the current state and does not depend on previous states and actions nor
on the time.
The stochastic process {X
n
,A
n
} is known as a controlled Markov chain, or
Markov decision process (MDP); see Derman [11, pp. 2–4]. We assume through-
out the paper that under any stationary policy, the state space forms an aperiodic
Markov chain such that all states communicate (regular Markov chain). The results of
the paper hold, in fact, under weaker ergodicity assumptions; however, the restricted
assumption makes the presentation clearer.
Denote by H the set of all possible states and actions histories which can be
observed until time b
1
c:
H =
[
{h},h=
(x
n
,a
n
),n=0,1, ..., b
1
c
.
Let F be the σ-algebra of all subsets of H. Each policy u and initial state x determines
a probability measure on F, on which the stochastic state and action process H =
X
n
,A
n
,n=0,1, ..., b
1
c
is defined. Denote by P
u
x
and E
u
x
the probability measure
and mathematical expectation that correspond to an initial state X
0
= x and a policy
u. Sometimes we shall assume an initial distribution ξ on X
0
, instead of a fixed
initial state. In that case P
u
ξ
, E
u
ξ
denote the corresponding probability measure and
mathematical expectation.
Let y : X × A R
k
, j =1, ..., k, be some given vector-valued function. Then Y
t
in (1) is given by
Y
t
= y(X
bt/c
,A
bt/c
).(5)
The system (1) with thus-defined Y
t
is called hybrid, first, because Y
t
changes
its values via some random jumps whereas Z
t
is a smooth (differentiable) function of
time and, second, because, as follows from the consideration below, Y
t
being controlled
“statistically” through controlling the transition probabilities plays by itself the role
of a “direct” control with respect to Z
t
.
Let g : R
n
R be some operating cost related to the process Z
t
. We assume
that it is Lipschitz continuous; i.e.,
||g (z ) g (z
0
)||
1
C
1
||z z
0
||
1
.
We consider the following control problem with and x fixed.
Downloaded 06/23/14 to 195.83.212.140. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

2072 EITAN ALTMAN AND VLADIMIR GAITSGORY
Q
: find a policy u that achieves F
(z,x) = inf
uU
E
u
x
g(Z
1
), where Z
1
is obtained
through (1).
Our model is characterized by the fact that is supposed to be a small parameter
and our objective is to construct a policy (depending, in general, on ) which is
asymptotically optimal for Q
. That is, the difference between the cost under this
policy and F
(z,x) converges to zero as 0.
The type of model which we introduce is natural in the control of inventories
or of production, where we deal with material whose quantity may “slowly” change
in a continuous (linear) way. Breakdowns, repairs, and other control decisions yield
the underlying MDP. Our model may also be used in the control of highly loaded
queueing networks for which the fluid approximation holds (see Kleinrock [20, p. 56]).
The slow variables Z
t
may then represent the number of customers in the different
queues, whereas the underlying MDP may correspond to routing, or flow control of,
say, some long on/off traffic.
The fact that is chosen to be small means that the variables Y
t
along with the
MDP X
t
can be considered to be fast with respect to the time scale t in which Z
t
evolves. Indeed, Y
t
and X
t
may have large jumps between t = m and t =(m+1),
whereas the corresponding change in Z
t
in that period is of order . The problem
is, thus, close in nature to stochastic singular perturbed control problems intensively
studied in the literature (see, for example, [1], [5], [6], [7], [9], [10], [21], [23], [24], [25]
and references therein). A common approach to this kind of problem is an application
of singular perturbations or averaging techniques to the Hamilton–Jacobi–Bellman
(HJB) equation for problems in continuous time (as in [5], [6], [21]) or to the dynamic
programming equation for singularly perturbed MDPs [1], [7], [9], [10], [24], [25]. In
contrast to this approach, we, as in [23], apply an averaging method directly to the
“slow” stochastic equation. Our model differs, however, from the ones in [23] in many
respects—mainly in the type of fast motions involved, which implies the differences
in both the technique used and the results obtained.
In our previous paper [2], we considered the problem similar to Q
for the case
of linear dynamics f and cost g and showed that an asymptotically optimal policy
can be constructed via maximization of the Hamiltonian of some linear deterministic
system. The technique we used was, however, strongly related to the linearity of the
model, and it is not applicable to the case when the dynamics and/or the cost are
nonlinear. As opposed to the linear case, the consideration for the nonlinear case is
much more involved and based on an ergodicity-type result for MDPs obtained in this
paper (see Theorem 4.1 below). Using this result we establish that the trajectories of
stochastic hybrid system (1) are approximated by the trajectories of some nonlinear
deterministic control system, and the problem Q
is approximated by the correspond-
ing deterministic optimal control problem allowing us, in particular, to construct an
asymptotically optimal policy for Q
. Notice that this result can be viewed as an
extension of the averaging technique for deterministic singularly perturbed control
systems (see, e.g., [15]) to the stochastic case under consideration. On the other
hand, it can be viewed as an extention of results on uncontrolled motions establishing
that the solution of the original stochastic system is approximated by the solution of
some deterministic system obtained via averaging over the fast random dynamics [16],
[19], [22] to the case when this random dynamics is defined by the controlled Markov
chain.
The paper consists of four sections. Section 1 is this introduction; section 2
describes the main results about the approximation of the problem of optimal control
Downloaded 06/23/14 to 195.83.212.140. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

ASYMPTOTIC OPTIMIZATION OF A HYBRID SYSTEM GOVERNED BY MDP 2073
of the hybrid system by a deterministic optimal control problem. In section 3 we
discuss ways that the solution of the deterministic optimal control problem can be
characterized and how it can be used to obtain an asymptotically optimal policy.
Section 4 contains the above-mentioned Theorem 4.1, as well as the proofs of some
basic lemmas used in section 2.
2. Description of main results. Let
Y(m, x)
def
=
[
uU
(
(m +1)
1
m
X
t=0
E
u
x
Y
t
)
,
where the union is taken over all policies. As follows from Theorem 3 in [2], the set
Y(m, x) converges in the Haussdorff metric to a set Y defined below:
lim
m→∞
Y(m, x)=Y
def
=
[
u∈S
(
X
v,a
η(u;v, a)y(v,a)
)
,(6)
where the union is taken over all stationary policies, and η(u)={η(u;v, a)}is the vec-
tor of steady state probabilities of state-action pairs obtained when using a stationary
policy u. That is,
η(u; v, a) = lim
n→∞
P
u
x
(X
n
= v, A
n
= a).(7)
Notice that due to the ergodicity assumption on our model, η(u; v, a) does not depend
on the initial distribution. Notice also that, since the set
W
def
=
[
u∈S
{η(u)}(8)
is a polyhedron (see, for example, [11, pp. 93–95]), the set Y is a polyhedron as well.
Define now the averaged deterministic control system as
d
dt
z
t
= f(z
t
,y
t
),z
0
=z,(9)
where y
t
is a measurable function of t taking values in Y. The set of such functions
y :[0,1] Y
will be called the set of admissible controls.
Our claim is that the set of all random trajectories of (1) is approximated by
the set of solutions of (9) obtained with all admissible controls. More specifically, we
establish that there exists a function γ() satisfying
lim
0
γ()=0
such that the following holds.
L
EMMA 2.1. Corresponding to any admissible control y = {y
t
,t [0, 1]}, there
exists a Markov policy u
(y) such that the random trajectory Z
t
of (1), obtained with
this policy u
(y), and the deterministic solution z
y
t
of (9), obtained with y, satisfy the
inequality
max
t[0,1]
E
u
(y)
x
||Z
t
z
y
t
||
1
γ ().(10)
Downloaded 06/23/14 to 195.83.212.140. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

2074 EITAN ALTMAN AND VLADIMIR GAITSGORY
LEMMA 2.2. There exists a function ˜y
t
(h),
˜y
:[0,1] × H Y,
such that (a) for each h H, ˜y
t
(h) is a piecewise constant function of t and (b) for
any policy u,
max
t[0,1]
E
u
x
||Z
t
˜z
t
(H )||
1
γ (),(11)
where Z
t
is the solution of (1), ˜z
t
(H) is the solution of (9) obtained with y
t
y
t
(H),
and H is the random realization of the state-action trajectories.
Notice that the quantity under the expectation sign in (11) is a random variable
for any policy u since H is a finite set and F is the σ-algebra of all subsets of H.
Notice also that a construction of a policy u
(y) which allows an estimate (10)
in Lemma 2.1 is described below in section 3. This is just a stationary policy when
the deterministic control y is a constant function of time, and it consists of a finite
number of stationary policies (and thus is not stationary itself) when y is piecewise
constant.
Define the “deterministic” optimal control problem Q
0
as follows.
Q
0
: Find an admissible control y which minimizes the cost function
F
0
(z)
def
= inf
y
g(z
1
)
over the trajectories z of system (9). The following theorem about approximation of
Q
by Q
0
is then easily established on the basis of Lemmas 2.1 and 2.2.
T
HEOREM 2.1. The values F
(z,x) of the original problem Q
converge to the
value F
0
(z) of the problem Q
0
,as0. More precisely,
F
(z,x) F
0
x
(z)
C
1
γ().
If y
is an optimal control for Q
0
, then the Markov policy u
(y
) allowing estimate
(10) with y = y
satisfies the inequality
E
u
(y
)
x
g(Z
1
) F
(z,x)
C
1
γ().
That is, u
(y
) is asymptotically optimal for Q
.
Remark 2.1. In the linear case studied in [2], γ can be chosen such that
lim
0
(1/2)
γ()=0.
Hence, for the linear case, simple bounds on the rate of convergence are available for
Lemmas 2.1 and 2.2 as well as for Theorem 2.1.
Proof of Theorem 2.1. Let u be an arbitrary policy and ˜y
(h) Y be the function
defined in Lemma 2.2. Then
|E
u
x
g(Z
1
) E
u
x
gz
1
(H))|≤C
1
E
u
x
||Z
1
˜z
1
(H )||
1
C
1
γ (),(12)
where C
1
is defined in (3). Being piecewise constant, the function ˜y
is measurable in
t. Hence,
gz
1
(h)) F
0
(z) h H,
Downloaded 06/23/14 to 195.83.212.140. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

Citations
More filters
Book ChapterDOI

Towars a Theory of Stochastic Hybrid Systems

TL;DR: The invariant distribution and exit probability from interval of the MC are studied and it is shown that they converge to their counterparts for the solution process of the original SDE as the discretization step goes to zero, providing a useful tool for studying various sample path properties of the SDE.
Journal ArticleDOI

A compositional modelling and analysis framework for stochastic hybrid systems

TL;DR: HModest is presented, an extension to the Modest modelling language—which is originally designed for stochastic timed systems without complex continuous aspects—that adds differential equations and inclusions as an expressive way to describe the continuous system evolution.
Proceedings ArticleDOI

Measurability and safety verification for stochastic hybrid systems

TL;DR: Stochastic hybrid systems where the continuous-time behaviour is given by differential equations, as for usual hybrid systems, but the targets of discrete jumps are chosen by probability distributions are considered, and it is shown that measurability of a complete system follows from the measURability of its constituent parts.
Book ChapterDOI

Safety verification for probabilistic hybrid systems

TL;DR: This paper considers probabilistic hybrid systems and develops a general abstraction technique that can formally verify safety properties of non-trivial continuous-time stochastic hybrid systems—without resorting to point-wise discretisation.
Journal ArticleDOI

Optimal and Hierarchical Controls in Dynamic Stochastic Manufacturing Systems: A Survey

TL;DR: In this paper, the authors present a survey of the research devoted to proving that a hierarchy based on the frequencies of occurrence of different types of events results in decisions that are asymptotically optimal as the rates of some events become large compared to those of others.
References
More filters
Book

Optimization and nonsmooth analysis

TL;DR: The Calculus of Variations as discussed by the authors is a generalization of the calculus of variations, which is used in many aspects of analysis, such as generalized gradient descent and optimal control.
Book

Controlled Markov processes and viscosity solutions

TL;DR: In this paper, an introduction to optimal stochastic control for continuous time Markov processes and to the theory of viscosity solutions is given, as well as a concise introduction to two-controller, zero-sum differential games.
Book

Finite Markov chains

TL;DR: This lecture reviews the theory of Markov chains and introduces some of the high quality routines for working with Markov Chains available in QuantEcon.jl.
Book

Deterministic and stochastic optimal control

TL;DR: In this paper, the authors considered the problem of optimal control of Markov diffusion processes in the context of calculus of variations, and proposed a solution to the problem by using the Euler Equation Extremals.
Related Papers (5)
Frequently Asked Questions (9)
Q1. What have the authors contributed in "Asymptotic optimization of a nonlinear hybrid system governed by a markov decision process" ?

The authors consider in this paper a continuous time stochastic hybrid control system with finite time horizon. Under the assumption that is a small parameter, the authors justify an averaging procedure allowing us to establish that their problem can be approximated by the solution of some deterministic optimal control problem. 

The class of stationary policies is compact; i.e., for any sequence u(i) ∈ S, there exists a subsequence u(ij) such that the policy u∗ = limj→∞ u(ij) (i.e., the policy for which u∗(a|x) = limj→∞ u(ij)(a|x) for all a and x) is stationary. 

In the linear case studied in [2], γ can be chosen such thatlim →0−(1/2)γ( ) = 0.Hence, for the linear case, simple bounds on the rate of convergence are available for Lemmas 2.1 and 2.2 as well as for Theorem 2.1. 

phpLEMMA 4.1. Let yit(h), i = 1, 2, be functions of time t and state-action histories h. Let zit(h) be the solution of (9) obtained with y i t(h) (h is fixed), i = 1, 2. 

Substituting the last inequality in (42), one obtainsmax t∈[0,1]Eu (y)x ||Zt − z y t ||1 ≤ L [ ∆( ) + (L1 + L2)∆( ) + L3µ(K( ))] ,which, by (43), completes the proof of the lemma. 

Since for any initial distribution ξ and for any stationary policy s(i), the authors haveψ0 = η(s(i)), P s(i) ξ a.s.,(38)it follows by choosing the sequence of times t(i) so that the intervals t(i + 1) − t(i) are sufficiently large, that (36) implies thatlim i→∞ 

As in this theorem, one can also establish thatlim →0B (z, x, s) = B0(z, s),with the convergence being uniform with respect to s ∈ [0, 1], x ∈ X, and z ∈ Z, where Z is a compact subset of Rn.Notice that the described approach has a decomposition structure. 

It follows by arguments as in the first part of the proof that there exist sequences of times t(i) and of stationary policies s(i), and a constant α4 > 0 such that for all i,E s(i) ξ d 2 t(i) ≥ α4(36)for any initial distribution ξ. 

The optimal value of the above problem does not depend on the initial distribution ξ, and it is equal to the optimal value of the following linear programming problem:Jξ(z, λ) = J(z, λ) def= minη {∑ v,a r(z, λ; v, a)η(v, a)|η = {η(v, a)} ∈W } (19)= λT f1(z) + min η{ λT f2(z)∑ v,ay(v, a)η(v, a)|η = {η(v, a)} ∈W } .b)