scispace - formally typeset
Open AccessJournal ArticleDOI

Robust Markov Decision Processes

TLDR
This work considers robust MDPs that offer probabilistic guarantees in view of the unknown parameters to counter the detrimental effects of estimation errors and determines a policy that attains the highest worst-case performance over this confidence region.
Abstract
Markov decision processes MDPs are powerful tools for decision making in uncertain dynamic environments. However, the solutions of MDPs are of limited practical use because of their sensitivity to distributional model parameters, which are typically unknown and have to be estimated by the decision maker. To counter the detrimental effects of estimation errors, we consider robust MDPs that offer probabilistic guarantees in view of the unknown parameters. To this end, we assume that an observation history of the MDP is available. Based on this history, we derive a confidence region that contains the unknown parameters with a prespecified probability 1-β. Afterward, we determine a policy that attains the highest worst-case performance over this confidence region. By construction, this policy achieves or exceeds its worst-case performance with a confidence of at least 1-β. Our method involves the solution of tractable conic programs of moderate size.

read more

Content maybe subject to copyright    Report

Robust Markov Decision Processes
Wolfram Wiesemann, Daniel Kuhn and Ber¸c Rustem
February 9, 2012
Abstract
Markov decision processes (MDPs) are powerful tools for decision making in uncertain dynamic
environments. However, the solutions of MDPs are of limited practical use due to their sensitivity
to distributional model parameters, which are typically unknown and have to be estimated by the
decision maker. To counter the detrimental effects of estimation errors, we consider robust MDPs
that offer probabilistic guarantees in view of the unknown parameters. To this end, we assume that
an observation history of the MDP is available. Based on this history, we derive a confidence re-
gion that contains the unknown parameters with a pre-specified probability 1 β. Afterwards, we
determine a policy that attains the highest worst-case performance over this confidence region. By
construction, this policy achieves or exceeds its worst-case performance with a confidence of at least
1 β. Our method involves the solution of tractable conic programs of moderate size.
Keywords Robust Optimization; Markov Decision Processes; Semidefinite Programming.
Notation For a finite set X = {1, . . . , X}, M(X) denotes the probability simplex in R
X
. An X-valued
random variable χ has distribution m M(X), denoted by χ m, if P(χ = x) = m
x
for all x X. By
default, all vectors are column vectors. We denote by e
k
the kth canonical basis vector, while e denotes
the vector whose components are all ones. In both cases, the dimension will usually be clear from the
context. For square matrices A and B, the relation A B indicates that the matrix A B is positive
semidefinite. We denote the space of symmetric n × n matrices by S
n
. The declaration f : X
c
7→ Y
(f : X
a
7→ Y ) implies that f is a continuous (affine) function from X to Y . For a matrix A, we denote
its ith row by A
>
i·
(a row vector) and its jth column by A
·j
.
1 Introduction
Markov decision processes (MDPs) provide a versatile model for sequential decision making under un-
certainty, which accounts for both the immediate effects and the future ramifications of decisions. In
1

the past sixty years, MDPs have been successfully applied to numerous areas, ranging from inventory
control and investment planning to studies in economics and behavioral ecology [5, 20].
In this paper, we study MDPs with a finite state space S = {1, . . . , S}, a finite action space A =
{1, . . . , A}, and a discrete but infinite planning horizon T = {0, 1, 2, . . .}. Without loss of generality
(w.l.o.g.), we assume that every action is admissible in every state. The initial state is random and follows
the probability distribution p
0
M(S). If action a A is chosen in state s S, then the subsequent state
is determined by the conditional probability distribution p(·|s, a) M(S). We condense these conditional
distributions to the transition kernel P [M(S)]
S×A
, where P
sa
:= p(·|s, a) for (s, a) S × A. The
decision maker receives an expected reward of r(s, a, s
0
) R
+
if action a A is chosen in state s S
and the subsequent state is s
0
S. W.l.o.g., we assume that all rewards are non-negative. The MDP is
controlled through a policy π = (π
t
)
t∈T
, where π
t
: (S ×A)
t
×S 7→ M(A). π
t
(·|s
0
, a
0
, . . . , s
t1
, a
t1
; s
t
)
represents the probability distribution over A according to which the next action is chosen if the current
state is s
t
and the state-action history is given by (s
0
, a
0
, . . . , s
t1
, a
t1
). Together with the transition
kernel P , π induces a stochastic process (s
t
, a
t
)
t∈T
on the space (S ×A)
of sample paths. We use the
notation E
P,π
to denote expectations with respect to this process. Throughout this paper, we evaluate
policies in view of their expected total reward under the discount factor λ (0, 1):
E
P,π
"
X
t=0
λ
t
r(s
t
, a
t
, s
t+1
)
s
0
p
0
#
(1)
For a fixed policy π, the policy evaluation problem asks for the value of expression (1). The policy
improvement problem, on the other hand, asks for a policy π that maximizes (1).
Most of the literature on MDPs assumes that the expected rewards r and the transition kernel P
are known, with a tacit understanding that they have to be estimated in practice. However, it is well-
known that the expected total reward (1) can be very sensitive to small changes in r and P [16]. Thus,
decision makers are confronted with two different sources of uncertainty. On one hand, they face internal
variation due to the stochastic nature of MDPs. On the other hand, they need to cope with external
variation because the estimates for r and P deviate from their true values. In this paper, we assume
that the decision maker is risk-neutral to internal variation but risk-averse to external variation. This
is justified if the MDP runs for a long time, or if many instances of the same MDP run in parallel [16].
We focus on external variation in P and assume r to be known. Indeed, the expected total reward (1)
is typically more sensitive to P , and the inclusion of reward variation is straightforward [8, 16].
Let P
0
be the unknown true transition kernel of the MDP. Since the expected total reward of a policy
depends on P
0
, we cannot evaluate expression (1) under external variation. Iyengar [12] and Nilim and
El Ghaoui [18] therefore suggest to find a policy that guarantees the highest expected total reward at a
2

given confidence level. To this end, they determine a policy π that maximizes the worst-case objective
z
= inf
P ∈P
E
P,π
"
X
t=0
λ
t
r(s
t
, a
t
, s
t+1
)
s
0
p
0
#
, (2)
where the ambiguity set P is the Cartesian product of independent marginal sets P
sa
M(S) for each
(s, a) S × A. In the following, we call such ambiguity sets rectangular. Problem (2) determines the
worst-case expected total reward of π if the transition kernel can vary freely within P. In analogy to our
earlier definitions, the robust policy evaluation problem evaluates expression (2) for a fixed policy π, while
the robust policy improvement problem asks for a policy that maximizes (2). The optimal value z
in (2)
provides a lower bound on the expected total reward of π if the true transition kernel P
0
is contained
in the ambiguity set P. Hence, if P is a confidence region that contains P
0
with probability 1 β, then
the policy π guarantees an expected total reward of at least z
at a confidence level 1 β. To construct
an ambiguity set P with this property, [12] and [18] assume that independent transition samples are
available for each state-action pair (s, a) S × A. Under this assumption, one can employ standard
results on the asymptotic properties of the maximum likelihood estimator to derive a confidence region
for P
0
. If we project this confidence region onto the marginal sets P
sa
, then z
provides the desired
probabilistic lower bound on the expected total reward of π.
In this paper, we alter two key assumptions of the outlined procedure. Firstly, we assume that the
decision maker cannot obtain independent transition samples for the state-action pairs. Instead, she has
merely access to an observation history (s
1
, a
1
, . . . , s
n
, a
n
) (S × A)
n
generated by the MDP under
some known policy. Secondly, we relax the assumption of rectangular ambiguity sets. In the following,
we briefly motivate these changes and give an outlook on their consequences.
Although transition sampling has theoretical appeal, it is often prohibitively costly or even infeasible
in practice. To obtain independent samples for each state-action pair, one needs to repeatedly direct
the MDP into any of its states and record the transitions resulting from different actions. In particular,
one cannot use the transition frequencies of an observation history because those frequencies violate the
independence assumption stated above. The availability of an observation history, on the other hand,
seems much more realistic in practice. Observation histories introduce a number of theoretical challenges,
such as the lack of observations for some transitions and stochastic dependencies between the transition
frequencies. We will apply results from statistical inference on Markov chains to address these issues. It
turns out that many of the results derived for transition sampling in [12] and [18] remain valid in the
new setting where the transition probabilities are estimated from observation histories.
The restriction to rectangular ambiguity sets has been introduced in [12] and [18] to facilitate compu-
tational tractability. Under the assumption of rectangularity, the robust policy evaluation and improve-
3

ment problems can be solved efficiently with a modified value or policy iteration. This implies, however,
that non-rectangular ambiguity sets have to be projected onto the marginal sets P
sa
. Not only does this
‘rectangularization’ unduly increase the level of conservatism, but it also creates a number of undesirable
side-effects that we discuss in Section 2. In this paper, we show that the robust policy evaluation and
improvement problems remain tractable for ambiguity sets that exhibit a milder form of rectangularity,
and we develop a polynomial time solution method. On the other hand, we prove that the robust policy
evaluation and improvement problems are intractable for non-rectangular ambiguity sets. For this set-
ting, we formulate conservative approximations of the policy evaluation and improvement problems. We
bound the optimality gap incurred from solving those approximations, and we outline how our approach
can be generalized to a hierarchy of increasingly accurate approximations.
The contributions of this paper can be summarized as follows.
1. We analyze a new class of ambiguity sets, which contains the above defined rectangular ambiguity
sets as a special case. We show that the optimal policies for this class are randomized but memo-
ryless. We develop algorithms that solve the robust policy evaluation and improvement problems
over these ambiguity sets in polynomial time.
2. It is stated in [18] that the robust policy evaluation and improvement problems “seem to be hard to
solve” for non-rectangular ambiguity sets. We prove that these problems cannot be approximated to
any constant factor in polynomial time unless P = NP. We develop a hierarchy of increasingly ac-
curate conservative approximations, together with ex post bounds on the incurred optimality gap.
3. We present a method to construct ambiguity sets from observation histories. Our approach allows
to account for different types of a priori information about the transition kernel, which helps to
reduce the size of the ambiguity set. We also investigate the convergence behavior of our ambiguity
set when the length of the observation history increases.
The study of robust MDPs with rectangular ambiguity sets dates back to the seventies, see [3, 10, 22,
26] and the surveys in [12, 18]. However, most of the early contributions do not address the construction
of suitable ambiguity sets. In [16], Mannor et al. approximate the bias and variance of the expected total
reward (1) if the unknown model parameters are replaced with estimates. Delage and Mannor [8] use
these approximations to solve a chance-constrained policy improvement problem in a Bayesian setting.
Recently, alternative performance criteria have been suggested to address external variation, such as the
worst-case expected utility and regret measures. We refer to [19, 27] and the references cited therein.
Note that external variation could be addressed by encoding the unknown model parameters into the
states of a partially observable MDP (POMDP) [17]. However, the optimization of POMDPs becomes
4

challenging even for small state spaces. In our case, the augmented state space would become very large,
which renders optimization of the resulting POMDPs prohibitively expensive.
The remainder of the paper is organized as follows. Section 2 defines and analyzes the classes of
robust MDPs that we consider. Sections 3 and 4 study the robust policy evaluation and improvement
problems, respectively. Section 5 constructs ambiguity sets from observation histories. We illustrate our
method in Section 6, where we apply it to the machine replacement problem. We conclude in Section 7.
Remark 1.1 (Finite Horizon MDPs) Throughout the paper, we outline how our results extend to
finite horizon MDPs. In this case, we assume that T = {0, 1, 2, . . . , T } with T < and that S can be
partitioned into nonempty disjoint sets {S
t
}
t∈T
such that at period t the system is in one of the states in
S
t
. We do not discount rewards in finite horizon MDPs. In addition to the transition rewards r(s, a, s
0
),
an expected reward of r
s
R
+
is received if the MDP reaches the terminal state s S
T
. We assume that
p
0
(s) = 0 for s / S
0
.
2 Robust Markov Decision Processes
This section studies properties of the robust policy evaluation and improvement problems. Both problems
are concerned with robust MDPs, for which the transition kernel is only known to be an element of an
ambiguity set P [M(S)]
S×A
. We assume that the initial state distribution p
0
is known.
We start with the robust policy evaluation problem. We define the structure of the ambiguity sets that
we consider, as well as different types of rectangularity that can be imposed to facilitate computational
tractability. Afterwards, we discuss the robust policy improvement problem. We define several policy
classes that are commonly used in MDPs, and we investigate the structure of optimal policies for different
types of rectangularity. We close with a complexity result for the robust policy evaluation problem. Since
the remainder of this paper almost exclusively deals with the robust versions of the policy evaluation
and improvement problems, we may suppress the attribute ‘robust’ in the following.
2.1 The Robust Policy Evaluation Problem
In this paper, we consider ambiguity sets P of the following type.
P :=
n
P [M(S)]
S×A
: ξ Ξ such that P
sa
= p
ξ
(·|s, a) (s, a) S × A
o
. (3a)
Here, we assume that Ξ is a subset of R
q
and that p
ξ
(·|s, a), (s, a) S × A, is an affine function from
Ξ to M(S) that satisfies p
ξ
(·|s, a) := k
sa
+ K
sa
ξ for some k
sa
R
S
and K
sa
R
S×q
. The distinction
between the sets P and Ξ allows us to condense all ambiguous parameters in the set Ξ. This will enable
5

Figures
Citations
More filters
Posted Content

Concrete Problems in AI Safety

TL;DR: A list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function, an objective function that is too expensive to evaluate frequently, or undesirable behavior during the learning process, are presented.
Journal ArticleDOI

Distributionally Robust Convex Optimization

TL;DR: A unifying framework for modeling and solving distributionally robust optimization problems and introduces standardized ambiguity sets that contain all distributions with prescribed conic representable confidence sets and with mean values residing on an affine manifold.
Journal ArticleDOI

Recent advances in robust optimization: An overview☆

TL;DR: An overview of developments in robust optimization since 2007 is provided to give a representative picture of the research topics most explored in recent years, highlight common themes in the investigations of independent research teams and highlight the contributions of rising as well as established researchers both to the theory of robust optimization and its practice.
Proceedings Article

Safe Model-based Reinforcement Learning with Stability Guarantees

TL;DR: In this paper, the authors present a learning algorithm that explicitly considers safety, defined in terms of stability guarantees, and show how to use statistical models of the dynamics to obtain high-performance control policies with provable stability certificates.
References
More filters
Book

Computers and Intractability: A Guide to the Theory of NP-Completeness

TL;DR: The second edition of a quarterly column as discussed by the authors provides a continuing update to the list of problems (NP-complete and harder) presented by M. R. Garey and myself in our book "Computers and Intractability: A Guide to the Theory of NP-Completeness,” W. H. Freeman & Co., San Francisco, 1979.
Book

Markov Decision Processes: Discrete Stochastic Dynamic Programming

TL;DR: Puterman as discussed by the authors provides a uniquely up-to-date, unified, and rigorous treatment of the theoretical, computational, and applied research on Markov decision process models, focusing primarily on infinite horizon discrete time models and models with discrete time spaces while also examining models with arbitrary state spaces, finite horizon models, and continuous time discrete state models.
Book

Dynamic Programming and Optimal Control

TL;DR: The leading and most up-to-date textbook on the far-ranging algorithmic methododogy of Dynamic Programming, which can be used for optimal control, Markovian decision problems, planning and sequential decision making under uncertainty, and discrete/combinatorial optimization.
Book

Probability and Measure

TL;DR: In this paper, the convergence of distributions is considered in the context of conditional probability, i.e., random variables and expected values, and the probability of a given distribution converging to a certain value.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What are the contributions in "Robust markov decision processes" ?

To counter the detrimental effects of estimation errors, the authors consider robust MDPs that offer probabilistic guarantees in view of the unknown parameters. Afterwards, the authors determine a policy that attains the highest worst-case performance over this confidence region.