Robust Markov Decision Processes
Wolfram Wiesemann, Daniel Kuhn and Ber¸c Rustem
February 9, 2012
Abstract
Markov decision processes (MDPs) are powerful tools for decision making in uncertain dynamic
environments. However, the solutions of MDPs are of limited practical use due to their sensitivity
to distributional model parameters, which are typically unknown and have to be estimated by the
decision maker. To counter the detrimental effects of estimation errors, we consider robust MDPs
that offer probabilistic guarantees in view of the unknown parameters. To this end, we assume that
an observation history of the MDP is available. Based on this history, we derive a confidence re-
gion that contains the unknown parameters with a pre-specified probability 1 − β. Afterwards, we
determine a policy that attains the highest worst-case performance over this confidence region. By
construction, this policy achieves or exceeds its worst-case performance with a confidence of at least
1 − β. Our method involves the solution of tractable conic programs of moderate size.
Keywords Robust Optimization; Markov Decision Processes; Semidefinite Programming.
Notation For a finite set X = {1, . . . , X}, M(X) denotes the probability simplex in R
X
. An X-valued
random variable χ has distribution m ∈ M(X), denoted by χ ∼ m, if P(χ = x) = m
x
for all x ∈ X. By
default, all vectors are column vectors. We denote by e
k
the kth canonical basis vector, while e denotes
the vector whose components are all ones. In both cases, the dimension will usually be clear from the
context. For square matrices A and B, the relation A B indicates that the matrix A − B is positive
semidefinite. We denote the space of symmetric n × n matrices by S
n
. The declaration f : X
c
7→ Y
(f : X
a
7→ Y ) implies that f is a continuous (affine) function from X to Y . For a matrix A, we denote
its ith row by A
>
i·
(a row vector) and its jth column by A
·j
.
1 Introduction
Markov decision processes (MDPs) provide a versatile model for sequential decision making under un-
certainty, which accounts for both the immediate effects and the future ramifications of decisions. In
1
the past sixty years, MDPs have been successfully applied to numerous areas, ranging from inventory
control and investment planning to studies in economics and behavioral ecology [5, 20].
In this paper, we study MDPs with a finite state space S = {1, . . . , S}, a finite action space A =
{1, . . . , A}, and a discrete but infinite planning horizon T = {0, 1, 2, . . .}. Without loss of generality
(w.l.o.g.), we assume that every action is admissible in every state. The initial state is random and follows
the probability distribution p
0
∈ M(S). If action a ∈ A is chosen in state s ∈ S, then the subsequent state
is determined by the conditional probability distribution p(·|s, a) ∈ M(S). We condense these conditional
distributions to the transition kernel P ∈ [M(S)]
S×A
, where P
sa
:= p(·|s, a) for (s, a) ∈ S × A. The
decision maker receives an expected reward of r(s, a, s
0
) ∈ R
+
if action a ∈ A is chosen in state s ∈ S
and the subsequent state is s
0
∈ S. W.l.o.g., we assume that all rewards are non-negative. The MDP is
controlled through a policy π = (π
t
)
t∈T
, where π
t
: (S ×A)
t
×S 7→ M(A). π
t
(·|s
0
, a
0
, . . . , s
t−1
, a
t−1
; s
t
)
represents the probability distribution over A according to which the next action is chosen if the current
state is s
t
and the state-action history is given by (s
0
, a
0
, . . . , s
t−1
, a
t−1
). Together with the transition
kernel P , π induces a stochastic process (s
t
, a
t
)
t∈T
on the space (S ×A)
∞
of sample paths. We use the
notation E
P,π
to denote expectations with respect to this process. Throughout this paper, we evaluate
policies in view of their expected total reward under the discount factor λ ∈ (0, 1):
E
P,π
"
∞
X
t=0
λ
t
r(s
t
, a
t
, s
t+1
)
s
0
∼ p
0
#
(1)
For a fixed policy π, the policy evaluation problem asks for the value of expression (1). The policy
improvement problem, on the other hand, asks for a policy π that maximizes (1).
Most of the literature on MDPs assumes that the expected rewards r and the transition kernel P
are known, with a tacit understanding that they have to be estimated in practice. However, it is well-
known that the expected total reward (1) can be very sensitive to small changes in r and P [16]. Thus,
decision makers are confronted with two different sources of uncertainty. On one hand, they face internal
variation due to the stochastic nature of MDPs. On the other hand, they need to cope with external
variation because the estimates for r and P deviate from their true values. In this paper, we assume
that the decision maker is risk-neutral to internal variation but risk-averse to external variation. This
is justified if the MDP runs for a long time, or if many instances of the same MDP run in parallel [16].
We focus on external variation in P and assume r to be known. Indeed, the expected total reward (1)
is typically more sensitive to P , and the inclusion of reward variation is straightforward [8, 16].
Let P
0
be the unknown true transition kernel of the MDP. Since the expected total reward of a policy
depends on P
0
, we cannot evaluate expression (1) under external variation. Iyengar [12] and Nilim and
El Ghaoui [18] therefore suggest to find a policy that guarantees the highest expected total reward at a
2
given confidence level. To this end, they determine a policy π that maximizes the worst-case objective
z
∗
= inf
P ∈P
E
P,π
"
∞
X
t=0
λ
t
r(s
t
, a
t
, s
t+1
)
s
0
∼ p
0
#
, (2)
where the ambiguity set P is the Cartesian product of independent marginal sets P
sa
⊆ M(S) for each
(s, a) ∈ S × A. In the following, we call such ambiguity sets rectangular. Problem (2) determines the
worst-case expected total reward of π if the transition kernel can vary freely within P. In analogy to our
earlier definitions, the robust policy evaluation problem evaluates expression (2) for a fixed policy π, while
the robust policy improvement problem asks for a policy that maximizes (2). The optimal value z
∗
in (2)
provides a lower bound on the expected total reward of π if the true transition kernel P
0
is contained
in the ambiguity set P. Hence, if P is a confidence region that contains P
0
with probability 1 −β, then
the policy π guarantees an expected total reward of at least z
∗
at a confidence level 1 −β. To construct
an ambiguity set P with this property, [12] and [18] assume that independent transition samples are
available for each state-action pair (s, a) ∈ S × A. Under this assumption, one can employ standard
results on the asymptotic properties of the maximum likelihood estimator to derive a confidence region
for P
0
. If we project this confidence region onto the marginal sets P
sa
, then z
∗
provides the desired
probabilistic lower bound on the expected total reward of π.
In this paper, we alter two key assumptions of the outlined procedure. Firstly, we assume that the
decision maker cannot obtain independent transition samples for the state-action pairs. Instead, she has
merely access to an observation history (s
1
, a
1
, . . . , s
n
, a
n
) ∈ (S × A)
n
generated by the MDP under
some known policy. Secondly, we relax the assumption of rectangular ambiguity sets. In the following,
we briefly motivate these changes and give an outlook on their consequences.
Although transition sampling has theoretical appeal, it is often prohibitively costly or even infeasible
in practice. To obtain independent samples for each state-action pair, one needs to repeatedly direct
the MDP into any of its states and record the transitions resulting from different actions. In particular,
one cannot use the transition frequencies of an observation history because those frequencies violate the
independence assumption stated above. The availability of an observation history, on the other hand,
seems much more realistic in practice. Observation histories introduce a number of theoretical challenges,
such as the lack of observations for some transitions and stochastic dependencies between the transition
frequencies. We will apply results from statistical inference on Markov chains to address these issues. It
turns out that many of the results derived for transition sampling in [12] and [18] remain valid in the
new setting where the transition probabilities are estimated from observation histories.
The restriction to rectangular ambiguity sets has been introduced in [12] and [18] to facilitate compu-
tational tractability. Under the assumption of rectangularity, the robust policy evaluation and improve-
3
ment problems can be solved efficiently with a modified value or policy iteration. This implies, however,
that non-rectangular ambiguity sets have to be projected onto the marginal sets P
sa
. Not only does this
‘rectangularization’ unduly increase the level of conservatism, but it also creates a number of undesirable
side-effects that we discuss in Section 2. In this paper, we show that the robust policy evaluation and
improvement problems remain tractable for ambiguity sets that exhibit a milder form of rectangularity,
and we develop a polynomial time solution method. On the other hand, we prove that the robust policy
evaluation and improvement problems are intractable for non-rectangular ambiguity sets. For this set-
ting, we formulate conservative approximations of the policy evaluation and improvement problems. We
bound the optimality gap incurred from solving those approximations, and we outline how our approach
can be generalized to a hierarchy of increasingly accurate approximations.
The contributions of this paper can be summarized as follows.
1. We analyze a new class of ambiguity sets, which contains the above defined rectangular ambiguity
sets as a special case. We show that the optimal policies for this class are randomized but memo-
ryless. We develop algorithms that solve the robust policy evaluation and improvement problems
over these ambiguity sets in polynomial time.
2. It is stated in [18] that the robust policy evaluation and improvement problems “seem to be hard to
solve” for non-rectangular ambiguity sets. We prove that these problems cannot be approximated to
any constant factor in polynomial time unless P = NP. We develop a hierarchy of increasingly ac-
curate conservative approximations, together with ex post bounds on the incurred optimality gap.
3. We present a method to construct ambiguity sets from observation histories. Our approach allows
to account for different types of a priori information about the transition kernel, which helps to
reduce the size of the ambiguity set. We also investigate the convergence behavior of our ambiguity
set when the length of the observation history increases.
The study of robust MDPs with rectangular ambiguity sets dates back to the seventies, see [3, 10, 22,
26] and the surveys in [12, 18]. However, most of the early contributions do not address the construction
of suitable ambiguity sets. In [16], Mannor et al. approximate the bias and variance of the expected total
reward (1) if the unknown model parameters are replaced with estimates. Delage and Mannor [8] use
these approximations to solve a chance-constrained policy improvement problem in a Bayesian setting.
Recently, alternative performance criteria have been suggested to address external variation, such as the
worst-case expected utility and regret measures. We refer to [19, 27] and the references cited therein.
Note that external variation could be addressed by encoding the unknown model parameters into the
states of a partially observable MDP (POMDP) [17]. However, the optimization of POMDPs becomes
4
challenging even for small state spaces. In our case, the augmented state space would become very large,
which renders optimization of the resulting POMDPs prohibitively expensive.
The remainder of the paper is organized as follows. Section 2 defines and analyzes the classes of
robust MDPs that we consider. Sections 3 and 4 study the robust policy evaluation and improvement
problems, respectively. Section 5 constructs ambiguity sets from observation histories. We illustrate our
method in Section 6, where we apply it to the machine replacement problem. We conclude in Section 7.
Remark 1.1 (Finite Horizon MDPs) Throughout the paper, we outline how our results extend to
finite horizon MDPs. In this case, we assume that T = {0, 1, 2, . . . , T } with T < ∞ and that S can be
partitioned into nonempty disjoint sets {S
t
}
t∈T
such that at period t the system is in one of the states in
S
t
. We do not discount rewards in finite horizon MDPs. In addition to the transition rewards r(s, a, s
0
),
an expected reward of r
s
∈ R
+
is received if the MDP reaches the terminal state s ∈ S
T
. We assume that
p
0
(s) = 0 for s /∈ S
0
.
2 Robust Markov Decision Processes
This section studies properties of the robust policy evaluation and improvement problems. Both problems
are concerned with robust MDPs, for which the transition kernel is only known to be an element of an
ambiguity set P ⊆ [M(S)]
S×A
. We assume that the initial state distribution p
0
is known.
We start with the robust policy evaluation problem. We define the structure of the ambiguity sets that
we consider, as well as different types of rectangularity that can be imposed to facilitate computational
tractability. Afterwards, we discuss the robust policy improvement problem. We define several policy
classes that are commonly used in MDPs, and we investigate the structure of optimal policies for different
types of rectangularity. We close with a complexity result for the robust policy evaluation problem. Since
the remainder of this paper almost exclusively deals with the robust versions of the policy evaluation
and improvement problems, we may suppress the attribute ‘robust’ in the following.
2.1 The Robust Policy Evaluation Problem
In this paper, we consider ambiguity sets P of the following type.
P :=
n
P ∈ [M(S)]
S×A
: ∃ξ ∈ Ξ such that P
sa
= p
ξ
(·|s, a) ∀(s, a) ∈ S × A
o
. (3a)
Here, we assume that Ξ is a subset of R
q
and that p
ξ
(·|s, a), (s, a) ∈ S × A, is an affine function from
Ξ to M(S) that satisfies p
ξ
(·|s, a) := k
sa
+ K
sa
ξ for some k
sa
∈ R
S
and K
sa
∈ R
S×q
. The distinction
between the sets P and Ξ allows us to condense all ambiguous parameters in the set Ξ. This will enable
5