What are the contributions in "Robust markov decision processes" ?

To counter the detrimental effects of estimation errors, the authors consider robust MDPs that offer probabilistic guarantees in view of the unknown parameters. Afterwards, the authors determine a policy that attains the highest worst-case performance over this confidence region.

(Open Access) Robust Markov Decision Processes (2013) | Wolfram Wiesemann

Robust Markov Decision Processes

Wolfram Wiesemann, Daniel Kuhn and Ber¸c Rustem

February 9, 2012

Abstract

Markov decision processes (MDPs) are powerful tools for decision making in uncertain dynamic

environments. However, the solutions of MDPs are of limited practical use due to their sensitivity

to distributional model parameters, which are typically unknown and have to be estimated by the

decision maker. To counter the detrimental eﬀects of estimation errors, we consider robust MDPs

that oﬀer probabilistic guarantees in view of the unknown parameters. To this end, we assume that

an observation history of the MDP is available. Based on this history, we derive a conﬁdence re-

gion that contains the unknown parameters with a pre-speciﬁed probability 1 − β. Afterwards, we

determine a policy that attains the highest worst-case performance over this conﬁdence region. By

construction, this policy achieves or exceeds its worst-case performance with a conﬁdence of at least

1 − β. Our method involves the solution of tractable conic programs of moderate size.

Keywords Robust Optimization; Markov Decision Processes; Semideﬁnite Programming.

Notation For a ﬁnite set X = {1, . . . , X}, M(X) denotes the probability simplex in R

. An X-valued

random variable χ has distribution m ∈ M(X), denoted by χ ∼ m, if P(χ = x) = m

for all x ∈ X. By

default, all vectors are column vectors. We denote by e

the kth canonical basis vector, while e denotes

the vector whose components are all ones. In both cases, the dimension will usually be clear from the

context. For square matrices A and B, the relation A  B indicates that the matrix A − B is positive

semideﬁnite. We denote the space of symmetric n × n matrices by S

. The declaration f : X

7→ Y

(f : X

7→ Y ) implies that f is a continuous (aﬃne) function from X to Y . For a matrix A, we denote

its ith row by A

i·

(a row vector) and its jth column by A

·j

1 Introduction

Markov decision processes (MDPs) provide a versatile model for sequential decision making under un-

certainty, which accounts for both the immediate eﬀects and the future ramiﬁcations of decisions. In

the past sixty years, MDPs have been successfully applied to numerous areas, ranging from inventory

control and investment planning to studies in economics and behavioral ecology [5, 20].

In this paper, we study MDPs with a ﬁnite state space S = {1, . . . , S}, a ﬁnite action space A =

{1, . . . , A}, and a discrete but inﬁnite planning horizon T = {0, 1, 2, . . .}. Without loss of generality

(w.l.o.g.), we assume that every action is admissible in every state. The initial state is random and follows

the probability distribution p

∈ M(S). If action a ∈ A is chosen in state s ∈ S, then the subsequent state

is determined by the conditional probability distribution p(·|s, a) ∈ M(S). We condense these conditional

distributions to the transition kernel P ∈ [M(S)]

S×A

, where P

:= p(·|s, a) for (s, a) ∈ S × A. The

decision maker receives an expected reward of r(s, a, s

) ∈ R

if action a ∈ A is chosen in state s ∈ S

and the subsequent state is s

∈ S. W.l.o.g., we assume that all rewards are non-negative. The MDP is

controlled through a policy π = (π

)

t∈T

, where π

: (S ×A)

×S 7→ M(A). π

(·|s

, a

, . . . , s

t−1

, a

t−1

; s

)

represents the probability distribution over A according to which the next action is chosen if the current

state is s

and the state-action history is given by (s

, a

, . . . , s

t−1

, a

t−1

). Together with the transition

kernel P , π induces a stochastic process (s

, a

)

t∈T

on the space (S ×A)

∞

of sample paths. We use the

notation E

P,π

to denote expectations with respect to this process. Throughout this paper, we evaluate

policies in view of their expected total reward under the discount factor λ ∈ (0, 1):

P,π

∞

t=0

r(s

, a

, s

t+1

)



∼ p

(1)

For a ﬁxed policy π, the policy evaluation problem asks for the value of expression (1). The policy

improvement problem, on the other hand, asks for a policy π that maximizes (1).

Most of the literature on MDPs assumes that the expected rewards r and the transition kernel P

are known, with a tacit understanding that they have to be estimated in practice. However, it is well-

known that the expected total reward (1) can be very sensitive to small changes in r and P [16]. Thus,

decision makers are confronted with two diﬀerent sources of uncertainty. On one hand, they face internal

variation due to the stochastic nature of MDPs. On the other hand, they need to cope with external

variation because the estimates for r and P deviate from their true values. In this paper, we assume

that the decision maker is risk-neutral to internal variation but risk-averse to external variation. This

is justiﬁed if the MDP runs for a long time, or if many instances of the same MDP run in parallel [16].

We focus on external variation in P and assume r to be known. Indeed, the expected total reward (1)

is typically more sensitive to P , and the inclusion of reward variation is straightforward [8, 16].

Let P

be the unknown true transition kernel of the MDP. Since the expected total reward of a policy

depends on P

, we cannot evaluate expression (1) under external variation. Iyengar [12] and Nilim and

El Ghaoui [18] therefore suggest to ﬁnd a policy that guarantees the highest expected total reward at a

given conﬁdence level. To this end, they determine a policy π that maximizes the worst-case objective

∗

= inf

P ∈P

P,π

∞

t=0

r(s

, a

, s

t+1

)



∼ p

, (2)

where the ambiguity set P is the Cartesian product of independent marginal sets P

⊆ M(S) for each

(s, a) ∈ S × A. In the following, we call such ambiguity sets rectangular. Problem (2) determines the

worst-case expected total reward of π if the transition kernel can vary freely within P. In analogy to our

earlier deﬁnitions, the robust policy evaluation problem evaluates expression (2) for a ﬁxed policy π, while

the robust policy improvement problem asks for a policy that maximizes (2). The optimal value z

∗

in (2)

provides a lower bound on the expected total reward of π if the true transition kernel P

is contained

in the ambiguity set P. Hence, if P is a conﬁdence region that contains P

with probability 1 −β, then

the policy π guarantees an expected total reward of at least z

∗

at a conﬁdence level 1 −β. To construct

an ambiguity set P with this property, [12] and [18] assume that independent transition samples are

available for each state-action pair (s, a) ∈ S × A. Under this assumption, one can employ standard

results on the asymptotic properties of the maximum likelihood estimator to derive a conﬁdence region

for P

. If we project this conﬁdence region onto the marginal sets P

, then z

∗

provides the desired

probabilistic lower bound on the expected total reward of π.

In this paper, we alter two key assumptions of the outlined procedure. Firstly, we assume that the

decision maker cannot obtain independent transition samples for the state-action pairs. Instead, she has

merely access to an observation history (s

, a

, . . . , s

, a

) ∈ (S × A)

generated by the MDP under

some known policy. Secondly, we relax the assumption of rectangular ambiguity sets. In the following,

we brieﬂy motivate these changes and give an outlook on their consequences.

Although transition sampling has theoretical appeal, it is often prohibitively costly or even infeasible

in practice. To obtain independent samples for each state-action pair, one needs to repeatedly direct

the MDP into any of its states and record the transitions resulting from diﬀerent actions. In particular,

one cannot use the transition frequencies of an observation history because those frequencies violate the

independence assumption stated above. The availability of an observation history, on the other hand,

seems much more realistic in practice. Observation histories introduce a number of theoretical challenges,

such as the lack of observations for some transitions and stochastic dependencies between the transition

frequencies. We will apply results from statistical inference on Markov chains to address these issues. It

turns out that many of the results derived for transition sampling in [12] and [18] remain valid in the

new setting where the transition probabilities are estimated from observation histories.

The restriction to rectangular ambiguity sets has been introduced in [12] and [18] to facilitate compu-

tational tractability. Under the assumption of rectangularity, the robust policy evaluation and improve-

ment problems can be solved eﬃciently with a modiﬁed value or policy iteration. This implies, however,

that non-rectangular ambiguity sets have to be projected onto the marginal sets P

. Not only does this

‘rectangularization’ unduly increase the level of conservatism, but it also creates a number of undesirable

side-eﬀects that we discuss in Section 2. In this paper, we show that the robust policy evaluation and

improvement problems remain tractable for ambiguity sets that exhibit a milder form of rectangularity,

and we develop a polynomial time solution method. On the other hand, we prove that the robust policy

evaluation and improvement problems are intractable for non-rectangular ambiguity sets. For this set-

ting, we formulate conservative approximations of the policy evaluation and improvement problems. We

bound the optimality gap incurred from solving those approximations, and we outline how our approach

can be generalized to a hierarchy of increasingly accurate approximations.

The contributions of this paper can be summarized as follows.

1. We analyze a new class of ambiguity sets, which contains the above deﬁned rectangular ambiguity

sets as a special case. We show that the optimal policies for this class are randomized but memo-

ryless. We develop algorithms that solve the robust policy evaluation and improvement problems

over these ambiguity sets in polynomial time.

2. It is stated in [18] that the robust policy evaluation and improvement problems “seem to be hard to

solve” for non-rectangular ambiguity sets. We prove that these problems cannot be approximated to

any constant factor in polynomial time unless P = NP. We develop a hierarchy of increasingly ac-

curate conservative approximations, together with ex post bounds on the incurred optimality gap.

3. We present a method to construct ambiguity sets from observation histories. Our approach allows

to account for diﬀerent types of a priori information about the transition kernel, which helps to

reduce the size of the ambiguity set. We also investigate the convergence behavior of our ambiguity

set when the length of the observation history increases.

The study of robust MDPs with rectangular ambiguity sets dates back to the seventies, see [3, 10, 22,

26] and the surveys in [12, 18]. However, most of the early contributions do not address the construction

of suitable ambiguity sets. In [16], Mannor et al. approximate the bias and variance of the expected total

reward (1) if the unknown model parameters are replaced with estimates. Delage and Mannor [8] use

these approximations to solve a chance-constrained policy improvement problem in a Bayesian setting.

Recently, alternative performance criteria have been suggested to address external variation, such as the

worst-case expected utility and regret measures. We refer to [19, 27] and the references cited therein.

Note that external variation could be addressed by encoding the unknown model parameters into the

states of a partially observable MDP (POMDP) [17]. However, the optimization of POMDPs becomes

challenging even for small state spaces. In our case, the augmented state space would become very large,

which renders optimization of the resulting POMDPs prohibitively expensive.

The remainder of the paper is organized as follows. Section 2 deﬁnes and analyzes the classes of

robust MDPs that we consider. Sections 3 and 4 study the robust policy evaluation and improvement

problems, respectively. Section 5 constructs ambiguity sets from observation histories. We illustrate our

method in Section 6, where we apply it to the machine replacement problem. We conclude in Section 7.

Remark 1.1 (Finite Horizon MDPs) Throughout the paper, we outline how our results extend to

ﬁnite horizon MDPs. In this case, we assume that T = {0, 1, 2, . . . , T } with T < ∞ and that S can be

partitioned into nonempty disjoint sets {S

}

t∈T

such that at period t the system is in one of the states in

. We do not discount rewards in ﬁnite horizon MDPs. In addition to the transition rewards r(s, a, s

an expected reward of r

∈ R

is received if the MDP reaches the terminal state s ∈ S

. We assume that

(s) = 0 for s /∈ S

2 Robust Markov Decision Processes

This section studies properties of the robust policy evaluation and improvement problems. Both problems

are concerned with robust MDPs, for which the transition kernel is only known to be an element of an

ambiguity set P ⊆ [M(S)]

S×A

. We assume that the initial state distribution p

is known.

We start with the robust policy evaluation problem. We deﬁne the structure of the ambiguity sets that

we consider, as well as diﬀerent types of rectangularity that can be imposed to facilitate computational

tractability. Afterwards, we discuss the robust policy improvement problem. We deﬁne several policy

classes that are commonly used in MDPs, and we investigate the structure of optimal policies for diﬀerent

types of rectangularity. We close with a complexity result for the robust policy evaluation problem. Since

the remainder of this paper almost exclusively deals with the robust versions of the policy evaluation

and improvement problems, we may suppress the attribute ‘robust’ in the following.

2.1 The Robust Policy Evaluation Problem

In this paper, we consider ambiguity sets P of the following type.

P :=

P ∈ [M(S)]

S×A

: ∃ξ ∈ Ξ such that P

= p

(·|s, a) ∀(s, a) ∈ S × A

. (3a)

Here, we assume that Ξ is a subset of R

and that p

(·|s, a), (s, a) ∈ S × A, is an aﬃne function from

Ξ to M(S) that satisﬁes p

(·|s, a) := k

+ K

ξ for some k

∈ R

and K

∈ R

S×q

. The distinction

between the sets P and Ξ allows us to condense all ambiguous parameters in the set Ξ. This will enable

Robust Markov Decision Processes

Figures

Citations

Convex Analysisの二,三の進展について

Concrete Problems in AI Safety

Distributionally Robust Convex Optimization

Recent advances in robust optimization: An overview☆

Safe Model-based Reinforcement Learning with Stability Guarantees

References

Johnson: Computers and Intractability-A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness

Markov Decision Processes: Discrete Stochastic Dynamic Programming

Dynamic Programming and Optimal Control

Probability and Measure

Related Papers (5)

Robust Dynamic Programming

Robust Control of Markov Decision Processes with Uncertain Transition Matrices

Markov Decision Processes: Discrete Stochastic Dynamic Programming

Distributionally Robust Optimization Under Moment Uncertainty with Application to Data-Driven Problems

Distributionally Robust Convex Optimization

Frequently Asked Questions (1)

Q1. What are the contributions in "Robust markov decision processes" ?