What are the contributions in this paper?

In this paper the authors propose a robust formulation for discrete time dynamic programming ( DP ). The authors prove that when this set of measures has a certain “ Rectangularity ” property all the main results for finite and infinite horizon DP extend to natural robust counterparts. The authors contrast the performance of robust and non-robust DP on small numerical examples.

What was the motivation for the robust methodology?

As mentioned in the introduction, the motivation for the robust methodology was to systematically correct for the statistical errors associated with estimating the transition probabilities using historical data.

what is the dt of measures consistent with a policy?

The set T π of measures consistent with a policy π is given byT π = { P : ∀hN ∈ HN , P(hN ) = ∏t∈Tpht(at, st+1), pht ∈ T dt , t ∈ T } ,= T d0 × T d1 × · · · × T dN−1 , (3)where the notation in (3) simply denotes that each p ∈ T π is a product of pt ∈ T dt , and vice versa.

What is the simplest way to solve the robust optimization problem?

They compute the value of a policy π = (d, d, . . .), i.e. solve the robust optimization problem (31), via the following iterative procedure:(a) For every s ∈ S, fix ps ∈ P(s, d(s).

what is the optimal decision rule dn at epoch n?

Then the optimal decision rule d∗n at epoch n is given byd∗n(s) = argmax a∈A(s){ inf p∈Pn(s,a) Ep [ rn(s, a, s ′) + Vn+1(s ′) ]} .

What is the ratio of the robust policy?

The ratio M(ω) measures the loss associated with using a robust policy designed for a confidence level ω. ClearlyM(ω) ≤ 1 and the authors expect the ratio to decrease as ω increases.

(Open Access) Robust Dynamic Programming (2005) | Garud Iyengar

CORC Tech Report TR-2002-07

Robust dynamic programming

∗

G. Iyengar

†

Submitted Dec. 3rd, 2002. Revised May 4, 2004.

Abstract

In this paper we propose a robust formulation for discrete time dynamic programming (DP). The

objective of the robust formulation is to systematically mitigate the sensitivity of the DP optimal policy

to ambiguity in the underlying transition probabilities. The ambiguity is modeled by associating a

set of conditional measures with each state-action pair. Consequently, in the robust formulation each

policy has a set of measures associated with it. We prove that when this set of measures has a certain

“Rectangularity” property all the main results for ﬁnite and inﬁnite horizon DP extend to natural robust

counterparts. We identify families of sets of conditional measures for which the computational complexity

of solving the robust DP is only modestly larger than solving the DP, typically logarithmic in the size

of the state space. These families of sets are constructed from the conﬁdence regions associated with

density estimation, and therefore, can be chosen to guarantee any desired level of conﬁdence in the robust

optimal policy. Moreover, the sets can be easily parameterized from historical data. We contrast the

performance of robust and non-robust DP on small numerical examples.

1 Introduction

This paper is concerned with sequential decision making in uncertain environments. Decisions are made

in stages and each decision, in addition to providing an immediate reward, changes the context of future

decisions; thereby aﬀecting the future rewards. Due to the uncertain nature of the environment, there is

limited information about both the immediate reward from each decision and the resulting future state. In

order to achieve a good performance over all the stages the decision maker has to trade-oﬀ the immediate

payoﬀ with future payoﬀs. Dynamic programming (DP) is the mathematical framework that allows the

decision maker to eﬃciently compute a good overall strategy by succinctly encoding the evolving information

state. In the DP formalism the uncertainty in the environment is modeled by a Markov process whose

transition probability depends both on the information state and the action taken by the decision maker. It

is assumed that the transition probability corresponding to each state-action pair is known to the decision

maker, and the goal is to choose a policy, i.e. a rule that maps states to actions, that maximizes some

performance measure. Puterman (1994) provides a excellent introduction to the DP formalism and its

various applications. In this paper, we assume that the reader has some prior knowledge of DP.

∗

Submitted to Math. Oper. Res. . Do not distribute

†

IEOR Department, Columbia University, Email: garud@ieor.columbia.edu. Research partially supported by NSF grants

CCR-00-09972 and DMS-01-04282.

The DP formalism encodes information in the form of a “reward-to-go” function (see Puterman, 1994, for

details) and chooses an action that maximizes the sum of the immediate reward and the expected “reward-

to-go”. Thus, to compute the optimal action in any given state the “reward-to-go” function for all the future

states must be known. In many applications of DP, the number of states and actions available in each state

are large; consequently, the computational eﬀort required to compute the optimal policy for a DP can be

overwhelming – Bellman’s “curse of dimensionality”. For this reason, considerable recent research eﬀort has

focused on developing algorithms that compute an approximately optimal policy eﬃciently (Bertsekas and

Tsitsiklis, 1996; de Farias and Van Roy, 2002).

Fortunately, for many applications the DP optimal policy can be computed with a modest computational

eﬀort. In this paper we restrict attention to this class of DPs. Typically, the transition probability of the

underlying Markov process is estimated from historical data and is, therefore, subject to statistical errors. In

current practice, these errors are ignored and the optimal policy is computed assuming that the estimate is,

indeed, the true transition probability. The DP optimal policy is quite sensitive to perturbations in the tran-

sition probability and ignoring the estimation errors can lead to serious degradation in performance (Nilim

and El Ghaoui, 2002; Tsitsiklis et al., 2002). Degradation in performance due to estimation errors in param-

eters has also been observed in other contexts (Ben-Tal and Nemirovski, 1997; Goldfarb and Iyengar, 2003).

Therefore, there is a need to develop DP models that explicitly account for the eﬀect of errors.

In order to mitigate the eﬀect of estimation errors we assume that the transition probability corresponding

to a state-action pair is not exactly known. The ambiguity in the transition probability is modeled by

associating a set P(s, a) of conditional measures with each state-action pair (s, a). (We adopt the convention

of the decision analysis literature wherein uncertainty refers to random quantities with known probability

measures and ambiguity refers to unknown probability measures (see, e.g. Epstein and Schneider, 2001)).

Consequently, in our formulation each policy has a set of measures associated with it. The value of a

policy is the minimum expected reward over the set of associated measures, and the goal of the decision

maker is to choose a policy with maximum value, i.e. we adopt a maximin approach. We will refer to this

formulation as robust DP. We prove that, when the set of measures associated with a policy satisfy a certain

“Rectangularity” property (Epstein and Schneider, 2001), the following results extend to natural robust

counterparts: the Bellman recursion, the optimality of deterministic policies, the contraction property of

the value iteration operator, and the policy iteration algorithm. “Rectangularity” is a sort of independence

assumption and is a minimal requirement for these results to hold. However, this assumption is not always

appropriate, and is particularly troublesome in the inﬁnite horizon setting (see Appendix A for details).

We show that if the decision maker is restricted to stationary policies the eﬀects of the “Rectangularity”

assumption are not serious.

There is some previous work on modeling ambiguity in the transition probability and mitigating its eﬀect

on the optimal policy. Satia and Lave (1973); White and Eldieb (1994); Bagnell et al. (2001) investigate

ambiguity in the context of inﬁnite horizon DP with ﬁnite state and action spaces. They model ambiguity

by constraining the transition probability matrix to lie in a pre-speciﬁed polytope. They do not discuss how

one constructs this polytope. Moreover, the complexity of the resulting robust DP is at least an order of

magnitude higher than DP. Shapiro and Kleywegt (2002) investigate ambiguity in the context of stochastic

programming and propose a sampling based method for solving the maximin problem. However, they do not

discuss how to choose and calibrate the set of ambiguous priors. None of this work discusses the dynamic

structure of the ambiguity; in particular, there is no discussion of the central role of “Rectangularity”. Our

theoretical contributions are based on recent work on uncertain priors in the economics literature (Gilboa

and Schmeidler, 1989; Epstein and Schneider, 2001, 2002; Hansen and Sargent, 2001). The focus of this

body of work is on the axiomatic justiﬁcation for uncertain priors in the context of multi-period utility

maximization. It does not provide any means of selecting the set of uncertain priors nor does it focus on

eﬃciently solving the resulting robust DP.

In this paper we identify families of sets of conditional measures that have the following desirable proper-

ties. These families of sets provide a means for setting any desired level of conﬁdence in the robust optimal

policy. For a given conﬁdence level, the corresponding set from each family is easily parameterizable from

data. The complexity of solving the robust DP corresponding to these families of sets is only modestly

larger that the non-robust counterpart. These families of sets are constructed from the conﬁdence regions

associated with density estimation.

While this paper was being prepared for publication we became aware of a technical report by Nilim

and El Ghaoui (2002) where they formulate ﬁnite horizon robust DP in the context of an aircraft routing

problem. A “robust counterpart” for the Bellman equation appears in their paper but they do not justify

that this “robust counterpart”, indeed, characterizes the robust value function. Like all the previous work

on robust DP, Nilim and El Ghaoui also do not recognize the importance of Rectangularity. However, they

do introduce sets based on conﬁdence regions and show that the ﬁnite horizon robust DP corresponding to

these sets can be solved eﬃciently.

The paper has two distinct and fairly independent parts. The ﬁrst part comprising of Section 2 and

Section 3 presents the robust DP theory. In Section 2 we formulate ﬁnite horizon robust DP and the

“Rectangularity” property that leads to the robust counterpart of the Bellman recursion; and Section 3

formulates the robust extension of discounted inﬁnite horizon DP. The focus of the second part comprising

of Section 4 and Section 5 is on computation. In Section 4 we describe three families of sets of conditional

measures that are based on the conﬁdence regions, and show that the computational eﬀort required to solve

the robust DP corresponding to these sets is only modestly higher than that required to solve the non-

robust counterpart. The results in this section, although independently obtained, are not new and were ﬁrst

obtained by Nilim and El Ghaoui (2002). In Section 5 we provide basic examples and computational results.

Section 6 includes some concluding remarks.

2 Finite horizon robust dynamic programming

Decisions are made at discrete points in time t ∈ T = {0, 1, . . .} referred to as decision epochs. In this

section we assume that T ﬁnite, i.e. T = {0, . . . , N − 1} for some N ≥ 1. At each epoch t ∈ T the system

occupies a state s ∈ S

, where S

is assumed to be discrete (ﬁnite or countably inﬁnite). In a state s ∈ S

the

decision maker is allowed to choose an action a ∈ A

(s), where A

(s) is assumed to be discrete. Although

many results in this paper extend to non-discrete state and action sets, we avoid this generality because the

associated measurability issues would detract from the ideas that we want to present in this work.

For any discrete set B, we will denote the set of probability measures on B by M(B). Decision makers

can choose actions either randomly or deterministically. A random action is a state s ∈ S

corresponds

to an element q

∈ M(A(s)) with the interpretation that an action a ∈ A(s) is selected with probability

(a). Degenerate probability measures that assign all the probability mass to a single action correspond to

deterministic actions.

Associated with each epoch t ∈ T and state-action pair (s, a), a ∈ A(s), s ∈ S

, is a set of conditional

measures P

(s, a) ⊆ M(S

t+1

) with the interpretation that if at epoch t, action a is chosen in state s, the

state s

t+1

at the next epoch t + 1 is determined by some conditional measure p

∈ P

(s, a). Thus, the state

transition is ambiguous. (We adopt the convention of the decision analysis literature wherein uncertainty

refers to random quantities with known probability measures and ambiguity refers to unknown probability

measures (see, e.g. Epstein and Schneider, 2001)).

The decision maker receives a reward r

, a

, s

t+1

) when the action a

∈ A(s

) is chosen in state s

∈ S

at the decision epoch t, and the state at the next epoch is s

t+1

∈ S. Since s

t+1

is ambiguous, we allow the

reward at time t to depend on s

t+1

as well. Note that one can assume, without loss of generality, that the

reward r

(·, ·, ·) is certain. The reward r

(s) at the epoch N is a only a function of the state s ∈ S

We will refer to the collection of objects

T, {S

, A

, P

, r

(·, ·, ·) : t ∈ T }

as a ﬁnite horizon ambiguous

Markov decision process (AMDP). The notation above is a modiﬁcation of that in Puterman (1994) and the

structure of ambiguity is motivated by Epstein and Schneider (2001).

A decision rule d

is a procedure for selecting actions in each state at a speciﬁed decision epoch t ∈ T . We

will call a decision rule history dependent if it depends on the entire past history of the system as represented

by the sequence of past states and actions, i.e. d

is a function of the history h

= (s

, a

, . . . , s

t−1

, a

t−1

, s

Let H

denote the set of all histories h

. Then a randomized decision rule d

is a map d

: H

7→ M(A(s

)).

A decision rule d

is called deterministic if it puts all the probability mass on a single action a ∈ A(s

), and

Markovian if it is a function of the current state s

alone.

The set of all conditional measures consistent with a deterministic Markov decision rule d

is given by

p : S

7→ M(S

t+1

) : ∀s ∈ S

, p

∈ P

(s, d

(s))

, (1)

i.e. for every state s ∈ S, the next state can be determined by any p ∈ P

(s, d

(s)). The set of all conditional

measures consistent with a history dependent decision rule d

is given by

(

p : H

7→ M(A(s

) × S

t+1

) :

∀h ∈ H

, p

(a, s) = q

(h)

(a)p

(s),

∈ P(s

, a), a ∈ A(s

), s ∈ S

t+1

)

(2)

A policy prescribes the decision rule to be used at all decision epochs. Thus, a policy π is a sequence of

decision rules, i.e. π = (d

: t ∈ T ). Given the ambiguity in the conditional measures, a policy π induces a

collection of measure on the history space H

. We assume that the set T

of measures consistent with a

policy π has the following structure.

Assumption 1 (Rectangularity) The set T

of measures consistent with a policy π is given by

P : ∀h

∈ H

, P(h

) =

t∈T

, s

t+1

), p

∈ T

, t ∈ T

= T

× T

× ··· × T

N−1

, (3)

where the notation in (3) simply denotes that each p ∈ T

is a product of p

∈ T

, and vice versa.

The Rectangularity assumption is motivated by the structure of the recursive multiple priors in Epstein and

Schneider (2001). We will defer discussing the implications of the this assumption until after we deﬁne the

objective of the decision maker.

The reward V

(s) generated by a policy π starting from the initial state s

= s is deﬁned as follows.

(s) = inf

P∈T

t∈T

, d

), s

t+1

) + r

)

, (4)

where E

denotes the expectation with respect to the ﬁxed measure P ∈ T

. Equation (4) deﬁnes the

reward of a policy π to be the minimum expected reward over all measures consistent with the policy π.

Thus, we take a worst-case approach in deﬁning the reward. In the optimization literature this approach is

known as the robust approach (Ben-Tal and Nemirovski, 1998). Let Π denote the set of all history dependent

policies. Then the goal of robust DP is to characterize the robust value function

∗

(s) = sup

π∈Π

(s)

= sup

π∈Π

(

inf

P∈T

t∈T

, d

), s

t+1

) + r

)

, (5)

and an optimal policy π

∗

if the supremum is achieved.

In order to appreciate the implications of the Rectangularity assumption the objective (5) has to in-

terpreted in an adversarial setting: the decision maker chooses π; an adversary observes π, and chooses a

measure P ∈ T

that minimizes the reward. In this context, Rectangularity is a form of an independence

assumption: the choice of particular distribution ¯p ∈ P(s

, a

) in a state-action pair (s

, a

) at time t does

not limit the choices of the adversary in the future. This, in turn, leads to a separability property that is

crucial for establishing the robust counterpart of the Bellman recursion (see Theorem 1). Such a model for

an adversary is not always appropriate. See Appendix A for an example of such a situation. We will return

to this issue in the context of inﬁnite horizon models in Section 3.

The optimistic value

) of a policy π starting from the initial state s

= s is deﬁned as

(s) = sup

P∈T

t∈T

, d

), s

t+1

) + r

)

. (6)

Let V

; P) denote the non-robust value of a policy π corresponding to a particular choice P ∈ T

. Then

) ≥ V

; P) ≥ V

). Analogous to the robust value function V

∗

(s), the optimistic value function

∗

(s) is deﬁned as

∗

(s) = sup

π∈Π

(s)

= sup

π∈Π

(

sup

P∈T

t∈T

, d

), s

t+1

) + r

)

. (7)

Remark 1 Since our interest is in computing the robust optimal policy π

∗

, we will restrict attention to the

robust value function V

∗

. However, all the results in this paper imply a corresponding result for the optimistic

value function

∗

with the inf

P∈T

(·) replaced by sup

P∈T

(·).

Let V

) denote the reward obtained by using policy π over epochs n, n + 1, . . . , N − 1, starting from

the history h

, i.e.

) = inf

P∈T

N−1

t=n

, d

), s

t+1

) + r

)

, (8)

where Rectangularity implies that the set of conditional measures T

consistent with the policy π and the

history h

is given by

: H

7→

N−1

t=n

× S

t+1

) :

∀h

∈ H

, P

, s

n+1

, . . . , a

N−1

, s

) =

N−1

t=n

, s

t+1

∈ T

, t = n, . . . , N − 1

= T

× T

× ··· × T

N−1

= T

× T

n+1

. (9)

Let V

∗

) denote the optimal reward starting from the history h

at the epoch n, i.e.

∗

) = sup

π∈Π

)

= sup

π∈Π

inf

P∈T

N−1

t=n

, d

), s

t+1

) + r

)

¸¾

, (10)

where Π

is the set of all history dependent randomized policies for epochs t ≥ n.

Robust Dynamic Programming

Figures

Citations

Lectures on Stochastic Programming: Modeling and Theory

Theory and Applications of Robust Optimization

Theory and Applications of Robust Optimization

Concrete Problems in AI Safety

A comprehensive survey on safe reinforcement learning

References

Elements of information theory

Markov Decision Processes: Discrete Stochastic Dynamic Programming

Risk, Ambiguity, and the Savage Axioms

Markov Decision Processes

Neuro-Dynamic Programming.

Related Papers (5)

Markov Decision Processes: Discrete Stochastic Dynamic Programming

Dynamic Programming and Optimal Control

Distributionally Robust Optimization Under Moment Uncertainty with Application to Data-Driven Problems

The Price of Robustness

Reinforcement Learning: An Introduction

Frequently Asked Questions (6)

Q1. What are the contributions in this paper?

Q2. What was the motivation for the robust methodology?

Q3. what is the dt of measures consistent with a policy?

Q4. What is the simplest way to solve the robust optimization problem?

Q5. what is the optimal decision rule dn at epoch n?

Q6. What is the ratio of the robust policy?