scispace - formally typeset
Open AccessJournal ArticleDOI

Robust Dynamic Programming

Garud Iyengar
- 01 May 2005 - 
- Vol. 30, Iss: 2, pp 257-280
TLDR
It is proved that when this set of measures has a certain "rectangularity" property, all of the main results for finite and infinite horizon DP extend to natural robust counterparts.
Abstract
In this paper we propose a robust formulation for discrete time dynamic programming (DP). The objective of the robust formulation is to systematically mitigate the sensitivity of the DP optimal policy to ambiguity in the underlying transition probabilities. The ambiguity is modeled by associating a set of conditional measures with each state-action pair. Consequently, in the robust formulation each policy has a set of measures associated with it. We prove that when this set of measures has a certain "rectangularity" property, all of the main results for finite and infinite horizon DP extend to natural robust counterparts. We discuss techniques from Nilim and El Ghaoui [17] for constructing suitable sets of conditional measures that allow one to efficiently solve for the optimal robust policy. We also show that robust DP is equivalent to stochastic zero-sum games with perfect information.

read more

Content maybe subject to copyright    Report

CORC Tech Report TR-2002-07
Robust dynamic programming
G. Iyengar
Submitted Dec. 3rd, 2002. Revised May 4, 2004.
Abstract
In this paper we propose a robust formulation for discrete time dynamic programming (DP). The
objective of the robust formulation is to systematically mitigate the sensitivity of the DP optimal policy
to ambiguity in the underlying transition probabilities. The ambiguity is modeled by associating a
set of conditional measures with each state-action pair. Consequently, in the robust formulation each
policy has a set of measures associated with it. We prove that when this set of measures has a certain
“Rectangularity” property all the main results for finite and infinite horizon DP extend to natural robust
counterparts. We identify families of sets of conditional measures for which the computational complexity
of solving the robust DP is only modestly larger than solving the DP, typically logarithmic in the size
of the state space. These families of sets are constructed from the confidence regions associated with
density estimation, and therefore, can be chosen to guarantee any desired level of confidence in the robust
optimal policy. Moreover, the sets can be easily parameterized from historical data. We contrast the
performance of robust and non-robust DP on small numerical examples.
1 Introduction
This paper is concerned with sequential decision making in uncertain environments. Decisions are made
in stages and each decision, in addition to providing an immediate reward, changes the context of future
decisions; thereby affecting the future rewards. Due to the uncertain nature of the environment, there is
limited information about both the immediate reward from each decision and the resulting future state. In
order to achieve a good performance over all the stages the decision maker has to trade-off the immediate
payoff with future payoffs. Dynamic programming (DP) is the mathematical framework that allows the
decision maker to efficiently compute a good overall strategy by succinctly encoding the evolving information
state. In the DP formalism the uncertainty in the environment is modeled by a Markov process whose
transition probability depends both on the information state and the action taken by the decision maker. It
is assumed that the transition probability corresponding to each state-action pair is known to the decision
maker, and the goal is to choose a policy, i.e. a rule that maps states to actions, that maximizes some
performance measure. Puterman (1994) provides a excellent introduction to the DP formalism and its
various applications. In this paper, we assume that the reader has some prior knowledge of DP.
Submitted to Math. Oper. Res. . Do not distribute
IEOR Department, Columbia University, Email: garud@ieor.columbia.edu. Research partially supported by NSF grants
CCR-00-09972 and DMS-01-04282.
1

The DP formalism encodes information in the form of a “reward-to-go” function (see Puterman, 1994, for
details) and chooses an action that maximizes the sum of the immediate reward and the expected “reward-
to-go”. Thus, to compute the optimal action in any given state the “reward-to-go” function for all the future
states must be known. In many applications of DP, the number of states and actions available in each state
are large; consequently, the computational effort required to compute the optimal policy for a DP can be
overwhelming Bellman’s “curse of dimensionality”. For this reason, considerable recent research effort has
focused on developing algorithms that compute an approximately optimal policy efficiently (Bertsekas and
Tsitsiklis, 1996; de Farias and Van Roy, 2002).
Fortunately, for many applications the DP optimal policy can be computed with a modest computational
effort. In this paper we restrict attention to this class of DPs. Typically, the transition probability of the
underlying Markov process is estimated from historical data and is, therefore, subject to statistical errors. In
current practice, these errors are ignored and the optimal policy is computed assuming that the estimate is,
indeed, the true transition probability. The DP optimal policy is quite sensitive to perturbations in the tran-
sition probability and ignoring the estimation errors can lead to serious degradation in performance (Nilim
and El Ghaoui, 2002; Tsitsiklis et al., 2002). Degradation in performance due to estimation errors in param-
eters has also been observed in other contexts (Ben-Tal and Nemirovski, 1997; Goldfarb and Iyengar, 2003).
Therefore, there is a need to develop DP models that explicitly account for the effect of errors.
In order to mitigate the effect of estimation errors we assume that the transition probability corresponding
to a state-action pair is not exactly known. The ambiguity in the transition probability is modeled by
associating a set P(s, a) of conditional measures with each state-action pair (s, a). (We adopt the convention
of the decision analysis literature wherein uncertainty refers to random quantities with known probability
measures and ambiguity refers to unknown probability measures (see, e.g. Epstein and Schneider, 2001)).
Consequently, in our formulation each policy has a set of measures associated with it. The value of a
policy is the minimum expected reward over the set of associated measures, and the goal of the decision
maker is to choose a policy with maximum value, i.e. we adopt a maximin approach. We will refer to this
formulation as robust DP. We prove that, when the set of measures associated with a policy satisfy a certain
“Rectangularity” property (Epstein and Schneider, 2001), the following results extend to natural robust
counterparts: the Bellman recursion, the optimality of deterministic policies, the contraction property of
the value iteration operator, and the policy iteration algorithm. “Rectangularity” is a sort of independence
assumption and is a minimal requirement for these results to hold. However, this assumption is not always
appropriate, and is particularly troublesome in the infinite horizon setting (see Appendix A for details).
We show that if the decision maker is restricted to stationary policies the effects of the “Rectangularity”
assumption are not serious.
There is some previous work on modeling ambiguity in the transition probability and mitigating its effect
on the optimal policy. Satia and Lave (1973); White and Eldieb (1994); Bagnell et al. (2001) investigate
ambiguity in the context of infinite horizon DP with finite state and action spaces. They model ambiguity
by constraining the transition probability matrix to lie in a pre-specified polytope. They do not discuss how
one constructs this polytope. Moreover, the complexity of the resulting robust DP is at least an order of
magnitude higher than DP. Shapiro and Kleywegt (2002) investigate ambiguity in the context of stochastic
programming and propose a sampling based method for solving the maximin problem. However, they do not
discuss how to choose and calibrate the set of ambiguous priors. None of this work discusses the dynamic
structure of the ambiguity; in particular, there is no discussion of the central role of “Rectangularity”. Our
theoretical contributions are based on recent work on uncertain priors in the economics literature (Gilboa
and Schmeidler, 1989; Epstein and Schneider, 2001, 2002; Hansen and Sargent, 2001). The focus of this
2

body of work is on the axiomatic justification for uncertain priors in the context of multi-period utility
maximization. It does not provide any means of selecting the set of uncertain priors nor does it focus on
efficiently solving the resulting robust DP.
In this paper we identify families of sets of conditional measures that have the following desirable proper-
ties. These families of sets provide a means for setting any desired level of confidence in the robust optimal
policy. For a given confidence level, the corresponding set from each family is easily parameterizable from
data. The complexity of solving the robust DP corresponding to these families of sets is only modestly
larger that the non-robust counterpart. These families of sets are constructed from the confidence regions
associated with density estimation.
While this paper was being prepared for publication we became aware of a technical report by Nilim
and El Ghaoui (2002) where they formulate finite horizon robust DP in the context of an aircraft routing
problem. A “robust counterpart” for the Bellman equation appears in their paper but they do not justify
that this “robust counterpart”, indeed, characterizes the robust value function. Like all the previous work
on robust DP, Nilim and El Ghaoui also do not recognize the importance of Rectangularity. However, they
do introduce sets based on confidence regions and show that the finite horizon robust DP corresponding to
these sets can be solved efficiently.
The paper has two distinct and fairly independent parts. The first part comprising of Section 2 and
Section 3 presents the robust DP theory. In Section 2 we formulate finite horizon robust DP and the
“Rectangularity” property that leads to the robust counterpart of the Bellman recursion; and Section 3
formulates the robust extension of discounted infinite horizon DP. The focus of the second part comprising
of Section 4 and Section 5 is on computation. In Section 4 we describe three families of sets of conditional
measures that are based on the confidence regions, and show that the computational effort required to solve
the robust DP corresponding to these sets is only modestly higher than that required to solve the non-
robust counterpart. The results in this section, although independently obtained, are not new and were first
obtained by Nilim and El Ghaoui (2002). In Section 5 we provide basic examples and computational results.
Section 6 includes some concluding remarks.
2 Finite horizon robust dynamic programming
Decisions are made at discrete points in time t T = {0, 1, . . .} referred to as decision epochs. In this
section we assume that T finite, i.e. T = {0, . . . , N 1} for some N 1. At each epoch t T the system
occupies a state s S
t
, where S
t
is assumed to be discrete (finite or countably infinite). In a state s S
t
the
decision maker is allowed to choose an action a A
t
(s), where A
t
(s) is assumed to be discrete. Although
many results in this paper extend to non-discrete state and action sets, we avoid this generality because the
associated measurability issues would detract from the ideas that we want to present in this work.
For any discrete set B, we will denote the set of probability measures on B by M(B). Decision makers
can choose actions either randomly or deterministically. A random action is a state s S
t
corresponds
to an element q
s
M(A(s)) with the interpretation that an action a A(s) is selected with probability
q
s
(a). Degenerate probability measures that assign all the probability mass to a single action correspond to
deterministic actions.
Associated with each epoch t T and state-action pair (s, a), a A(s), s S
t
, is a set of conditional
measures P
t
(s, a) M(S
t+1
) with the interpretation that if at epoch t, action a is chosen in state s, the
state s
t+1
at the next epoch t + 1 is determined by some conditional measure p
sa
P
t
(s, a). Thus, the state
transition is ambiguous. (We adopt the convention of the decision analysis literature wherein uncertainty
3

refers to random quantities with known probability measures and ambiguity refers to unknown probability
measures (see, e.g. Epstein and Schneider, 2001)).
The decision maker receives a reward r
t
(s
t
, a
t
, s
t+1
) when the action a
t
A(s
t
) is chosen in state s
t
S
at the decision epoch t, and the state at the next epoch is s
t+1
S. Since s
t+1
is ambiguous, we allow the
reward at time t to depend on s
t+1
as well. Note that one can assume, without loss of generality, that the
reward r
t
(·, ·, ·) is certain. The reward r
N
(s) at the epoch N is a only a function of the state s S
N
.
We will refer to the collection of objects
©
T, {S
t
, A
t
, P
t
, r
t
(·, ·, ·) : t T }
ª
as a finite horizon ambiguous
Markov decision process (AMDP). The notation above is a modification of that in Puterman (1994) and the
structure of ambiguity is motivated by Epstein and Schneider (2001).
A decision rule d
t
is a procedure for selecting actions in each state at a specified decision epoch t T . We
will call a decision rule history dependent if it depends on the entire past history of the system as represented
by the sequence of past states and actions, i.e. d
t
is a function of the history h
t
= (s
0
, a
0
, . . . , s
t1
, a
t1
, s
t
).
Let H
t
denote the set of all histories h
t
. Then a randomized decision rule d
t
is a map d
t
: H
t
7→ M(A(s
t
)).
A decision rule d
t
is called deterministic if it puts all the probability mass on a single action a A(s
t
), and
Markovian if it is a function of the current state s
t
alone.
The set of all conditional measures consistent with a deterministic Markov decision rule d
t
is given by
T
d
t
=
n
p : S
t
7→ M(S
t+1
) : s S
t
, p
s
P
t
(s, d
t
(s))
o
, (1)
i.e. for every state s S, the next state can be determined by any p P
t
(s, d
t
(s)). The set of all conditional
measures consistent with a history dependent decision rule d
t
is given by
T
d
t
=
(
p : H
t
7→ M(A(s
t
) × S
t+1
) :
h H
t
, p
h
(a, s) = q
d
t
(h)
(a)p
s
t
a
(s),
p
s
t
a
P(s
t
, a), a A(s
t
), s S
t+1
)
(2)
A policy prescribes the decision rule to be used at all decision epochs. Thus, a policy π is a sequence of
decision rules, i.e. π = (d
t
: t T ). Given the ambiguity in the conditional measures, a policy π induces a
collection of measure on the history space H
N
. We assume that the set T
π
of measures consistent with a
policy π has the following structure.
Assumption 1 (Rectangularity) The set T
π
of measures consistent with a policy π is given by
T
π
=
½
P : h
N
H
N
, P(h
N
) =
Y
tT
p
h
t
(a
t
, s
t+1
), p
h
t
T
d
t
, t T
¾
,
= T
d
0
× T
d
1
× ··· × T
d
N1
, (3)
where the notation in (3) simply denotes that each p T
π
is a product of p
t
T
d
t
, and vice versa.
The Rectangularity assumption is motivated by the structure of the recursive multiple priors in Epstein and
Schneider (2001). We will defer discussing the implications of the this assumption until after we define the
objective of the decision maker.
The reward V
π
0
(s) generated by a policy π starting from the initial state s
0
= s is defined as follows.
V
π
0
(s) = inf
P∈T
π
E
P
·
X
tT
r
t
(s
t
, d
t
(h
t
), s
t+1
) + r
N
(s
N
)
¸
, (4)
where E
P
denotes the expectation with respect to the fixed measure P T
π
. Equation (4) defines the
reward of a policy π to be the minimum expected reward over all measures consistent with the policy π.
Thus, we take a worst-case approach in defining the reward. In the optimization literature this approach is
4

known as the robust approach (Ben-Tal and Nemirovski, 1998). Let Π denote the set of all history dependent
policies. Then the goal of robust DP is to characterize the robust value function
V
0
(s) = sup
πΠ
n
V
π
0
(s)
o
= sup
πΠ
(
inf
P∈T
π
E
P
·
X
tT
r
t
(s
t
, d
t
(h
t
), s
t+1
) + r
N
(s
N
)
¸
)
, (5)
and an optimal policy π
if the supremum is achieved.
In order to appreciate the implications of the Rectangularity assumption the objective (5) has to in-
terpreted in an adversarial setting: the decision maker chooses π; an adversary observes π, and chooses a
measure P T
π
that minimizes the reward. In this context, Rectangularity is a form of an independence
assumption: the choice of particular distribution ¯p P(s
t
, a
t
) in a state-action pair (s
t
, a
t
) at time t does
not limit the choices of the adversary in the future. This, in turn, leads to a separability property that is
crucial for establishing the robust counterpart of the Bellman recursion (see Theorem 1). Such a model for
an adversary is not always appropriate. See Appendix A for an example of such a situation. We will return
to this issue in the context of infinite horizon models in Section 3.
The optimistic value
¯
V
π
0
(s
0
) of a policy π starting from the initial state s
0
= s is defined as
¯
V
π
0
(s) = sup
P∈T
π
E
P
·
X
tT
r
t
(s
t
, d
t
(h
t
), s
t+1
) + r
N
(s
N
)
¸
. (6)
Let V
π
0
(s
0
; P) denote the non-robust value of a policy π corresponding to a particular choice P T
π
. Then
¯
V
π
0
(s
0
) V
π
0
(s
0
; P) V
π
0
(s
0
). Analogous to the robust value function V
0
(s), the optimistic value function
¯
V
0
(s) is defined as
¯
V
0
(s) = sup
πΠ
n
¯
V
π
0
(s)
o
= sup
πΠ
(
sup
P∈T
π
E
P
·
X
tT
r
t
(s
t
, d
t
(h
t
), s
t+1
) + r
N
(s
N
)
¸
)
. (7)
Remark 1 Since our interest is in computing the robust optimal policy π
, we will restrict attention to the
robust value function V
0
. However, all the results in this paper imply a corresponding result for the optimistic
value function
¯
V
0
with the inf
P∈T
π
(·) replaced by sup
P∈T
π
(·).
Let V
π
n
(h
n
) denote the reward obtained by using policy π over epochs n, n + 1, . . . , N 1, starting from
the history h
n
, i.e.
V
π
n
(h
n
) = inf
P∈T
π
n
E
P
·
N1
X
t=n
r
t
(s
t
, d
t
(h
t
), s
t+1
) + r
N
(s
N
)
¸
, (8)
where Rectangularity implies that the set of conditional measures T
π
n
consistent with the policy π and the
history h
n
is given by
T
π
n
=
½
P
n
: H
n
7→
N1
Y
t=n
(A
t
× S
t+1
) :
h
n
H
n
, P
h
n
(a
n
, s
n+1
, . . . , a
N1
, s
N
) =
Q
N1
t=n
p
h
t
(a
t
, s
t+1
),
p
h
t
T
d
t
, t = n, . . . , N 1
¾
,
= T
d
n
× T
d
1
× ··· × T
d
N1
,
= T
d
n
× T
π
n+1
. (9)
Let V
n
(h
n
) denote the optimal reward starting from the history h
n
at the epoch n, i.e.
V
n
(h
n
) = sup
πΠ
n
n
V
π
n
(h
n
)
o
= sup
πΠ
n
½
inf
P∈T
π
n
E
P
·
N1
X
t=n
r
t
(s
t
, d
t
(h
t
), s
t+1
) + r
N
(s
N
)
¸¾
, (10)
where Π
n
is the set of all history dependent randomized policies for epochs t n.
5

Citations
More filters
Book

Lectures on Stochastic Programming: Modeling and Theory

TL;DR: The authors dedicate this book to Julia, Benjamin, Daniel, Natan and Yael; to Tsonka, Konstatin and Marek; and to the Memory of Feliks, Maria, and Dentcho.
Journal ArticleDOI

Theory and Applications of Robust Optimization

TL;DR: This paper surveys the primary research, both theoretical and applied, in the area of robust optimization (RO), focusing on the computational attractiveness of RO approaches, as well as the modeling power and broad applicability of the methodology.
Journal Article

Theory and Applications of Robust Optimization

TL;DR: In this article, the authors survey the primary research, both theoretical and applied, in the area of robust optimization and highlight applications of RO across a wide spectrum of domains, including finance, statistics, learning, and various areas of engineering.
Posted Content

Concrete Problems in AI Safety

TL;DR: A list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function, an objective function that is too expensive to evaluate frequently, or undesirable behavior during the learning process, are presented.
Journal Article

A comprehensive survey on safe reinforcement learning

TL;DR: This work categorize and analyze two approaches of Safe Reinforcement Learning, based on the modification of the optimality criterion, the classic discounted finite/infinite horizon, with a safety factor and the incorporation of external knowledge or the guidance of a risk metric.
References
More filters
Book

Elements of information theory

TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.
Book

Markov Decision Processes: Discrete Stochastic Dynamic Programming

TL;DR: Puterman as discussed by the authors provides a uniquely up-to-date, unified, and rigorous treatment of the theoretical, computational, and applied research on Markov decision process models, focusing primarily on infinite horizon discrete time models and models with discrete time spaces while also examining models with arbitrary state spaces, finite horizon models, and continuous time discrete state models.
Journal ArticleDOI

Risk, Ambiguity, and the Savage Axioms

TL;DR: The notion of "degrees of belief" was introduced by Knight as mentioned in this paper, who argued that people tend to behave "as though" they assigned numerical probabilities to events, or degrees of belief to the events impinging on their actions.
MonographDOI

Markov Decision Processes

TL;DR: Markov Decision Processes covers recent research advances in such areas as countable state space models with average reward criterion, constrained models, and models with risk sensitive optimality criteria, and explores several topics that have received little or no attention in other books.

Neuro-Dynamic Programming.

TL;DR: In this article, the authors present the first textbook that fully explains the neuro-dynamic programming/reinforcement learning methodology, which is a recent breakthrough in the practical application of neural networks and dynamic programming to complex problems of planning, optimal decision making, and intelligent control.
Frequently Asked Questions (6)
Q1. What are the contributions in this paper?

In this paper the authors propose a robust formulation for discrete time dynamic programming ( DP ). The authors prove that when this set of measures has a certain “ Rectangularity ” property all the main results for finite and infinite horizon DP extend to natural robust counterparts. The authors contrast the performance of robust and non-robust DP on small numerical examples. 

As mentioned in the introduction, the motivation for the robust methodology was to systematically correct for the statistical errors associated with estimating the transition probabilities using historical data. 

The set T π of measures consistent with a policy π is given byT π = { P : ∀hN ∈ HN , P(hN ) = ∏t∈Tpht(at, st+1), pht ∈ T dt , t ∈ T } ,= T d0 × T d1 × · · · × T dN−1 , (3)where the notation in (3) simply denotes that each p ∈ T π is a product of pt ∈ T dt , and vice versa. 

They compute the value of a policy π = (d, d, . . .), i.e. solve the robust optimization problem (31), via the following iterative procedure:(a) For every s ∈ S, fix ps ∈ P(s, d(s). 

Then the optimal decision rule d∗n at epoch n is given byd∗n(s) = argmax a∈A(s){ inf p∈Pn(s,a) Ep [ rn(s, a, s ′) + Vn+1(s ′) ]} . 

The ratio M(ω) measures the loss associated with using a robust policy designed for a confidence level ω. ClearlyM(ω) ≤ 1 and the authors expect the ratio to decrease as ω increases.