It is proved that when this set of measures has a certain "rectangularity" property, all of the main results for finite and infinite horizon DP extend to natural robust counterparts.
Abstract:
In this paper we propose a robust formulation for discrete time dynamic programming (DP). The objective of the robust formulation is to systematically mitigate the sensitivity of the DP optimal policy to ambiguity in the underlying transition probabilities. The ambiguity is modeled by associating a set of conditional measures with each state-action pair. Consequently, in the robust formulation each policy has a set of measures associated with it. We prove that when this set of measures has a certain "rectangularity" property, all of the main results for finite and infinite horizon DP extend to natural robust counterparts. We discuss techniques from Nilim and El Ghaoui [17] for constructing suitable sets of conditional measures that allow one to efficiently solve for the optimal robust policy. We also show that robust DP is equivalent to stochastic zero-sum games with perfect information.
TL;DR: The authors dedicate this book to Julia, Benjamin, Daniel, Natan and Yael; to Tsonka, Konstatin and Marek; and to the Memory of Feliks, Maria, and Dentcho.
TL;DR: This paper surveys the primary research, both theoretical and applied, in the area of robust optimization (RO), focusing on the computational attractiveness of RO approaches, as well as the modeling power and broad applicability of the methodology.
TL;DR: In this article, the authors survey the primary research, both theoretical and applied, in the area of robust optimization and highlight applications of RO across a wide spectrum of domains, including finance, statistics, learning, and various areas of engineering.
TL;DR: A list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function, an objective function that is too expensive to evaluate frequently, or undesirable behavior during the learning process, are presented.
TL;DR: This work categorize and analyze two approaches of Safe Reinforcement Learning, based on the modification of the optimality criterion, the classic discounted finite/infinite horizon, with a safety factor and the incorporation of external knowledge or the guidance of a risk metric.
TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.
TL;DR: Puterman as discussed by the authors provides a uniquely up-to-date, unified, and rigorous treatment of the theoretical, computational, and applied research on Markov decision process models, focusing primarily on infinite horizon discrete time models and models with discrete time spaces while also examining models with arbitrary state spaces, finite horizon models, and continuous time discrete state models.
TL;DR: The notion of "degrees of belief" was introduced by Knight as mentioned in this paper, who argued that people tend to behave "as though" they assigned numerical probabilities to events, or degrees of belief to the events impinging on their actions.
TL;DR: Markov Decision Processes covers recent research advances in such areas as countable state space models with average reward criterion, constrained models, and models with risk sensitive optimality criteria, and explores several topics that have received little or no attention in other books.
TL;DR: In this article, the authors present the first textbook that fully explains the neuro-dynamic programming/reinforcement learning methodology, which is a recent breakthrough in the practical application of neural networks and dynamic programming to complex problems of planning, optimal decision making, and intelligent control.
In this paper the authors propose a robust formulation for discrete time dynamic programming ( DP ). The authors prove that when this set of measures has a certain “ Rectangularity ” property all the main results for finite and infinite horizon DP extend to natural robust counterparts. The authors contrast the performance of robust and non-robust DP on small numerical examples.
Q2. What was the motivation for the robust methodology?
As mentioned in the introduction, the motivation for the robust methodology was to systematically correct for the statistical errors associated with estimating the transition probabilities using historical data.
Q3. what is the dt of measures consistent with a policy?
The set T π of measures consistent with a policy π is given byT π = { P : ∀hN ∈ HN , P(hN ) = ∏t∈Tpht(at, st+1), pht ∈ T dt , t ∈ T } ,= T d0 × T d1 × · · · × T dN−1 , (3)where the notation in (3) simply denotes that each p ∈ T π is a product of pt ∈ T dt , and vice versa.
Q4. What is the simplest way to solve the robust optimization problem?
They compute the value of a policy π = (d, d, . . .), i.e. solve the robust optimization problem (31), via the following iterative procedure:(a) For every s ∈ S, fix ps ∈ P(s, d(s).
Q5. what is the optimal decision rule dn at epoch n?
Then the optimal decision rule d∗n at epoch n is given byd∗n(s) = argmax a∈A(s){ inf p∈Pn(s,a) Ep [ rn(s, a, s ′) + Vn+1(s ′) ]} .
Q6. What is the ratio of the robust policy?
The ratio M(ω) measures the loss associated with using a robust policy designed for a confidence level ω. ClearlyM(ω) ≤ 1 and the authors expect the ratio to decrease as ω increases.