scispace - formally typeset
Search or ask a question

Showing papers on "Bellman equation published in 2011"


BookDOI
04 Aug 2011
TL;DR: This book discusses the challenges of dynamic programming, the three curses of dimensionality, and some experimental comparisons of stepsize formulas that led to the creation of ADP for online applications.
Abstract: Preface. Acknowledgments. 1. The challenges of dynamic programming. 1.1 A dynamic programming example: a shortest path problem. 1.2 The three curses of dimensionality. 1.3 Some real applications. 1.4 Problem classes. 1.5 The many dialects of dynamic programming. 1.6 What is new in this book? 1.7 Bibliographic notes. 2. Some illustrative models. 2.1 Deterministic problems. 2.2 Stochastic problems. 2.3 Information acquisition problems. 2.4 A simple modeling framework for dynamic programs. 2.5 Bibliographic notes. Problems. 3. Introduction to Markov decision processes. 3.1 The optimality equations. 3.2 Finite horizon problems. 3.3 Infinite horizon problems. 3.4 Value iteration. 3.5 Policy iteration. 3.6 Hybrid valuepolicy iteration. 3.7 The linear programming method for dynamic programs. 3.8 Monotone policies. 3.9 Why does it work? 3.10 Bibliographic notes. Problems 4. Introduction to approximate dynamic programming. 4.1 The three curses of dimensionality (revisited). 4.2 The basic idea. 4.3 Sampling random variables . 4.4 ADP using the postdecision state variable. 4.5 Lowdimensional representations of value functions. 4.6 So just what is approximate dynamic programming? 4.7 Experimental issues. 4.8 Dynamic programming with missing or incomplete models. 4.9 Relationship to reinforcement learning. 4.10 But does it work? 4.11 Bibliographic notes. Problems. 5. Modeling dynamic programs. 5.1 Notational style. 5.2 Modeling time. 5.3 Modeling resources. 5.4 The states of our system. 5.5 Modeling decisions. 5.6 The exogenous information process. 5.7 The transition function. 5.8 The contribution function. 5.9 The objective function. 5.10 A measuretheoretic view of information. 5.11 Bibliographic notes. Problems. 6. Stochastic approximation methods. 6.1 A stochastic gradient algorithm. 6.2 Some stepsize recipes. 6.3 Stochastic stepsizes. 6.4 Computing bias and variance. 6.5 Optimal stepsizes. 6.6 Some experimental comparisons of stepsize formulas. 6.7 Convergence. 6.8 Why does it work? 6.9 Bibliographic notes. Problems. 7. Approximating value functions. 7.1 Approximation using aggregation. 7.2 Approximation methods using regression models. 7.3 Recursive methods for regression models. 7.4 Neural networks. 7.5 Batch processes. 7.6 Why does it work? 7.7 Bibliographic notes. Problems. 8. ADP for finite horizon problems. 8.1 Strategies for finite horizon problems. 8.2 Qlearning. 8.3 Temporal difference learning. 8.4 Policy iteration. 8.5 Monte Carlo value and policy iteration. 8.6 The actorcritic paradigm. 8.7 Bias in value function estimation. 8.8 State sampling strategies. 8.9 Starting and stopping. 8.10 A taxonomy of approximate dynamic programming strategies. 8.11 Why does it work? 8.12 Bibliographic notes. Problems. 9. Infinite horizon problems. 9.1 From finite to infinite horizon. 9.2 Algorithmic strategies. 9.3 Stepsizes for infinite horizon problems. 9.4 Error measures. 9.5 Direct ADP for online applications. 9.6 Finite horizon models for steady state applications. 9.7 Why does it work? 9.8 Bibliographic notes. Problems. 10. Exploration vs. exploitation. 10.1 A learning exercise: the nomadic trucker. 10.2 Learning strategies. 10.3 A simple information acquisition problem. 10.4 Gittins indices and the information acquisition problem. 10.5 Variations. 10.6 The knowledge gradient algorithm. 10.7 Information acquisition in dynamic programming. 10.8 Bibliographic notes. Problems. 11. Value function approximations for special functions. 11.1 Value functions versus gradients. 11.2 Linear approximations. 11.3 Piecewise linear approximations. 11.4 The SHAPE algorithm. 11.5 Regression methods. 11.6 Cutting planes. 11.7 Why does it work? 11.8 Bibliographic notes. Problems. 12. Dynamic resource allocation. 12.1 An asset acquisition problem. 12.2 The blood management problem. 12.3 A portfolio optimization problem. 12.4 A general resource allocation problem. 12.5 A fleet management problem. 12.6 A driver management problem. 12.7 Bibliographic references. Problems. 13. Implementation challenges. 13.1 Will ADP work for your problem? 13.2 Designing an ADP algorithm for complex problems. 13.3 Debugging an ADP algorithm. 13.4 Convergence issues. 13.5 Modeling your problem. 13.6 Online vs. offline models. 13.7 If it works, patent it!

2,300 citations


Journal ArticleDOI
TL;DR: It is illustrated that policy evaluation when done by the projected equation/TD approach may lead to policy oscillation, but whendone by aggregation it does not, which implies better error bounds and more regular performance for aggregation, at the expense of some loss of generality in cost function representation capability.
Abstract: We consider the classical policy iteration method of dynamic programming (DP), where approximations and simulation are used to deal with the curse of dimensionality. We survey a number of issues: convergence and rate of convergence of approximate policy evaluation methods, singularity and susceptibility to simulation noise of policy evaluation, exploration issues, constrained and enhanced policy iteration, policy oscillation and chattering, and optimistic and distributed policy iteration. Our discussion of policy evaluation is couched in general terms and aims to unify the available methods in the light of recent research developments and to compare the two main policy evaluation approaches: projected equations and temporal differences (TD), and aggregation. In the context of these approaches, we survey two different types of simulation-based algorithms: matrix inversion methods, such as least-squares temporal difference (LSTD), and iterative methods, such as least-squares policy evaluation (LSPE) and TD (λ), and their scaled variants. We discuss a recent method, based on regression and regularization, which rectifies the unreliability of LSTD for nearly singular projected Bellman equations. An iterative version of this method belongs to the LSPE class of methods and provides the connecting link between LSTD and LSPE. Our discussion of policy improvement focuses on the role of policy oscillation and its effect on performance guarantees. We illustrate that policy evaluation when done by the projected equation/TD approach may lead to policy oscillation, but when done by aggregation it does not. This implies better error bounds and more regular performance for aggregation, at the expense of some loss of generality in cost function representation capability. Hard aggregation provides the connecting link between projected equation/TD-based and aggregation-based policy evaluation, and is characterized by favorable error bounds.

265 citations


Journal ArticleDOI
TL;DR: A weak version of the dynamic programming principle is proved for standard stochastic control problems and mixed control-stopping problems, which avoids the technical difficulties related to the measurable selection argument.
Abstract: We prove a weak version of the dynamic programming principle for standard stochastic control problems and mixed control-stopping problems, which avoids the technical difficulties related to the measurable selection argument. In the Markov case, our result is tailor-made for the derivation of the dynamic programming equation in the sense of viscosity solutions.

242 citations


Journal ArticleDOI
TL;DR: The main advantage of the approach proposed is that it can be applied to a general class of target-hitting continuous dynamic games with nonlinear dynamics, and has very good properties in terms of its numerical solution, since the value function and the Hamiltonian of the system are both continuous.
Abstract: A new framework for formulating reachability problems with competing inputs, nonlinear dynamics, and state constraints as optimal control problems is developed. Such reach-avoid problems arise in, among others, the study of safety problems in hybrid systems. Earlier approaches to reach-avoid computations are either restricted to linear systems, or face numerical difficulties due to possible discontinuities in the Hamiltonian of the optimal control problem. The main advantage of the approach proposed in this paper is that it can be applied to a general class of target-hitting continuous dynamic games with nonlinear dynamics, and has very good properties in terms of its numerical solution, since the value function and the Hamiltonian of the system are both continuous. The performance of the proposed method is demonstrated by applying it to a case study, which involves the target-hitting problem of an underactuated underwater vehicle in the presence of obstacles.

193 citations


Book
01 Jan 2011
TL;DR: Dynamic Programming - An Outline Preliminary Analysis Markovian Decomposition Scheme Optimality Equation Dynamic Programming Problems The Final State Model Principle of Optimality Summary Solution Methods.
Abstract: Introduction Welcome to Dynamic Programming! How to Read This Book SCIENCE Fundamentals Introduction Meta-Recipe Revisited Problem Formulation Decomposition of the Solution Set Principle of Conditional Optimization Conditional Problems Optimality Equation Solution Procedure Time Out: Direct Enumeration! Equivalent Conditional Problems Modified Problems The Role of a Decomposition Scheme Dynamic Programming Problem - Revisited Trivial Decomposition Scheme Summary and a Look Ahead Multistage Decision Model Introduction A Prototype Multistage Decision Model Problem vs Problem Formulation Policies Markovian Policies Remarks on the Notation Summary Bibliographic Notes Dynamic Programming - An Outline Introduction Preliminary Analysis Markovian Decomposition Scheme Optimality Equation Dynamic Programming Problems The Final State Model Principle of Optimality Summary Solution Methods Introduction Additive Functional Equations Truncated Functional Equations Nontruncated Functional Equations Summary Successive Approximation Methods Introduction Motivation Preliminaries Functional Equations of Type One Functional Equations of Type Two Truncation Method Stationary Models Truncation and Successive Approximation Summary Bibliographic Notes Optimal Policies Introduction Preliminary Analysis Truncated Functional Equations Nontruncated Functional Equations Successive Approximation in the Policy Space Summary Bibliographic Notes The Curse of Dimensionality Introduction Motivation Discrete Problems Special Cases Complete Enumeration Conclusions The Rest Is Mathematics and Experience Introduction Choice of Model Dynamic Programming Models Forward Decomposition Models Practice What You Preach! Computational Schemes Applications Dynamic Programming Software Summary ART Refinements Introduction Weak-Markovian Condition Markovian Formulations Decomposition Schemes Sequential Decision Models Example Shortest Path Model The Art of Dynamic Programming Modeling Summary Bibliographic Notes The State Introduction Preliminary Analysis Mathematically Speaking Decomposition Revisited Infeasible States and Decisions State Aggregation Nodes as States Multistage vs Sequential Models Models vs Functional Equations Easy Problems Modeling Tips Concluding Remarks Summary Parametric Schemes Introduction Background and Motivation Fractional Programming Scheme C-Programming Scheme Lagrange Multiplier Scheme Summary Bibliographic Notes The Principle of Optimality Introduction Bellman's Principle of Optimality Prevailing Interpretation Variations on a Theme Criticism So What Is Amiss? The Final State Model Revisited Bellman's Treatment of Dynamic Programming Summary Post Script: Pontryagin's Maximum Principle Forward Decomposition Introduction Function Decomposition Initial Problem Separable Objective Functions Revisited Modified Problems Revisited Backward Conditional Problems Revisited Markovian Condition Revisited Forward Functional Equation Impact on the State Space Anomaly Pathologic Cases Summary and Conclusions Bibliographic Notes Push! Introduction The Pull Method The Push Method Monotone Accumulated Return Processes Dijkstra's Algorithm Summary Bibliographic Notes EPILOGUE What Then Is Dynamic Programming? Review Non-Optimization Problems An Abstract Dynamic Programming Model Examples The Towers of Hanoi Problem Optimization-Free Dynamic Programming Concluding Remarks Appendix A: Contraction Mapping Appendix B: Fractional Programming Appendix C: Composite Concave Programming Appendix D: The Principle of Optimality in Stochastic Processes Appendix E: The Corridor Method Bibliography Index

142 citations


Journal ArticleDOI
TL;DR: In this article, the optimal investment and proportional reinsurance strategy when an insurance company wishes to maximize the expected exponential utility of the terminal wealth was studied, assuming that the instantaneous rate of investment return follows an Ornstein-Uhlenbeck process.
Abstract: In this paper, we study the optimal investment and proportional reinsurance strategy when an insurance company wishes to maximize the expected exponential utility of the terminal wealth. It is assumed that the instantaneous rate of investment return follows an Ornstein–Uhlenbeck process. Using stochastic control theory and Hamilton–Jacobi–Bellman equations, explicit expressions for the optimal strategy and value function are derived not only for the compound Poisson risk model but also for the Brownian motion risk model. Further, we investigate the partially observable optimization problem, and also obtain explicit expressions for the optimal results.

91 citations


Journal ArticleDOI
TL;DR: The equivalence between the one-dimensional delay problem and the associated infinite-dimensional problem without delay is shown and it is proved that the value function is continuous in this infinite- dimensional setting.
Abstract: This paper deals with the optimal control of a stochastic delay differential equation arising in the management of a pension fund with surplus. The problem is approached by the tool of a representation in infinite dimension. We show the equivalence between the one-dimensional delay problem and the associated infinite-dimensional problem without delay. Then we prove that the value function is continuous in this infinite-dimensional setting. These results represent a starting point for the investigation of the associated infinite-dimensional Hamilton–Jacobi–Bellman equation in the viscosity sense and for approaching the problem by numerical algorithms. Also an example with complete solution of a simpler but similar problem is provided.

89 citations


Proceedings Article
11 Jun 2011
TL;DR: A new heuristic-search-based family of algorithms, FRET (Find, Revise, Eliminate Traps), is presented and a preliminary empirical evaluation shows that FRET solves GSSPs much more efficiently than Value Iteration.
Abstract: Research in efficient methods for solving infinite-horizon MDPs has so far concentrated primarily on discounted MDPs and the more general stochastic shortest path problems (SSPs). These are MDPs with 1) an optimal value function V* that is the unique solution of Bellman equation and 2) optimal policies that are the greedy policies w.r.t. V*. This paper's main contribution is the description of a new class of MDPs, that have well-defined optimal solutions that do not comply with either 1 or 2 above. We call our new class Generalized Stochastic Shortest Path (GSSP) problems. GSSP allows more general reward structure than SSP and subsumes several established MDP types including SSP, positive-bounded, negative, and discounted-reward models. While existing efficient heuristic search algorithms like LAO* and LRTDP are not guaranteed to converge to the optimal value function for GSSPs, we present a new heuristic-search-based family of algorithms, FRET (Find, Revise, Eliminate Traps). A preliminary empirical evaluation shows that FRET solves GSSPs much more efficiently than Value Iteration.

79 citations


Journal ArticleDOI
TL;DR: In this paper, an optimal investment-reinsurance-investment problem is considered for an insurer whose surplus process follows a jump-diffusion model, where the insurer transfers part of the risk due to insurance claims via a proportional reinsurance and invests the surplus in a simplified financial market consisting of a risk-free asset and a risky asset.
Abstract: We consider an optimal reinsurance-investment problem of an insurer whose surplus process follows a jump-diffusion model. In our model the insurer transfers part of the risk due to insurance claims via a proportional reinsurance and invests the surplus in a “simplified” financial market consisting of a risk-free asset and a risky asset. The dynamics of the risky asset are governed by a constant elasticity of variance model to incorporate conditional heteroscedasticity. The objective of the insurer is to choose an optimal reinsurance-investment strategy so as to maximize the expected exponential utility of terminal wealth. We investigate the problem using the Hamilton-Jacobi-Bellman dynamic programming approach. Explicit forms for the optimal reinsuranceinvestment strategy and the corresponding value function are obtained. Numerical examples are provided to illustrate how the optimal investment-reinsurance policy changes when the model parameters vary.

75 citations


Journal ArticleDOI
TL;DR: This paper deals with denumerable continuous-time Markov decision processes (MDP) with constraints, and reminds the reader about the Bellman equation, introduces and study occupation measures, and provides the form of optimal policies for a constrained optimization problem here.
Abstract: This paper deals with denumerable continuous-time Markov decision processes (MDP) with constraints. The optimality criterion to be minimized is expected discounted loss, while several constraints of the same type are imposed. The transition rates may be unbounded, the loss rates are allowed to be unbounded as well (from above and from below), and the policies may be history-dependent and randomized. Based on Kolmogorov's forward equation and Dynkin's formula, we remind the reader about the Bellman equation, introduce and study occupation measures, reformulate the optimization problem as a (primary) linear program, provide the form of optimal policies for a constrained optimization problem here, and establish the duality between the convex analytic approach and dynamic programming. Finally, a series of examples is given to illustrate all of our results.

63 citations


Journal ArticleDOI
TL;DR: In this paper, the authors characterize the highest return relative to the market that can be achieved using non-anticipative investment rules over a given time horizon, and under any admissible configuration of model parameters that might materialize.
Abstract: In an equity market model with “Knightian” uncertainty regarding the relative risk and covariance structure of its assets, we characterize in several ways the highest return relative to the market that can be achieved using nonanticipative investment rules over a given time horizon, and under any admissible configuration of model parameters that might materialize. One characterization is in terms of the smallest positive supersolution to a fully nonlinear parabolic partial differential equation of the Hamilton–Jacobi–Bellman type. Under appropriate conditions, this smallest supersolution is the value function of an associated stochastic control problem, namely, the maximal probability with which an auxiliary multidimensional diffusion process, controlled in a manner which affects both its drift and covariance structures, stays in the interior of the positive orthant through the end of the time-horizon. This value function is also characterized in terms of a stochastic game, and can be used to generate an investment rule that realizes such best possible outperformance of the market.

Journal ArticleDOI
TL;DR: This work considers a nonlinear nonseparable functional approximation to the value function of a dynamic programming formulation for the network revenue management (RM) problem with customer choice and shows that it leads to a tighter upper bound on optimal expected revenue than some known bounds in the literature.
Abstract: We consider a nonlinear nonseparable functional approximation to the value function of a dynamic programming formulation for the network revenue management (RM) problem with customer choice. We propose a simultaneous dynamic programming approach to solve the resulting problem, which is a nonlinear optimization problem with nonlinear constraints. We show that our approximation leads to a tighter upper bound on optimal expected revenue than some known bounds in the literature. Our approach can be viewed as a variant of the classical dynamic programming decomposition widely used in the research and practice of network RM. The computational cost of this new decomposition approach is only slightly higher than the classical version. A numerical study shows that heuristic control policies from the decomposition consistently outperform policies from the classical decomposition.

Journal ArticleDOI
TL;DR: The relation of the relaxed constant rank regularity condition with the error bound property, the directional differentiability of the optimal value function, and necessary and sufficient second order optimality conditions are studied.
Abstract: The paper deals with perturbed nonlinear programming problems under the relaxed constant rank regularity condition. We study the relation of the relaxed constant rank regularity condition with the error bound property, the directional differentiability of the optimal value function, and necessary and sufficient second order optimality conditions.

Journal ArticleDOI
TL;DR: A dynamic programming equation is derived for problems with general time inconsistent preferences and random duration for differential games with random duration by solving the cake-eating problem describing the classical model of management of a nonrenewable resource.

Posted Content
TL;DR: In this paper, the authors consider an agent who invests in a stock and a money market account with the goal of maximizing the utility of his investment at the final time T in the presence of a proportional transaction cost.
Abstract: We consider an agent who invests in a stock and a money market account with the goal of maximizing the utility of his investment at the final time T in the presence of a proportional transaction cost. The utility function considered is power utility. We provide a heuristic and a rigorous derivation of the asymptotic expansion of the value function in powers of transaction cost parameter. We also obtain a "nearly optimal" strategy, whose utility asymptotically matches the leading terms in the value function.

Journal ArticleDOI
TL;DR: In this article, the authors present an optimal investment theorem for a currency exchange model with random and possibly discontinuous proportional transaction costs, where the investor's preferences are represented by a multivariate utility function, allowing for simultaneous consumption of any prescribed selection of the currencies at a given terminal date.
Abstract: We present an optimal investment theorem for a currency exchange model with random and possibly discontinuous proportional transaction costs. The investor’s preferences are represented by a multivariate utility function, allowing for simultaneous consumption of any prescribed selection of the currencies at a given terminal date. We prove the existence of an optimal portfolio process under the assumption of asymptotic satiability of the value function. Sufficient conditions for this include reasonable asymptotic elasticity of the utility function, or a growth condition on its dual function. We show that the portfolio optimization problem can be reformulated in terms of maximization of a terminal liquidation utility function, and that both problems have a common optimizer.

Journal ArticleDOI
TL;DR: The multiobjective bilevel program is a sequence of two optimization problems, with the upper-level problem being multiobjectives and the constraint region of the upper level problem being determined implicitly by the solution set to the lower- level problem.
Abstract: The multiobjective bilevel program is a sequence of two optimization problems, with the upper-level problem being multiobjective and the constraint region of the upper level problem being determined implicitly by the solution set to the lower-level problem. In the case where the Karush-Kuhn-Tucker (KKT) condition is necessary and sufficient for global optimality of all lower-level problems near the optimal solution, we present various optimality conditions by replacing the lower-level problem with its KKT conditions. For the general multiobjective bilevel problem, we derive necessary optimality conditions by considering a combined problem, with both the value function and the KKT condition of the lower-level problem involved in the constraints. Most results of this paper are new, even for the case of a single-objective bilevel program, the case of a mathematical program with complementarity constraints, and the case of a multiobjective optimization problem.

Book ChapterDOI
01 Aug 2011
TL;DR: Optimizing a sequence of actions to attain some future goal is the general topic of control theory Stengel (1993); Fleming and Soner (1992).
Abstract: Optimizing a sequence of actions to attain some future goal is the general topic of control theory Stengel (1993); Fleming and Soner (1992). It views an agent as an automaton that seeks to maximize expected reward (or minimize cost) over some future time period. Two typical examples that illustrate this are motor control and foraging for food. As an example of a motor control task, consider a human throwing a spear to kill an animal. Throwing a spear requires the execution of a motor program that is such that at the moment that the spear releases the hand, it has the correct speed and direction such that it will hit the desired target. A motor program is a sequence of actions, and this sequence can be assigned a cost that consists generally of two terms: a path cost, that specifies the energy consumption to contract the muscles in order to execute the motor program; and an end cost, that specifies whether the spear will kill the animal, just hurt it, or misses it altogether. The optimal control solution is a sequence of motor commands that results in killing the animal by throwing the spear with minimal physical effort. If x denotes the state space (the positions and velocities of the muscles), the optimal control solution is a function u(x, t) that depends both on the actual state of the system at each time and also depends explicitly on time. When an animal forages for food, it explores the environment with the objective to find as much food as possible in a short time window. At each time t, the animal considers the food it expects to encounter in the period [t, t+ T ]. Unlike the motor control example, the time horizon recedes into the future with the current time and the cost consists now only of a path contribution and no end-cost. Therefore, at each time the animal faces the same task, but possibly from a different location of the animal in the environment. The optimal control solution u(x) is now timeindependent and specifies for each location in the environment x the direction u in which the animal should move. The general stochastic control problem is intractable to solve and requires an exponential amount of memory and computation time. The reason is that the state space needs to be discretized and thus becomes exponentially large in the number of dimensions. Computing the expectation values means that all states need to be visited and requires the summation of exponentially large sums. The same intractabilities are encountered in reinforcement learning.

Proceedings ArticleDOI
16 Jul 2011
TL;DR: It is shown that the optimal policies in CPOMDPs can be randomized, and exact and approximate dynamic programming methods for computing randomized optimal policies are presented.
Abstract: Constrained partially observable Markov decision processes (CPOMDPs) extend the standard POMDPs by allowing the specification of constraints on some aspects of the policy in addition to the optimality objective for the value function. CPOMDPs have many practical advantages over standard POMDPs since they naturally model problems involving limited resource or multiple objectives. In this paper, we show that the optimal policies in CPOMDPs can be randomized, and present exact and approximate dynamic programming methods for computing randomized optimal policies. While the exact method requires solving a minimax quadratically constrained program (QCP) in each dynamic programming update, the approximate method utilizes the point-based value update with a linear program (LP). We show that the randomized policies are significantly better than the deterministic ones. We also demonstrate that the approximate point-based method is scalable to solve large problems.

Journal ArticleDOI
TL;DR: In this article, a class of risk-sensitive mean-field stochastic differential games is studied, and the authors show that the mean field value of the exponentiated cost function coincides with the value function of a Hamilton-Jacobi-Bellman-Fleming (HJBF) equation with an additional quadratic term.

Journal ArticleDOI
TL;DR: In this article, the authors compare two different calmness conditions which are widely used in the literature on bilevel programming and on mathematical programs with equilibrium constraints, and they seem to suggest that partial calmness is considerably more restrictive than calmness of the perturbed generalized equation.
Abstract: In this article, we compare two different calmness conditions which are widely used in the literature on bilevel programming and on mathematical programs with equilibrium constraints. In order to do so, we consider convex bilevel programming as a kind of intersection between both research areas. The so-called partial calmness concept is based on the function value approach for describing the lower level solution set. Alternatively, calmness in the sense of multifunctions may be considered for perturbations of the generalized equation representing the same lower level solution set. Both concepts allow to derive first-order necessary optimality conditions via tools of generalized differentiation introduced by Mordukhovich. They are very different, however, concerning their range of applicability and the form of optimality conditions obtained. The results of this article seem to suggest that partial calmness is considerably more restrictive than calmness of the perturbed generalized equation. This fact is al...

Journal ArticleDOI
TL;DR: In this paper, a characterization of the value function as the maximal subsolution of a backward stochastic differential equation (BSDE) and an optimality criterium is provided.
Abstract: In this paper, we study the exponential utility maximization problem in an incomplete market with a default time inducing a discontinuity in the price of stock. We consider the case of strategies valued in a closed set. Using dynamic programming and BSDEs techniques, we provide a characterization of the value function as the maximal subsolution of a backward stochastic differential equation (BSDE) and an optimality criterium. Moreover, in the case of bounded coefficients, the value function is shown to be the maximal solution of a BSDE. Moreover, the value function can be written as the limit of a sequence of processes which can be characterized as the solutions of Lipschitz BSDEs in the case of bounded coefficients. In the case of convex constraints and under some exponential integrability assumptions on the coefficients, some complementary properties are provided. These results can be generalized to the case of several default times or a Poisson process.

Journal ArticleDOI
TL;DR: Some recent research by the authors on approximate policy iteration algorithms that offer convergence guarantees for both parametric and nonparametric architectures for the value function are described.
Abstract: We review the literature on approximate dynamic programming, with the goal of better understanding the theory behind practical algorithms for solving dynamic programs with continuous and vector-valued states and actions and complex information processes. We build on the literature that has addressed the well-known problem of multidimensional (and possibly continuous) states, and the extensive literature on model-free dynamic programming, which also assumes that the expectation in Bellman’s equation cannot be computed. However, we point out complications that arise when the actions/controls are vector-valued and possibly continuous. We then describe some recent research by the authors on approximate policy iteration algorithms that offer convergence guarantees (with technical assumptions) for both parametric and nonparametric architectures for the value function.

Proceedings ArticleDOI
24 Sep 2011
TL;DR: This work derives from its approach a refinement of the curse of dimensionality free method introduced previously by McEneaney, with a higher accuracy for a comparable computational cost.
Abstract: Max-plus based methods have been recently developed to approximate the value function of possibly high dimensional optimal control problems. A critical step of these methods consists in approximating a function by a supremum of a small number of functions (max-plus “basis functions”) taken from a prescribed dictionary. We study several variants of this approximation problem, which we show to be continuous versions of the facility location and k-center combinatorial optimization problems, in which the connection costs arise from a Bregman distance. We give theoretical error estimates, quantifying the number of basis functions needed to reach a prescribed accuracy. We derive from our approach a refinement of the curse of dimensionality free method introduced previously by McEneaney, with a higher accuracy for a comparable computational cost.

Posted Content
TL;DR: In this article, the authors consider the case where the dynamic and running cost can be completely different in two (or more) complementary domains of the space and present discontinuities at the boundary of these domains.
Abstract: This article is the starting point of a series of works whose aim is the study of deterministic control problems where the dynamic and the running cost can be completely different in two (or more) complementary domains of the space $\R^N$. As a consequence, the dynamic and running cost present discontinuities at the boundary of these domains and this is the main difficulty of this type of problems. We address these questions by using a Bellman approach: our aim is to investigate how to define properly the value function(s), to deduce what is (are) the right Bellman Equation(s) associated to this problem (in particular what are the conditions on the set where the dynamic and running cost are discontinuous) and to study the uniqueness properties for this Bellman equation. In this work, we provide rather complete answers to these questions in the case of a simple geometry, namely when we only consider two different domains which are half spaces: we properly define the control problem, identify the different conditions on the hyperplane where the dynamic and the running cost are discontinuous and discuss the uniqueness properties of the Bellman problem by either providing explicitly the minimal and maximal solution or by giving explicit conditions to have uniqueness.

Book ChapterDOI
01 Jan 2011
TL;DR: In this article, the authors give an introduction to nonlinear infinite horizon optimal control and show that the optimal value function is a Lyapunov function for the closed-loop system.
Abstract: In this chapter we give an introduction to nonlinear infinite horizon optimal control. The dynamic programming principle as well as several consequences of this principle are proved. One of the main results of this chapter is that the infinite horizon optimal feedback law asymptotically stabilizes the system and that the infinite horizon optimal value function is a Lyapunov function for the closed loop system. Motivated by this property we formulate a relaxed version of the dynamic programming principle, which allows to prove stability and suboptimality results for nonoptimal feedback laws and without using the optimal value function. A practical version of this principle is provided, too. These results will be central in the following chapters for the stability and performance analysis of NMPC algorithms. For the special case of sampled-data systems we finally show that for suitable integral costs asymptotic stability of the continuous time sampled data closed loop system follows from the asymptotic stability of the associated discrete time system.

Journal ArticleDOI
TL;DR: The implicit learning capabilities of the RISE control structure is used to learn the dynamics asymptotically and it is shown that the system converges to a state space system that has a quadratic performance index which has been optimized by an additional control element.

Journal ArticleDOI
TL;DR: In this paper, the authors considered the problem of finding good deterministic policies whose risk is smaller than some user-specified threshold, and formalized it as a constrained MDP with two criteria.
Abstract: In this paper, we consider Markov Decision Processes (MDPs) with error states. Error states are those states entering which is undesirable or dangerous. We define the risk with respect to a policy as the probability of entering such a state when the policy is pursued. We consider the problem of finding good policies whose risk is smaller than some user-specified threshold, and formalize it as a constrained MDP with two criteria. The first criterion corresponds to the value function originally given. We will show that the risk can be formulated as a second criterion function based on a cumulative return, whose definition is independent of the original value function. We present a model free, heuristic reinforcement learning algorithm that aims at finding good deterministic policies. It is based on weighting the original value function and the risk. The weight parameter is adapted in order to find a feasible solution for the constrained problem that has a good performance with respect to the value function. The algorithm was successfully applied to the control of a feed tank with stochastic inflows that lies upstream of a distillation column. This control task was originally formulated as an optimal control problem with chance constraints, and it was solved under certain assumptions on the model to obtain an optimal solution. The power of our learning algorithm is that it can be used even when some of these restrictive assumptions are relaxed.

DOI
01 Jan 2011
TL;DR: It is proved that the regularization-based Approximate Value/Policy Iteration algorithms introduced in this thesis enjoys an oracle-like property and it may be used to achieve adaptivity: the performance is almost as good as the performance of the unknown best parameters.
Abstract: This thesis studies the reinforcement learning and planning problems that are modeled by a discounted Markov Decision Process (MDP) with a large state space and finite action space. We follow the value-based approach in which a function approximator is used to estimate the optimal value function. The choice of function approximator, however, is nontrivial, as it depends on both the number of data samples and the MDP itself. The goal of this work is to introduce flexible and statistically-efficient algorithms that find close to optimal policies for these problems without much prior information about them. The recurring theme of this thesis is the application of the regularization technique to design value function estimators that choose their estimates from rich function spaces. We introduce regularization-based Approximate Value/Policy Iteration algorithms, analyze their statistical properties, and provide upper bounds on the performance loss of the resulted policy compared to the optimal one. The error bounds show the dependence of the performance loss on the number of samples, the capacity of the function space to which the estimated value function belongs, and some intrinsic properties of the MDP itself. Remarkably, the dependence on the number of samples in the task of policy evaluation is minimax optimal. We also address the problem of automatic parameter-tuning of reinforcement learning/planning algorithms and introduce a complexity regularization-based model selection algorithm. We prove that the algorithm enjoys an oracle-like property and it may be used to achieve adaptivity: the performance is almost as good as the performance of the unknown best parameters. Our two other contributions are used to analyze the aforementioned algorithms. First, we analyze the rate of convergence of the estimation error in regularized least-squares regression when the data is exponentially β-mixing. We prove that up to a logarithmic factor, the convergence rate is the same as the optimal minimax rate available for the i.i.d. case. Second, we attend to the question of how the errors at each iteration of the approximate policy/value iteration influence the quality of the resulting policy. We provide results that highlight some new aspects of these algorithms.

Journal ArticleDOI
TL;DR: In this paper, the authors apply the idea of k-local contraction of Rincon-Zapatero and Rodriguez-Palmero (Econometrica 71:1519-1555, 2003; Econ Theory 33:381-391, 2007) to study discounted stochastic dynamic programming models with unbounded returns.
Abstract: In this paper, we apply the idea of k-local contraction of Rincon-Zapatero and Rodriguez-Palmero (Econometrica 71:1519–1555, 2003; Econ Theory 33:381–391, 2007) to study discounted stochastic dynamic programming models with unbounded returns. Our main results concern the existence of a unique solution to the Bellman equation and are applied to the theory of stochastic optimal growth. Also a discussion of some subtle issues concerning k-local and global contractions is included.