scispace - formally typeset
Search or ask a question

Showing papers on "Bellman equation published in 1999"


Proceedings Article
29 Nov 1999
TL;DR: This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.
Abstract: Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own function approximator, independent of the value function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams's REINFORCE method and actor-critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.

5,492 citations


Journal ArticleDOI
TL;DR: The authors propose a stochastic approximation algorithm that tunes weights of a linear combination of basis functions in order to approximate a value function and prove that this algorithm converges and that the limit of convergence has some desirable properties.
Abstract: The authors develop a theory characterizing optimal stopping times for discrete-time ergodic Markov processes with discounted rewards. The theory differs from prior work by its view of per-stage and terminal reward functions as elements of a certain Hilbert space. In addition to a streamlined analysis establishing existence and uniqueness of a solution to Bellman's equation, this approach provides an elegant framework for the study of approximate solutions. In particular, the authors propose a stochastic approximation algorithm that tunes weights of a linear combination of basis functions in order to approximate a value function. They prove that this algorithm converges (almost surely) and that the limit of convergence has some desirable properties. The utility of the approximation method is illustrated via a computational case study involving the pricing of a path dependent financial derivative security that gives rise to an optimal stopping problem with a 100-dimensional state space.

370 citations


Proceedings ArticleDOI
07 Dec 1999
TL;DR: This paper presents a method for optimal control of hybrid systems using an inequality of Bellman type, and an approximation of the optimal feedback control law is given and tried on some examples.
Abstract: This paper presents a method for optimal control of hybrid systems. An inequality of Bellman type is considered and every solution to this inequality gives a lower bound on the optimal value function. A discretization of this "hybrid Bellman inequality" leads to a convex optimization problem in terms of finite-dimensional linear programming. From the solution of the discretized problem, a value function that preserves the lower bound property can be constructed. An approximation of the optimal feedback control law is given and tried on some examples.

299 citations


Book ChapterDOI
01 Jan 1999
TL;DR: In this article, theoretical and numerical results for solving qualitative and quantitative control and differential game problems are treated in the framework of set-valued analysis and viability theory, which is rather well adapted to look at these several problems with a unified point of view.
Abstract: This chapter deals with theoretical and numerical results for solving qualitative and quantitative control and differential game problems. These questions are treated in the framework of set-valued analysis and viability theory. In a way, this approach is rather well adapted to look at these several problems with a unified point of view. The idea is to characterize the value function as a viability kernel instead of solving a Hamilton—Jacobi—Bellmann equation. This allows us to easily take into account state constraints without any controllability assumptions on the dynamic, neither at the boundary of targets, nor at the boundary of the constraint set. In the case of two-player differential games, the value function is characterized as a discriminating kernel. This allows dealing with a large class of systems with minimal regularity and convexity assumptions. Rigorous proofs of the convergence, including irregular cases, and completely explicit algorithms are provided.

229 citations


Journal ArticleDOI
TL;DR: A powerful new theorem is presented that can provide a unified analysis of value-function-based reinforcement-learning algorithms and allows the convergence of a complex asynchronous reinforcement- learning algorithm to be proved by verifying that a simpler synchronous algorithm converges.
Abstract: Reinforcement learning is the problem of generating optimal behavior in a sequential decision-making environment given the opportunity of interacting with it. Many algorithms for solving reinforcement-learning problems work by computing improved estimates of the optimal value function. We extend prior analyses of reinforcement-learning algorithms and present a powerful new theorem that can provide a unified analysis of such value-function-based reinforcement-learning algorithms. The usefulness of the theorem lies in how it allows the convergence of a complex asynchronous reinforcement-learning algorithm to be proved by verifying that a simpler synchronous algorithm converges. We illustrate the application of the theorem by analyzing the convergence of Q-learning, model-based reinforcement learning, Q-learning with multistate updates, Q-learning for Markov games, and risk-sensitive reinforcement learning.

183 citations


Proceedings Article
31 Jul 1999
TL;DR: A convergent, approximate value determination algorithm for structured MDPs that maintains an additive value function, alternating dynamic programming steps with steps that project the result back into the restricted space of additive functions.
Abstract: Many large Markov decision processes (MDPs) can be represented compactly using a structured representation such as a dynamic Bayesian network. Unfortunately, the compact representation does not help standard MDP algorithms, because the value function for the MDP does not retain the structure of the process description. We argue that in many such MDPs, structure is approximately retained. That is, the value functions are nearly additive: closely approximated by a linear function over factors associated with small subsets of problem features. Based on this idea, we present a convergent, approximate value determination algorithm for structured MDPs. The algorithm maintains an additive value function, alternating dynamic programming steps with steps that project the result back into the restricted space of additive functions. We show that both the dynamic programming and the projection steps can be computed efficiently, despite the fact that the number of states is exponential in the number of state variables.

176 citations


Journal ArticleDOI
TL;DR: For particular classes of systems, asymptotics for vanishing risk factor is investigated, showing that in the limit the optimal value for an average cost per unit time is obtained.
Abstract: In this paper we study existence of solutions to the Bellman equation corresponding to risk-sensitive ergodic control of discrete-time Markov processes using three different approaches. Also, for particular classes of systems, asymptotics for vanishing risk factor is investigated, showing that in the limit the optimal value for an average cost per unit time is obtained.

152 citations


Proceedings Article
31 Jul 1999
TL;DR: This paper describes variable resolution policy and value function representations based on Kuhn triangulations embedded in a kd-tree and derives a splitting criterion that allows one cell to efficiently take into account its impact on other cells when deciding whether to split.
Abstract: State abstraction is of central importance in remforcement learning and Markov Decision Processes. This paper studies the case of variable resolution state abstraction for continuous-state, deterministic dynamic control problems in which near-optimal policies are required. We describe variable resolution policy and value function representations based on Kuhn triangulations embedded in a kd-tree. We then consider top-down approaches to choosing which cells to split in order to generate improved policies. We begin with local approaches based on value function properties and policy properties that use only features of individual cells in making splitting choices. Later, by introducing two new non-local measures, influence and variance, we derive a splitting criterion that allows one cell to efficiently take into account its impact on other cells when deciding whether to split. We evaluate the performance of a variety of splitting criteria on many benchmark problems (published on the web), paying careful attention to their number-of-cells versus closeness-to-optimality tradeoff curves.

116 citations


Book ChapterDOI
01 Jan 1999
TL;DR: In this paper, a class of numerical schemes for the Isaacs equation of pursuit-evasion games is presented, where the solution is interpreted in the viscosity sense, as well as discontinuous value functions, and the convergence of the approximation scheme to the value function of the game is proved.
Abstract: We present a class of numerical schemes for the Isaacs equation of pursuit-evasion games. We consider continuous value functions, where the solution is interpreted in the viscosity sense, as well as discontinuous value functions, where the notion of viscosity envelope-solution is needed. The convergence of the approximation scheme to the value function of the game is proved in both cases. A priori estimates of the convergence in L∞ are established when the value function is Holder continuous. We also treat problems with state constraints and discuss several issues concerning the implementation of the approximation scheme, the synthesis of approximate feedback controls, and the approximation of optimal trajectories. The efficiency of the algorithm is illustrated by a number of numerical tests, either in the case of one player (i.e., minimum time problem) or for some 2-players games.

104 citations


Journal ArticleDOI
TL;DR: In this paper, the authors studied the problem of minimizing risk in Markov decision processes with countable state space and reward set, where the objective is to find a policy which minimizes the probability that the total discounted rewards do not exceed a specified value (target).

93 citations


Book
31 Jan 1999
TL;DR: Theoretical foundations for the Laws of Conservation and the Bellman's Principle of Optimality and its generalizations are discussed in this paper, with a focus on set theory and axiomatic set theory.
Abstract: Naive Set Theory.- Axiomatic Set Theory.- Centralizability and Tests of Applications.- A Theoretical Foundation for the Laws of Conservation.- A Mathematics of Computability that Speaks the Language of Levels.- Bellman's Principle of Optimality and its Generalizations.- Unreasonable Effectiveness of Mathematics: A New Tour.- General Systems: A Multirelation Approach.- Systems of Single Relations.- Calculus of Generalized Numbers.- Some Unsolved Problems in General Systems Theory.

Journal ArticleDOI
TL;DR: First-order necessary optimality conditions for generalized semi-infinite optimization problems where the index set of the corresponding inequality constraints depends on the decision variables and the involved functions are assumed to be continuously differentiable are derived.
Abstract: In this paper, we consider a generalized semi-infinite optimization problem where the index set of the corresponding inequality constraints depends on the decision variables and the involved functions are assumed to be continuously differentiable. We derive first-order necessary optimality conditions for such problems by using bounds for the upper and lower directional derivatives of the corresponding optimal value function. In the case where the optimal value function is directly differentiable, we present first-order conditions based on the linearization of the given problem. Finally, we investigate necessary and sufficient first-order conditions by using the calculus of quasidifferentiable functions.

Journal ArticleDOI
TL;DR: In this article, the authors considered the scheduling problem for multiclass queueing networks and optimization of Markov decision processes, and showed that the value iteration algorithm may perform poorly when the algorithm is not initialized properly.
Abstract: This paper considers in parallel the scheduling problem for multiclass queueing networks, and optimization of Markov decision processes. It is shown that the value iteration algorithm may perform poorly when the algorithm is not initialized properly. The most typical case where the initial value function is taken to be zero may be a particularly bad choice. In contrast, if the value iteration algorithm is initialized with a stochastic Lyapunov function, then the following hold: (i) a stochastic Lyapunov function exists for each intermediate policy, and hence each policy is regular (a strong stability condition), (ii) intermediate costs converge to the optimal cost, and (iii) any limiting policy is average cost optimal. It is argued that a natural choice for the initial value function is the value function for the associated deterministic control problem based upon a fluid model, or the approximate solution to Poisson’s equation obtained from the LP of Kumar and Meyn. Numerical studies show that either choice may lead to fast convergence to an optimal policy.

Book ChapterDOI
01 Jan 1999
TL;DR: In this paper, a systematic account of some results obtained in viscosity solutions, optimal control, and the variational theory of minimizing the maximum is presented, including a derivation of the Euler equation for minimax problems, and an application of minimax control to the minimal time function.
Abstract: A systematic account of some results obtained in viscosity solutions, optimal control, and the variational theory of minimizing the maximum is presented. The major highlights include the theory of lower semicontinuous viscosity solutions, optimal control of the supremum, and its applications to explicit formulas for nonlinear partial differential equations. Several new results are presented including a new definition of Morrey convexity and Morrey quasiconvexity for vector valued problems, result on the existence of minimizers for nonquasiconvex supremands, a derivation of the Euler equation for minimax problems, and an application of minimax control to the minimal time function.

Journal Article
TL;DR: In this paper, general optimality principles for semicontinuous viscosity solutions of Hamilton-Jacobi equations were proved for a class of equations without uniqueness, including the degenerate eikonal equation and the Bellman equation of the linear quadratic control problem.
Abstract: We prove general optimality principles for semicontinuous viscosity solutions of Hamilton-Jacobi equations. We also characterize the minimal nonnegative supersolution and the maximal subsolution null on a closed given set for a class of equations without uniqueness, including the degenerate eikonal equation and the Bellman equation of the linear quadratic control problem.

Journal ArticleDOI
TL;DR: A new transform of set functions over a finite set is introduced, which is linear and invertible as the well-known Mobius transform in combinatorics, which leads to the interaction index, a central concept in multicriteria decision making.

Journal ArticleDOI
TL;DR: In this paper, the existence of optimal stationary policies for infinite-horizon risk-sensitive Markov control processes with denumerable state space, unbounded cost function, and long-run average cost is studied.
Abstract: In this paper we are concerned with the existence of optimal stationary policies for infinite-horizon risk-sensitive Markov control processes with denumerable state space, unbounded cost function, and long-run average cost. Introducing a discounted cost dynamic game, we prove that its value function satisfies an Isaacs equation, and its relationship with the risk-sensitive control problem is studied. Using the vanishing discount approach, we prove that the risk-sensitive dynamic programming inequality holds, and derive an optimal stationary policy.

Book ChapterDOI
01 Jan 1999
TL;DR: In this paper, the authors study invariance and viability properties of a closed set for the trajectories of either a controlled diffusion process or a controlled deterministic system with disturbances, using the value functions associated to suitable optimal control problems or differential games and analyze the related Dynamic Programming equation within the theory of viscosity solutions.
Abstract: We study invariance and viability properties of a closed set for the trajectories of either a controlled diffusion process or a controlled deterministic system with disturbances. We use the value functions associated to suitable optimal control problems or differential games and analyze the related Dynamic Programming equation within the theory of viscosity solutions.

Proceedings ArticleDOI
10 Jul 1999
TL;DR: This work uses neural networks to approximate the solution to the Hamilton-Jacobi-Bellman (HJB) equation which is a first-order, nonlinear, partial differential equation, and derives the gradient descent rule for integrating this equation inside the domain, given the conditions on the boundary.
Abstract: We investigate new approaches to dynamic-programming-based optimal control of continuous time-and-space systems. We use neural networks to approximate the solution to the Hamilton-Jacobi-Bellman (HJB) equation which is a first-order, nonlinear, partial differential equation. We derive the gradient descent rule for integrating this equation inside the domain, given the conditions on the boundary. We apply this approach to the "car-on-the-hill" which is a 2D highly nonlinear control problem. We discuss the results obtained and point out a low quality of approximation of the value function and of the derived control. We attribute this bad approximation to the fact that the HJB equation has many generalized solutions other than the value function, and our gradient descent method converges to one among these functions, thus possibly failing to find the correct value function. We illustrate this limitation on a simple 1D control problem.

Journal ArticleDOI
TL;DR: D discrete time portfolio selection with maximization of the risk sensitized growth rate with and without transaction costs is considered.
Abstract: In the paper discrete time portfolio selection with maximization of the risk sensitized growth rate with and without transaction costs is considered.

Book ChapterDOI
01 Jan 1999
TL;DR: In this paper, it was proved that the value function of piecewise deterministic process optimal control is the unique viscosity solution of its associated Hamilton-Jacobi-Bellman equation.
Abstract: This article concerns the optimal control of piecewise deterministic processes in the viscosity solutions context. Boundary conditions given in Vermes (1985) are weakened and replaced by boundary conditions in exit-time optimal control problems as given in Barles (1994). It is proved that the value function of piecewise-deterministic process optimal control is the unique viscosity solution of its associated Hamilton-Jacobi-Bellman equation.

Journal ArticleDOI
TL;DR: Requirements for the efficient visualization of both the optimal value functions and the optimal trajectories are discussed and graphic routines that in particular support adaptive, hierarchical grid structures, interactivity and animation are developed.
Abstract: We present methods for the visualization of the numerical solution of optimal control problems. The solution is based on dynamic programming techniques where the corresponding optimal value function is approximated on an adaptively refined grid. This approximation is then used in order to compute approximately optimal solution trajectories. We discuss requirements for the efficient visualization of both the optimal value functions and the optimal trajectories and develop graphic routines that in particular support adaptive, hierarchical grid structures, interactivity and animation. Several implementational aspects using the Graphics Programming Environment ‘GRAPE’ are discussed.

Proceedings ArticleDOI
07 Dec 1999
TL;DR: In this article, influence and variance of a Markov chain are used to decide where to refine the resolution of adaptive discretizations for solving continuous time-and-space deterministic optimal control problems.
Abstract: This paper addresses the difficult problem of deciding where to refine the resolution of adaptive discretizations for solving continuous time-and-space deterministic optimal control problems. We introduce two measures, influence and variance of a Markov chain. Influence measures the extent to which changes of some state affect the value function at other states. Variance measures the heterogeneity of the future cumulated active rewards (whose mean is the value function). We combine these two measures to derive a nonlocal efficient splitting criterion that takes into account the impact of a state on other states when deciding whether to split. We illustrate this method on the non-linear, two dimensional "Car on the Hill" and the 4d "space-shuttle" and "airplane-meeting" control problems.

Journal ArticleDOI
TL;DR: In this article, the authors derived necessary and sufficient conditions for a pair of functions to be the optimal policy function and the optimal value function of a dynamic maximization problem with convex constraints and concave objective functional.
Abstract: We derive necessary and sufficient conditions for a pair of functions to be the optimal policy function and the optimal value function of a dynamic maximization problem with convex constraints and concave objective functional. It is shown that every Lipschitz continuous function can be the solution of such a problem. If the maintained assumptions include free disposal and monotonicity, then we obtain a complete characterization of all optimal policy and optimal value functions. This is the case, e.g., in the standard aggregative optimal growth model.

Journal ArticleDOI
TL;DR: In this article, the existence of solutions to the Bellman equation corresponding to risk sensitive control of partially observed discrete time Markov processes is shown; this in turn leads to optimal strategies, and the method used in the paper is based on discounted risk sensitive approximation.
Abstract: In this paper existence of solutions to the Bellman equation corresponding to risk sensitive control of partially observed discrete time Markov processes is shown; this in turn leads to the existence of optimal strategies. The method used in the paper is based on discounted risk sensitive approximation

Journal ArticleDOI
TL;DR: In this paper, the effect of simple cusp singularities in the flow of a parametrized family of extremal trajectories of an optimal control problem has on the corresponding cost or value function is analyzed.
Abstract: We analyze the effect which a fold and simple cusp singularity in the flow of a parametrized family of extremal trajectories of an optimal control problem has on the corresponding parametrized cost or value function. A fold singularity in the flow of extremals generates an edge of regression of the value implying the well-known results that trajectories stay strongly locally optimal until the fold-locus is reached, but lose optimality beyond. Thus fold points correspond to conjugate points. A simple cusp point in the parametrized flow of extremals generates a swallow-tail in the parametrized value. More specifically, there exists a region in the state space which is covered 3:1 with both locally minimizing and maximizing branches. The changes from the locally minimizing to the maximizing branch occur at the fold-loci and there trajectories lose strong local optimality. However, the branches intersect and generate a cut-locus which limits the optimality of close-by trajectories and eliminates these trajectories from optimality near the cusp point prior to the conjugate point. In the language of partial differential equations, a simple cusp point generates a shock in the solutions to the Hamilton--Jacobi--Bellman equation while fold points will not be part of the synthesis of optimal controls near the simple cusp point.

Journal ArticleDOI
TL;DR: In this paper, a general multidimensional diffusion-type stochastic control problem is studied and the value function of the problem is a viscosity solution of certain Hamilton-Jacobi-Bellman (HJB) quasivariational inequalities.
Abstract: In this paper we study a general multidimensional diffusion-type stochastic control problem. Our model contains the usual regular control problem, singular control problem and impulse control problem as special cases. Using a unified treatment of dynamic programming, we show that the value function of the problem is a viscosity solution of certain Hamilton-Jacobi-Bellman (HJB) quasivariational inequality. The uniqueness of such a quasi-variational inequality is proved.

Journal ArticleDOI
TL;DR: In this article, the expected total cost (ETC) criterion for discrete-time Markov control processes on Borel spaces, and possibly unbounded cost-per-stage functions, is studied.
Abstract: This paper studies the expected total cost (ETC) criterion for discrete-time Markov control processes on Borel spaces, and possibly unbounded cost-per-stage functions. It presents optimality results which include conditions for a control policy to be ETC-optimal and for the ETC-value function to be a solution of the dynamic programming equation. Conditions are also given for the ETC-value function to be the limit of the α-discounted cost value function as α ↑ 1, and for the Markov control process to be `stable" in the sense of Lagrange and almost surely. In addition, transient control models are fully analized. The paper thus provides a fairly complete, up-dated, survey-like presentation of the ETC criterion for Markov control processes on Borel spaces.

Journal ArticleDOI
TL;DR: The proposed grid schemes for solving optimal guaranteed control problems can be applied to models arising in mechanics, mathematical economics, differential and evolutionarygames.
Abstract: Grid approximation schemes for constructing value functions and optimal feedbacks inproblems of guaranteed control are proposed. Value functions in optimal control problemsare usually nondifferentiable and corresponding feedbacks have a discontinuous switchingcharacter. Constructions of generalized gradients for local (convex, concave, linear) hullsare adapted to finite difference operators which approximate value functions. Optimal feedbacksare synthesized by extremal shift in the direction of generalized gradients. Bothproblems of constructing the value function and control synthesis are solved simultaneouslyin the single grid scheme. The interpolation problem is analyzed for grid values of optimalfeedbacks. Questions of correlation between spatial and temporal meshes are examined.The significance of quasiconvex properties is clarified for linear dependence of space‐timegrids.

Journal ArticleDOI
TL;DR: In this paper, the authors extend known existence results for Hamilton-Jacobi-Bellman equations for infinite cylinders to the more general case where they have principal eigenvalues bounded below by a positive constant.
Abstract: The purpose of this paper is to extend known existence results for Hamilton-Jacobi-Bellman equations. The classical results give existence, uniqueness and Holder regularity when all elliptic operators involved have nonpositive zero-order term. We want to handle here the more general case where they have principal eigenvalues bounded below by a positive constant. As a motivation for this work, we give an application to the study of the Maximum Principle in infinite cylinders, following a work by Berestycki, Caffarelli and Nirenberg [2]. This is used to extend the cylindrical symmetry result in [2] to a more general class of operators.