scispace - formally typeset
Search or ask a question

Showing papers on "Bellman equation published in 2000"


Journal ArticleDOI
TL;DR: This paper describes an iterative dynamic programming algorithm which computes an interval value function for a given bounded parameter MDP and specified policy and introduces {\em interval value functions\/} as a natural extension of traditional value functions.

314 citations


Journal ArticleDOI
TL;DR: In this paper, a risk process modelled as a compound Poisson process is considered and the ruin probability of this risk process is minimized by the choice of a suitable investment strategy for a capital market index.
Abstract: We consider a risk process modelled as a compound Poisson process. The ruin probability of this risk process is minimized by the choice of a suitable investment strategy for a capital market index. The optimal strategy is computed using the Bellman equation. We prove the existence of a smooth solution and a verification theorem, and give explicit solutions in some cases with exponential claim size distribution, as well as numerical results in a case with Pareto claim size. For this last case, the optimal amount invested will not be bounded.

284 citations


Journal ArticleDOI
TL;DR: In this paper, the estimates for parabolic Bellman's equations with variable coefficients were obtained for constant and variable coefficients, respectively, and they were extended to the case of variable coefficients.
Abstract: The estimates presented here for parabolic Bellman's equations with variable coefficients extend the ones earlier obtained for constant coefficients.

200 citations


Journal ArticleDOI
TL;DR: In this paper, the authors consider an optimal investment model in which the goal is to maximize the long-term growth rate of expected utility of wealth, and reformulate the problem as an infinite time horizon risk sensitive control problem.
Abstract: We consider an optimal investment model in which the goal is to maximize the long-term growth rate of expected utility of wealth. In the model, the mean returns of the securities are explicitly affected by the underlying economic factors. The utility function is HARA. The problem is reformulated as an infinite time horizon risk-sensitive control problem. We study the dynamic programming equation associated with this control problem and derive some consequences of the investment problem.

182 citations


Journal ArticleDOI
TL;DR: A verification theorem of variational inequality type is proved and is applied to solve explicitly some classes of optimal harvesting delay problems.
Abstract: We consider optimal harvesting of systems described by stochastic differential equations with delay. We focus on those situations where the value function of the harvesting problem depends on the initial path of the process in a simple way, namely through its value at 0 and through some weighted averages A verification theorem of variational inequality type is proved. This is applied to solve explicitly some classes of optimal harvesting delay problems

159 citations


Proceedings Article
30 Jun 2000
TL;DR: This work presents a new approach to value determination, that uses a simple closed-form computation to compute a least-squares decomposed approximation to the value function for any weights directly, and uses this value determination algorithm as a subroutine in a policy iteration process.
Abstract: Many large MDPs can be represented compactly using a dynamic Bayesian network. Although the structure of the value function does not retain the structure of the process, recent work has suggested that value functions in factored MDPs can often be approximated well using a factored value function: a linear combination of restricted basis functions, each of which refers only to a small subset of variables. An approximate factored value function for a particular policy can be computed using approximate dynamic programming, but this approach (and others) can only produce an approximation relative to a distance metric which is weighted by the stationary distribution of the current policy. This type of weighted projection is ill-suited to policy improvement. We present a new approach to value determination, that uses a simple closed-form computation to compute a least-squares decomposed approximation to the value function for any weights directly. We then use this value determination algorithm as a subroutine in a policy iteration process. We show that, under reasonable restrictions, the policies induced by a factored value function can be compactly represented as a decision list, and can be manipulated efficiently in a policy iteration process. We also present a method for computing error bounds for decomposed value functions using a variableelimination algorithm for function optimization. The complexity of all of our algorithms depends on the factorization of the system dynamics and of the approximate value function.

157 citations


Journal ArticleDOI
TL;DR: In this paper, the principal seeks an optimal payment scheme, striving to induce the actions that will maximize her expected discounted profits over a finite planning horizon, and a set of assumptions are introduced that enable a systematic analysis.
Abstract: The principal-agent paradigm, in which a principal has a primary stake in the performance of some system but delegates operational control of that system to an agent, has many natural applications in operations management (OM). However, existing principal-agent models are of limited use to OM researchers because they cannot represent the rich dynamic structure required of OM models. This paper formulates a novel dynamic model that overcomes these limitations by combining the principal-agent framework with the physical structure of a Markov decision process. In this model one has a system moving from state to state as time passes, with transition probabilities depending on actions chosen by an agent, and a principal who pays the agent based on state transitions observed. The principal seeks an optimal payment scheme, striving to induce the actions that will maximize her expected discounted profits over a finite planning horizon. Although dynamic principal-agent models similar to the one proposed here are considered intractable, a set of assumptions are introduced that enable a systematic analysis. These assumptions involve the "economic structure" of the model but not its "physical structure." Under these assumptions, the paper establishes that one can use a dynamic-programming recursion to derive an optimal payment scheme. This scheme is memoryless and satisfies a generalization of Bellman's principle of optimality. Important managerial insights are highlighted in the context of a two-state example called "the maintenance problem".

108 citations


Journal ArticleDOI
TL;DR: It is demonstrated that given these conditions increased stochastic fluctuations decrease the value and increase the optimal threshold, thus postponing the exercise of the irreversible policy.
Abstract: We consider a class of singular stochastic control problems arising frequently in applications of stochastic control. We state a set of conditions under which the optimal policy and its value can be derived in terms of the minimal r-excessive functions of the controlled diffusion, and demonstrate that the optimal policy is of the standard local time type. We then state a set of weak smoothness conditions under which the value function is increasing and concave, and demonstrate that given these conditions increased stochastic fluctuations decrease the value and increase the optimal threshold, thus postponing the exercise of the irreversible policy. In line with previous studies of singular stochastic control, we also establish a connection between singular control and optimal stopping, and show that the marginal value of the singular control problem coincides with the value of the associated stopping problem whenever 0 is not a regular boundary for the controlled diffusion.

100 citations


Journal ArticleDOI
TL;DR: In this paper, the authors show that the modified dynamic programming operator should not possess a fixed point, and therefore, approximate value iteration should not be expected to converge to any fixed point.
Abstract: Approximate value iteration is a simple algorithm that combats the curse of dimensionality in dynamic programs by approximating iterates of the classical value iteration algorithm in a spirit reminiscent of statistical regression. Each iteration of this algorithm can be viewed as an application of a modified dynamic programming operator to the current iterate. The hope is that the iterates converge to a fixed point of this operator, which will then serve as a useful approximation of the optimal value function. In this paper, we show that, in general, the modified dynamic programming operator need not possess a fixed point; therefore, approximate value iteration should not be expected to converge. We then propose a variant of approximate value iteration for which the associated operator is guaranteed to possess at least one fixed point. This variant is motivated by studies of temporal-difference (TD) learning, and existence of fixed points implies here existence of stationary points for the ordinary differential equation approximated by a version of TD that incorporates exploration.

86 citations


Journal ArticleDOI
TL;DR: A general convergence theorem is derived for RL algorithms when one uses only “approximations” of the initial data, which can be used for model-based or model-free RL algorithms, with off-line or on-line updating methods, for deterministic or stochastic state dynamics, and based on FE or FD discretization methods.
Abstract: This paper proposes a study of Reinforcement Learning (RL) for continuous state-space and time control problems, based on the theoretical framework of viscosity solutions (VSs). We use the method of dynamic programming (DP) which introduces the value function (VF), expectation of the best future cumulative reinforcement. In the continuous case, the value function satisfies a non-linear first (or second) order (depending on the deterministic or stochastic aspect of the process) differential equation called the Hamilton-Jacobi-Bellman (HJB) equation. It is well known that there exists an infinity of generalized solutions (differentiable almost everywhere) to this equation, other than the VF. We show that gradient-descent methods may converge to one of these generalized solutions, thus failing to find the optimal control. In order to solve the HJB equation, we use the powerful framework of viscosity solutions and state that there exists a unique viscosity solution to the HJB equation, which is the value function. Then, we use another main result of VSs (their stability when passing to the limit) to prove the convergence of numerical approximations schemes based on finite difference (FD) and finite element (FE) methods. These methods discretize, at some resolution, the HJB equation into a DP equation of a Markov Decision Process (MDP), which can be solved by DP methods (thanks to a “strong” contraction property) if all the initial data (the state dynamics and the reinforcement function) were perfectly known. However, in the RL approach, as we consider a system in interaction with some a priori (at least partially) unknown environment, which learns “from experience”, the initial data are not perfectly known but have to be approximated during learning. The main contribution of this work is to derive a general convergence theorem for RL algorithms when one uses only “approximations” (in a sense of satisfying some “weak” contraction property) of the initial data. This result can be used for model-based or model-free RL algorithms, with off-line or on-line updating methods, for deterministic or stochastic state dynamics (though this latter case is not described here), and based on FE or FD discretization methods. It is illustrated with several RL algorithms and one numerical simulation for the “Car on the Hill” problem.

80 citations


Proceedings ArticleDOI
12 Dec 2000
TL;DR: In this paper, the value function for an optimal control problem with endpoint and state constraints is characterized as the unique lower semi-continuous generalized solution of the Hamilton-Jacobi equation under a constraint qualification (CQ) concerning the interaction of the state and dynamic constraints.
Abstract: In this paper, the value function for an optimal control problem with endpoint and state constraints is characterized as the unique lower semi-continuous generalized solution of the Hamilton-Jacobi equation. This is achieved under a constraint qualification (CQ) concerning the interaction of the state and dynamic constraints. The novelty of the results reported here is partly the nature of (CQ) and partly the proof techniques employed, which are based on new estimates of the distance of the set of state trajectories satisfying a state constraint from a given trajectory which violates the constraint.

Journal ArticleDOI
TL;DR: In this paper, the limit of the value function of a singularly perturbed optimal control problem is characterized under general conditions and it is shown that limit value functions exist and solve in a viscosity sense a Hamilton-Jacobi equation.
Abstract: The limit as ɛ→ 0 of the value function of a singularly perturbed optimal control problem is characterized. Under general conditions it is shown that limit value functions exist and solve in a viscosity sense a Hamilton—Jacobi equation. The Hamiltonian of this equation is generated by an infinite horizon optimization on the fast time scale. In particular, the limit Hamiltonian and the limit Hamilton—Jacobi equation are applicable in cases where the reduction of order, namely setting ɛ = 0 , does not yield an optimal behavior.

Journal ArticleDOI
TL;DR: Value functions propagated from initial or terminal costs and constraints by way of a differential inclusion, or more broadly through a Lagrangian that may take on $\infty$, are studied in the case where convexity persists in the state argument and a extended "method of characteristics" is developed.
Abstract: Value functions propagated from initial or terminal costs and constraints by way of a differential inclusion, or more broadly through a Lagrangian that may take on $\infty$, are studied in the case where convexity persists in the state argument. Such value functions, themselves taking on $\infty$, are shown to satisfy a subgradient form of the Hamilton--Jacobi equation which strongly supports properties of local Lipschitz continuity, semidifferentiability and Clarke regularity. An extended "method of characteristics" is developed which determines them from the Hamiltonian dynamics underlying the given Lagrangian. Close relations with a dual value function are revealed.

Journal ArticleDOI
TL;DR: It is proved that the value function of this problem is the unique viscosity solution of the associated Hamilton--Jacobi--Bellman equation.
Abstract: The paper is concerned with fully nonlinear second order Hamilton--Jacobi--Bellman--Isaacs equations of elliptic type in separable Hilbert spaces which have unbounded first and second order terms. The viscosity solution approach is adapted to the equations under consideration and the existence and uniqueness of viscosity solutions are proved. A stochastic optimal control problem driven by a parabolic stochastic PDE with control of Dirichlet type on the boundary is considered. It is proved that the value function of this problem is the unique viscosity solution of the associated Hamilton--Jacobi--Bellman equation.

Journal ArticleDOI
TL;DR: In this article, the authors derived the value of the optimal singular stochastic control for maximizing the expected cumulative revenue flows in the presence of a state-dependent marginal yield measuring the instantaneous returns accrued from irreversibly exerting the singular policy.

Journal ArticleDOI
TL;DR: In this paper, the authors study the stochastic version of a previous paper of the authors, in which hybrid control for deterministic systems was considered, and show how the dynamic programming approach leads to some involved quasi-variational inequality.


Journal ArticleDOI
TL;DR: The value function for the average cost control of a class of partially observed Markov chains is derived as the "vanishing discount limit," in a suitable sense, of the value functions for the corresponding discounted cost problems.
Abstract: The value function for the average cost control of a class of partially observed Markov chains is derived as the "vanishing discount limit," in a suitable sense, of the value functions for the corresponding discounted cost problems. The limiting procedure is justified by bounds derived using a simple coupling argument.

Journal ArticleDOI
TL;DR: In this article, the value function of a Bolza optimal control problem with state constraints is characterized as the unique lower semicontinuous solution of a Hamilton-Jacobi equation.


Journal ArticleDOI
TL;DR: In this paper, a control problem with risk sensitive ergodic performance criterion for a discrete time Feller process is considered and the existence and uniqueness of the solution to the Bellman equation is proved.

01 Jan 2000
TL;DR: Alternative numerical methods for approximating solutions to continuous-state dynamic programming (DP) problems are compared, including DPI and PPI to parameteric methods applied to the Euler equation for several test problems with closed-form solutions.
Abstract: We compare alternative numerical methods for approximating solutions to continuous-state dynamic programming (DP) problems. We distinguish two approaches: discrete approximation and parametric approximation. In the former, the continuous state space is discretized into a finite number of points N, and the resulting finite-state DP problem is solved numerically. In the latter, a function associated with the DP problem such as the value function, the policy function, or some other related function is approximated by a smooth function of K unknown parameters. Values of the parameters are chosen so that the parametric function approximates the true function as closely as possible. We focus on approximations that are linear in parameters, i.e. where the parametric approximation is a linear combination of K basis functions. We also focus on methods that approximate the value function V as the solution to the Bellman equation associated with the DP problem. In finite state DP problems the method of policy iteration is an effective iterative method for solving the Bellman equation that converges to V in a finite number of steps. Each iteration involves a policy valuation step that computes the value function Vα corresponding to a trial policy α. We show how policy iteration can be extended to continuous-state DP problems. For discrete approximation, we refer to the resulting algorithm as discrete policy iteration (DPI). Each policy valuation step requires the solution of a system of linear equations with N variables. For parametric approximation, we refer to the resulting algorithm as parametric policy iteration (PPI). Each policy valuation step requires the solution of a linear regression with K unknown parameters. The advantage of PPI is that it is generally much faster than DPI, particularly when V can be well-approximated with small K. The disadvantage is that the PPI algorithm may either fail to converge or may converge to an incorrect solution. We compare DPI and PPI to parameteric methods applied to the Euler equation for several test problems with closed-form solutions. We also compare the performance of these methods in several “real” applications, including a life-cycle consumption problem, an inventory investment problem, and a problem of optimal pricing, advertising, and exit decisions for newly introduced products.

Journal ArticleDOI
TL;DR: In this paper, a class of Hamilton-Jacobi-Bellman (HJB) equations associated to stochastic optimal control of the Duncan-Mortensen-Zakai equation are investigated in weighted L 2 spaces.

Journal ArticleDOI
TL;DR: In this article, the problem of determining efficient allocations in a multi-agent economy was characterized in terms of a single value function (that of a social planner), rather than multiple functions (one for each investor).

Journal ArticleDOI
TL;DR: In this paper, the existence, uniqueness, and regularity properties for a class of H-J-B equations arising in non-linear control problems with unbounded controls are investigated.
Abstract: We investigate existence, uniqueness, and regular- ity properties for a class of H-J-B equations arising in non-linear control problems with unbounded controls. These equations involve Hamiltonians which are superlinear in the adjoint vari- able, and they have been already studied in the case when the growth in the adjoint variable is, in a sense, uniform with re- spect to the state variable. For instance, this is the case of the linear-quadratic problem. On the contrary, our results concern Hamiltonians that are superlinear in the adjoint variable, pos- sibly not uniformly with respect to the state variable. Actually, this is the general situation one has to deal with when consid- ering optimal control problems with a nonlinear dynamics (e.g. by slightly perturbing the linear quadratic problem). We also investigate situations where the fast growth of the Hamilton- ian in the adjoint variable degenerates into a very discontinuity. Such Hamiltonians arise quite naturally in those optimal con- trol problems where, roughly speaking, the dynamics and the cost display the same growth in the control variable.

Proceedings ArticleDOI
12 Dec 2000
TL;DR: In this article, the problem of optimal robust sensor scheduling is formulated and solution to this problem is given in terms of the existence of suitable solutions to a Riccati differential equation of the game type and a dynamic programming equation.
Abstract: This paper considers the sensor scheduling problem which consists of estimating the state of an uncertain process based on measurements obtained by switching a given set of noisy sensors. The noise and uncertainty models considered in this paper are assumed to be unknown deterministic functions which satisfy an energy type constraint known as an integral quadratic constraint. The problem of optimal robust sensor scheduling is formulated and solution to this problem is given in terms of the existence of suitable solutions to a Riccati differential equation of the game type and a dynamic programming equation. Furthermore, a real time implementable method for sensor scheduling is also presented.

Proceedings ArticleDOI
12 Dec 2000
TL;DR: In this article, a method for the numerical solution of the Hamilton Jacobi Bellman PDE that arises in an infinite time optimal control problem is presented, which can be of higher order to reduce "the curse of dimensionality".
Abstract: We present a method for the numerical solution of the Hamilton Jacobi Bellman PDE that arises in an infinite time optimal control problem. The method can be of higher order to reduce "the curse of dimensionality". It proceeds in two stages. First the HJB PDE is solved in a neighborhood of the origin using the power series method of Al'brecht (1961). From a boundary point of this neighborhood, an extremal trajectory is computed backward in time using the Pontryagin maximum principle. Then ordinary differential equations are developed for the higher partial derivatives of the solution along the extremal. These are solved yielding a power series for the approximate solution in a neighborhood of the extremal. This is repeated for other extremals and these approximate solutions are fitted together by transferring them to a rectangular grid using splines.

Journal ArticleDOI
TL;DR: In this paper, an optimal cost problem for a stochastic Navier-Stokes equation in space dimension 2 was solved by proving existence and uniqueness of a smooth solution of the corresponding Hamilton-Jacobi-Bellman equation.
Abstract: We solve an optimal cost problem for a stochastic Navier-Stokes equation in space dimension 2 by proving existence and uniqueness of a smooth solution of the corresponding Hamilton-Jacobi-Bellman equation.

Journal ArticleDOI
TL;DR: In this article, the authors consider a financial market model, where the dynamics of asset prices are given by an Rm -valued continuous semimartingale and obtain an explicit description of the variance optimal martingale measure in terms of the value process of a suitable problem of an optimal equivalent change of measure.
Abstract: Abstract We consider a financial market model, where the dynamics of asset prices is given by an Rm -valued continuous semimartingale. Using the dynamic programming approach we obtain an explicit description of the variance optimal martingale measure in terms of the value process of a suitable problem of an optimal equivalent change of measure and show that this value process uniquely solves the corresponding semimartingale backward equation. This result is applied to prove the existence of a unique generalized solution of Bellman's equation for stochastic volatility models, which is used to determine the variance-optimal martingale measure.

Journal ArticleDOI
TL;DR: This paper provides new results for estimation of the convergence rate of numerical schemes and discusses conditions for the convergence of discrete optimal controls to the optimal control for the initial problem.
Abstract: In this paper we explain that various (possibly discontinuous) value functions for optimal control problem under state-constraints can be approached by a sequence of value functions for suitable discretized systems. The key-point of this approach is the characterization of epigraphs of the value functions as suitable viability kernels. We provide new results for estimation of the convergence rate of numerical schemes and discuss conditions for the convergence of discrete optimal controls to the optimal control for the initial problem.