scispace - formally typeset
Search or ask a question

Showing papers on "Bellman equation published in 2009"


Journal ArticleDOI
TL;DR: It is shown that the time-dependent problem is decomposable with respect to arrival times and therefore can be solved as easily as its static counterpart.
Abstract: This paper studies the problem of finding a priori shortest paths to guarantee a given likelihood of arriving on-time in a stochastic network. Such “reliable” paths help travelers better plan their trips to prepare for the risk of running late in the face of stochastic travel times. Optimal solutions to the problem can be obtained from local-reliable paths, which are a set of non-dominated paths under first-order stochastic dominance. We show that Bellman’s principle of optimality can be applied to construct local-reliable paths. Acyclicity of local-reliable paths is established and used for proving finite convergence of solution procedures. The connection between the a priori path problem and the corresponding adaptive routing problem is also revealed. A label-correcting algorithm is proposed and its complexity is analyzed. A pseudo-polynomial approximation is proposed based on extreme-dominance. An extension that allows travel time distribution functions to vary over time is also discussed. We show that the time-dependent problem is decomposable with respect to arrival times and therefore can be solved as easily as its static counterpart. Numerical results are provided using typical transportation networks.

305 citations


Proceedings Article
07 Dec 2009
TL;DR: This work presents a Bellman error objective function and two gradient-descent TD algorithms that optimize it, and proves the asymptotic almost-sure convergence of both algorithms, for any finite Markov decision process and any smooth value function approximator, to a locally optimal solution.
Abstract: We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD(λ), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). Sutton et al. (2009a, 2009b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman error, and algorithms that perform stochastic gradient-descent on this function. These methods can be viewed as natural generalizations to previous TD methods, as they converge to the same limit points when used with linear function approximation methods. We generalize this work to nonlinear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms, for any finite Markov decision process and any smooth value function approximator, to a locally optimal solution. The algorithms are incremental and the computational complexity per time step scales linearly with the number of parameters of the approximator. Empirical results obtained in the game of Go demonstrate the algorithms' effectiveness.

249 citations


Journal ArticleDOI
TL;DR: This work develops a column generation algorithm to solve the problem for a multinomial logit choice model with disjoint consideration sets (MNLD), and derives a bound as a by-product of a decomposition heuristic.
Abstract: We consider a network revenue management problem where customers choose among open fare products according to some prespecified choice model. Starting with a Markov decision process (MDP) formulation, we approximate the value function with an affine function of the state vector. We show that the resulting problem provides a tighter bound for the MDP value than the choice-based linear program. We develop a column generation algorithm to solve the problem for a multinomial logit choice model with disjoint consideration sets (MNLD). We also derive a bound as a by-product of a decomposition heuristic. Our numerical study shows the policies from our solution approach can significantly outperform heuristics from the choice-based linear program.

223 citations


Journal ArticleDOI
TL;DR: This article introduces Gaussian process dynamic programming (GPDP), an approximate value function-based RL algorithm, and proposes to learn probabilistic models of the a priori unknown transition dynamics and the value functions on the fly.

222 citations


Journal ArticleDOI
TL;DR: In this article, a neural network is tuned online using novel tuning laws to learn the complete plant dynamics so that a local asymptotic stability of the identification error can be shown.

176 citations


Proceedings Article
01 Jan 2009
TL;DR: The need of the partial knowledge of the nonlinear system dynamics is relaxed in the development of a novel approach to ADP using a two part process: online system identification and offline optimal control training.
Abstract: The optimal control of linear systems accompanied by quadratic cost functions can be achieved by solving the well-known Riccati equation. However, the optimal control of nonlinear discrete-time systems is a much more challenging task that often requires solving the nonlinear Hamilton―Jacobi―Bellman (HJB) equation. In the recent literature, discrete-time approximate dynamic programming (ADP) techniques have been widely used to determine the optimal or near optimal control policies for affine nonlinear discrete-time systems. However, an inherent assumption of ADP requires the value of the controlled system one step ahead and at least partial knowledge of the system dynamics to be known. In this work, the need of the partial knowledge of the nonlinear system dynamics is relaxed in the development of a novel approach to ADP using a two part process: online system identification and offline optimal control training. First, in the system identification process, a neural network (NN) is tuned online using novel tuning laws to learn the complete plant dynamics so that a local asymptotic stability of the identification error can be shown. Then, using only the learned NN system model, offline ADP is attempted resulting in a novel optimal control law. The proposed scheme does not require explicit knowledge of the system dynamics as only the learned NN model is needed. The proof of convergence is demonstrated. Simulation results verify theoretical conjecture.

131 citations


Journal ArticleDOI
TL;DR: A new derivation of the dynamic programming equation for general stochastic target problems with unbounded controls is provided, together with the appropriate boundary conditions, which are applied to the problem of quantile hedging in financial mathematics.
Abstract: We consider the problem of finding the minimal initial data of a controlled process which guarantees to reach a controlled target with a given probability of success or, more generally, with a given level of expected loss. By suitably increasing the state space and the controls, we show that this problem can be converted into a stochastic target problem, i.e., finding the minimal initial data of a controlled process which guarantees to reach a controlled target with probability one. Unlike in the existing literature on stochastic target problems, our increased controls are valued in an unbounded set. In this paper, we provide a new derivation of the dynamic programming equation for general stochastic target problems with unbounded controls, together with the appropriate boundary conditions. These results are applied to the problem of quantile hedging in financial mathematics and are shown to recover the explicit solution of Follmer and Leukert [Finance Stoch., 3 (1999), pp. 251-273].

127 citations


Posted Content
TL;DR: In this article, a new framework for formulating reachability problems with competing inputs, nonlinear dynamics and state constraints as optimal control problems is developed, which can be applied to a general class of target hitting continuous dynamic games with non-linear dynamics, and has very good properties in terms of its numerical solution.
Abstract: A new framework for formulating reachability problems with competing inputs, nonlinear dynamics and state constraints as optimal control problems is developed. Such reach-avoid problems arise in, among others, the study of safety problems in hybrid systems. Earlier approaches to reach-avoid computations are either restricted to linear systems, or face numerical difficulties due to possible discontinuities in the Hamiltonian of the optimal control problem. The main advantage of the approach proposed in this paper is that it can be applied to a general class of target hitting continuous dynamic games with nonlinear dynamics, and has very good properties in terms of its numerical solution, since the value function and the Hamiltonian of the system are both continuous. The performance of the proposed method is demonstrated by applying it to a two aircraft collision avoidance scenario under target window constraints and in the presence of wind disturbance. Target Windows are a novel concept in air traffic management, and represent spatial and temporal constraints, that the aircraft have to respect to meet their schedule.

114 citations


Proceedings Article
07 Dec 2009
TL;DR: A theory of compositionality in stochastic optimal control is presented, showing how task-optimal controllers can be constructed from certain primitives, and illustrating the theory in the context of human arm movements.
Abstract: We present a theory of compositionality in stochastic optimal control, showing how task-optimal controllers can be constructed from certain primitives. The primitives are themselves feedback controllers pursuing their own agendas. They are mixed in proportion to how much progress they are making towards their agendas and how compatible their agendas are with the present task. The resulting composite control law is provably optimal when the problem belongs to a certain class. This class is rather general and yet has a number of unique properties - one of which is that the Bellman equation can be made linear even for non-linear or discrete dynamics. This gives rise to the compositionality developed here. In the special case of linear dynamics and Gaussian noise our framework yields analytical solutions (i.e. non-linear mixtures of LQG controllers) without requiring the final cost to be quadratic. More generally, a natural set of control primitives can be constructed by applying SVD to Green's function of the Bellman equation. We illustrate the theory in the context of human arm movements. The ideas of optimality and compositionality are both very prominent in the field of motor control, yet they have been difficult to reconcile. Our work makes this possible.

113 citations


Journal ArticleDOI
TL;DR: It is proved that any finite-horizon value function of the DSLQR problem is the pointwise minimum of a finite number of quadratic functions that can be obtained recursively using the so-called switched Riccati mapping.
Abstract: In this paper, we derive some important properties for the finite-horizon and the infinite-horizon value functions associated with the discrete-time switched LQR (DSLQR) problem. It is proved that any finite-horizon value function of the DSLQR problem is the pointwise minimum of a finite number of quadratic functions that can be obtained recursively using the so-called switched Riccati mapping. It is also shown that under some mild conditions, the family of the finite-horizon value functions is homogeneous (of degree 2), is uniformly bounded over the unit ball, and converges exponentially fast to the infinite-horizon value function. The exponential convergence rate of the value iterations is characterized analytically in terms of the subsystem matrices.

101 citations


Journal ArticleDOI
TL;DR: In this paper, the authors show that the problem is equivalent to a parabolic double obstacle problem involving two free boundaries that correspond to the optimal buying and selling policies, and the C 2, 1 regularity of the value function is proven.

Journal ArticleDOI
TL;DR: The effective state space of the corresponding optimal wealth and standard of living processes is described, the associated value function is identified as a generalized utility function, and the interplay between dynamic programming and Feynman-Kac results is exploited via the theory of random fields and stochastic partial differential equations.
Abstract: This paper studies the habit-forming preference problem of maximizing total expected utility from consumption net of the standard of living, a weighted average of past consumption. We describe the effective state space of the corresponding optimal wealth and standard of living processes, identify the associated value function as a generalized utility function, and exploit the interplay between dynamic programming and Feynman-Kac results via the theory of random fields and stochastic partial differential equations (SPDEs). The resulting value random field of the optimization problem satisfies a nonlinear, backward SPDE of parabolic type, widely referred to as the stochastic Hamilton-Jacobi-Bellman equation. The dual value random field is characterized further in terms of a backward parabolic SPDE which is linear. Progressively measurable versions of stochastic feedback formulae for the optimal portfolio and consumption choices are obtained as well.

Book ChapterDOI
01 Jan 2009
TL;DR: It is proved that a non-dominated path should contain no cycles if random link travel times are consistent with the stochastic first-in-first-out principle, and it is shown that the optimal solution is a set of non- dominated paths under the first-order stochastically dominance.
Abstract: This paper studies the problem of finding most reliable a priori shortest paths (RASP) in a stochastic and time-dependent network. Correlations are modeled by assuming the probability density functions of link traversal times to be conditional on both the time of day and link states. Such correlations are spatially limited by the Markovian property of the link states, which may be such defined to reflect congestion levels or the intensity of random disruptions. We formulate the RASP problem with the above correlation structure as a general dynamic programming problem, and show that the optimal solution is a set of non-dominated paths under the first-order stochastic dominance. Conditions are proposed to regulate the transition probabilities of link states such that Bellman’s principle of optimality can be utilized. We prove that a non-dominated path should contain no cycles if random link travel times are consistent with the stochastic first-in-first-out principle. The RASP problem is solved using a non-deterministic polynomial label correcting algorithm. Approximation algorithms with polynomial complexity may be achieved when further assumptions are made to the correlation structure and to the applicability of dynamic programming. Numerical results are provided.

Journal ArticleDOI
TL;DR: This work provides a rare proof of convergence for an approximate dynamic programming algorithm using pure exploitation, where the states the authors visit depend on the decisions produced by solving the approximate problem.
Abstract: We consider a multistage asset acquisition problem where assets are purchased now, at a price that varies randomly over time, to be used to satisfy a random demand at a particular point in time in the future. We provide a rare proof of convergence for an approximate dynamic programming algorithm using pure exploitation, where the states we visit depend on the decisions produced by solving the approximate problem. The resulting algorithm does not require knowing the probability distribution of prices or demands, nor does it require any assumptions about its functional form. The algorithm and its proof rely on the fact that the true value function is a family of piecewise linear concave functions.

Journal ArticleDOI
Marie-Amelie Morlais1
TL;DR: In this paper, the authors consider the problem of maximizing the utility of a financial market allowing jumps and prove existence and uniqueness results for the introduced BSDE, which allows to give the expression of the value function and characterize optimal strategies for the problem.
Abstract: In this paper, we consider the classical problem of utility maximization in a financial market allowing jumps. Assuming that the constraint set of all trading strategies is a compact set, rather than a convex one, we use a dynamic method from which we derive a specific BSDE. To solve the financial problem, we first prove existence and uniqueness results for the introduced BSDE. This allows to give the expression of the value function and characterize optimal strategies for the problem.

Journal ArticleDOI
TL;DR: An iterative algorithm to solve Hamilton-Jacobi-Bellman-Isaacs (HJBI) equations for a broad class of nonlinear control systems is proposed by constructing two series of nonnegative functions whose solutions can be approximated recursively by existing methods.

Journal ArticleDOI
TL;DR: This work considers the optimal control of a multidimensional cash management system where the cash balances fluctuate as a homogeneous diffusion process in and compute the solution in two-dimensions with linear and distance cost functions.

Journal ArticleDOI
TL;DR: In this paper, the inverse of the transformed solution is carried out by applying a method of Bellman et al. using the Laplace transformation and the fundamental equations have been expressed in the form of vector-matrix differential equation which is then solved by eigen value approach.

Journal ArticleDOI
TL;DR: A class of dynamic advertising problems under uncertainty in the presence of carryover and distributed forgetting effects, generalizing the classical model of Nerlove and Arrow is considered, allowing the dynamics of the product goodwill to depend on its past values, as well as previous advertising levels.
Abstract: We consider a class of dynamic advertising problems under uncertainty in the presence of carryover and distributed forgetting effects, generalizing the classical model of Nerlove and Arrow (Economica 29:129–142, 1962). In particular, we allow the dynamics of the product goodwill to depend on its past values, as well as previous advertising levels. Building on previous work (Gozzi and Marinelli in Lect. Notes Pure Appl. Math., vol. 245, pp. 133–148, 2006), the optimal advertising model is formulated as an infinite-dimensional stochastic control problem. We obtain (partial) regularity as well as approximation results for the corresponding value function. Under specific structural assumptions, we study the effects of delays on the value function and optimal strategy. In the absence of carryover effects, since the value function and the optimal advertising policy can be characterized in terms of the solution of the associated HJB equation, we obtain sharper characterizations of the optimal policy.

Journal ArticleDOI
TL;DR: Yan et al. as discussed by the authors studied stochastic optimal control problems with jumps with the help of the theory of Backward Stochastic Differential Equations (BSDEs) with jumps and proved that the value functions are the viscosity solutions of the associated generalized Hamilton-Jacobi-Bellman equations with integral differential operators.
Abstract: In this paper we study stochastic optimal control problems with jumps with the help of the theory of Backward Stochastic Differential Equations (BSDEs) with jumps. We generalize the results of Peng [S. Peng, BSDE and stochastic optimizations, in: J. Yan, S. Peng, S. Fang, L. Wu, Topics in Stochastic Analysis, Science Press, Beijing, 1997 (Chapter 2) (in Chinese)] by considering cost functionals defined by controlled BSDEs with jumps. The application of BSDE methods, in particular, the use of the notion of stochastic backward semigroups introduced by Peng in the above-mentioned work allows a straightforward proof of a dynamic programming principle for value functions associated with stochastic optimal control problems with jumps. We prove that the value functions are the viscosity solutions of the associated generalized Hamilton–Jacobi–Bellman equations with integral-differential operators. For this proof, we adapt Peng’s BSDE approach, given in the above-mentioned reference, developed in the framework of stochastic control problems driven by Brownian motion to that of stochastic control problems driven by Brownian motion and Poisson random measure.

Journal ArticleDOI
TL;DR: The general framework deals with the important case when several consecutive orders may be decided before the effective execution of the first one, motivated by financial applications in the trading of illiquid assets such as hedge funds.

Journal ArticleDOI
TL;DR: In this article, the authors consider the case where the risk is a stock whose price process is a geometric Brownian motion and find a dynamic choice of the investment policy which minimizes the ruin probability of the company.
Abstract: We consider that the surplus of an insurance company follows a Cramer–Lundberg process. The management has the possibility of investing part of the surplus in a risky asset. We consider that the risky asset is a stock whose price process is a geometric Brownian motion. Our aim is to find a dynamic choice of the investment policy which minimizes the ruin probability of the company. We impose that the ratio between the amount invested in the risky asset and the surplus should be smaller than a given positive bound a . For instance the case a = 1 means that the management cannot borrow money to buy stocks. [Hipp, C., Plum, M., 2000. Optimal investment for insurers. Insurance: Mathematics and Economics 27, 215–228] and [Schmidli, H., 2002. On minimizing the ruin probability by investment and reinsurance. Ann. Appl. Probab. 12, 890–907] solved this problem without borrowing constraints. They found that the ratio between the amount invested in the risky asset and the surplus goes to infinity as the surplus approaches zero, so the optimal strategies of the constrained and unconstrained problems never coincide. We characterize the optimal value function as the classical solution of the associated Hamilton–Jacobi–Bellman equation. This equation is a second-order non-linear integro-differential equation. We obtain numerical solutions for some claim-size distributions and compare our results with those of the unconstrained case.

Journal ArticleDOI
TL;DR: The dynamic programming equation for a certain class of problems which is called the second order stochastic target problems is derived, and it is proved by using the framework developed in H. Soner and N. Touzi (2002).
Abstract: Motivated by applications in mathematical finance [U. Cetin, H. M. Soner, and N. Touzi, “Options hedging for small investors under liquidity costs,” Finance Stoch., to appear] we continue our study of second order backward stochastic equations. In this paper, we derive the dynamic programming equation for a certain class of problems which we call the second order stochastic target problems. In contrast with previous formulations of similar problems, we restrict control processes to be continuous. This new framework enables us to apply our results to a larger class of models. Also the resulting derivation is more transparent. The main technical tool is the geometric dynamic programming principle in this context, and it is proved by using the framework developed in [H. M. Soner and N. Touzi, J. Eur. Math. Soc. (JEMS), 8 (2002), pp. 201-236].

Proceedings ArticleDOI
15 May 2009
TL;DR: iLDP can be considered a generalization of Differential Dynamic Programming, inasmuch as it uses general basis functions rather than quadratics to approximate the optimal value function and introduces a collocation method that dispenses with explicit differentiation of the cost and dynamics.
Abstract: We develop an iterative local dynamic programming method (iLDP) applicable to stochastic optimal control problems in continuous high-dimensional state and action spaces. Such problems are common in the control of biological movement, but cannot be handled by existing methods. iLDP can be considered a generalization of Differential Dynamic Programming, inasmuch as: (a) we use general basis functions rather than quadratics to approximate the optimal value function; (b) we introduce a collocation method that dispenses with explicit differentiation of the cost and dynamics and ties iLDP to the Unscented Kalman filter; (c) we adapt the local function approximator to the propagated state covariance, thus increasing accuracy at more likely states. Convergence is similar to quasi-Netwon methods. We illustrate iLDP on several problems including the “swimmer” dynamical system which has 14 state and 4 control variables.

Journal ArticleDOI
TL;DR: In this paper, a continuous-time optimal investment and the consumption decision of a constant relative risk aversion (CRRA) investor who faces proportional transaction costs and a finite time horizon were studied.
Abstract: This paper concerns continuous-time optimal investment and the consumption decision of a constant relative risk aversion (CRRA) investor who faces proportional transaction costs and a finite time horizon. In the no-consumption case, it has been studied by Liu and Loewenstein [Review of Financial Studies, 15 (2002), pp. 805-835] and Dai and Yi [J. Differential Equations, 246 (2009), pp. 1445-1469]. Mathematically, it is a singular stochastic control problem whose value function satisfies a parabolic variational inequality with gradient constraints. The problem gives rise to two free boundaries which stand for the optimal buying and selling strategies, respectively. We present an analytical approach to analyze the behaviors of free boundaries. The regularity of the value function is studied as well. Our approach is essentially based on the connection between singular control and optimal stopping, which is first revealed in the present problem.

Journal ArticleDOI
TL;DR: The classical epidemic model is adapted to model malware propagation in this multi-network framework and the trade-off between the infection spread and the patching costs is captured in a cost function, leading to an optimal control problem.

Journal ArticleDOI
TL;DR: In this article, an optimal control algorithm based on Hamilton-Jacobi-Bellman (HJB) equation, for the bounded robust controller design for finite-time-horizon nonlinear systems, is proposed.
Abstract: In this study, an optimal control algorithm based on Hamilton-Jacobi-Bellman (HJB) equation, for the bounded robust controller design for finite-time-horizon nonlinear systems, is proposed. The HJB equation formulated using a suitable nonquadratic term in the performance functional to take care of magnitude constraints on the control input. Utilising the direct method of Lyapunov stability, we have proved the optimality of the controller with respect to a cost functional, that includes penalty on the control effort and the maximum bound on system uncertainty. The bounded controller requires the knowledge of the upper bound of system uncertainty. In the proposed algorithm, neural network is used to approximate the time-varying solution of HJB equation using least squares method. Proposed algorithm has been applied on the nonlinear system with matched and unmatched system uncertainties. Necessary theoretical and simulation results are presented to validate proposed algorithm.

Journal ArticleDOI
TL;DR: It is shown that value iteration as well as Howard’s policy improvement algorithm works and error bounds are given when the utility function is approximated and when the state space is discretized.
Abstract: We consider the problem of maximizing the expected utility of the terminal wealth of a portfolio in a continuous-time pure jump market with general utility function. This leads to an optimal control problem for piecewise deterministic Markov processes. Using an embedding procedure we solve the problem by looking at a discrete-time contracting Markov decision process. Our aim is to show that this point of view has a number of advantages, in particular as far as computational aspects are concerned. We characterize the value function as the unique fixed point of the dynamic programming operator and prove the existence of optimal portfolios. Moreover, we show that value iteration as well as Howard’s policy improvement algorithm works. Finally, we give error bounds when the utility function is approximated and when we discretize the state space. A numerical example is presented and our approach is compared to the approximating Markov chain method.

Journal ArticleDOI
TL;DR: In this paper various mathematical tools are applied in dynamic optimization of power-maximizing paths, with special attention paid to nonlinear systems, and convergence of discrete algorithms to viscosity solutions of HJB equations, discrete approximations and the role of Lagrange multiplier λ associated with the duration constraint is considered.

Journal ArticleDOI
TL;DR: A HJB formalism is used and the explicit form of the Krotov-Bellman function is obtained for different growth stages and the optimal control problem related to the seasonal benefit of the grower is described.