scispace - formally typeset
Search or ask a question

Showing papers on "Bellman equation published in 2008"


Journal ArticleDOI
01 Aug 2008
TL;DR: It is shown that HDP converges to the optimal control and the optimal value function that solves the Hamilton-Jacobi-Bellman equation appearing in infinite-horizon discrete-time (DT) nonlinear optimal control.
Abstract: Convergence of the value-iteration-based heuristic dynamic programming (HDP) algorithm is proven in the case of general nonlinear systems. That is, it is shown that HDP converges to the optimal control and the optimal value function that solves the Hamilton-Jacobi-Bellman equation appearing in infinite-horizon discrete-time (DT) nonlinear optimal control. It is assumed that, at each iteration, the value and action update equations can be exactly solved. The following two standard neural networks (NN) are used: a critic NN is used to approximate the value function, whereas an action network is used to approximate the optimal control policy. It is stressed that this approach allows the implementation of HDP without knowing the internal dynamics of the system. The exact solution assumption holds for some classes of nonlinear systems and, specifically, in the specific case of the DT linear quadratic regulator (LQR), where the action is linear and the value quadratic in the states and NNs have zero approximation error. It is stressed that, for the LQR, HDP may be implemented without knowing the system A matrix by using two NNs. This fact is not generally appreciated in the folklore of HDP for the DT LQR, where only one critic NN is generally used.

919 citations


Proceedings ArticleDOI
01 Dec 2008
TL;DR: This work obtains a more natural form of LQG duality by replacing the Kalman-Bucy filter with the information filter and generalizes this result to non-linear stochastic systems, discrete stochastics systems, and deterministic systems.
Abstract: Optimal control and estimation are dual in the LQG setting, as Kalman discovered, however this duality has proven difficult to extend beyond LQG. Here we obtain a more natural form of LQG duality by replacing the Kalman-Bucy filter with the information filter. We then generalize this result to non-linear stochastic systems, discrete stochastic systems, and deterministic systems. All forms of duality are established by relating exponentiated costs to probabilities. Unlike the LQG setting where control and estimation are in one-to-one correspondence, in the general case control turns out to be a larger problem class than estimation and only a sub-class of control problems have estimation duals. These are problems where the Bellman equation is intrinsically linear. Apart from their theoretical significance, our results make it possible to apply estimation algorithms to control problems and vice versa.

312 citations


Journal ArticleDOI
TL;DR: P Peng's BSDE method is extended from the framework of stochastic control theory into that of Stochastic differential games and is shown to prove a dynamic programming principle for both the upper and the lower value functions of the game in a straightforward way.
Abstract: In this paper we study zero-sum two-player stochastic differential games with the help of the theory of backward stochastic differential equations (BSDEs). More precisely, we generalize the results of the pioneering work of Fleming and Souganidis [Indiana Univ. Math. J., 38 (1989), pp. 293-314] by considering cost functionals defined by controlled BSDEs and by allowing the admissible control processes to depend on events occurring before the beginning of the game. This extension of the class of admissible control processes has the consequence that the cost functionals become random variables. However, by making use of a Girsanov transformation argument, which is new in this context, we prove that the upper and the lower value functions of the game remain deterministic. Apart from the fact that this extension of the class of admissible control processes is quite natural and reflects the behavior of the players who always use the maximum of available information, its combination with BSDE methods, in particular that of the notion of stochastic “backward semigroups" introduced by Peng [BSDE and stochastic optimizations, in Topics in Stochastic Analysis, Science Press, Beijing, 1997], allows us then to prove a dynamic programming principle for both the upper and the lower value functions of the game in a straightforward way. The upper and the lower value functions are then shown to be the unique viscosity solutions of the upper and the lower Hamilton-Jacobi-Bellman-Isaacs equations, respectively. For this Peng's BSDE method is extended from the framework of stochastic control theory into that of stochastic differential games.

268 citations


Journal ArticleDOI
TL;DR: In this article, the authors consider the problem of finding a near-optimal policy in a continuous space, discounted Markovian Decision Problem (MDP) by employing value-function-based methods when only a single trajectory of a fixed policy is available as the input.
Abstract: In this paper we consider the problem of finding a near-optimal policy in a continuous space, discounted Markovian Decision Problem (MDP) by employing value-function-based methods when only a single trajectory of a fixed policy is available as the input. We study a policy-iteration algorithm where the iterates are obtained via empirical risk minimization with a risk function that penalizes high magnitudes of the Bellman-residual. Our main result is a finite-sample, high-probability bound on the performance of the computed policy that depends on the mixing rate of the trajectory, the capacity of the function set as measured by a novel capacity concept (the VC-crossing dimension), the approximation power of the function set and the controllability properties of the MDP. Moreover, we prove that when a linear parameterization is used the new algorithm is equivalent to Least-Squares Policy Iteration. To the best of our knowledge this is the first theoretical result for off-policy control learning over continuous state-spaces using a single trajectory.

231 citations


Proceedings ArticleDOI
05 Jul 2008
TL;DR: It is shown that linear value-function approximation is equivalent to a form of linear model approximation, and a relationship between the model-approximation error and the Bellman error is derived, which can guide feature selection for model improvement and/or value- function improvement.
Abstract: We show that linear value-function approximation is equivalent to a form of linear model approximation. We then derive a relationship between the model-approximation error and the Bellman error, and show how this relationship can guide feature selection for model improvement and/or value-function improvement. We also show how these results give insight into the behavior of existing feature-selection algorithms.

198 citations


Journal ArticleDOI
TL;DR: A broad class of stochastic dynamic programming problems that are amenable to relaxation via decomposition are considered, namely, Lagrangian relaxation and the linear programming (LP) approach to approximate dynamic programming.
Abstract: We consider a broad class of stochastic dynamic programming problems that are amenable to relaxation via decomposition. These problems comprise multiple subproblems that are independent of each other except for a collection of coupling constraints on the action space. We fit an additively separable value function approximation using two techniques, namely, Lagrangian relaxation and the linear programming (LP) approach to approximate dynamic programming. We prove various results comparing the relaxations to each other and to the optimal problem value. We also provide a column generation algorithm for solving the LP-based relaxation to any desired optimality tolerance, and we report on numerical experiments on bandit-like problems. Our results provide insight into the complexity versus quality trade-off when choosing which of these relaxations to implement.

187 citations


Journal ArticleDOI
TL;DR: In this paper, neural networks are used along with two-player policy iterations to solve for the feedback strategies of a continuous-time zero-sum game that appears in L2-gain optimal control, suboptimal Hinfin control, of nonlinear systems affine in input with the control policy having saturation constraints.
Abstract: In this paper, neural networks are used along with two-player policy iterations to solve for the feedback strategies of a continuous-time zero-sum game that appears in L2-gain optimal control, suboptimal Hinfin control, of nonlinear systems affine in input with the control policy having saturation constraints. The result is a closed-form representation, on a prescribed compact set chosen a priori, of the feedback strategies and the value function that solves the associated Hamilton-Jacobi-Isaacs (HJI) equation. The closed-loop stability, L2-gain disturbance attenuation of the neural network saturated control feedback strategy, and uniform convergence results are proven. Finally, this approach is applied to the rotational/translational actuator (RTAC) nonlinear benchmark problem under actuator saturation, offering guaranteed stability and disturbance attenuation.

173 citations


Journal ArticleDOI
TL;DR: In this paper, a portfolio problem of a pension fund manager who wants to maximize the expected utility of the terminal wealth in a complete financial market with the stochastic interest rate is studied.
Abstract: In this paper, we study the portfolio problem of a pension fund manager who wants to maximize the expected utility of the terminal wealth in a complete financial market with the stochastic interest rate. Using the method of stochastic optimal control, we derive a non-linear second-order partial differential equation for the value function. As it is difficult to find a closed form solution, we transform the primary problem into a dual one by applying a Legendre transform and dual theory, and try to find an explicit solution for the optimal investment strategy under the logarithm utility function. Finally, a numerical simulation is presented to characterize the dynamic behavior of the optimal portfolio strategy.

116 citations


Journal ArticleDOI
TL;DR: From the tangential condition characterizing capture basins, it is proved that this solution is the unique “upper semicontinuous” solution to the Hamilton-Jacobi-Bellman partial differential equation in the Barron-Jensen/Frankowska sense.
Abstract: We use viability techniques for solving Dirichlet problems with inequality constraints (obstacles) for a class of Hamilton-Jacobi equations. The hypograph of the “solution” is defined as the “capture basin” under an auxiliary control system of a target associated with the initial and boundary conditions, viable in an environment associated with the inequality constraint. From the tangential condition characterizing capture basins, we prove that this solution is the unique “upper semicontinuous” solution to the Hamilton-Jacobi-Bellman partial differential equation in the Barron-Jensen/Frankowska sense. We show how this framework allows us to translate properties of capture basins into corresponding properties of the solutions to this problem. For instance, this approach provides a representation formula of the solution which boils down to the Lax-Hopf formula in the absence of constraints.

94 citations


Journal ArticleDOI
TL;DR: In this article, the authors consider constrained finite-time optimal control problems for discrete-time linear time-invariant systems with constraints on inputs and outputs based on linear and quadratic performance indices.
Abstract: We consider constrained finite-time optimal control problems for discrete-time linear time-invariant systems with constraints on inputs and outputs based on linear and quadratic performance indices. The solution to such problems is a time-varying piecewise affine (PWA) state-feedback law and can be computed by means of multiparametric programming. By exploiting the properties of the value function and the piecewise affine optimal control law of the constrained finite-time optimal control (CFTOC), we propose two new algorithms that avoid storing the polyhedral regions. The new algorithms significantly reduce the on-line storage demands and computational complexity during evaluation of the PWA feedback control law resulting from the CFTOC.

91 citations


Journal ArticleDOI
01 Aug 2008
TL;DR: In this article, the core backward induction algorithm of dynamic programming is extended from its traditional discrete case to all isolated time scales and the Hamilton-Jacobi-Bellman equations are motivated and proven on time scales.
Abstract: The time scales calculus is a key emerging area of mathematics due to its potential use in a wide variety of multidisciplinary applications. We extend this calculus to approximate dynamic programming (ADP). The core backward induction algorithm of dynamic programming is extended from its traditional discrete case to all isolated time scales. Hamilton-Jacobi-Bellman equations, the solution of which is the fundamental problem in the field of dynamic programming, are motivated and proven on time scales. By drawing together the calculus of time scales and the applied area of stochastic control via ADP, we have connected two major fields of research.

Journal ArticleDOI
TL;DR: In this article, the authors consider a stochastic control problem which is a natural extension of the Monge-Kantorovich problem and provide a probabilistic proof of two fundamental results in mass transportation: the Kantorovich duality and the graph property.
Abstract: We address an optimal mass transportation problem by means of optimal stochastic control. We consider a stochastic control problem which is a natural extension of the Monge-Kantorovich problem. Using a vanishing viscosity argument we provide a probabilistic proof of two fundamental results in mass transportation: the Kantorovich duality and the graph property for the support of an optimal measure for the Monge-Kantorovich problem. Our key tool is a stochastic duality result involving solutions of the Hamilton-Jacobi-Bellman PDE.

Posted Content
TL;DR: This work explores the way of solving Monge--Amp\`ere equation by a sort of method of characteristics to find the Bellman function of certain classical Harmonic Analysis problems, and, therefore, of finding full structure of sharp constants and extremal sequences for those problems.
Abstract: Monge--Amp\`ere equation plays an important part in Analysis. For example, it is instrumental in mass transport problems. On the other hand, the Bellman function technique appeared recently as a way to consider certain Harmonic Analysis problems as the problems of Stochastic Optimal Control. This brings us to Bellman PDE, which in stochastic setting is often a Monge--Amp\`ere equation or its close relative. We explore the way of solving Monge--Amp\`ere equation by a sort of method of characteristics to find the Bellman function of certain classical Harmonic Analysis problems, and, therefore, of finding full structure of sharp constants and extremal sequences for those problems.

Posted Content
TL;DR: In a newsvendor problem with partially observed Markovian demand, the optimal order is set to exceed the myopic optimal order, and a near-optimal solution is characterized by establishing that the value function is piecewise linear.
Abstract: We consider a newsvendor problem with partially observed Markovian demand. Demand is observed if it is less than the inventory. Otherwise, only the event that it is larger than or equal to the inventory is observed. These observations are used to update the demand distribution from one period to the next. The state of the resulting dynamic programming equation is the current demand distribution, which is generally infinite dimensional. We use unnormalized probabilities to convert the nonlinear state transition equation to a linear one. This helps in proving the existence of an optimal feedback ordering policy. So as to learn more about the demand, the optimal order is set to exceed the myopic optimal order. The optimal cost decreases as the demand distribution decreases in the hazard rate order. In a special case with finitely many demand values, we characterize a near-optimal solution by establishing that the value function is piecewise linear.

Journal ArticleDOI
TL;DR: In this article, the authors present an introduction to generalized semi-infinite programming (GSIP) models and present necessary and sufficient first-and second-order optimality conditions where directional differentiability properties of the optimal value function of the lower level problem are used.

Journal ArticleDOI
TL;DR: It is proved that, under generic assumptions, such trajectories of control-affine systems satisfying the Lie algebra rank condition (LARC), singular trajectories are strictly abnormal, generically with respect to the cost, and it is shown how these results can be used to derive regularity results for the value function and in the theory of Hamilton-Jacobi equations.
Abstract: When applying methods of optimal control to motion planning or stabilization problems, we see that some theoretical or numerical difficulties may arise, due to the presence of specific trajectories, namely, minimizing singular trajectories of the underlying optimal control problem. In this article, we provide characterizations for singular trajectories of control-affine systems. We prove that, under generic assumptions, such trajectories share nice properties, related to computational aspects; more precisely, we show that, for a generic system—with respect to the Whitney topology—all nontrivial singular trajectories are of minimal order and of corank one. These results, established both for driftless and for control-affine systems, extend results of [Y. Chitour, F. Jean, and E. Trelat, Comptes Rendus Math., 337 (2003), pp. 49-52 (in French); Y. Chitour, F. Jean, and E. Trelat, J. Differential Geom., 73 (2006), pp. 45-73]. As a consequence, for generic control-affine systems (with or without drift) defined by more than two vector fields, and for a fixed cost, there do not exist minimizing singular trajectories. Besides, we prove that, given a control-affine system satisfying the Lie algebra rank condition (LARC), singular trajectories are strictly abnormal, generically with respect to the cost. We then show how these results can be used to derive regularity results for the value function and in the theory of Hamilton-Jacobi equations, which in turn have applications for stabilization and motion planning, from both theoretical and implementational points of view.

Proceedings Article
08 Dec 2008
TL;DR: A metric for measuring behavior similarity between states in a Markov decision process (MDP), which takes action similarity into account, is defined and it is proved that the difference in the optimal value function of different states can be upper-bounded by the value of this metric.
Abstract: We define a metric for measuring behavior similarity between states in a Markov decision process (MDP), which takes action similarity into account. We show that the kernel of our metric corresponds exactly to the classes of states defined by MDP homomorphisms (Ravindran & Barto, 2003). We prove that the difference in the optimal value function of different states can be upper-bounded by the value of this metric, and that the bound is tighter than previous bounds provided by bisimulation metrics (Ferns et al. 2004, 2005). Our results hold both for discrete and for continuous actions. We provide an algorithm for constructing approximate homomorphisms, by using this metric to identify states that can be grouped together, as well as actions that can be matched. Previous research on this topic is based mainly on heuristics.

Journal ArticleDOI
TL;DR: In this article, the authors deal with an endogenous growth model with vintage capital and more precisely with the AK model proposed in [R. Boucekkine, O. Licandro, L. Puch, F. del Rio, and L.A.

Journal ArticleDOI
TL;DR: The dynamic programming principle is given for this kind of optimal control problem and it is shown that the value function is the unique viscosity solution of the obstacle problem for the corresponding Hamilton-Jacobi-Bellman equation.
Abstract: In this paper, we study one kind of stochastic recursive optimal control problem with the obstacle constraint for the cost functional described by the solution of a reflected backward stochastic differential equation. We give the dynamic programming principle for this kind of optimal control problem and show that the value function is the unique viscosity solution of the obstacle problem for the corresponding Hamilton-Jacobi-Bellman equation.

Journal ArticleDOI
TL;DR: In this article, the authors prove the semiconcavity of the value function of an optimal control problem with end-point constraints for which all minimizing controls are supposed to be nonsingular.
Abstract: Semiconcavity results have generally been obtained for optimal control problems in absence of state constraints. In this paper, we prove the semiconcavity of the value function of an optimal control problem with end-point constraints for which all minimizing controls are supposed to be nonsingular.

Proceedings Article
13 Jul 2008
TL;DR: An exact dynamic programming update for constrained partially observable Markov decision processes (CPOMDPs) relies on implicit enumeration of the vectors in the piecewise linear value function, and pruning operations to obtain a minimal representation of the updated value function.
Abstract: We describe an exact dynamic programming update for constrained partially observable Markov decision processes (CPOMDPs). State-of-the-art exact solution of unconstrained POMDPs relies on implicit enumeration of the vectors in the piecewise linear value function, and pruning operations to obtain a minimal representation of the updated value function. In dynamic programming for CPOMDPs, each vector takes two valuations, one with respect to the objective function and another with respect to the constraint function. The dynamic programming update consists of finding, for each belief state, the vector that has the best objective function valuation while still satisfying the constraint function. Whereas the pruning operation in an unconstrained POMDP requires solution of a linear program, the pruning operation for CPOMDPs requires solution of a mixed integer linear program.

Journal ArticleDOI
TL;DR: The possibility for the immediate one-impulse strategy to be nonoptimal while both growth functions are monotonic is a surprising result and is illustrated with the help of numerical simulations.
Abstract: We consider the optimal control problem of feeding in minimal time a tank where several species compete for a single resource, with the objective being to reach a given level of the resource. We allow controls to be bounded measurable functions of time plus possible impulses. For the one-species case, we show that the immediate one-impulse strategy (filling the whole reactor with one single impulse at the initial time) is optimal when the growth function is monotonic. For nonmonotonic growth functions with one maximum, we show that a particular singular arc strategy (precisely defined in section 3) is optimal. These results extend and improve former ones obtained for the class of measurable controls only. For the two-species case with monotonic growth functions, we give conditions under which the immediate one-impulse strategy is optimal. We also give optimality conditions for the singular arc strategy (at a level that depends on the initial condition) to be optimal. The possibility for the immediate one-impulse strategy to be nonoptimal while both growth functions are monotonic is a surprising result and is illustrated with the help of numerical simulations.

Posted Content
Marie-Amelie Morlais1
TL;DR: To solve the financial problem of utility maximization in a financial market allowing jumps, this paper first proves existence and uniqueness results for the introduced BSDE, which allows the expression of the value function and characterize optimal strategies for the problem.
Abstract: In this paper, we consider the classical problem of utility maximization in a financial market allowing jumps. Assuming that the constraint set is a compact set, rather than a convex one, we use a dynamic method from which we derive a specific BSDE. We then aim at showing existence and uniqueness results for the introduced BSDE. This allows us to give an explicit expression of the value function and characterize optimal strategies for our problem.

Journal ArticleDOI
01 Aug 2008
TL;DR: This paper combines three threads of research on approximate dynamic programming: sparse random sampling of states, value function and policy approximation using local models, and using local trajectory optimizers to globally optimize a policy and associated value function.
Abstract: We combine three threads of research on approximate dynamic programming: sparse random sampling of states, value function and policy approximation using local models, and using local trajectory optimizers to globally optimize a policy and associated value function. Our focus is on finding steady-state policies for deterministic time-invariant discrete time control problems with continuous states and actions often found in robotics. In this paper, we describe our approach and provide initial results on several simulated robotics problems.

Journal ArticleDOI
TL;DR: In this paper, the authors studied two-period nonlinear optimization problems whose parameters are uncertain and showed that quasiconvexity of the optimal value function of certain subproblems is sufficient for reducibility of the resulting robust optimization problem to a single-level deterministic problem.
Abstract: We study two-period nonlinear optimization problems whose parameters are uncertain. We assume that uncertain parameters are revealed in stages and model them using the adjustable robust optimization approach. For problems with polytopic uncertainty, we show that quasiconvexity of the optimal value function of certain subproblems is sufficient for the reducibility of the resulting robust optimization problem to a single-level deterministic problem. We relate this sufficient condition to the cone-quasiconvexity of the feasible set mapping for adjustable variables and present several examples and applications satisfying these conditions.

Journal ArticleDOI
TL;DR: This paper defines the production-path property of an optimal solution for their model and uses this property to develop a backward dynamic programming recursion, which allows for a full characterization of the optimal value function to be obtained by a dynamic programming algorithm in polynomial time.
Abstract: In 1958, Wagner and Whitin published a seminal paper on the deterministic uncapacitated lot-sizing problem, a fundamental model that is embedded in many practical production planning problems. In this paper, we consider a basic version of this model in which problem parameters are stochastic: the stochastic uncapacitated lot-sizing problem. We define the production-path property of an optimal solution for our model and use this property to develop a backward dynamic programming recursion. This approach allows us to show that the value function is piecewise linear and right continuous. We then use these results to show that a full characterization of the optimal value function can be obtained by a dynamic programming algorithm in polynomial time for the case that each nonleaf node contains at least two children. Moreover, we show that our approach leads to a polynomial-time algorithm to obtain an optimal solution to any instance of the stochastic uncapacitated lot-sizing problem, regardless of the structur...

Proceedings ArticleDOI
18 Aug 2008
TL;DR: This paper builds upon previous and existing optimization strategies to present an alternative hybrid variant of differential dynamic programming for robust low-thrust optimization that uses first- and second-order state transition matrices to take advantage of an efficient discretization scheme and obtain the partial derivatives needed to perform the minimization.
Abstract: Low-thrust propulsion is becoming increasingly considered for future space missions, but optimization of the resulting trajectories is very challenging. To solve such complex problems, differential dynamic programming is a proven technique based on Bellman’s Principle of Optimality and successive minimization of quadratic approximations. In this paper, we build upon previous and existing optimization strategies to present an alternative hybrid variant of differential dynamic programming for robust low-thrust optimization. It uses first- and second-order state transition matrices to take advantage of an efficient discretization scheme and obtain the partial derivatives needed to perform the minimization. Unlike the traditional formulation, the state transition approach provides valuable constraint sensitivities and furthermore is naturally amenable to parallel computation. The method includes also a smoothing strategy to improve robustness of convergence when starting far from the optimum, as well as the capability to handle efficiently both soft and hard constraints. Procedures to drastically reduce the computation cost are mentioned. Preliminary numerical results are presented and compared to existing algorithms to illustrate the performance and the accuracy of our approach.

Journal ArticleDOI
TL;DR: A game-theory based approach in a multi–target searching using a multi-robot system in a dynamic environment with main advantage in its real-time capabilities whilst being efficient and robust to dynamic environments.
Abstract: This paper proposes a game-theory based approach in a multi-target searching using a multi-robot system in a dynamic environment. It is assumed that a rough priori probability map of the targets' distribution within the environment is given. To consider the interaction between the robots, a dynamic-programming equation is proposed to estimate the utility function for each robot. Based on this utility function, a cooperative nonzero-sum game is generated, where both pure Nash Equilibrium and mixed-strategy Equilibrium solutions are presented to achieve an optimal overall robot behaviors. A special consideration has been taken to improve the real-time performance of the game-theory based approach. Several mechanisms, such as event-driven discretization, one-step dynamic programming, and decision buffer, have been proposed to reduce the computational complexity. The main advantage of the algorithm lies in its real-time capabilities whilst being efficient and robust to dynamic environments.

Proceedings ArticleDOI
09 Dec 2008
TL;DR: In this paper, occupation measures are used to approximate pointwise the optimal value function of a given OCP, using a hierarchy of linear matrix inequality (LMI) relaxations, and an almost optimal control law is derived.
Abstract: We consider nonlinear optimal control problems (OCPs) for which all problem data are polynomial. In the first part of the paper, we review how occupation measures can be used to approximate pointwise the optimal value function of a given OCP, using a hierarchy of linear matrix inequality (LMI) relaxations. In the second part, we extend the methodology to approximate the optimal value function on a given set and we use such a function to constructively and computationally derive an almost optimal control law. Numerical examples show the effectiveness of the approach.

Journal ArticleDOI
TL;DR: An Approximate Dynamic Programming scheme that efficiently solves the optimal power split between the internal combustion engine and the electric machine in parallel hybrid powertrains is presented.