scispace - formally typeset
Search or ask a question

Showing papers on "Bellman equation published in 2012"


Book
27 Sep 2012
TL;DR: In this paper, the authors propose a method for solving control problems by verification, which is based on the Viscosity Solution Equation (VSP) in the sense of VVS.
Abstract: Preface.- 1. Conditional Expectation and Linear Parabolic PDEs.- 2. Stochastic Control and Dynamic Programming.- 3. Optimal Stopping and Dynamic Programming.- 4. Solving Control Problems by Verification.- 5. Introduction to Viscosity Solutions.- 6. Dynamic Programming Equation in the Viscosity Sense.- 7. Stochastic Target Problems.- 8. Second Order Stochastic Target Problems.- 9. Backward SDEs and Stochastic Control.- 10. Quadratic Backward SDEs.- 11. Probabilistic Numerical Methods for Nonlinear PDEs.- 12. Introduction to Finite Differences Methods.- References.

244 citations


Journal ArticleDOI
TL;DR: In this article, the authors derive numerical theory results characterizing the properties of the nested fixed point algorithm used to evaluate the objective function of BLP's estimator, and recast estimation as a mathematical program with equilibrium constraints.
Abstract: The widely used estimator of Berry, Levinsohn, and Pakes (1995 )p roduces estimates of consumer preferences from a discrete-choice demand model with random coefficients, market-level demand shocks, and endogenous prices. We derive numerical theory results characterizing the properties of the nested fixed point algorithm used to evaluate the objective function of BLP’s estimator. We discuss problems with typical implementations, including cases that can lead to incorrect parameter estimates. As a solution, we recast estimation as a mathematical program with equilibrium constraints, which can be faster and which avoids the numerical issues associated with nested inner loops. The advantages are even more pronounced for forward-looking demand models where the Bellman equation must also be solved repeatedly. Several Monte Carlo and real-data experiments support our numerical concerns about the nested fixed point approach and the advantages of constrained optimization. For static BLP, the constrained optimization approach can be as much as ten to forty times faster for large-dimensional problems with many markets.

212 citations


Posted Content
TL;DR: In this paper, the authors present metrics for measuring the similarity of states in a finite Markov decision process (MDP) based on the notion of bisimulation, with an aim towards solving discounted infinite horizon reinforcement learning tasks.
Abstract: We present metrics for measuring the similarity of states in a finite Markov decision process (MDP). The formulation of our metrics is based on the notion of bisimulation for MDPs, with an aim towards solving discounted infinite horizon reinforcement learning tasks. Such metrics can be used to aggregate states, as well as to better structure other value function approximators (e.g., memory-based or nearest-neighbor approximators). We provide bounds that relate our metric distances to the optimal values of states in the given MDP.

201 citations


Book
14 Dec 2012
TL;DR: Adaptive Dynamic Programming in Discrete Time (ADPDP-DTM) as discussed by the authors is a generalization of ADP for nonlinear systems with a focus on optimal control.
Abstract: There are many methods of stable controller design for nonlinear systems. In seeking to go beyond the minimum requirement of stability, Adaptive Dynamic Programming in Discrete Time approaches the challenging topic of optimal control for nonlinear systems using the tools of adaptive dynamic programming (ADP). The range of systems treated is extensive; affine, switched, singularly perturbed and time-delay nonlinear systems are discussed as are the uses of neural networks and techniques of value and policy iteration. The text features three main aspects of ADP in which the methods proposed for stabilization and for tracking and games benefit from the incorporation of optimal control methods: infinite-horizon control for which the difficulty of solving partial differential HamiltonJacobiBellman equations directly is overcome, and proof provided that the iterative value function updating sequence converges to the infimum of all the value functions obtained by admissible control law sequences; finite-horizon control, implemented in discrete-time nonlinear systems showing the reader how to obtain suboptimal control solutions within a fixed number of control steps and with results more easily applied in real systems than those usually gained from infinite-horizon control; nonlinear games for which a pair of mixed optimal policies are derived for solving games both when the saddle point does not exist, and, when it does, avoiding the existence conditions of the saddle point. Non-zero-sum games are studied in the context of a single network scheme in which policies are obtained guaranteeing system stability and minimizing the individual performance function yielding a Nash equilibrium. In order to make the coverage suitable for the student as well as for the expert reader, Adaptive Dynamic Programming in Discrete Time: establishes the fundamental theory involved clearly with each chapter devoted to a clearly identifiable control paradigm; demonstrates convergence proofs of the ADP algorithms to deepen understanding of the derivation of stability and convergence with the iterative computational methods used; and shows how ADP methods can be put to use both in simulation and in real applications. This text will be of considerable interest to researchers interested in optimal control and its applications in operations research, applied mathematics computational intelligence and engineering. Graduate students working in control and operations research will also find the ideas presented here to be a source of powerful methods for furthering their study.

200 citations


Journal ArticleDOI
TL;DR: An online gaming algorithm based on policy iteration to solve the continuoustime (CT) twoplayer zerosum game with infinite horizon cost for nonlinear systems with known dynamics is presented.
Abstract: SUMMARY The two-player zero-sum (ZS) game problem provides the solution to the bounded L2-gain problem and so is important for robust control. However, its solution depends on solving a design Hamilton–Jacobi–Isaacs (HJI) equation, which is generally intractable for nonlinear systems. In this paper, we present an online adaptive learning algorithm based on policy iteration to solve the continuous-time two-player ZS game with infinite horizon cost for nonlinear systems with known dynamics. That is, the algorithm learns online in real time an approximate local solution to the game HJI equation. This method finds, in real time, suitable approximations of the optimal value and the saddle point feedback control policy and disturbance policy, while also guaranteeing closed-loop stability. The adaptive algorithm is implemented as an actor/critic/disturbance structure that involves simultaneous continuous-time adaptation of critic, actor, and disturbance neural networks. We call this online gaming algorithm ‘synchronous’ ZS game policy iteration. A persistence of excitation condition is shown to guarantee convergence of the critic to the actual optimal value function. Novel tuning algorithms are given for critic, actor, and disturbance networks. The convergence to the optimal saddle point solution is proven, and stability of the system is also guaranteed. Simulation examples show the effectiveness of the new algorithm in solving the HJI equation online for a linear system and a complex nonlinear system. Copyright © 2011 John Wiley & Sons, Ltd.

145 citations


Journal ArticleDOI
TL;DR: In this paper, a general time-inconsistent optimal control problem is considered for stochastic differential equations with deterministic coefficients, and a Hamilton-Jacobi-Bellman (HJB) type equation is derived for the equilibrium value function of the problem.
Abstract: A general time-inconsistent optimal control problem is considered for stochastic differential equations with deterministic coefficients. Under suitable conditions, a Hamilton-Jacobi-Bellman type equation is derived for the equilibrium value function of the problem. Well-posedness such an equation is studied, and time-consistent equilibrium strategies are constructed. As special cases, the linear-quadratic problem and a generalized Merton's portfolio problem are investigated.

145 citations


Journal ArticleDOI
TL;DR: The proposed Q-learning scheme evaluates the current value function and the improved control policy at the same time, and are proven stable and convergent to the LQ optimal solution, provided that the initial policy is stabilizing.

144 citations


Posted Content
TL;DR: It is shown that under appropriate regularity conditions, the mean-field value of the stochastic differential game with exponentiated integral cost functional coincides with the value function satisfying a Hamilton -Jacobi- Bellman (HJB) equation with an additional quadratic term.
Abstract: In this paper, we study a class of risk-sensitive mean-field stochastic differential games. We show that under appropriate regularity conditions, the mean-field value of the stochastic differential game with exponentiated integral cost functional coincides with the value function described by a Hamilton-Jacobi-Bellman (HJB) equation with an additional quadratic term. We provide an explicit solution of the mean-field best response when the instantaneous cost functions are log-quadratic and the state dynamics are affine in the control. An equivalent mean-field risk-neutral problem is formulated and the corresponding mean-field equilibria are characterized in terms of backward-forward macroscopic McKean-Vlasov equations, Fokker-Planck-Kolmogorov equations, and HJB equations. We provide numerical examples on the mean field behavior to illustrate both linear and McKean-Vlasov dynamics.

126 citations


Journal Article
TL;DR: A performance bound is reported for the widely used least-squares policy iteration (LSPI) algorithm based on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function.
Abstract: In this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is b-mixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm.

118 citations


Journal ArticleDOI
TL;DR: In this article, a novel adaptive dynamic programming scheme based on general value iteration (VI) was proposed to obtain near optimal control for discrete-time affine non-linear systems with continuous state and control spaces.
Abstract: In this study, the authors propose a novel adaptive dynamic programming scheme based on general value iteration (VI) to obtain near optimal control for discrete-time affine non-linear systems with continuous state and control spaces. First, the selection of initial value function is different from the traditional VI, and a new method is introduced to demonstrate the convergence property and convergence speed of value function. Then, the control law obtained at each iteration can stabilise the system under some conditions. At last, an error-bound-based condition is derived considering the approximation errors of neural networks, and then the error between the optimal and approximated value functions can also be estimated. To facilitate the implementation of the iterative scheme, three neural networks with Levenberg-Marquardt training algorithm are used to approximate the unknown system, the value function and the control law. Two simulation examples are presented to demonstrate the effectiveness of the proposed scheme.

109 citations


Journal ArticleDOI
TL;DR: In this article, the authors compare different solution methods for computing the equilibrium of dynamic stochastic general equilibrium (DSGE) models with recursive preferences such as those in Epstein and Zin (1989, 1991) and Stochastic volatility and conclude that perturbation methods are an attractive approach for computing this class of problems.

Journal ArticleDOI
TL;DR: In this article, the authors considered the homogenization of Hamilton-Jacobi equations and degenerate Bellman equations in stationary, ergodic, unbounded environments, and showed that, as the microscopic scale tends to zero, the equation averages to a deterministic Hamilton -Jacobi equation and studied some properties of the effective Hamiltonian.

Journal ArticleDOI
TL;DR: The pathwise optimization (PO) method is introduced, a new convex optimization procedure to produce upper and lower bounds on the optimal value (the “price”) of a high-dimensional optimal stopping problem and an approximation theory relevant to martingale duality approaches in general and the PO method in particular is developed.
Abstract: We introduce the pathwise optimization (PO) method, a new convex optimization procedure to produce upper and lower bounds on the optimal value (the “price”) of a high-dimensional optimal stopping problem. The PO method builds on a dual characterization of optimal stopping problems as optimization problems over the space of martingales, which we dub the martingale duality approach. We demonstrate via numerical experiments that the PO method produces upper bounds of a quality comparable with state-of-the-art approaches, but in a fraction of the time required for those approaches. As a by-product, it yields lower bounds (and suboptimal exercise policies) that are substantially superior to those produced by state-of-the-art methods. The PO method thus constitutes a practical and desirable approach to high-dimensional pricing problems. Furthermore, we develop an approximation theory relevant to martingale duality approaches in general and the PO method in particular. Our analysis provides a guarantee on the quality of upper bounds resulting from these approaches and identifies three key determinants of their performance: the quality of an input value function approximation, the square root of the effective time horizon of the problem, and a certain spectral measure of “predictability” of the underlying Markov chain. As a corollary to this analysis we develop approximation guarantees specific to the PO method. Finally, we view the PO method and several approximate dynamic programming methods for high-dimensional pricing problems through a common lens and in doing so show that the PO method dominates those alternatives. This paper was accepted by Wei Xiong, stochastic models and simulation.

Journal ArticleDOI
TL;DR: A delay-aware distributed solution with the BS-DTX control at the BS controller (BSC) and the user scheduling at each cluster manager (CM) using approximate dynamic programming and distributed stochastic learning is obtained and the proposed distributed two-timescale algorithm converges almost surely.
Abstract: In this paper, we propose a two-timescale delay-optimal base station discontinuous transmission (BS-DTX) control and user scheduling for downlink coordinated MIMO systems with energy harvesting capability. To reduce the complexity and signaling overhead in practical systems, the BS-DTX control is adaptive to both the energy state information (ESI) and the data queue state information (QSI) over a longer timescale. The user scheduling is adaptive to the ESI, the QSI and the channel state information (CSI) over a shorter timescale. We show that the two-timescale delay-optimal control problem can be modeled as an infinite horizon average cost partially observed Markov decision problem (POMDP), which is well known to be a difficult problem in general. By using sample-path analysis and exploiting specific problem structure, we first obtain some structural results on the optimal control policy and derive an equivalent Bellman equation with reduced state space. To reduce the complexity and facilitate distributed implementation, we obtain a delay-aware distributed solution with the BS-DTX control at the BS controller (BSC) and the user scheduling at each cluster manager (CM) using approximate dynamic programming and distributed stochastic learning. We show that the proposed distributed two-timescale algorithm converges almost surely. Furthermore, using queueing theory, stochastic geometry, and optimization techniques, we derive sufficient conditions for the data queues to be stable in the coordinated MIMO network and discuss various design insights.

Journal ArticleDOI
TL;DR: It is shown that the optimal reward of such a Markov decision process, which satisfies a Bellman equation, converges to the solution of a continuous Hamilton-Jacobi-Bellman (HJB) equation based on the mean field approximation of the Markov decided process.
Abstract: We study the convergence of Markov decision processes, composed of a large number of objects, to optimization problems on ordinary differential equations. We show that the optimal reward of such a Markov decision process, which satisfies a Bellman equation, converges to the solution of a continuous Hamilton-Jacobi-Bellman (HJB) equation based on the mean field approximation of the Markov decision process. We give bounds on the difference of the rewards and an algorithm for deriving an approximating solution to the Markov decision process from a solution of the HJB equations. We illustrate the method on three examples pertaining, respectively, to investment strategies, population dynamics control and scheduling in queues. They are used to illustrate and justify the construction of the controlled ODE and to show the advantage of solving a continuous HJB equation rather than a large discrete Bellman equation.

Posted Content
TL;DR: In this article, the state space is dynamically partitioned into regions where the value function is the same throughout the region, where the state variables can be expressed by piecewise constant representations.
Abstract: We describe an approach for exploiting structure in Markov Decision Processes with continuous state variables. At each step of the dynamic programming, the state space is dynamically partitioned into regions where the value function is the same throughout the region. We first describe the algorithm for piecewise constant representations. We then extend it to piecewise linear representations, using techniques from POMDPs to represent and reason about linear surfaces efficiently. We show that for complex, structured problems, our approach exploits the natural structure so that optimal solutions can be computed efficiently.

Posted Content
TL;DR: In this paper, a nonparametric approach is proposed to learn and represent transition dynamics in Markov decision processes (MDPs), which can be combined easily with dynamic programming methods for policy optimisation and value estimation.
Abstract: We propose a new, nonparametric approach to learning and representing transition dynamics in Markov decision processes (MDPs), which can be combined easily with dynamic programming methods for policy optimisation and value estimation. This approach makes use of a recently developed representation of conditional distributions as \emph{embeddings} in a reproducing kernel Hilbert space (RKHS). Such representations bypass the need for estimating transition probabilities or densities, and apply to any domain on which kernels can be defined. This avoids the need to calculate intractable integrals, since expectations are represented as RKHS inner products whose computation has linear complexity in the number of points used to represent the embedding. We provide guarantees for the proposed applications in MDPs: in the context of a value iteration algorithm, we prove convergence to either the optimal policy, or to the closest projection of the optimal policy in our model class (an RKHS), under reasonable assumptions. In experiments, we investigate a learning task in a typical classical control setting (the under-actuated pendulum), and on a navigation problem where only images from a sensor are observed. For policy optimisation we compare with least-squares policy iteration where a Gaussian process is used for value function estimation. For value estimation we also compare to the NPDP method. Our approach achieves better performance in all experiments.

Journal ArticleDOI
TL;DR: A methodology to design a dynamic controller to achieve L2-disturbance attenuation or approximate optimality, with asymptotic stability is introduced.
Abstract: The solution of most nonlinear control problems hinges upon the solvability of partial differential equations or inequalities. In particular, disturbance attenuation and optimal control problems for nonlinear systems are generally solved exploiting the solution of the so-called Hamilton-Jacobi (HJ) inequality and the Hamilton-Jacobi-Bellman (HJB) equation, respectively. An explicit closed-form solution of this inequality, or equation, may however be hard or impossible to find in practical situations. Herein we introduce a methodology to circumvent this issue for input-affine nonlinear systems proposing a dynamic, i.e., time-varying, approximate solution of the HJ inequality and of the HJB equation the construction of which does not require solving any partial differential equation or inequality. This is achieved considering the immersion of the underlying nonlinear system into an augmented system defined on an extended state-space in which a (locally) positive definite storage function, or value function, can be explicitly constructed. The result is a methodology to design a dynamic controller to achieve L2-disturbance attenuation or approximate optimality, with asymptotic stability.

Journal ArticleDOI
TL;DR: This work considers differentiability with respect to the switch times of the value function of an optimal control problem for a non-autonomous switched system and provides a method to compute the derivative of the cost function given a nominal input.

Journal ArticleDOI
TL;DR: A new algorithm for the solution of Hamilton--Jacobi--Bellman equations related to optimal control problems is presented that has the advantage that every subdomain is invariant with respect to the optimal dynamics, and then the solution can be computed independently in each subdomain.
Abstract: In this paper we present a new algorithm for the solution of Hamilton--Jacobi--Bellman equations related to optimal control problems The key idea is to divide the domain of computation into subdomains which are shaped by the optimal dynamics of the underlying control problem This can result in a rather complex geometrical subdivision, but it has the advantage that every subdomain is invariant with respect to the optimal dynamics, and then the solution can be computed independently in each subdomain The features of this dynamics-dependent domain decomposition can be exploited to speed up the computation and for an efficient parallelization, since the classical transmission conditions at the boundaries of the subdomains can be avoided For their properties, the subdomains are patches in the sense introduced by Ancona and Bressan [ESAIM Control Optim Calc Var, 4 (1999), pp 445--471] Several examples in two and three dimensions illustrate the properties of the new method

Journal ArticleDOI
TL;DR: A state and action discretization procedure for approximating the optimal value function and an optimal policy of the original control model is proposed and explicit bounds on the approximation errors are provided.

Proceedings Article
26 Jun 2012
TL;DR: A new, nonparametric approach to learning and representing transition dynamics in Markov decision processes (MDPs), which can be combined easily with dynamic programming methods for policy optimisation and value estimation, makes use of a recently developed representation of conditional distributions as embeddings in a reproducing kernel Hilbert space (RKHS).
Abstract: We propose a new, nonparametric approach to learning and representing transition dynamics in Markov decision processes (MDPs), which can be combined easily with dynamic programming methods for policy optimisation and value estimation. This approach makes use of a recently developed representation of conditional distributions as embeddings in a reproducing kernel Hilbert space (RKHS). Such representations bypass the need for estimating transition probabilities or densities, and apply to any domain on which kernels can be defined. This avoids the need to calculate intractable integrals, since expectations are represented as RKHS inner products whose computation has linear complexity in the number of points used to represent the embedding. We provide guarantees for the proposed applications in MDPs: in the context of a value iteration algorithm, we prove convergence to either the optimal policy, or to the closest projection of the optimal policy in our model class (an RKHS), under reasonable assumptions. In experiments, we investigate a learning task in a typical classical control setting (the under-actuated pendulum), and on a navigation problem where only images from a sensor are observed. For policy optimisation we compare with least-squares policy iteration where a Gaussian process is used for value function estimation. For value estimation we also compare to the NPDP method. Our approach achieves better performance in all experiments.

Posted Content
TL;DR: In this article, value function approximation in the context of zero-sum Markov games was investigated, which can be viewed as a generalization of the Markov decision process (MDP) framework to the two-agent case.
Abstract: This paper investigates value function approximation in the context of zero-sum Markov games, which can be viewed as a generalization of the Markov decision process (MDP) framework to the two-agent case. We generalize error bounds from MDPs to Markov games and describe generalizations of reinforcement learning algorithms to Markov games. We present a generalization of the optimal stopping problem to a two-player simultaneous move Markov game. For this special problem, we provide stronger bounds and can guarantee convergence for LSTD and temporal difference learning with linear value function approximation. We demonstrate the viability of value function approximation for Markov games by using the Least squares policy iteration (LSPI) algorithm to learn good policies for a soccer domain and a flow control problem.

Proceedings ArticleDOI
27 Jun 2012
TL;DR: This paper presents explicit solutions for a class of decentralized LQG problems in which players communicate their states with delays using a method for decomposing the Bellman equation into a hierarchy of independent subproblems.
Abstract: This paper presents explicit solutions for a class of decentralized LQG problems in which players communicate their states with delays. A method for decomposing the Bellman equation into a hierarchy of independent subproblems is introduced. Using this decomposition, all of the gains for the optimal controller are computed from the solution of a single algebraic Riccati equation.

Proceedings Article
22 Jul 2012
TL;DR: This work introduces a more general and richer dual optimization criterion, which minimizes the average (undiscounted) cost of only paths leading to the goal among all policies that maximize the probability to reach the goal.
Abstract: Optimal solutions to Stochastic Shortest Path Problems (SSPs) usually require that there exists at least one policy that reaches the goal with probability 1 from the initial state. This condition is very strong and prevents from solving many interesting problems, for instance where all possible policies reach some dead-end states with a positive probability. We introduce a more general and richer dual optimization criterion, which minimizes the average (undiscounted) cost of only paths leading to the goal among all policies that maximize the probability to reach the goal. We present policy update equations in the form of dynamic programming for this new dual criterion, which are different from the standard Bellman equations. We demonstrate that our equations converge in infinite horizon without any condition on the structure of the problem or on its policies, which actually extends the class of SSPs that can be solved. We experimentally show that our dual criterion provides wellfounded solutions to SSPs that can not be solved by the standard criterion, and that using a discount factor with the latter certainly provides solution policies but which are not optimal considering our well-founded criterion.

Journal ArticleDOI
TL;DR: This paper presents a stochastic dynamic programming formulation of the Dynamic Integrated Model of Climate and the Economy (DICE), and the application of approximate dynamic programming techniques to numerically solve for the optimal policy under uncertain and decision-dependent technological change in a multi-stage setting.
Abstract: Analyses of global climate policy as a sequential decision under uncertainty have been severely restricted by dimensionality and computational burdens. Therefore, they have limited the number of decision stages, discrete actions, or number and type of uncertainties considered. In particular, two common simplifications are the use of two-stage models to approximate a multi-stage problem and exogenous formulations for inherently endogenous or decision-dependent uncertainties (in which the shock at time t+1 depends on the decision made at time t). In this paper, we present a stochastic dynamic programming formulation of the Dynamic Integrated Model of Climate and the Economy (DICE), and the application of approximate dynamic programming techniques to numerically solve for the optimal policy under uncertain and decision-dependent technological change in a multi-stage setting. We compare numerical results using two alternative value function approximation approaches, one parametric and one non-parametric. We show that increasing the variance of a symmetric mean-preserving uncertainty in abatement costs leads to higher optimal first-stage emission controls, but the effect is negligible when the uncertainty is exogenous. In contrast, the impact of decision-dependent cost uncertainty, a crude approximation of technology R&D, on optimal control is much larger, leading to higher control rates (lower emissions). Further, we demonstrate that the magnitude of this effect grows with the number of decision stages represented, suggesting that for decision-dependent phenomena, the conventional two-stage approximation will lead to an underestimate of the effect of uncertainty.

Proceedings ArticleDOI
14 May 2012
TL;DR: The proposed incremental Markov Decision Process (iMDP) provides an anytime approach to the computation of optimal control policies of the continuous problem and is demonstrated on motion planning and control problems in cluttered environments in the presence of process noise.
Abstract: In this paper, we consider a class of continuous-time, continuous-space stochastic optimal control problems Building upon recent advances in Markov chain approximation methods and sampling-based algorithms for deterministic path planning, we propose a novel algorithm called the incremental Markov Decision Process (iMDP) to compute incrementally control policies that approximate arbitrarily well an optimal policy in terms of the expected cost The main idea behind the algorithm is to generate a sequence of finite discretizations of the original problem through random sampling of the state space At each iteration, the discretized problem is a Markov Decision Process that serves as an incrementally refined model of the original problem We show that with probability one, (i) the sequence of the optimal value functions for each of the discretized problems converges uniformly to the optimal value function of the original stochastic optimal control problem, and (ii) the original optimal value function can be computed efficiently in an incremental manner using asynchronous value iterations Thus, the proposed algorithm provides an anytime approach to the computation of optimal control policies of the continuous problem The effectiveness of the proposed approach is demonstrated on motion planning and control problems in cluttered environments in the presence of process noise

Journal ArticleDOI
TL;DR: A new method to approximate Markov perfect equilibrium in largescale Ericson and Pakes (1995)-style dynamic oligopoly models that are not amenable to exact solution due to the curse of dimensionality is introduced.
Abstract: In this article, we introduce a new method to approximate Markov perfect equilibrium in largescale Ericson and Pakes (1995)-style dynamic oligopoly models that are not amenable to exact solution due to the curse of dimensionality. The method is based on an algorithm that iterates an approximate best response operator using an approximate dynamic programming approach. The method, based on mathematical programming, approximates the value function with a linear combination of basis functions. We provide results that lend theoretical support to our approach. We introduce a rich yet tractable set of basis functions, and test our method on important classes of models. Our results suggest that the approach we propose significantly expands the set of dynamic oligopoly models that can be analyzed computationally.

Journal ArticleDOI
TL;DR: This work proposes three families of iterative strategies for solving the linearized discrete MFG systems, most of which involve suitable multigrid solvers or preconditioners.
Abstract: Mean fields games (MFG) describe the asymptotic behavior of stochastic differential games in which the number of players tends to $+\infty$. Under suitable assumptions, they lead to a new kind of system of two partial differential equations: a forward Bellman equation coupled with a backward Fokker-Planck equation. In earlier articles, finite difference schemes preserving the structure of the system have been proposed and studied. They lead to large systems of nonlinear equations in finite dimension. A possible way of numerically solving the latter is to use inexact Newton methods: a Newton step consists of solving a linearized discrete MFG system. The forward-backward character of the MFG system makes it impossible to use time marching methods. In the present work, we propose three families of iterative strategies for solving the linearized discrete MFG systems, most of which involve suitable multigrid solvers or preconditioners.

Journal ArticleDOI
TL;DR: In this paper, a deterministic time-inconsistent optimal control problem is formulated for ordinary differential equations and a non-cooperative N-person differential game (but essentially cooperative in some sense) is introduced.
Abstract: A general deterministic time-inconsistent optimal control problem is formulated for ordinary differential equations. To find a time-consistent equilibrium value function and the corresponding time-consistent equilibrium control, a non-cooperative N-person differential game (but essentially cooperative in some sense) is introduced. Under certain conditions, it is proved that the open-loop Nash equilibrium value function of the N-person differential game converges to a time-consistent equilibrium value function of the original problem, which is the value function of a time-consistent optimal control problem. Moreover, it is proved that any optimal control of the time-consistent limit problem is a time-consistent equilibrium control of the original problem.