Showing papers on "Bellman equation published in 2012"

PDF

Open Access

Book•

Optimal Stochastic Control, Stochastic Target Problems, and Backward SDE

[...]

27 Sep 2012

TL;DR: In this paper, the authors propose a method for solving control problems by verification, which is based on the Viscosity Solution Equation (VSP) in the sense of VVS.

...read moreread less

Abstract: Preface.- 1. Conditional Expectation and Linear Parabolic PDEs.- 2. Stochastic Control and Dynamic Programming.- 3. Optimal Stopping and Dynamic Programming.- 4. Solving Control Problems by Verification.- 5. Introduction to Viscosity Solutions.- 6. Dynamic Programming Equation in the Viscosity Sense.- 7. Stochastic Target Problems.- 8. Second Order Stochastic Target Problems.- 9. Backward SDEs and Stochastic Control.- 10. Quadratic Backward SDEs.- 11. Probabilistic Numerical Methods for Nonlinear PDEs.- 12. Introduction to Finite Differences Methods.- References.

...read moreread less

244 citations

Journal Article•DOI•

Improving the numerical performance of static and dynamic aggregate discrete choice random coefficients demand estimation

[...]

Jean-Pierre Dubé¹, J Eremy T. F Ox², Che-Lin Su¹•Institutions (2)

University of Chicago¹, University of Michigan²

01 Sep 2012-Econometrica

TL;DR: In this article, the authors derive numerical theory results characterizing the properties of the nested fixed point algorithm used to evaluate the objective function of BLP's estimator, and recast estimation as a mathematical program with equilibrium constraints.

...read moreread less

Abstract: The widely used estimator of Berry, Levinsohn, and Pakes (1995 )p roduces estimates of consumer preferences from a discrete-choice demand model with random coefficients, market-level demand shocks, and endogenous prices. We derive numerical theory results characterizing the properties of the nested fixed point algorithm used to evaluate the objective function of BLP’s estimator. We discuss problems with typical implementations, including cases that can lead to incorrect parameter estimates. As a solution, we recast estimation as a mathematical program with equilibrium constraints, which can be faster and which avoids the numerical issues associated with nested inner loops. The advantages are even more pronounced for forward-looking demand models where the Bellman equation must also be solved repeatedly. Several Monte Carlo and real-data experiments support our numerical concerns about the nested fixed point approach and the advantages of constrained optimization. For static BLP, the constrained optimization approach can be as much as ten to forty times faster for large-dimensional problems with many markets.

...read moreread less

212 citations

Posted Content•

Metrics for Finite Markov Decision Processes

[...]

Norm Ferns¹, Prakash Panangaden¹, Doina Precup¹•Institutions (1)

McGill University¹

11 Jul 2012-arXiv: Artificial Intelligence

TL;DR: In this paper, the authors present metrics for measuring the similarity of states in a finite Markov decision process (MDP) based on the notion of bisimulation, with an aim towards solving discounted infinite horizon reinforcement learning tasks.

...read moreread less

Abstract: We present metrics for measuring the similarity of states in a finite Markov decision process (MDP). The formulation of our metrics is based on the notion of bisimulation for MDPs, with an aim towards solving discounted infinite horizon reinforcement learning tasks. Such metrics can be used to aggregate states, as well as to better structure other value function approximators (e.g., memory-based or nearest-neighbor approximators). We provide bounds that relate our metric distances to the optimal values of states in the given MDP.

...read moreread less

201 citations

Book•

Adaptive Dynamic Programming for Control: Algorithms and Stability

[...]

Huaguang Zhang, Derong Liu, Yanhong Luo, Ding Wang

14 Dec 2012

TL;DR: Adaptive Dynamic Programming in Discrete Time (ADPDP-DTM) as discussed by the authors is a generalization of ADP for nonlinear systems with a focus on optimal control.

...read moreread less

Abstract: There are many methods of stable controller design for nonlinear systems. In seeking to go beyond the minimum requirement of stability, Adaptive Dynamic Programming in Discrete Time approaches the challenging topic of optimal control for nonlinear systems using the tools of adaptive dynamic programming (ADP). The range of systems treated is extensive; affine, switched, singularly perturbed and time-delay nonlinear systems are discussed as are the uses of neural networks and techniques of value and policy iteration. The text features three main aspects of ADP in which the methods proposed for stabilization and for tracking and games benefit from the incorporation of optimal control methods: infinite-horizon control for which the difficulty of solving partial differential HamiltonJacobiBellman equations directly is overcome, and proof provided that the iterative value function updating sequence converges to the infimum of all the value functions obtained by admissible control law sequences; finite-horizon control, implemented in discrete-time nonlinear systems showing the reader how to obtain suboptimal control solutions within a fixed number of control steps and with results more easily applied in real systems than those usually gained from infinite-horizon control; nonlinear games for which a pair of mixed optimal policies are derived for solving games both when the saddle point does not exist, and, when it does, avoiding the existence conditions of the saddle point. Non-zero-sum games are studied in the context of a single network scheme in which policies are obtained guaranteeing system stability and minimizing the individual performance function yielding a Nash equilibrium. In order to make the coverage suitable for the student as well as for the expert reader, Adaptive Dynamic Programming in Discrete Time: establishes the fundamental theory involved clearly with each chapter devoted to a clearly identifiable control paradigm; demonstrates convergence proofs of the ADP algorithms to deepen understanding of the derivation of stability and convergence with the iterative computational methods used; and shows how ADP methods can be put to use both in simulation and in real applications. This text will be of considerable interest to researchers interested in optimal control and its applications in operations research, applied mathematics computational intelligence and engineering. Graduate students working in control and operations research will also find the ideas presented here to be a source of powerful methods for furthering their study.

...read moreread less

200 citations

Journal Article•DOI•

Online solution of nonlinear two-player zero-sum games using synchronous policy iteration

[...]

Kyriakos G. Vamvoudakis¹, Frank L. Lewis¹•Institutions (1)

University of Texas at Arlington¹

10 Sep 2012-International Journal of Robust and Nonlinear Control

TL;DR: An online gaming algorithm based on policy iteration to solve the continuoustime (CT) twoplayer zerosum game with infinite horizon cost for nonlinear systems with known dynamics is presented.

...read moreread less

Abstract: SUMMARY The two-player zero-sum (ZS) game problem provides the solution to the bounded L2-gain problem and so is important for robust control. However, its solution depends on solving a design Hamilton–Jacobi–Isaacs (HJI) equation, which is generally intractable for nonlinear systems. In this paper, we present an online adaptive learning algorithm based on policy iteration to solve the continuous-time two-player ZS game with infinite horizon cost for nonlinear systems with known dynamics. That is, the algorithm learns online in real time an approximate local solution to the game HJI equation. This method finds, in real time, suitable approximations of the optimal value and the saddle point feedback control policy and disturbance policy, while also guaranteeing closed-loop stability. The adaptive algorithm is implemented as an actor/critic/disturbance structure that involves simultaneous continuous-time adaptation of critic, actor, and disturbance neural networks. We call this online gaming algorithm ‘synchronous’ ZS game policy iteration. A persistence of excitation condition is shown to guarantee convergence of the critic to the actual optimal value function. Novel tuning algorithms are given for critic, actor, and disturbance networks. The convergence to the optimal saddle point solution is proven, and stability of the system is also guaranteed. Simulation examples show the effectiveness of the new algorithm in solving the HJI equation online for a linear system and a complex nonlinear system. Copyright © 2011 John Wiley & Sons, Ltd.

...read moreread less

145 citations

Journal Article•DOI•

Time-inconsistentoptimal control problems and the equilibrium HJB equation

[...]

Jiongmin Yong

01 Aug 2012-Mathematical Control and Related Fields

TL;DR: In this paper, a general time-inconsistent optimal control problem is considered for stochastic differential equations with deterministic coefficients, and a Hamilton-Jacobi-Bellman (HJB) type equation is derived for the equilibrium value function of the problem.

...read moreread less

Abstract: A general time-inconsistent optimal control problem is considered for stochastic differential equations with deterministic coefficients. Under suitable conditions, a Hamilton-Jacobi-Bellman type equation is derived for the equilibrium value function of the problem. Well-posedness such an equation is studied, and time-consistent equilibrium strategies are constructed. As special cases, the linear-quadratic problem and a generalized Merton's portfolio problem are investigated.

...read moreread less

145 citations

Journal Article•DOI•

Integral Q-learning and explorized policy iteration for adaptive optimal control of continuous-time linear systems

[...]

Jae Young Lee¹, Jin Bae Park¹, Yoon Ho Choi²•Institutions (2)

Yonsei University¹, Kyonggi University²

01 Nov 2012-Automatica

TL;DR: The proposed Q-learning scheme evaluates the current value function and the improved control policy at the same time, and are proven stable and convergent to the LQ optimal solution, provided that the initial policy is stabilizing.

...read moreread less

144 citations

Posted Content•

Risk-Sensitive Mean Field Games

[...]

Hamidou Tembine¹, Quanyan Zhu², Tamer Basar²•Institutions (2)

King Abdullah University of Science and Technology¹, University of Illinois at Urbana–Champaign²

10 Oct 2012-arXiv: Optimization and Control

TL;DR: It is shown that under appropriate regularity conditions, the mean-field value of the stochastic differential game with exponentiated integral cost functional coincides with the value function satisfying a Hamilton -Jacobi- Bellman (HJB) equation with an additional quadratic term.

...read moreread less

Abstract: In this paper, we study a class of risk-sensitive mean-field stochastic differential games. We show that under appropriate regularity conditions, the mean-field value of the stochastic differential game with exponentiated integral cost functional coincides with the value function described by a Hamilton-Jacobi-Bellman (HJB) equation with an additional quadratic term. We provide an explicit solution of the mean-field best response when the instantaneous cost functions are log-quadratic and the state dynamics are affine in the control. An equivalent mean-field risk-neutral problem is formulated and the corresponding mean-field equilibria are characterized in terms of backward-forward macroscopic McKean-Vlasov equations, Fokker-Planck-Kolmogorov equations, and HJB equations. We provide numerical examples on the mean field behavior to illustrate both linear and McKean-Vlasov dynamics.

...read moreread less

126 citations

Journal Article•

Finite-sample analysis of least-squares policy iteration

[...]

Alessandro Lazaric¹, Mohammad Ghavamzadeh¹, Rémi Munos¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

01 Jan 2012-Journal of Machine Learning Research

TL;DR: A performance bound is reported for the widely used least-squares policy iteration (LSPI) algorithm based on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function.

...read moreread less

Abstract: In this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is b-mixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm.

...read moreread less

118 citations

Journal Article•DOI•

Optimal control for discrete-time affine non-linear systems using general value iteration

[...]

Hongliang Li¹, Derong Liu¹•Institutions (1)

Chinese Academy of Sciences¹

01 Dec 2012-Iet Control Theory and Applications

TL;DR: In this article, a novel adaptive dynamic programming scheme based on general value iteration (VI) was proposed to obtain near optimal control for discrete-time affine non-linear systems with continuous state and control spaces.

...read moreread less

Abstract: In this study, the authors propose a novel adaptive dynamic programming scheme based on general value iteration (VI) to obtain near optimal control for discrete-time affine non-linear systems with continuous state and control spaces. First, the selection of initial value function is different from the traditional VI, and a new method is introduced to demonstrate the convergence property and convergence speed of value function. Then, the control law obtained at each iteration can stabilise the system under some conditions. At last, an error-bound-based condition is derived considering the approximation errors of neural networks, and then the error between the optimal and approximated value functions can also be estimated. To facilitate the implementation of the iterative scheme, three neural networks with Levenberg-Marquardt training algorithm are used to approximate the unknown system, the value function and the control law. Two simulation examples are presented to demonstrate the effectiveness of the proposed scheme.

...read moreread less

109 citations

Journal Article•DOI•

Computing DSGE models with recursive preferences and stochastic volatility

[...]

Dario Caldara¹, Jesús Fernández-Villaverde, Juan F Rubio-Ramirez², Juan F Rubio-Ramirez³, Wen Yao⁴ - Show less +1 more•Institutions (4)

Federal Reserve System¹, Duke University², Federal Reserve Bank of Atlanta³, University of Pennsylvania⁴

01 Apr 2012-Review of Economic Dynamics

TL;DR: In this article, the authors compare different solution methods for computing the equilibrium of dynamic stochastic general equilibrium (DSGE) models with recursive preferences such as those in Epstein and Zin (1989, 1991) and Stochastic volatility and conclude that perturbation methods are an attractive approach for computing this class of problems.

...read moreread less

Journal Article•DOI•

Stochastic homogenization of Hamilton-Jacobi and degenerate Bellman equations in unbounded environments

[...]

Scott N. Armstrong¹, Panagiotis E. Souganidis¹•Institutions (1)

University of Chicago¹

01 May 2012-Journal de Mathématiques Pures et Appliquées

TL;DR: In this article, the authors considered the homogenization of Hamilton-Jacobi equations and degenerate Bellman equations in stationary, ergodic, unbounded environments, and showed that, as the microscopic scale tends to zero, the equation averages to a deterministic Hamilton -Jacobi equation and studied some properties of the effective Hamiltonian.

...read moreread less

Journal Article•DOI•

Pathwise Optimization for Optimal Stopping Problems

[...]

Vijay V. Desai¹, Vivek F. Farias², Ciamac C. Moallemi¹•Institutions (2)

Columbia University¹, Massachusetts Institute of Technology²

01 Dec 2012-Management Science

TL;DR: The pathwise optimization (PO) method is introduced, a new convex optimization procedure to produce upper and lower bounds on the optimal value (the “price”) of a high-dimensional optimal stopping problem and an approximation theory relevant to martingale duality approaches in general and the PO method in particular is developed.

...read moreread less

Abstract: We introduce the pathwise optimization (PO) method, a new convex optimization procedure to produce upper and lower bounds on the optimal value (the “price”) of a high-dimensional optimal stopping problem. The PO method builds on a dual characterization of optimal stopping problems as optimization problems over the space of martingales, which we dub the martingale duality approach. We demonstrate via numerical experiments that the PO method produces upper bounds of a quality comparable with state-of-the-art approaches, but in a fraction of the time required for those approaches. As a by-product, it yields lower bounds (and suboptimal exercise policies) that are substantially superior to those produced by state-of-the-art methods. The PO method thus constitutes a practical and desirable approach to high-dimensional pricing problems. Furthermore, we develop an approximation theory relevant to martingale duality approaches in general and the PO method in particular. Our analysis provides a guarantee on the quality of upper bounds resulting from these approaches and identifies three key determinants of their performance: the quality of an input value function approximation, the square root of the effective time horizon of the problem, and a certain spectral measure of “predictability” of the underlying Markov chain. As a corollary to this analysis we develop approximation guarantees specific to the PO method. Finally, we view the PO method and several approximate dynamic programming methods for high-dimensional pricing problems through a common lens and in doing so show that the PO method dominates those alternatives. This paper was accepted by Wei Xiong, stochastic models and simulation.

...read moreread less

Journal Article•DOI•

Delay-Aware BS Discontinuous Transmission Control and User Scheduling for Energy Harvesting Downlink Coordinated MIMO Systems

[...]

Ying Cui¹, Vincent K. N. Lau¹, Yueping Wu¹•Institutions (1)

Hong Kong University of Science and Technology¹

01 Jul 2012-IEEE Transactions on Signal Processing

TL;DR: A delay-aware distributed solution with the BS-DTX control at the BS controller (BSC) and the user scheduling at each cluster manager (CM) using approximate dynamic programming and distributed stochastic learning is obtained and the proposed distributed two-timescale algorithm converges almost surely.

...read moreread less

Abstract: In this paper, we propose a two-timescale delay-optimal base station discontinuous transmission (BS-DTX) control and user scheduling for downlink coordinated MIMO systems with energy harvesting capability. To reduce the complexity and signaling overhead in practical systems, the BS-DTX control is adaptive to both the energy state information (ESI) and the data queue state information (QSI) over a longer timescale. The user scheduling is adaptive to the ESI, the QSI and the channel state information (CSI) over a shorter timescale. We show that the two-timescale delay-optimal control problem can be modeled as an infinite horizon average cost partially observed Markov decision problem (POMDP), which is well known to be a difficult problem in general. By using sample-path analysis and exploiting specific problem structure, we first obtain some structural results on the optimal control policy and derive an equivalent Bellman equation with reduced state space. To reduce the complexity and facilitate distributed implementation, we obtain a delay-aware distributed solution with the BS-DTX control at the BS controller (BSC) and the user scheduling at each cluster manager (CM) using approximate dynamic programming and distributed stochastic learning. We show that the proposed distributed two-timescale algorithm converges almost surely. Furthermore, using queueing theory, stochastic geometry, and optimization techniques, we derive sufficient conditions for the data queues to be stable in the coordinated MIMO network and discuss various design insights.

...read moreread less

Journal Article•DOI•

Mean Field for Markov Decision Processes: From Discrete to Continuous Optimization

[...]

Nicolas Gast¹, Bruno Gaujal², Jean-Yves Le Boudec¹•Institutions (2)

École Polytechnique Fédérale de Lausanne¹, French Institute for Research in Computer Science and Automation²

03 Feb 2012-IEEE Transactions on Automatic Control

TL;DR: It is shown that the optimal reward of such a Markov decision process, which satisfies a Bellman equation, converges to the solution of a continuous Hamilton-Jacobi-Bellman (HJB) equation based on the mean field approximation of the Markov decided process.

...read moreread less

Abstract: We study the convergence of Markov decision processes, composed of a large number of objects, to optimization problems on ordinary differential equations. We show that the optimal reward of such a Markov decision process, which satisfies a Bellman equation, converges to the solution of a continuous Hamilton-Jacobi-Bellman (HJB) equation based on the mean field approximation of the Markov decision process. We give bounds on the difference of the rewards and an algorithm for deriving an approximating solution to the Markov decision process from a solution of the HJB equations. We illustrate the method on three examples pertaining, respectively, to investment strategies, population dynamics control and scheduling in queues. They are used to illustrate and justify the construction of the controlled ODE and to show the advantage of solving a continuous HJB equation rather than a large discrete Bellman equation.

...read moreread less

Posted Content•

Dynamic Programming for Structured Continuous Markov Decision Problems

[...]

Zhengzhu Feng¹, Richard Dearden², Nicolas Meuleau², Richard Washington²•Institutions (2)

University of Massachusetts Amherst¹, Ames Research Center²

11 Jul 2012-arXiv: Artificial Intelligence

TL;DR: In this article, the state space is dynamically partitioned into regions where the value function is the same throughout the region, where the state variables can be expressed by piecewise constant representations.

...read moreread less

Abstract: We describe an approach for exploiting structure in Markov Decision Processes with continuous state variables. At each step of the dynamic programming, the state space is dynamically partitioned into regions where the value function is the same throughout the region. We first describe the algorithm for piecewise constant representations. We then extend it to piecewise linear representations, using techniques from POMDPs to represent and reason about linear surfaces efficiently. We show that for complex, structured problems, our approach exploits the natural structure so that optimal solutions can be computed efficiently.

...read moreread less

Posted Content•

Modelling transition dynamics in MDPs with RKHS embeddings

[...]

Steffen Grünewälder¹, Guy Lever¹, Luca Baldassarre¹, Massi Pontil¹, Arthur Gretton² - Show less +1 more•Institutions (2)

University College London¹, Max Planck Society²

18 Jun 2012-arXiv: Learning

TL;DR: In this paper, a nonparametric approach is proposed to learn and represent transition dynamics in Markov decision processes (MDPs), which can be combined easily with dynamic programming methods for policy optimisation and value estimation.

...read moreread less

Abstract: We propose a new, nonparametric approach to learning and representing transition dynamics in Markov decision processes (MDPs), which can be combined easily with dynamic programming methods for policy optimisation and value estimation. This approach makes use of a recently developed representation of conditional distributions as \emph{embeddings} in a reproducing kernel Hilbert space (RKHS). Such representations bypass the need for estimating transition probabilities or densities, and apply to any domain on which kernels can be defined. This avoids the need to calculate intractable integrals, since expectations are represented as RKHS inner products whose computation has linear complexity in the number of points used to represent the embedding. We provide guarantees for the proposed applications in MDPs: in the context of a value iteration algorithm, we prove convergence to either the optimal policy, or to the closest projection of the optimal policy in our model class (an RKHS), under reasonable assumptions. In experiments, we investigate a learning task in a typical classical control setting (the under-actuated pendulum), and on a navigation problem where only images from a sensor are observed. For policy optimisation we compare with least-squares policy iteration where a Gaussian process is used for value function estimation. For value estimation we also compare to the NPDP method. Our approach achieves better performance in all experiments.

...read moreread less

Journal Article•DOI•

Dynamic Approximate Solutions of the HJ Inequality and of the HJB Equation for Input-Affine Nonlinear Systems

[...]

Mario Sassano¹, Alessandro Astolfi¹•Institutions (1)

Imperial College London¹

03 Feb 2012-IEEE Transactions on Automatic Control

TL;DR: A methodology to design a dynamic controller to achieve L2-disturbance attenuation or approximate optimality, with asymptotic stability is introduced.

...read moreread less

Abstract: The solution of most nonlinear control problems hinges upon the solvability of partial differential equations or inequalities. In particular, disturbance attenuation and optimal control problems for nonlinear systems are generally solved exploiting the solution of the so-called Hamilton-Jacobi (HJ) inequality and the Hamilton-Jacobi-Bellman (HJB) equation, respectively. An explicit closed-form solution of this inequality, or equation, may however be hard or impossible to find in practical situations. Herein we introduce a methodology to circumvent this issue for input-affine nonlinear systems proposing a dynamic, i.e., time-varying, approximate solution of the HJ inequality and of the HJB equation the construction of which does not require solving any partial differential equation or inequality. This is achieved considering the immersion of the underlying nonlinear system into an augmented system defined on an extended state-space in which a (locally) positive definite storage function, or value function, can be explicitly constructed. The result is a methodology to design a dynamic controller to achieve L2-disturbance attenuation or approximate optimality, with asymptotic stability.

...read moreread less

Journal Article•DOI•

Brief paper: On optimal control of non-autonomous switched systems with a fixed mode sequence

[...]

Maryam Kamgarpour¹, Claire J. Tomlin¹•Institutions (1)

University of California, Berkeley¹

01 Jun 2012-Automatica

TL;DR: This work considers differentiability with respect to the switch times of the value function of an optimal control problem for a non-autonomous switched system and provides a method to compute the derivative of the cost function given a nominal input.

...read moreread less

Journal Article•DOI•

A Patchy Dynamic Programming Scheme for a Class of Hamilton-Jacobi-Bellman Equations

[...]

Simone Cacace¹, Emiliano Cristiani, Maurizio Falcone², Athena Picarelli•Institutions (2)

École Polytechnique¹, Sapienza University of Rome²

02 Oct 2012-SIAM Journal on Scientific Computing

TL;DR: A new algorithm for the solution of Hamilton--Jacobi--Bellman equations related to optimal control problems is presented that has the advantage that every subdomain is invariant with respect to the optimal dynamics, and then the solution can be computed independently in each subdomain.

...read moreread less

Abstract: In this paper we present a new algorithm for the solution of Hamilton--Jacobi--Bellman equations related to optimal control problems The key idea is to divide the domain of computation into subdomains which are shaped by the optimal dynamics of the underlying control problem This can result in a rather complex geometrical subdivision, but it has the advantage that every subdomain is invariant with respect to the optimal dynamics, and then the solution can be computed independently in each subdomain The features of this dynamics-dependent domain decomposition can be exploited to speed up the computation and for an efficient parallelization, since the classical transmission conditions at the boundaries of the subdomains can be avoided For their properties, the subdomains are patches in the sense introduced by Ancona and Bressan [ESAIM Control Optim Calc Var, 4 (1999), pp 445--471] Several examples in two and three dimensions illustrate the properties of the new method

...read moreread less

Journal Article•DOI•

Approximation of Markov decision processes with general state space

[...]

François Dufour¹, Tomás Prieto-Rumeau²•Institutions (2)

University of Bordeaux¹, National University of Distance Education²

15 Apr 2012-Journal of Mathematical Analysis and Applications

TL;DR: A state and action discretization procedure for approximating the optimal value function and an optimal policy of the original control model is proposed and explicit bounds on the approximation errors are provided.

...read moreread less

Proceedings Article•

Modelling transition dynamics in MDPs with RKHS embeddings

[...]

Guy Lever¹, Luca Baldassarre¹, Arthur Gretton¹, Massimiliano Pontil¹, Steffen Gr new lder¹ - Show less +1 more•Institutions (1)

University College London¹

26 Jun 2012

TL;DR: A new, nonparametric approach to learning and representing transition dynamics in Markov decision processes (MDPs), which can be combined easily with dynamic programming methods for policy optimisation and value estimation, makes use of a recently developed representation of conditional distributions as embeddings in a reproducing kernel Hilbert space (RKHS).

...read moreread less

Abstract: We propose a new, nonparametric approach to learning and representing transition dynamics in Markov decision processes (MDPs), which can be combined easily with dynamic programming methods for policy optimisation and value estimation. This approach makes use of a recently developed representation of conditional distributions as embeddings in a reproducing kernel Hilbert space (RKHS). Such representations bypass the need for estimating transition probabilities or densities, and apply to any domain on which kernels can be defined. This avoids the need to calculate intractable integrals, since expectations are represented as RKHS inner products whose computation has linear complexity in the number of points used to represent the embedding. We provide guarantees for the proposed applications in MDPs: in the context of a value iteration algorithm, we prove convergence to either the optimal policy, or to the closest projection of the optimal policy in our model class (an RKHS), under reasonable assumptions. In experiments, we investigate a learning task in a typical classical control setting (the under-actuated pendulum), and on a navigation problem where only images from a sensor are observed. For policy optimisation we compare with least-squares policy iteration where a Gaussian process is used for value function estimation. For value estimation we also compare to the NPDP method. Our approach achieves better performance in all experiments.

...read moreread less

Posted Content•

Value Function Approximation in Zero-Sum Markov Games

[...]

Michail G. Lagoudakis¹, Ronald Parr¹•Institutions (1)

Duke University¹

12 Dec 2012-arXiv: Artificial Intelligence

TL;DR: In this article, value function approximation in the context of zero-sum Markov games was investigated, which can be viewed as a generalization of the Markov decision process (MDP) framework to the two-agent case.

...read moreread less

Abstract: This paper investigates value function approximation in the context of zero-sum Markov games, which can be viewed as a generalization of the Markov decision process (MDP) framework to the two-agent case. We generalize error bounds from MDPs to Markov games and describe generalizations of reinforcement learning algorithms to Markov games. We present a generalization of the optimal stopping problem to a two-player simultaneous move Markov game. For this special problem, we provide stronger bounds and can guarantee convergence for LSTD and temporal difference learning with linear value function approximation. We demonstrate the viability of value function approximation for Markov games by using the Least squares policy iteration (LSPI) algorithm to learn good policies for a soccer domain and a flow control problem.

...read moreread less

Proceedings Article•DOI•

Dynamic programming solutions for decentralized state-feedback LQG problems with communication delays

[...]

Andrew Lamperski¹, John Doyle¹•Institutions (1)

California Institute of Technology¹

27 Jun 2012

TL;DR: This paper presents explicit solutions for a class of decentralized LQG problems in which players communicate their states with delays using a method for decomposing the Bellman equation into a hierarchy of independent subproblems.

...read moreread less

Abstract: This paper presents explicit solutions for a class of decentralized LQG problems in which players communicate their states with delays. A method for decomposing the Bellman equation into a hierarchy of independent subproblems is introduced. Using this decomposition, all of the gains for the optimal controller are computed from the solution of a single algebraic Riccati equation.

...read moreread less

Proceedings Article•

Stochastic safest and shortest path problems

[...]

Florent Teichteil-Königsbuch

22 Jul 2012

TL;DR: This work introduces a more general and richer dual optimization criterion, which minimizes the average (undiscounted) cost of only paths leading to the goal among all policies that maximize the probability to reach the goal.

...read moreread less

Abstract: Optimal solutions to Stochastic Shortest Path Problems (SSPs) usually require that there exists at least one policy that reaches the goal with probability 1 from the initial state. This condition is very strong and prevents from solving many interesting problems, for instance where all possible policies reach some dead-end states with a positive probability. We introduce a more general and richer dual optimization criterion, which minimizes the average (undiscounted) cost of only paths leading to the goal among all policies that maximize the probability to reach the goal. We present policy update equations in the form of dynamic programming for this new dual criterion, which are different from the standard Bellman equations. We demonstrate that our equations converge in infinite horizon without any condition on the structure of the problem or on its policies, which actually extends the class of SSPs that can be solved. We experimentally show that our dual criterion provides wellfounded solutions to SSPs that can not be solved by the standard criterion, and that using a discount factor with the latter certainly provides solution policies but which are not optimal considering our well-founded criterion.

...read moreread less

Journal Article•DOI•

An Approximate Dynamic Programming Framework for Modeling Global Climate Policy under Decision-Dependent Uncertainty

[...]

Mort Webster¹, Nidhi Santen¹, Panos Parpas²•Institutions (2)

Massachusetts Institute of Technology¹, Imperial College London²

09 May 2012-Computational Management Science

TL;DR: This paper presents a stochastic dynamic programming formulation of the Dynamic Integrated Model of Climate and the Economy (DICE), and the application of approximate dynamic programming techniques to numerically solve for the optimal policy under uncertain and decision-dependent technological change in a multi-stage setting.

...read moreread less

Abstract: Analyses of global climate policy as a sequential decision under uncertainty have been severely restricted by dimensionality and computational burdens. Therefore, they have limited the number of decision stages, discrete actions, or number and type of uncertainties considered. In particular, two common simplifications are the use of two-stage models to approximate a multi-stage problem and exogenous formulations for inherently endogenous or decision-dependent uncertainties (in which the shock at time t+1 depends on the decision made at time t). In this paper, we present a stochastic dynamic programming formulation of the Dynamic Integrated Model of Climate and the Economy (DICE), and the application of approximate dynamic programming techniques to numerically solve for the optimal policy under uncertain and decision-dependent technological change in a multi-stage setting. We compare numerical results using two alternative value function approximation approaches, one parametric and one non-parametric. We show that increasing the variance of a symmetric mean-preserving uncertainty in abatement costs leads to higher optimal first-stage emission controls, but the effect is negligible when the uncertainty is exogenous. In contrast, the impact of decision-dependent cost uncertainty, a crude approximation of technology R&D, on optimal control is much larger, leading to higher control rates (lower emissions). Further, we demonstrate that the magnitude of this effect grows with the number of decision stages represented, suggesting that for decision-dependent phenomena, the conventional two-stage approximation will lead to an underestimate of the effect of uncertainty.

...read moreread less

Proceedings Article•DOI•

An incremental sampling-based algorithm for stochastic optimal control

[...]

Vu Anh Huynh¹, Sertac Karaman¹, Emilio Frazzoli¹•Institutions (1)

Massachusetts Institute of Technology¹

14 May 2012

TL;DR: The proposed incremental Markov Decision Process (iMDP) provides an anytime approach to the computation of optimal control policies of the continuous problem and is demonstrated on motion planning and control problems in cluttered environments in the presence of process noise.

...read moreread less

Abstract: In this paper, we consider a class of continuous-time, continuous-space stochastic optimal control problems Building upon recent advances in Markov chain approximation methods and sampling-based algorithms for deterministic path planning, we propose a novel algorithm called the incremental Markov Decision Process (iMDP) to compute incrementally control policies that approximate arbitrarily well an optimal policy in terms of the expected cost The main idea behind the algorithm is to generate a sequence of finite discretizations of the original problem through random sampling of the state space At each iteration, the discretized problem is a Markov Decision Process that serves as an incrementally refined model of the original problem We show that with probability one, (i) the sequence of the optimal value functions for each of the discretized problems converges uniformly to the optimal value function of the original stochastic optimal control problem, and (ii) the original optimal value function can be computed efficiently in an incremental manner using asynchronous value iterations Thus, the proposed algorithm provides an anytime approach to the computation of optimal control policies of the continuous problem The effectiveness of the proposed approach is demonstrated on motion planning and control problems in cluttered environments in the presence of process noise

...read moreread less

Journal Article•DOI•

An approximate dynamic programming approach to solving dynamic oligopoly models

[...]

Vivek F. Farias¹, Denis Sauré², Gabriel Y. Weintraub³•Institutions (3)

Massachusetts Institute of Technology¹, University of Pittsburgh², Columbia University³

01 Jun 2012-The RAND Journal of Economics

TL;DR: A new method to approximate Markov perfect equilibrium in largescale Ericson and Pakes (1995)-style dynamic oligopoly models that are not amenable to exact solution due to the curse of dimensionality is introduced.

...read moreread less

Abstract: In this article, we introduce a new method to approximate Markov perfect equilibrium in largescale Ericson and Pakes (1995)-style dynamic oligopoly models that are not amenable to exact solution due to the curse of dimensionality. The method is based on an algorithm that iterates an approximate best response operator using an approximate dynamic programming approach. The method, based on mathematical programming, approximates the value function with a linear combination of basis functions. We provide results that lend theoretical support to our approach. We introduce a rich yet tractable set of basis functions, and test our method on important classes of models. Our results suggest that the approach we propose significantly expands the set of dynamic oligopoly models that can be analyzed computationally.

...read moreread less

Journal Article•DOI•

Iterative strategies for solving linearized discrete mean field games systems

[...]

Yves Achdou, Victor Perez

01 Jun 2012-Networks and Heterogeneous Media

TL;DR: This work proposes three families of iterative strategies for solving the linearized discrete MFG systems, most of which involve suitable multigrid solvers or preconditioners.

...read moreread less

Abstract: Mean fields games (MFG) describe the asymptotic behavior of stochastic differential games in which the number of players tends to $+\infty$. Under suitable assumptions, they lead to a new kind of system of two partial differential equations: a forward Bellman equation coupled with a backward Fokker-Planck equation. In earlier articles, finite difference schemes preserving the structure of the system have been proposed and studied. They lead to large systems of nonlinear equations in finite dimension. A possible way of numerically solving the latter is to use inexact Newton methods: a Newton step consists of solving a linearized discrete MFG system. The forward-backward character of the MFG system makes it impossible to use time marching methods. In the present work, we propose three families of iterative strategies for solving the linearized discrete MFG systems, most of which involve suitable multigrid solvers or preconditioners.

...read moreread less

Journal Article•DOI•

Deterministic time-inconsistent optimal control problems — an essentially cooperative approach

[...]

Jiongmin Yong¹•Institutions (1)

University of Central Florida¹

01 Jan 2012-Acta Mathematicae Applicatae Sinica

TL;DR: In this paper, a deterministic time-inconsistent optimal control problem is formulated for ordinary differential equations and a non-cooperative N-person differential game (but essentially cooperative in some sense) is introduced.

...read moreread less

Abstract: A general deterministic time-inconsistent optimal control problem is formulated for ordinary differential equations. To find a time-consistent equilibrium value function and the corresponding time-consistent equilibrium control, a non-cooperative N-person differential game (but essentially cooperative in some sense) is introduced. Under certain conditions, it is proved that the open-loop Nash equilibrium value function of the N-person differential game converges to a time-consistent equilibrium value function of the original problem, which is the value function of a time-consistent optimal control problem. Moreover, it is proved that any optimal control of the time-consistent limit problem is a time-consistent equilibrium control of the original problem.

...read moreread less

Collapse