Showing papers on "Bellman equation published in 2019"

PDF

Open Access

Posted Content•

Provably Efficient Reinforcement Learning with Linear Function Approximation

[...]

Chi Jin¹, Zhuoran Yang², Zhaoran Wang, Michael I. Jordan¹•Institutions (2)

University of California, Berkeley¹, Princeton University²

11 Jul 2019-arXiv: Learning

TL;DR: This paper proves that an optimistic modification of Least-Squares Value Iteration (LSVI) achieves regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps, and is independent of the number of states and actions.

...read moreread less

Abstract: Modern Reinforcement Learning (RL) is commonly applied to practical problems with an enormous number of states, where function approximation must be deployed to approximate either the value function or the policy. The introduction of function approximation raises a fundamental set of challenges involving computational and statistical efficiency, especially given the need to manage the exploration/exploitation tradeoff. As a result, a core RL question remains open: how can we design provably efficient RL algorithms that incorporate function approximation? This question persists even in a basic setting with linear dynamics and linear rewards, for which only linear function approximation is needed. This paper presents the first provable RL algorithm with both polynomial runtime and polynomial sample complexity in this linear setting, without requiring a "simulator" or additional assumptions. Concretely, we prove that an optimistic modification of Least-Squares Value Iteration (LSVI)---a classical algorithm frequently studied in the linear setting---achieves $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret, where $d$ is the ambient dimension of feature space, $H$ is the length of each episode, and $T$ is the total number of steps. Importantly, such regret is independent of the number of states and actions.

...read moreread less

337 citations

Posted Content•

Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes

[...]

Alekh Agarwal¹, Sham M. Kakade², Jason D. Lee³, Gaurav Mahajan⁴•Institutions (4)

Microsoft¹, University of Washington², Princeton University³, University of California, San Diego⁴

01 Aug 2019

TL;DR: In this article, the authors provide provable characterizations of computational, approximation, and sample size issues with policy gradient methods in the context of discounted Markov Decision Processes (MDPs).

...read moreread less

Abstract: Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution (say with a sufficiently rich policy class); how they cope with approximation error due to using a restricted class of parametric policies; or their finite sample behavior. Such characterizations are important not only to compare these methods to their approximate value function counterparts (where such issues are relatively well understood, at least in the worst case), but also to help with more principled approaches to algorithm design. This work provides provable characterizations of computational, approximation, and sample size issues with regards to policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: 1) "tabular" policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy, and 2) restricted policy classes, which may not contain the optimal policy and where we provide agnostic learning results. One insight of this work is in formalizing the importance how a favorable initial state distribution provides a means to circumvent worst-case exploration issues. Overall, these results place policy gradient methods under a solid theoretical footing, analogous to the global convergence guarantees of iterative value function based algorithms.

...read moreread less

94 citations

Journal Article•DOI•

Off-Policy Interleaved $Q$ -Learning: Optimal Control for Affine Nonlinear Discrete-Time Systems

[...]

Jinna Li¹, Tianyou Chai¹, Frank L. Lewis¹, Zhengtao Ding², Yi Jiang¹ - Show less +1 more•Institutions (2)

Northeastern University (China)¹, University of Manchester²

01 May 2019-IEEE Transactions on Neural Networks

TL;DR: A novel off-policy interleaved Q-learning algorithm is presented for solving optimal control problem of affine nonlinear discrete-time (DT) systems, using only the measured data along the system trajectories and its convergence is rigorously proven.

...read moreread less

Abstract: In this paper, a novel off-policy interleaved Q-learning algorithm is presented for solving optimal control problem of affine nonlinear discrete-time (DT) systems, using only the measured data along the system trajectories. Affine nonlinear feature of systems, unknown dynamics, and off-policy learning approach pose tremendous challenges on approximating optimal controllers. To this end, on-policy Q-learning method for optimal control of affine nonlinear DT systems is reviewed first, and its convergence is rigorously proven. The bias of solution to Q-function-based Bellman equation caused by adding probing noises to systems for satisfying persistent excitation is also analyzed when using on-policy Q-learning approach. Then, a behavior control policy is introduced followed by proposing an off-policy Q-learning algorithm. Meanwhile, the convergence of algorithm and no bias of solution to optimal control problem when adding probing noise to systems are investigated. Third, three neural networks run by the interleaved Q-learning approach in the actor-critic framework. Thus, a novel off-policy interleaved Q-learning algorithm is derived, and its convergence is proven. Simulation results are given to verify the effectiveness of the proposed method.

...read moreread less

86 citations

Journal Article•DOI•

From the master equation to mean field game limit theory: A central limit theorem

[...]

François Delarue, Daniel Lacker, Kavita Ramanan¹•Institutions (1)

Brown University¹

01 Jan 2019-Electronic Journal of Probability

TL;DR: In this paper, a functional central limit theorem (CLT) was established for mean field games, which characterizes the limiting fluctuations around the LLN limit as the unique solution of a linear stochastic PDE.

...read moreread less

Abstract: Mean field games (MFGs) describe the limit, as $n$ tends to infinity, of stochastic differential games with $n$ players interacting with one another through their common empirical distribution. Under suitable smoothness assumptions that guarantee uniqueness of the MFG equilibrium, a form of law of large of numbers (LLN), also known as propagation of chaos, has been established to show that the MFG equilibrium arises as the limit of the sequence of empirical measures of the $n$-player game Nash equilibria, including the case when player dynamics are driven by both idiosyncratic and common sources of noise. The proof of convergence relies on the so-called master equation for the value function of the MFG, a partial differential equation on the space of probability measures. In this work, under additional assumptions, we establish a functional central limit theorem (CLT) that characterizes the limiting fluctuations around the LLN limit as the unique solution of a linear stochastic PDE. The key idea is to use the solution to the master equation to construct an associated McKean-Vlasov interacting $n$-particle system that is sufficiently close to the Nash equilibrium dynamics of the $n$-player game for large $n$. We then derive the CLT for the latter from the CLT for the former. Along the way, we obtain a new multidimensional CLT for McKean-Vlasov systems. We also illustrate the broader applicability of our methodology by applying it to establish a CLT for a specific linear-quadratic example that does not satisfy our main assumptions, and we explicitly solve the resulting stochastic PDE in this case.

...read moreread less

84 citations

Journal Article•DOI•

Data-Based Optimal Control of Multiagent Systems: A Reinforcement Learning Design Approach

[...]

Jilie Zhang¹, Zhanshan Wang², Hongwei Zhang¹•Institutions (2)

Southwest Jiaotong University¹, Northeastern University (China)²

01 Dec 2019-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: The optimal tracking problem is reformulated as finding a Nash-equilibrium solution to multiplayer games, which can be done by solving associated coupled Hamilton–Jacobi equations, and a data- based error estimator is designed to obtain the data-based control.

...read moreread less

Abstract: This paper studies an optimal consensus tracking problem of heterogeneous linear multiagent systems. By introducing tracking error dynamics, the optimal tracking problem is reformulated as finding a Nash-equilibrium solution to multiplayer games, which can be done by solving associated coupled Hamilton–Jacobi equations. A data-based error estimator is designed to obtain the data-based control for the multiagent systems. Using the quadratic functional to approximate every agent’s value function, we can obtain the optimal cooperative control by the input–output (I/O) ${Q}$ -learning algorithm with a value iteration technique in the least-square sense. The control law solves the optimal consensus problem for multiagent systems with measured I/O information, and does not rely on the model of multiagent systems. A numerical example is provided to illustrate the effectiveness of the proposed algorithm.

...read moreread less

81 citations

Posted Content•

A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation.

[...]

Runzhe Yang¹, Xingyuan Sun¹, Karthik Narasimhan¹•Institutions (1)

Princeton University¹

21 Aug 2019-arXiv: Learning

TL;DR: A generalized version of the Bellman equation is proposed to learn a single parametric representation for optimal policies over the space of all possible preferences in MORL, with the goal of enabling few-shot adaptation to new tasks.

...read moreread less

Abstract: We introduce a new algorithm for multi-objective reinforcement learning (MORL) with linear preferences, with the goal of enabling few-shot adaptation to new tasks. In MORL, the aim is to learn policies over multiple competing objectives whose relative importance (preferences) is unknown to the agent. While this alleviates dependence on scalar reward design, the expected return of a policy can change significantly with varying preferences, making it challenging to learn a single model to produce optimal policies under different preference conditions. We propose a generalized version of the Bellman equation to learn a single parametric representation for optimal policies over the space of all possible preferences. After an initial learning phase, our agent can execute the optimal policy under any given preference, or automatically infer an underlying preference with very few samples. Experiments across four different domains demonstrate the effectiveness of our approach.

...read moreread less

80 citations

Proceedings Article•

Neural Temporal-Difference Learning Converges to Global Optima

[...]

Qi Cai¹, Zhuoran Yang², Jason D. Lee², Zhaoran Wang•Institutions (2)

Northwestern University¹, Princeton University²

01 Jan 2019

TL;DR: This paper proves for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation and establishes the global convergence of neural (soft) Q-learning, which is further connected to that of policy gradient algorithms.

...read moreread less

Abstract: Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, due to the nonlinearity in value function approximation, such a coupling leads to nonconvexity and even divergence in optimization. As a result, the global convergence of neural TD remains unclear. In this paper, we prove for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation. In particular, we show how such global convergence is enabled by the overparametrization of neural networks, which also plays a vital role in the empirical success of neural TD. Beyond policy evaluation, we establish the global convergence of neural (soft) Q-learning, which is further connected to that of policy gradient algorithms.

...read moreread less

73 citations

Posted Content•

Hyperbolic Discounting and Learning over Multiple Horizons

[...]

William Fedus¹, Carles Gelada¹, Yoshua Bengio, Marc G. Bellemare¹, Hugo Larochelle¹ - Show less +1 more•Institutions (1)

Google¹

19 Feb 2019-arXiv: Machine Learning

TL;DR: It is demonstrated that a simple approach approximates hyperbolic discount functions while still using familiar temporal-difference learning techniques in RL and a surprising discovery is made that simultaneously learning value functions over multiple time-horizons is an effective auxiliary task which often improves over a strong value-based RL agent, Rainbow.

...read moreread less

Abstract: Reinforcement learning (RL) typically defines a discount factor as part of the Markov Decision Process. The discount factor values future rewards by an exponential scheme that leads to theoretical convergence guarantees of the Bellman equation. However, evidence from psychology, economics and neuroscience suggests that humans and animals instead have hyperbolic time-preferences. In this work we revisit the fundamentals of discounting in RL and bridge this disconnect by implementing an RL agent that acts via hyperbolic discounting. We demonstrate that a simple approach approximates hyperbolic discount functions while still using familiar temporal-difference learning techniques in RL. Additionally, and independent of hyperbolic discounting, we make a surprising discovery that simultaneously learning value functions over multiple time-horizons is an effective auxiliary task which often improves over a strong value-based RL agent, Rainbow.

...read moreread less

70 citations

Journal Article•DOI•

Adaptive Deep Learning for High-Dimensional Hamilton-Jacobi-Bellman Equations

[...]

Tenavi Nakamura-Zimmerer, Qi Gong, Wei Kang

11 Jul 2019-arXiv: Optimization and Control

TL;DR: A data-driven method to approximate semi-global solutions to HJB equations for general high dimensional nonlinear systems and compute optimal feedback controls in real-time with neural networks trained on data generated independently of any state space discretization is proposed.

...read moreread less

Abstract: Computing optimal feedback controls for nonlinear systems generally requires solving Hamilton-Jacobi-Bellman (HJB) equations, which, in high dimensions, are notoriously difficult Existing strategies for high dimensional problems generally rely on specific, restrictive problem structures, or are valid only locally around some nominal trajectory In this paper, we propose a data-driven method to approximate semi-global solutions to HJB equations for general high dimensional nonlinear systems and compute optimal feedback controls in real-time To accomplish this, we model solutions to HJB equations with neural networks (NNs) trained on data generated independently of any state space discretization Training is made more effective and efficient by leveraging the known physics of the problem and using the partially trained NN to aid in adaptive data generation We demonstrate the effectiveness of our method by learning the approximate solution to the HJB equation corresponding to the stabilization of six dimensional nonlinear rigid body, and controlling the system with the trained NN

...read moreread less

67 citations

Proceedings Article•DOI•

Bridging Hamilton-Jacobi Safety Analysis and Reinforcement Learning

[...]

Jaime F. Fisac¹, Neil F. Lugovoy¹, Vicenc Rubies-Royo¹, Shromona Ghosh¹, Claire J. Tomlin¹ - Show less +1 more•Institutions (1)

University of California, Berkeley¹

20 May 2019

TL;DR: This work shows how a time-discounted modification of the problem of maximizing the minimum payoff over time, central to safety analysis, through a modified dynamic programming equation that induces a contraction mapping can render reinforcement learning techniques amenable to quantitative safety analysis as tools to approximate the safe set and optimal safety policy.

...read moreread less

Abstract: Safety analysis is a necessary component in the design and deployment of autonomous robotic systems. Techniques from robust optimal control theory, such as Hamilton-Jacobi reachability analysis, allow a rigorous formalization of safety as guaranteed constraint satisfaction. Unfortunately, the computational complexity of these tools for general dynamical systems scales poorly with state dimension, making existing tools impractical beyond small problems. Modern reinforcement learning methods have shown promising ability to find approximate yet proficient solutions to optimal control problems in complex and high-dimensional systems, however their application has in practice been restricted to problems with an additive payoff over time, unsuitable for reasoning about safety. In recent work, we introduced a time-discounted modification of the problem of maximizing the minimum payoff over time, central to safety analysis, through a modified dynamic programming equation that induces a contraction mapping. Here, we show how a similar contraction mapping can render reinforcement learning techniques amenable to quantitative safety analysis as tools to approximate the safe set and optimal safety policy. This opens a new avenue of research connecting control-theoretic safety analysis and the reinforcement learning domain. We validate the correctness of our formulation by comparing safety results computed through Q-learning to analytic and numerical solutions, and demonstrate its scalability by learning safe sets and control policies for simulated systems of up to 18 state dimensions using value learning and policy gradient techniques.

...read moreread less

62 citations

Posted Content•

Model-Free Mean-Field Reinforcement Learning: Mean-Field MDP and Mean-Field Q-Learning

[...]

René Carmona, Mathieu Laurière, Zongjun Tan

28 Oct 2019-arXiv: Optimization and Control

TL;DR: This work introduces generic model-free algorithms based on the state-action value function at the mean field level and proves convergence for a prototypical Q-learning method for mean field control problems.

...read moreread less

Abstract: We develop a general reinforcement learning framework for mean field control (MFC) problems. Such problems arise for instance as the limit of collaborative multi-agent control problems when the number of agents is very large. The asymptotic problem can be phrased as the optimal control of a non-linear dynamics. This can also be viewed as a Markov decision process (MDP) but the key difference with the usual RL setup is that the dynamics and the reward now depend on the state's probability distribution itself. Alternatively, it can be recast as a MDP on the Wasserstein space of measures. In this work, we introduce generic model-free algorithms based on the state-action value function at the mean field level and we prove convergence for a prototypical Q-learning method. We then implement an actor-critic method and report numerical results on two archetypal problems: a finite space model motivated by a cyber security application and a continuous space model motivated by an application to swarm motion.

...read moreread less

Journal Article•DOI•

Output Tracking Control Based on Adaptive Dynamic Programming With Multistep Policy Evaluation

[...]

Biao Luo¹, Derong Liu², Tingwen Huang³, Jiangjiang Liu⁴•Institutions (4)

Chinese Academy of Sciences¹, Guangdong University of Technology², Texas A&M University at Qatar³, University of Science and Technology Beijing⁴

01 Oct 2019-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: The convergence of MsHDP algorithm is proved by demonstrating that it converges to the solution of the Bellman equation.

...read moreread less

Abstract: In this paper, the optimal output tracking control problem of discrete-time nonlinear systems is considered. First, the augmented system is derived and the tracking control problem is converted to the regulation problem with a discounted performance index, which relies on the solution of the Bellman equation. It is known that policy iteration and value iteration are two classical algorithms for solving the Bellman equation. Through analysis of the two algorithms, it is found that policy iteration converges fast while requires an initial admissible control policy, and value iteration avoids the requirement of an initial admissible control policy but converges slowly. To achieve the tradeoff between policy iteration and value iteration, the multistep heuristic dynamic programming (MsHDP) is proposed by using multistep policy evaluation scheme. The convergence of MsHDP algorithm is proved by demonstrating that it converges to the solution of the Bellman equation. Subsequently, neural network-based actor-critic structure is developed to implement the MsHDP algorithm. The effectiveness and advantages of the developed MsHDP method are validated through comparative simulation studies.

...read moreread less

Posted Content•

On the Sample Complexity of Actor-Critic Method for Reinforcement Learning with Function Approximation.

[...]

Harshat Kumar, Alec Koppel, Alejandro Ribeiro

18 Oct 2019-arXiv: Learning

TL;DR: This work puts forth a new variant of actor-critic that employs Monte Carlo rollouts during the policy search updates, which results in controllable bias that depends on the number of critic evaluations, providing insight into the interplay between optimization and generalization in reinforcement learning.

...read moreread less

Abstract: Reinforcement learning, mathematically described by Markov Decision Problems, may be approached either through dynamic programming or policy search. Actor-critic algorithms combine the merits of both approaches by alternating between steps to estimate the value function and policy gradient updates. Due to the fact that the updates exhibit correlated noise and biased gradient updates, only the asymptotic behavior of actor-critic is known by connecting its behavior to dynamical systems. This work puts forth a new variant of actor-critic that employs Monte Carlo rollouts during the policy search updates, which results in controllable bias that depends on the number of critic evaluations. As a result, we are able to provide for the first time the convergence rate of actor-critic algorithms when the policy search step employs policy gradient, agnostic to the choice of policy evaluation technique. In particular, we establish conditions under which the sample complexity is comparable to stochastic gradient method for non-convex problems or slower as a result of the critic estimation error, which is the main complexity bottleneck. These results hold for in continuous state and action spaces with linear function approximation for the value function. We then specialize these conceptual results to the case where the critic is estimated by Temporal Difference, Gradient Temporal Difference, and Accelerated Gradient Temporal Difference. These learning rates are then corroborated on a navigation problem involving an obstacle, which suggests that learning more slowly may lead to improved limit points, providing insight into the interplay between optimization and generalization in reinforcement learning.

...read moreread less

Proceedings Article•

A Kernel Loss for Solving the Bellman Equation

[...]

Yihao Feng¹, Lihong Li², Qiang Liu¹•Institutions (2)

University of Texas at Austin¹, Google²

01 Jan 2019

TL;DR: A novel loss function is proposed, which can be optimized using standard gradient-based methods with guaranteed convergence, and can be easily approximated using sampled transitions, avoiding the need for double samples required by prior algorithms like residual gradient.

...read moreread less

Abstract: Value function learning plays a central role in many state-of-the-art reinforcement learning algorithms. Many popular algorithms like Q-learning do not optimize any objective function, but are fixed-point iterations of some variants of Bellman operator that are not necessarily a contraction. As a result, they may easily lose convergence guarantees, as can be observed in practice. In this paper, we propose a novel loss function, which can be optimized using standard gradient-based methods with guaranteed convergence. The key advantage is that its gradient can be easily approximated using sampled transitions, avoiding the need for double samples required by prior algorithms like residual gradient. Our approach may be combined with general function classes such as neural networks, using either on- or off-policy data, and is shown to work reliably and effectively in several benchmarks, including classic problems where standard algorithms are known to diverge.

...read moreread less

Proceedings Article•

A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation

[...]

Runzhe Yang¹, Xingyuan Sun¹, Karthik Narasimhan•Institutions (1)

Princeton University¹

01 Jan 2019

TL;DR: This article propose a generalized version of the Bellman equation to learn a single parametric representation for optimal policies over the space of all possible preferences, which can learn policies over multiple competing objectives whose relative importance (preferences) is unknown to the agent.

...read moreread less

Journal Article•DOI•

Q-learning solution for optimal consensus control of discrete-time multiagent systems using reinforcement learning

[...]

Chaoxu Mu¹, Qian Zhao¹, Zhong-Ke Gao¹, Changyin Sun²•Institutions (2)

Tianjin University¹, Southeast University²

01 Sep 2019-Journal of The Franklin Institute-engineering and Applied Mathematics

TL;DR: A Q-learning scheme for the optimal consensus control of discrete-time multiagent systems is investigated by reinforcement learning using system data instead of system dynamics information, and least square method is employed to motivate the implementation process.

...read moreread less

Abstract: This paper investigates a Q-learning scheme for the optimal consensus control of discrete-time multiagent systems. The Q-learning algorithm is conducted by reinforcement learning (RL) using system data instead of system dynamics information. In the multiagent systems, the agents are interacted with each other and at least one agent can communicate with the leader directly, which is described by an algebraic graph structure. The objective is to make all the agents achieve synchronization with leader and make the performance indices reach Nash equilibrium. On one hand, the solutions of the optimal consensus control for multiagent systems are acquired by solving the coupled Hamilton–Jacobi–Bellman (HJB) equation. However, it is difficult to get analytical solutions directly of the discrete-time HJB equation. On the other hand, accurate mathematical models of most systems in real world are hard to be obtained. To overcome these difficulties, Q-learning algorithm is developed using system data rather than the accurate system model. We formulate performance index and corresponding Bellman equation of each agent i. Then, the Q-function Bellman equation is acquired on the basis of Q-function. Policy iteration is adopted to calculate the optimal control iteratively, and least square (LS) method is employed to motivate the implementation process. Stability analysis of proposed Q-learning algorithm for multiagent systems by policy iteration is given. Two simulation examples are experimented to verify the effectiveness of the proposed scheme.

...read moreread less

Journal Article•DOI•

An Efficient DP Algorithm on a Tree-Structure for Finite Horizon Optimal Control Problems

[...]

Alessandro Alla, Maurizio Falcone, Luca Saluzzi

25 Jul 2019-SIAM Journal on Scientific Computing

TL;DR: A new approach for finite horizon optimal control problems where the value function is computed using a DP algorithm on a tree structure algorithm (TSA) constructed by the time discrete dynamics allowing for the solution of very high-dimensional problems.

...read moreread less

Abstract: The classical dynamic programming (DP) approach to optimal control problems is based on the characterization of the value function as the unique viscosity solution of a Hamilton--Jacobi--Bellman eq...

...read moreread less

Journal Article•DOI•

Randomized filtering and Bellman equation in Wasserstein space for partial observation control problem

[...]

Elena Bandini¹, Andrea Cosso², Marco Fuhrman¹, Huyên Pham³, Huyên Pham⁴ - Show less +1 more•Institutions (4)

University of Milan¹, University of Bologna², ENSAE ParisTech³, Paris Diderot University⁴

01 Feb 2019-Stochastic Processes and their Applications

TL;DR: In this article, a stochastic optimal control problem for a partially observed diffusion is studied and a corresponding randomized dynamic programming principle for the value function is obtained from a flow property of an associated filter process.

...read moreread less

Journal Article•DOI•

Primal-Dual Q-Learning Framework for LQR Design

[...]

Donghwan Lee¹, Jianghai Hu²•Institutions (2)

University of Illinois at Urbana–Champaign¹, Purdue University²

01 Sep 2019-IEEE Transactions on Automatic Control

TL;DR: In this article, a new optimization formulation of the linear quadratic regulator (LQR) problem via the Lagrangian duality theories was proposed to lay theoretical foundations of potentially effective RL algorithms.

...read moreread less

Abstract: Recently, reinforcement learning (RL) is receiving more and more attentions due to its successful demonstrations outperforming human performance in certain challenging tasks. The goal of this paper is to study a new optimization formulation of the linear quadratic regulator (LQR) problem via the Lagrangian duality theories in order to lay theoretical foundations of potentially effective RL algorithms. The new optimization problem includes the Q-function parameters so that it can be directly used to develop Q-learning algorithms, known to be one of the most popular RL algorithms. We prove relations between saddle-points of the Lagrangian function and the optimal solutions of the Bellman equation. As an example of its applications, we propose a model-free primal-dual Q-learning algorithm to solve the LQR problem and demonstrate its validity through examples.

...read moreread less

Proceedings Article•DOI•

LVIS: Learning from Value Function Intervals for Contact-Aware Robot Controllers

[...]

Robin Deits¹, Twan Koolen¹, Russ Tedrake¹•Institutions (1)

Massachusetts Institute of Technology¹

01 May 2019

TL;DR: LVIS is introduced, which circumvents the issue of local minima through global mixed-integer optimization and theissue of non-uniqueness through learning the optimal value function rather than the optimal policy, and is applied to a fundamentally hard problem in feedback control–control through contact.

...read moreread less

Abstract: Guided policy search is a popular approach for training controllers for high-dimensional systems, but it has a number of pitfalls. Non-convex trajectory optimization has local minima, and non-uniqueness in the optimal policy itself can mean that independently-optimized samples do not describe a coherent policy from which to train. We introduce LVIS, which circumvents the issue of local minima through global mixed-integer optimization and the issue of non-uniqueness through learning the optimal value function rather than the optimal policy. To avoid the expense of solving the mixed-integer programs to full global optimality, we instead solve them only partially, extracting intervals containing the true cost-to-go from early termination of the branch-and-bound algorithm. These interval samples are used to weakly supervise the training of a neural net which approximates the true cost-to-go. Online, we use that learned cost-to-go as the terminal cost of a one-step model-predictive controller, which we solve via a small mixed-integer optimization. We demonstrate LVIS on piecewise affine models of a cart-pole system with walls and a planar humanoid robot and show that it can be applied to a fundamentally hard problem in feedback control–control through contact.

...read moreread less

Posted Content•

Finite-Time Analysis of Distributed TD(0) with Linear Function Approximation for Multi-Agent Reinforcement Learning

[...]

Thinh T. Doan¹, Siva Theja Maguluri, Justin Romberg•Institutions (1)

Georgia Institute of Technology¹

20 Feb 2019-arXiv: Optimization and Control

TL;DR: An explicit upper bound is obtained on the rate of convergence of this algorithm as a function of the network topology and the discount factor when the communication network between the agents is time-varying in general.

...read moreread less

Abstract: We study the policy evaluation problem in multi-agent reinforcement learning. In this problem, a group of agents works cooperatively to evaluate the value function for the global discounted accumulative reward problem, which is composed of local rewards observed by the agents. Over a series of time steps, the agents act, get rewarded, update their local estimate of the value function, then communicate with their neighbors. The local update at each agent can be interpreted as a distributed consensus-based variant of the popular temporal difference learning algorithm TD(0). While distributed reinforcement learning algorithms have been presented in the literature, almost nothing is known about their convergence rate. Our main contribution is providing a finite-time analysis for the convergence of the distributed TD(0) algorithm. We do this when the communication network between the agents is time-varying in general. We obtain an explicit upper bound on the rate of convergence of this algorithm as a function of the network topology and the discount factor. Our results mirror what we would expect from using distributed stochastic gradient descent for solving convex optimization problems.

...read moreread less

Journal Article•DOI•

Symbolic Optimal Control

[...]

Gunther Reissig¹, Matthias Rungger²•Institutions (2)

Bundeswehr University Munich¹, Technische Universität München²

01 Jun 2019-IEEE Transactions on Automatic Control

TL;DR: Novel results on the solution of a class of leavable, undiscounted optimal control problems in the minimax sense for nonlinear, continuous-state, discrete-time plants are presented.

...read moreread less

Abstract: We present novel results on the solution of a class of leavable, undiscounted optimal control problems in the minimax sense for nonlinear, continuous-state, discrete-time plants. The problem class includes entry-(exit-)time problems as well as minimum-time, pursuit-evasion, and reach-avoid games as special cases. We utilize auxiliary optimal control problems (“abstractions”) to compute both upper bounds of the value function, i.e., of the achievable closed-loop performance, and symbolic feedback controllers realizing those bounds. The abstractions are obtained from discretizing the problem data, and we prove that the computed bounds and the performance of the symbolic controllers converge to the value function as the discretization parameters approach zero. In particular, if the optimal control problem is solvable on some compact subset of the state space, and if the discretization parameters are sufficiently small, then we obtain a symbolic feedback controller solving the problem on that subset. These results do not assume the continuity of the value function or any problem data, and they fully apply in the presence of hard state and control constraints.

...read moreread less

Journal Article•DOI•

Dynamic Quantile Models of Rational Behavior

[...]

Luciano I. de Castro¹, Antonio F. Galvao•Institutions (1)

Instituto Nacional de Matemática Pura e Aplicada¹

01 Nov 2019-Econometrica

TL;DR: In this paper, the authors develop a dynamic model of rational behavior under uncertainty, in which the agent maximizes the stream of future τ-quantile utilities, for τ ∈ (0, 1).

...read moreread less

Abstract: This paper develops a dynamic model of rational behavior under uncertainty, in which the agent maximizes the stream of future τ‐quantile utilities, for τ ∈ (0,1). That is, the agent has a quantile utility preference instead of the standard expected utility. Quantile preferences have useful advantages, including the ability to capture heterogeneity and allowing the separation between risk aversion and elasticity of intertemporal substitution. Although quantiles do not share some of the helpful properties of expectations, such as linearity and the law of iterated expectations, we are able to establish all the standard results in dynamic models. Namely, we show that the quantile preferences are dynamically consistent, the corresponding dynamic problem yields a value function, via a fixed point argument, this value function is concave and differentiable, and the principle of optimality holds. Additionally, we derive the corresponding Euler equation, which is well suited for using well‐known quantile regression methods for estimating and testing the economic model. In this way, the parameters of the model can be interpreted as structural objects. Therefore, the proposed methods provide microeconomic foundations for quantile regression methods. To illustrate the developments, we construct an intertemporal consumption model and estimate the discount factor and elasticity of intertemporal substitution parameters across the quantiles. The results provide evidence of heterogeneity in these parameters.

...read moreread less

Journal Article•DOI•

On a Wasserstein-type distance between solutions to stochastic differential equations

[...]

Jocelyne Bion-Nadal¹, Denis Talay•Institutions (1)

École Polytechnique¹

01 Jun 2019-Annals of Applied Probability

TL;DR: In this article, a Wasserstein distance on the set of the probability distributions of strong solutions to stochastic differential equations is defined by restricting a set of possible coupling measures.

...read moreread less

Abstract: In this paper we introduce a Wasserstein-type distance on the set of the probability distributions of strong solutions to stochastic differential equations. This new distance is defined by restricting the set of possible coupling measures. We prove that it may also be defined by means of the value function of a stochastic control problem whose Hamilton–Jacobi– Bellman equation has a smooth solution, which allows one to deduce a priori estimates or to obtain numerical evaluations. We exhibit an optimal coupling measure and characterizes it as a weak solution to an explicit stochastic differential equation, and we finally describe procedures to approximate this optimal coupling measure. A notable application concerns the following modeling issue: given an exact diffusion model, how to select a simplified diffusion model within a class of admissible models under the constraint that the probability distribution of the exact model is preserved as much as possible?

...read moreread less

Posted Content•

Deep adaptive dynamic programming for nonaffine nonlinear optimal control problem with state constraints

[...]

Jingliang Duan¹, Zhengyu Liu¹, Shengbo Eben Li¹, Qi Sun¹, Zhenzhong Jia², Bo Cheng¹ - Show less +2 more•Institutions (2)

Tsinghua University¹, Carnegie Mellon University²

26 Nov 2019

TL;DR: This paper presents a constrained deep adaptive dynamic programming algorithm to solve general nonlinear optimal control problems with known dynamics and proposes a series of recovery rules to update the policy in case the primal problem is infeasible.

...read moreread less

Abstract: This paper presents a constrained deep adaptive dynamic programming (CDADP) algorithm to solve general nonlinear optimal control problems with known dynamics. Unlike previous ADP algorithms, it can directly deal with problems with state constraints. Both the policy and value function are approximated by deep neural networks (NNs), which directly map the system state to action and value function respectively without needing to use hand-crafted basis function. The proposed algorithm considers the state constraints by transforming the policy improvement process to a constrained optimization problem. Meanwhile, a trust region constraint is added to prevent excessive policy update. We first linearize this constrained optimization problem locally into a quadratically-constrained quadratic programming problem, and then obtain the optimal update of policy network parameters by solving its dual problem. We also propose a series of recovery rules to update the policy in case the primal problem is infeasible. In addition, parallel learners are employed to explore different state spaces and then stabilize and accelerate the learning speed. The vehicle control problem in path-tracking task is used to demonstrate the effectiveness of this proposed method.

...read moreread less

Journal Article•DOI•

Risk-Sensitive Zero-Sum Differential Games

[...]

Jun Moon¹, Tyrone E. Duncan², Tamer Basar³•Institutions (3)

Seoul National University¹, University of Kansas², University of Illinois at Urbana–Champaign³

01 Apr 2019-IEEE Transactions on Automatic Control

TL;DR: In this article, the authors considered two-player risk-sensitive zero-sum differential games (RSZSDGs), where both the drift term and the diffusion term in the controlled stochastic differential equation are dependent on the state and controls of both players, and the objective functional is of the risk sensitive type.

...read moreread less

Abstract: We consider two-player risk-sensitive zero-sum differential games (RSZSDGs). In our problem setup, both the drift term and the diffusion term in the controlled stochastic differential equation are dependent on the state and controls of both players, and the objective functional is of the risk-sensitive type. First, a stochastic maximum principle type necessary condition for an open-loop saddle point of the RSZSDG is established via nonlinear transformations of the adjoint processes of the equivalent risk-neutral stochastic zero-sum differential game. In particular, we obtain two variational inequalities, namely, the pair of saddle-point inequalities of the RSZSDG. Next, we obtain the Hamilton–Jacobi–Isaacs partial differential equation for the RSZSDG, which provides a sufficient condition for a feedback saddle point of the RSZSDG, using a logarithmic transformation of the associated value function. Finally, we study the extended linear-quadratic RSZSDG (LQ-RSZSDG). We show intractability of the extended LQ-RSZSDG with the state and/or controls of both players appearing in the diffusion term. This unexpected intractability could lead to nonlinear open-loop and feedback saddle points even if the problem itself is essentially LQ and the Isaacs condition holds.

...read moreread less

Posted Content•

Neural Temporal-Difference Learning Converges to Global Optima

[...]

Qi Cai¹, Zhuoran Yang², Jason D. Lee², Zhaoran Wang•Institutions (2)

Northwestern University¹, Princeton University²

24 May 2019-arXiv: Learning

TL;DR: Recently, this article showed that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation, enabled by overparametrization of neural networks, which also plays a vital role in the empirical success of neural TD.

...read moreread less

Proceedings Article•

Limiting Extrapolation in Linear Approximate Value Iteration

[...]

Andrea Zanette¹, Alessandro Lazaric², Mykel J. Kochenderfer¹, Emma Brunskill¹•Institutions (2)

Stanford University¹, Facebook²

01 Jan 2019

TL;DR: It is proved that if the features at any state can be represented as a convex combination of features at the anchor points, then errors are propagated linearly over iterations (instead of exponentially) and the method achieves a polynomial sample complexity bound in the horizon and the number of anchor points.

...read moreread less

Abstract: We study linear approximate value iteration (LAVI) with a generative model. While linear models may accurately represent the optimal value function using a few parameters, several empirical and theoretical studies show the combination of least-squares projection with the Bellman operator may be expansive, thus leading LAVI to amplify errors over iterations and eventually diverge. We introduce an algorithm that approximates value functions by combining Q-values estimated at a set of \textit{anchor} states. Our algorithm tries to balance the generalization and compactness of linear methods with the small amplification of errors typical of interpolation methods. We prove that if the features at any state can be represented as a convex combination of features at the anchor points, then errors are propagated linearly over iterations (instead of exponentially) and our method achieves a polynomial sample complexity bound in the horizon and the number of anchor points. These findings are confirmed in preliminary simulations in a number of simple problems where a traditional least-square LAVI method diverges.

...read moreread less

Journal Article•DOI•

Event-triggered single-network ADP method for constrained optimal tracking control of continuous-time non-linear systems

[...]

Lili Cui¹, Xiangpeng Xie², Xiaowei Wang¹, Yanhong Luo³, Jingbo Liu - Show less +1 more•Institutions (3)

Shenyang Normal University¹, Nanjing University of Posts and Telecommunications², Northeastern University (China)³

01 Jul 2019-Applied Mathematics and Computation

TL;DR: A novel event-triggered single-network adaptive dynamic programming (ADP) method is proposed to obtain the solution of constrained OTCP, and the convergence of critic NN weights and the stability of closed-loop system are demonstrated.

...read moreread less

Posted Content•

Multiple Pursuer Multiple Evader Differential Games

[...]

Eloy Garcia¹, David W. Casbeer¹, Alexander Von Moll¹, Meir Pachter²•Institutions (2)

Wright-Patterson Air Force Base¹, Air Force Institute of Technology²

09 Nov 2019-arXiv: Optimization and Control

TL;DR: Classical differential game theory is extended to simultaneously address weapon assignments and multiplayer pursuit-evasion scenarios and Saddle-point strategies that provide guaranteed performance for each team regardless of the actual strategies implemented by the opponent are devised.

...read moreread less

Abstract: In this paper an N-pursuer vs. M-evader team conflict is studied. The differential game of border defense is addressed and we focus on the game of degree in the region of the state space where the pursuers are able to win. This work extends classical differential game theory to simultaneously address weapon assignments and multi-player pursuit-evasion scenarios. Saddle-point strategies that provide guaranteed performance for each team regardless of the actual strategies implemented by the opponent are devised. The players' optimal strategies require the co-design of cooperative optimal assignments and optimal guidance laws. A representative measure of performance is proposed and the Value function of the game is obtained. It is shown that the Value function is continuous, continuously differentiable, and that it satisfies the Hamilton-Jacobi-Isaacs equation - the curse of dimensionality is overcome and the optimal strategies are obtained. The cases of N=M and N>M are considered. In the latter case, cooperative guidance strategies are also developed in order for the pursuers to exploit their numerical advantage. This work provides a foundation to formally analyze complex and high-dimensional conflicts between teams of N pursuers and M evaders by means of differential game theory.

...read moreread less

Collapse