Showing papers on "Bellman equation published in 2021"

PDF

Open Access

Journal Article•DOI•

Continuous-Time Distributed Policy Iteration for Multicontroller Nonlinear Systems

[...]

Qinglai Wei¹, Hongyang Li¹, Xiong Yang², Haibo He³•Institutions (3)

Chinese Academy of Sciences¹, Tianjin University², University of Rhode Island³

15 Apr 2021-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: A novel distributed policy iteration algorithm is established for infinite horizon optimal control problems of continuous-time nonlinear systems to improve the iterative control law one by one, instead of updating all the control laws in each iteration of the traditional policy iteration algorithms.

...read moreread less

Abstract: In this article, a novel distributed policy iteration algorithm is established for infinite horizon optimal control problems of continuous-time nonlinear systems. In each iteration of the developed distributed policy iteration algorithm, only one controller’s control law is updated and the other controllers’ control laws remain unchanged. The main contribution of the present algorithm is to improve the iterative control law one by one, instead of updating all the control laws in each iteration of the traditional policy iteration algorithms, which effectively releases the computational burden in each iteration. The properties of distributed policy iteration algorithm for continuous-time nonlinear systems are analyzed. The admissibility of the present methods has also been analyzed. Monotonicity, convergence, and optimality have been discussed, which show that the iterative value function is nonincreasingly convergent to the solution of the Hamilton–Jacobi–Bellman equation. Finally, numerical simulations are conducted to illustrate the effectiveness of the proposed method.

...read moreread less

62 citations

Posted Content•

Statistical inference of the value function for reinforcement learning in infinite-horizon settings

[...]

Chengchun Shi, Shengxing Zhang, Wenbin Lu, Rui Song

17 Jun 2021-arXiv: Machine Learning

TL;DR: The proposed method to construct confidence intervals (CIs) for a policy's value in infinite horizon settings where the number of decision points diverges to infinity is applied to a dataset from mobile health studies and it is found that reinforcement learning algorithms could help improve patient's health status.

...read moreread less

Abstract: Reinforcement learning is a general technique that allows an agent to learn an optimal policy and interact with an environment in sequential decision making problems. The goodness of a policy is measured by its value function starting from some initial state. The focus of this paper is to construct confidence intervals (CIs) for a policy’s value in infinite horizon settings where the number of decision points diverges to infinity. We propose to model the action-value state function (Q-function) associated with a policy based on series/sieve method to derive its confidence interval. When the target policy depends on the observed data as well, we propose a SequentiAl Value Evaluation (SAVE) method to recursively update the estimated policy and its value estimator. As long as either the number of trajectories or the number of decision points diverges to infinity, we show that the proposed CI achieves nominal coverage even in cases where the optimal policy is not unique. Simulation studies are conducted to back up our theoretical findings. We apply the proposed method to a dataset from mobile health studies and find that reinforcement learning algorithms could help improve patient’s health status. A Python implementation of the proposed procedure is available at https://github.com/shengzhang37/SAVE.

...read moreread less

50 citations

Journal Article•DOI•

Tensor Decomposition Methods for High-dimensional Hamilton--Jacobi--Bellman Equations

[...]

Sergey Dolgov¹, Dante Kalise², Karl Kunisch³, Karl Kunisch⁴•Institutions (4)

University of Bath¹, University of Nottingham², Austrian Academy of Sciences³, University of Graz⁴

10 May 2021-SIAM Journal on Scientific Computing

TL;DR: A tensor decomposition approach for the solution of high-dimensional, fully nonlinear Hamilton-Jacobi-Bellman equations arising in optimal feedback control of nonlinear dynamics is presented in this article.

...read moreread less

Abstract: A tensor decomposition approach for the solution of high-dimensional, fully nonlinear Hamilton--Jacobi--Bellman equations arising in optimal feedback control of nonlinear dynamics is presented. The...

...read moreread less

50 citations

Posted Content•

Mean-field Markov decision processes with common noise and open-loop controls

[...]

Médéric Motte¹, Huyên Pham¹•Institutions (1)

Paris Diderot University¹

08 Sep 2021-arXiv: Optimization and Control

TL;DR: The correspondence between CMKV-MDP and a general lifted MDP on the space of probability measures is proved, and the dynamic programming Bellman fixed point equation satisfied by the value function is established.

...read moreread less

Abstract: We develop an exhaustive study of Markov decision process (MDP) under mean field interaction both on states and actions in the presence of common noise, and when optimization is performed over open-loop controls on infinite horizon. Such model, called CMKV-MDP for conditional McKean-Vlasov MDP, arises and is obtained here rigorously with a rate of convergence as the asymptotic problem of N-cooperative agents controlled by a social planner/influencer that observes the environment noises but not necessarily the individual states of the agents. We highlight the crucial role of relaxed controls and randomization hypothesis for this class of models with respect to classical MDP theory. We prove the correspondence between CMKV-MDP and a general lifted MDP on the space of probability measures, and establish the dynamic programming Bellman fixed point equation satisfied by the value function, as well as the existence of-optimal randomized feedback controls. The arguments of proof involve an original measurable optimal coupling for the Wasserstein distance. This provides a procedure for learning strategies in a large population of interacting collaborative agents. MSC Classification: 90C40, 49L20.

...read moreread less

43 citations

Book•DOI•

A Unified Framework for Multistage and Multilevel Mixed Integer Linear Optimization.

[...]

Suresh Bolusani¹, Stefano Coniglio², Ted K. Ralphs¹, Sahar Tahernejad•Institutions (2)

Lehigh University¹, University of Southampton²

19 Apr 2021-arXiv: Optimization and Control

TL;DR: In this paper, a unified framework for the study of multilevel mixed integer linear optimization problems and multistage stochastic MILO problems with recourse is introduced, which highlights the common mathematical structure of the two problems and allows for the development of a common algorithmic framework.

...read moreread less

Abstract: We introduce a unified framework for the study of multilevel mixed integer linear optimization problems and multistage stochastic mixed integer linear optimization problems with recourse. The framework highlights the common mathematical structure of the two problems and allows for the development of a common algorithmic framework. Focusing on the two-stage case, we investigate, in particular, the nature of the value function of the second-stage problem, highlighting its connection to dual functions and the theory of duality for mixed integer linear optimization problems, and summarize different reformulations. We then present two main solution techniques, one based on a Benders-like decomposition to approximate either the risk function or the value function, and the other one based on cutting plane generation.

...read moreread less

41 citations

Journal Article•DOI•

Sliding-Mode Surface-Based Approximate Optimal Control for Uncertain Nonlinear Systems With Asymptotically Stable Critic Structure

[...]

Bo Zhao¹, Derong Liu², Cesare Alippi³•Institutions (3)

Beijing Normal University¹, Guangdong University of Technology², Polytechnic University of Milan³

18 May 2021-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: A sliding-mode surface (SMS)-based approximate optimal control scheme for a large class of nonlinear systems affected by unknown mismatched perturbations and the stability is proved based on the Lyapunov’s direct method is developed.

...read moreread less

Abstract: This article develops a novel sliding-mode surface (SMS)-based approximate optimal control scheme for a large class of nonlinear systems affected by unknown mismatched perturbations. The observer-based perturbation estimation procedure is employed to establish the online updated value function. The solution to the Hamilton–Jacobi–Bellman equation is approximated by an SMS-based critic neural network whose weights error dynamics is designed to be asymptotically stable by nested update laws. The sliding-mode control strategy is combined with the approximate optimal control design procedure to obtain a faster control action. The stability is proved based on the Lyapunov’s direct method. The simulation results show the effectiveness of the developed control scheme.

...read moreread less

40 citations

Journal Article•DOI•

Discrete-Time Non-Zero-Sum Games With Completely Unknown Dynamics

[...]

Ruizhuo Song¹, Qinglai Wei², Huaguang Zhang³, Frank L. Lewis⁴•Institutions (4)

University of Science and Technology Beijing¹, Chinese Academy of Sciences², Northeastern University (China)³, University of Texas at Arlington⁴

18 May 2021-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: Off-policy reinforcement learning (RL) algorithm is established to solve the discrete-time unknown dynamics NZS games with completely unknown dynamics with the existence of Nash equilibrium proved.

...read moreread less

Abstract: In this article, off-policy reinforcement learning (RL) algorithm is established to solve the discrete-time $N$ -player nonzero-sum (NZS) games with completely unknown dynamics. The $N$ -coupled generalized algebraic Riccati equations (GARE) are derived, and then policy iteration (PI) algorithm is used to obtain the $N$ -tuple of iterative control and iterative value function. As the system dynamics is necessary in PI algorithm, off-policy RL method is developed for discrete-time $N$ -player NZS games. The off-policy $N$ -coupled Hamilton-Jacobi (HJ) equation is derived based on quadratic value functions. According to the Kronecker product, the $N$ -coupled HJ equation is decomposed into unknown parameter part and the system operation data part, which makes the $N$ -coupled HJ equation solved independent of system dynamics. The least square is used to calculate the iterative value function and $N$ -tuple of iterative control. The existence of Nash equilibrium is proved. The result of the proposed method for discrete-time unknown dynamics NZS games is indicated by the simulation examples.

...read moreread less

38 citations

Journal Article•DOI•

Event-Triggered ADP for Tracking Control of Partially Unknown Constrained Uncertain Systems.

[...]

Shan Xue¹, Biao Luo², Derong Liu³, Ying Gao¹•Institutions (3)

South China University of Technology¹, Central South University², University of Illinois at Chicago³

04 Mar 2021-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: In this article, an event-triggered adaptive dynamic programming (ADP) algorithm is developed to solve the tracking control problem for partially unknown constrained uncertain systems, where the learning of neural network weights not only relaxes the initial admissible control but also executes only when the predefined execution rule is violated.

...read moreread less

Abstract: An event-triggered adaptive dynamic programming (ADP) algorithm is developed in this article to solve the tracking control problem for partially unknown constrained uncertain systems. First, an augmented system is constructed, and the solution of the optimal tracking control problem of the uncertain system is transformed into an optimal regulation of the nominal augmented system with a discounted value function. The integral reinforcement learning is employed to avoid the requirement of augmented drift dynamics. Second, the event-triggered ADP is adopted for its implementation, where the learning of neural network weights not only relaxes the initial admissible control but also executes only when the predefined execution rule is violated. Third, the tracking error and the weight estimation error prove to be uniformly ultimately bounded, and the existence of a lower bound for the interexecution times is analyzed. Finally, simulation results demonstrate the effectiveness of the present event-triggered ADP method.

...read moreread less

35 citations

Journal Article•DOI•

Deep learning for solving dynamic economic models.

[...]

Lilia Maliar¹, Serguei Maliar², Pablo Winant³•Institutions (3)

City University of New York¹, Santa Clara University², École Polytechnique³

01 Sep 2021-Journal of Monetary Economics

TL;DR: A unified deep learning method that solves dynamic economic models by casting them into nonlinear regression equations for three fundamental objects of economic dynamics – lifetime reward functions, Bellman equations and Euler equations is introduced.

...read moreread less

34 citations

Journal Article•DOI•

Instance-Dependent ℓ ∞ -Bounds for Policy Evaluation in Tabular Reinforcement Learning

[...]

Ashwin Pananjady¹, Martin J. Wainwright¹•Institutions (1)

University of California, Berkeley¹

01 Jan 2021-IEEE Transactions on Information Theory

TL;DR: This work analyzes both the standard plug-in approach to this problem and a more robust variant, and establishes non-asymptotic bounds that depend on the (unknown) problem instance, as well as data-dependent bounds that can be evaluated based on the observations of state-transitions and rewards.

...read moreread less

Abstract: Markov reward processes (MRPs) are used to model stochastic phenomena arising in operations research, control engineering, robotics, and artificial intelligence, as well as communication and transportation networks. In many of these cases, such as in the policy evaluation problem encountered in reinforcement learning, the goal is to estimate the long-term value function of such a process without access to the underlying population transition and reward functions. Working with samples generated under the synchronous model, we study the problem of estimating the value function of an infinite-horizon discounted MRP with finite state space in the $\ell _{\infty }$ -norm. We analyze both the standard plug-in approach to this problem and a more robust variant, and establish non-asymptotic bounds that depend on the (unknown) problem instance, as well as data-dependent bounds that can be evaluated based on the observations of state-transitions and rewards. We show that these approaches are minimax-optimal up to constant factors over natural sub-classes of MRPs. Our analysis makes use of a leave-one-out decoupling argument tailored to the policy evaluation problem, one which may be of independent interest.

...read moreread less

33 citations

Journal Article•DOI•

Reinforcement Learning and Adaptive Optimal Control for Continuous-Time Nonlinear Systems: A Value Iteration Approach.

[...]

Tao Bian¹, Zhong-Ping Jiang¹•Institutions (1)

New York University¹

08 Jan 2021-IEEE Transactions on Neural Networks

TL;DR: In this article, the adaptive control problem for continuous-time nonlinear systems described by differential equations is studied and a learning-based control algorithm is proposed to learn robust optimal controllers directly from real-time data.

...read moreread less

Abstract: This article studies the adaptive optimal control problem for continuous-time nonlinear systems described by differential equations. A key strategy is to exploit the value iteration (VI) method proposed initially by Bellman in 1957 as a fundamental tool to solve dynamic programming problems. However, previous VI methods are all exclusively devoted to the Markov decision processes and discrete-time dynamical systems. In this article, we aim to fill up the gap by developing a new continuous-time VI method that will be applied to address the adaptive or nonadaptive optimal control problems for continuous-time systems described by differential equations. Like the traditional VI, the continuous-time VI algorithm retains the nice feature that there is no need to assume the knowledge of an initial admissible control policy. As a direct application of the proposed VI method, a new class of adaptive optimal controllers is obtained for nonlinear systems with totally unknown dynamics. A learning-based control algorithm is proposed to show how to learn robust optimal controllers directly from real-time data. Finally, two examples are given to illustrate the efficacy of the proposed methodology.

...read moreread less

Journal Article•DOI•

Generalized value iteration for discounted optimal control with stability analysis

[...]

Mingming Ha¹, Ding Wang², Derong Liu³•Institutions (3)

University of Science and Technology Beijing¹, Beijing University of Technology², University of Illinois at Chicago³

01 Jan 2021-Systems & Control Letters

TL;DR: The generalized value iteration with a discount factor is developed for optimal control of discrete-time nonlinear systems, which is initialized with a positive definite value function rather than zero, and the convergence analysis of the discounted value function sequence is provided.

...read moreread less

Journal Article•DOI•

A novel adaptive dynamic programming based on tracking error for nonlinear discrete-time systems

[...]

Chun Li¹, Jinliang Ding¹, Frank L. Lewis², Tianyou Chai¹•Institutions (2)

Northeastern University (China)¹, University of Texas at Arlington²

01 Jul 2021-Automatica

TL;DR: In this article, a novel formulation of the value function is presented for the optimal tracking problem (TP) of nonlinear discrete-time systems, and the optimal control policy can be deduced without considering the reference control input.

...read moreread less

Journal Article•

Experience Replay with Likelihood-free Importance Weights

[...]

Samarth Sinha¹, Jiaming Song², Animesh Garg¹, Stefano Ermon²•Institutions (2)

University of Toronto¹, Stanford University²

04 May 2021-arXiv: Artificial Intelligence

TL;DR: This work proposes to reweight experiences based on their likelihood under the stationary distribution of the current policy, using a likelihood-free density ratio estimator over the replay buffer to assign the prioritization weights.

...read moreread less

Abstract: The use of past experiences to accelerate temporal difference (TD) learning of value functions, or experience replay, is a key component in deep reinforcement learning. In this work, we propose to reweight experiences based on their likelihood under the stationary distribution of the current policy, and justify this with a contraction argument over the Bellman evaluation operator. The resulting TD objective encourages small approximation errors on the value function over frequently encountered states. To balance bias and variance in practice, we use a likelihood-free density ratio estimator between on-policy and off-policy experiences, and use the ratios as the prioritization weights. We apply the proposed approach empirically on three competitive methods, Soft Actor Critic (SAC), Twin Delayed Deep Deterministic policy gradient (TD3) and Data-regularized Q (DrQ), over 11 tasks from OpenAI gym and DeepMind control suite. We achieve superior sample complexity on 35 out of 45 method-task combinations compared to the best baseline and similar sample complexity on the remaining 10.

...read moreread less

Journal Article•DOI•

Deep Neural Networks Algorithms for Stochastic Control Problems on Finite Horizon: Convergence Analysis

[...]

Côme Huré¹, Huyên Pham¹, Achref Bachouch², Nicolas Langrené•Institutions (2)

Paris Diderot University¹, University of Oslo²

22 Feb 2021-SIAM Journal on Numerical Analysis

TL;DR: In this article, the authors developed algorithms for high-dimensional stochastic control problems based on deep learning and dynamic programming (DP) and provided a theoretical justification of these algorithms.

...read moreread less

Abstract: This paper develops algorithms for high-dimensional stochastic control problems based on deep learning and dynamic programming (DP). Differently from the classical approximate DP approach, we first approximate the optimal policy by means of neural networks in the spirit of deep reinforcement learning, and then the value function by Monte Carlo regression. This is achieved in the DP recursion by performance or hybrid iteration, and regress now or later/quantization methods from numerical probabilities. We provide a theoretical justification of these algorithms. Consistency and rate of convergence for the control and value function estimates are analyzed and expressed in terms of the universal approximation error of the neural networks. Numerical results on various applications are presented in a companion paper [2] and illustrate the performance of our algorithms.

...read moreread less

Journal Article•DOI•

Generalized Actor-Critic Learning Optimal Control in Smart Home Energy Management

[...]

Qinglai Wei¹, Zehua Liao¹, Guang Shi•Institutions (1)

Chinese Academy of Sciences¹

01 Oct 2021-IEEE Transactions on Industrial Informatics

TL;DR: In the present GACL optimal control method, it is the first time that three iteration processes, which are global iteration, local iteration, and interior iteration, respectively, are established to obtain the optimal energy control law.

...read moreread less

Abstract: This article is concerned with a new generalized actor-critic learning (GACL) optimal control method. It aims at the optimal energy control and management for smart home systems, which is expected to minimize the consumption cost for home users. In the present GACL optimal control method, it is the first time that three iteration processes, which are global iteration, local iteration, and interior iteration, respectively, are established to obtain the optimal energy control law. The main contribution of the developed method is to establish a common iteration structure for both value and policy iterations in adaptive dynamic programming based on a control law sequence in each iteration for periodic time-varying systems, instead of a single control law, and simultaneously accelerates the convergence rate. The monotonicity, convergence, and optimality of the iterative value function for the GACL optimal control method are proven. Finally, numerical results and comparisons are displayed to show the superiority of the developed method.

...read moreread less

Journal Article•DOI•

Event-Triggered Optimal Control for Discrete-Time Switched Nonlinear Systems With Constrained Control Input

[...]

Xiumei Han¹, Xudong Zhao¹, Tao Sun¹, Yuhu Wu¹, Ning Xu², Guangdeng Zong³ - Show less +2 more•Institutions (3)

Dalian University of Technology¹, Bohai University², Qufu Normal University³

01 Dec 2021-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: A novel method, event-triggered heuristic dynamic programming (ETHDP), is applied to derive the optimal control policy and two neural networks are utilized to approximate the value function and control law, respectively.

...read moreread less

Abstract: This article considers the problem of event-triggered optimal control for discrete-time switched nonlinear systems with constrained control input. First, an event-triggered condition is given to make the closed-loop switched system asymptotically stable. Second, a novel method, event-triggered heuristic dynamic programming (ETHDP), is applied to derive the optimal control policy. Two neural networks (NNs) are utilized to approximate the value function and control law, respectively. When the event-triggered condition is violated, the weights of the two NNs are updated, which can decrease the networks calculation and transmission load notably. A proof of the convergence of the ETHDP is also carried out. Finally, the effectiveness of the proposed method is verified by an example.

...read moreread less

Journal Article•DOI•

Adaptive deep learning for high-dimensional hamilton-jacobi-bellman equations

[...]

Tenavi Nakamura-Zimmerer, Qi Gong, Wei Kang

07 Apr 2021-SIAM Journal on Scientific Computing

TL;DR: In this paper, optimal feedback control for nonlinear systems generally requires solving Hamilton-Jacobi-Bellman (HJB) equations, which is notoriously difficult when the state dimension is large.

...read moreread less

Abstract: Computing optimal feedback controls for nonlinear systems generally requires solving Hamilton--Jacobi--Bellman (HJB) equations, which are notoriously difficult when the state dimension is large. Ex...

...read moreread less

Journal Article•DOI•

Generalized Risk-Sensitive Optimal Control and Hamilton–Jacobi–Bellman Equation

[...]

Jun Moon¹•Institutions (1)

Hanyang University¹

01 May 2021-IEEE Transactions on Automatic Control

TL;DR: The generalized risk-sensitive dynamic programming principle for the value function via the backward semigroup associated with the BSDE is obtained, and it is shown that the corresponding value function is a viscosity solution to the Hamilton–Jacobi–Bellman equation.

...read moreread less

Abstract: In this article, we consider the generalized risk-sensitive optimal control problem, where the objective functional is defined by the controlled backward stochastic differential equation (BSDE) with quadratic growth coefficient We extend the earlier results of the risk-sensitive optimal control problem to the case of the objective functional given by the controlled BSDE Note that the risk-neutral stochastic optimal control problem corresponds to the BSDE objective functional with linear growth coefficient, which can be viewed as a special case of the article We obtain the generalized risk-sensitive dynamic programming principle for the value function via the backward semigroup associated with the BSDE Then we show that the corresponding value function is a viscosity solution to the Hamilton–Jacobi–Bellman equation Under an additional parameter condition, the viscosity solution is unique, which implies that the solution characterizes the value function We apply the theoretical results to the risk-sensitive European option pricing problem

...read moreread less

Journal Article•DOI•

Finite Dimensional Approximations of Hamilton--Jacobi--Bellman Equations in Spaces of Probability Measures

[...]

Wilfrid Gangbo, Sergio Mayorga, Andrzej Święch

08 Mar 2021-Siam Journal on Mathematical Analysis

TL;DR: It is proved that viscosity solutions of Hamilton--Jacobi--Bellman (HJB) equations, corresponding either to deterministic optimal control problems for systems of systems of $n$ particles or to stochastic optimal ...

...read moreread less

Abstract: We prove that viscosity solutions of Hamilton--Jacobi--Bellman (HJB) equations, corresponding either to deterministic optimal control problems for systems of $n$ particles or to stochastic optimal ...

...read moreread less

Journal Article•DOI•

H ∞ Tracking Control for Linear Discrete-Time Systems: Model-Free Q-Learning Designs

[...]

Yunjie Yang¹, Yan Wan², Jihong Zhu¹, Frank L. Lewis²•Institutions (2)

Tsinghua University¹, University of Texas at Arlington²

01 Jan 2021

TL;DR: A novel model-free Q-learning based approach is developed to solve the tracking problem for linear discrete-time systems and it is proved that probing noises in maintaining the persistence of excitation (PE) condition do not result in any bias.

...read moreread less

Abstract: In this letter, a novel model-free Q-learning based approach is developed to solve the H ∞ tracking problem for linear discrete-time systems. A new exponential discounted value function is introduced that includes the cost of the whole control input and tracking error. The tracking Bellman equation and the game algebraic Riccati equation (GARE) are derived. The solution to the GARE leads to the feedback and feedforward parts of the control input. A Q-learning algorithm is then developed to learn the solution of the GARE online without requiring any knowledge of the system dynamics. Convergence of the algorithm is analyzed, and it is also proved that probing noises in maintaining the persistence of excitation (PE) condition do not result in any bias. An example of the F-16 aircraft short period dynamics is developed to validate the proposed algorithm.

...read moreread less

Proceedings Article•

Tesseract: Tensorised Actors for Multi-Agent Reinforcement Learning

[...]

Anuj Mahajan¹, Mikayel Samvelyan², Lei Mao³, Viktor Makoviychuk³, Animesh Garg⁴, Jean Kossaifi³, Shimon Whiteson¹, Yuke Zhu⁵, Animashree Anandkumar⁶ - Show less +5 more•Institutions (6)

University of Oxford¹, University College London², Nvidia³, University of Toronto⁴, University of Texas at Austin⁵, California Institute of Technology⁶

18 Jul 2021

TL;DR: This work proposes a novel tensorised formulation of the Bellman equation, which gives rise to the method Tesseract, which utilises the view of Q-function seen as a tensor where the modes correspond to action spaces of different agents.

...read moreread less

Abstract: Reinforcement Learning in large action spaces is a challenging problem. This is especially true for cooperative multi-agent reinforcement learning (MARL), which often requires tractable learning while respecting various constraints like communication budget and information about other agents. In this work, we focus on the fundamental hurdle affecting both value-based and policy-gradient approaches: an exponential blowup of the action space with the number of agents. For value-based methods, it poses challenges in accurately representing the optimal value function for value-based methods, thus inducing suboptimality. For policy gradient methods, it renders the critic ineffective and exacerbates the problem of the lagging critic. We show that from a learning theory perspective, both problems can be addressed by accurately representing the associated action-value function with a low-complexity hypothesis class. This requires accurately modelling the agent interactions in a sample efficient way. To this end, we propose a novel tensorised formulation of the Bellman equation. This gives rise to our method Tesseract, which utilises the view of Q-function seen as a tensor where the modes correspond to action spaces of different agents. Algorithms derived from Tesseract decompose the Q-tensor across the agents and utilise low-rank tensor approximations to model the agent interactions relevant to the task. We provide PAC analysis for Tesseract based algorithms and highlight their relevance to the class of rich observation MDPs. Empirical results in different domains confirm the gains in sample efficiency using Tesseract as supported by the theory.

...read moreread less

Journal Article•DOI•

Time-Inconsistent Stochastic Optimal Control Problems and Backward Stochastic Volterra Integral Equations

[...]

Hanxiao Wang¹, Jiongmin Yong²•Institutions (2)

National University of Singapore¹, University of Central Florida²

01 Jan 2021-ESAIM: Control, Optimisation and Calculus of Variations

TL;DR: In this paper, a family of approximate equilibrium strategies is constructed associated with partitions of the time intervals, and an equilibrium Hamilton-Jacobi-Bellman (HJB) equation is derived, through which the equilibrium value function and equilibrium strategy are obtained.

...read moreread less

Abstract: An optimal control problem is considered for a stochastic differential equation with the cost functional determined by a backward stochastic Volterra integral equation (BSVIE, for short). This kind of cost functional can cover the general discounting (including exponential and non-exponential) situations with a recursive feature. It is known that such a problem is time-inconsistent in general. Therefore, instead of finding a global optimal control, we look for a time-consistent locally near optimal equilibrium strategy. With the idea of multi-person differential games, a family of approximate equilibrium strategies is constructed associated with partitions of the time intervals. By sending the mesh size of the time interval partition to zero, an equilibrium Hamilton–Jacobi–Bellman (HJB, for short) equation is derived, through which the equilibrium value function and an equilibrium strategy are obtained. Under certain conditions, a verification theorem is proved and the well-posedness of the equilibrium HJB is established. As a sort of Feynman–Kac formula for the equilibrium HJB equation, a new class of BSVIEs (containing the diagonal value Z (r , r ) of Z (⋅ , ⋅)) is naturally introduced and the well-posedness of such kind of equations is briefly presented.

...read moreread less

Journal Article•DOI•

Policy-Iteration-Based Learning for Nonlinear Player Game Systems With Constrained Inputs

[...]

Chaoxu Mu¹, Ke Wang¹, Changyin Sun²•Institutions (2)

Tianjin University¹, Southeast University²

01 Oct 2021-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: An adaptive learning algorithm is developed based on policy iteration technique to approximately obtain the Nash equilibrium using real-time data to solve the optimal control problem for nonlinear nonzero-sum differential game in the environment of no initial admissible policies.

...read moreread less

Abstract: This article investigates the optimal control problem for nonlinear nonzero-sum differential game in the environment of no initial admissible policies while considering the control constraint. An adaptive learning algorithm is thus developed based on policy iteration technique to approximately obtain the Nash equilibrium using real-time data. A two-player continuous-time system is used to present this approximate mechanism, which is implemented as a critic–actor architecture for every player. The constraint is incorporated into this optimization by introducing the nonquadratic value function, and the associated constrained Hamilton–Jacobi equation is derived. The critic neural network (NN) and actor NN are utilized to learn the value function and the optimal control policy, respectively, in the light of novel weight tuning laws. In order to tackle the stability during the learning phase, two stable operators are designed for two actors. The proposed algorithm is proved to be convergent as a Newton’s iteration, and the stability of this closed-loop system is also ensured by Lyapunov analysis. Finally, two simulation examples demonstrate the effectiveness of the proposed learning scheme by considering different constraint scenes.

...read moreread less

Journal Article•DOI•

Constrained Inverse Optimal Control With Application to a Human Manipulation Task

[...]

Marcel Menner¹, Peter Worsnop¹, Melanie N. Zeilinger¹•Institutions (1)

ETH Zurich¹

01 Mar 2021-IEEE Transactions on Control Systems and Technology

TL;DR: The study indicates that individual human movements can be predicted with low error using an infinite-horizon optimal control problem with constraints on the shoulder movement.

...read moreread less

Abstract: This brief presents an inverse optimal control methodology and its application to training a predictive model of human motor control from a manipulation task. It introduces a convex formulation for learning both objective function and constraints of an infinite-horizon constrained optimal control problem with nonlinear system dynamics. The inverse approach utilizes Bellman’s principle of optimality to formulate the infinite-horizon optimal control problem as a shortest path problem and the Lagrange multipliers to identify constraints. We highlight the key benefit of using the shortest path formulation, i.e., the possibility of training the predictive model with short and selected trajectory segments. The method is applied to training a predictive model of movements of a human subject from a manipulation task. The study indicates that individual human movements can be predicted with low error using an infinite-horizon optimal control problem with constraints on the shoulder movement.

...read moreread less

Journal Article•DOI•

Event-triggered Optimal Control for Nonlinear Stochastic Systems via Adaptive Dynamic Programming

[...]

Guoping Zhang¹, Quanxin Zhu¹•Institutions (1)

Hunan Normal University¹

03 Mar 2021-Nonlinear Dynamics

TL;DR: It can be proved that the ETOC based on ADP approach can ensure that the CNN weight errors and states of system are semi-globally uniformly ultimately bounded in probability.

...read moreread less

Abstract: For nonlinear Ito-type stochastic systems, the problem of event-triggered optimal control (ETOC) is studied in this paper, and the adaptive dynamic programming (ADP) approach is explored to implement it. The value function of the Hamilton–Jacobi–Bellman(HJB) equation is approximated by applying critical neural network (CNN). Moreover, a new event-triggering scheme is proposed, which can be used to design ETOC directly via the solution of HJB equation. By utilizing the Lyapunov direct method, it can be proved that the ETOC based on ADP approach can ensure that the CNN weight errors and states of system are semi-globally uniformly ultimately bounded in probability. Furthermore, an upper bound is given on predetermined cost function. Specifically, there has been no published literature on the ETOC for nonlinear Ito-type stochastic systems via the ADP method. This work is the first attempt to fill the gap in this subject. Finally, the effectiveness of the proposed method is illustrated through two numerical examples.

...read moreread less

Journal Article•DOI•

Reinforcement learning with Gaussian processes for condition-based maintenance

[...]

Shenglin Peng¹, Qianmei Feng¹•Institutions (1)

University of Houston¹

01 Aug 2021-Computers & Industrial Engineering

TL;DR: This paper model the condition-based maintenance problem as a discrete-time continuous-state MDP without discretizing the deterioration condition of the system, and proposes a RL algorithm to minimize the long-run average cost.

...read moreread less

Journal Article•DOI•

Adaptive C0 interior penalty methods for Hamilton–Jacobi–Bellman equations with Cordes coefficients

[...]

Susanne C. Brenner¹, Ellya L. Kawecki²•Institutions (2)

Louisiana State University¹, University College London²

01 May 2021-Journal of Computational and Applied Mathematics

TL;DR: These estimates show the quasi-optimality of the method, and provide one with an adaptive finite element method that only assumes that the solution of the Hamilton-Jacobi-Bellman equation belongs to $H^2$.

...read moreread less

Journal Article•DOI•

Policy iterations for reinforcement learning problems in continuous time and space — Fundamental theory and methods

[...]

Jae Young Lee¹, Richard S. Sutton²•Institutions (2)

University of Waterloo¹, University of Alberta²

01 Apr 2021-Automatica

TL;DR: In this paper, the authors proposed two policy iteration methods, called differential PI (DPI) and integral PI (IPI), for a general RL framework in continuous time and space (CTS), where the environment is modeled by a system of ODEs.

...read moreread less

Journal Article•DOI•

Value of structural health information in partially observable stochastic environments

[...]

C. P. Andriotis¹, K. G. Papakonstantinou¹, Eleni Chatzi²•Institutions (2)

Pennsylvania State University¹, ETH Zurich²

01 Nov 2021-Structural Safety

TL;DR: It is shown that a POMDP policy inherently leverages the notion of VoI to guide observational actions in an optimal way at every decision step, and that the permanent or intermittent information provided by SHM or inspection visits, respectively, can only improve the cost of this policy in the long-term.

...read moreread less

Collapse