scispace - formally typeset
Search or ask a question

Showing papers on "Bellman equation published in 2021"


Journal ArticleDOI
TL;DR: A novel distributed policy iteration algorithm is established for infinite horizon optimal control problems of continuous-time nonlinear systems to improve the iterative control law one by one, instead of updating all the control laws in each iteration of the traditional policy iteration algorithms.
Abstract: In this article, a novel distributed policy iteration algorithm is established for infinite horizon optimal control problems of continuous-time nonlinear systems. In each iteration of the developed distributed policy iteration algorithm, only one controller’s control law is updated and the other controllers’ control laws remain unchanged. The main contribution of the present algorithm is to improve the iterative control law one by one, instead of updating all the control laws in each iteration of the traditional policy iteration algorithms, which effectively releases the computational burden in each iteration. The properties of distributed policy iteration algorithm for continuous-time nonlinear systems are analyzed. The admissibility of the present methods has also been analyzed. Monotonicity, convergence, and optimality have been discussed, which show that the iterative value function is nonincreasingly convergent to the solution of the Hamilton–Jacobi–Bellman equation. Finally, numerical simulations are conducted to illustrate the effectiveness of the proposed method.

62 citations


Posted Content
TL;DR: The proposed method to construct confidence intervals (CIs) for a policy's value in infinite horizon settings where the number of decision points diverges to infinity is applied to a dataset from mobile health studies and it is found that reinforcement learning algorithms could help improve patient's health status.
Abstract: Reinforcement learning is a general technique that allows an agent to learn an optimal policy and interact with an environment in sequential decision making problems. The goodness of a policy is measured by its value function starting from some initial state. The focus of this paper is to construct confidence intervals (CIs) for a policy’s value in infinite horizon settings where the number of decision points diverges to infinity. We propose to model the action-value state function (Q-function) associated with a policy based on series/sieve method to derive its confidence interval. When the target policy depends on the observed data as well, we propose a SequentiAl Value Evaluation (SAVE) method to recursively update the estimated policy and its value estimator. As long as either the number of trajectories or the number of decision points diverges to infinity, we show that the proposed CI achieves nominal coverage even in cases where the optimal policy is not unique. Simulation studies are conducted to back up our theoretical findings. We apply the proposed method to a dataset from mobile health studies and find that reinforcement learning algorithms could help improve patient’s health status. A Python implementation of the proposed procedure is available at https://github.com/shengzhang37/SAVE.

50 citations


Journal ArticleDOI
TL;DR: A tensor decomposition approach for the solution of high-dimensional, fully nonlinear Hamilton-Jacobi-Bellman equations arising in optimal feedback control of nonlinear dynamics is presented in this article.
Abstract: A tensor decomposition approach for the solution of high-dimensional, fully nonlinear Hamilton--Jacobi--Bellman equations arising in optimal feedback control of nonlinear dynamics is presented. The...

50 citations


Posted Content
TL;DR: The correspondence between CMKV-MDP and a general lifted MDP on the space of probability measures is proved, and the dynamic programming Bellman fixed point equation satisfied by the value function is established.
Abstract: We develop an exhaustive study of Markov decision process (MDP) under mean field interaction both on states and actions in the presence of common noise, and when optimization is performed over open-loop controls on infinite horizon. Such model, called CMKV-MDP for conditional McKean-Vlasov MDP, arises and is obtained here rigorously with a rate of convergence as the asymptotic problem of N-cooperative agents controlled by a social planner/influencer that observes the environment noises but not necessarily the individual states of the agents. We highlight the crucial role of relaxed controls and randomization hypothesis for this class of models with respect to classical MDP theory. We prove the correspondence between CMKV-MDP and a general lifted MDP on the space of probability measures, and establish the dynamic programming Bellman fixed point equation satisfied by the value function, as well as the existence of-optimal randomized feedback controls. The arguments of proof involve an original measurable optimal coupling for the Wasserstein distance. This provides a procedure for learning strategies in a large population of interacting collaborative agents. MSC Classification: 90C40, 49L20.

43 citations


BookDOI
TL;DR: In this paper, a unified framework for the study of multilevel mixed integer linear optimization problems and multistage stochastic MILO problems with recourse is introduced, which highlights the common mathematical structure of the two problems and allows for the development of a common algorithmic framework.
Abstract: We introduce a unified framework for the study of multilevel mixed integer linear optimization problems and multistage stochastic mixed integer linear optimization problems with recourse. The framework highlights the common mathematical structure of the two problems and allows for the development of a common algorithmic framework. Focusing on the two-stage case, we investigate, in particular, the nature of the value function of the second-stage problem, highlighting its connection to dual functions and the theory of duality for mixed integer linear optimization problems, and summarize different reformulations. We then present two main solution techniques, one based on a Benders-like decomposition to approximate either the risk function or the value function, and the other one based on cutting plane generation.

41 citations


Journal ArticleDOI
TL;DR: A sliding-mode surface (SMS)-based approximate optimal control scheme for a large class of nonlinear systems affected by unknown mismatched perturbations and the stability is proved based on the Lyapunov’s direct method is developed.
Abstract: This article develops a novel sliding-mode surface (SMS)-based approximate optimal control scheme for a large class of nonlinear systems affected by unknown mismatched perturbations. The observer-based perturbation estimation procedure is employed to establish the online updated value function. The solution to the Hamilton–Jacobi–Bellman equation is approximated by an SMS-based critic neural network whose weights error dynamics is designed to be asymptotically stable by nested update laws. The sliding-mode control strategy is combined with the approximate optimal control design procedure to obtain a faster control action. The stability is proved based on the Lyapunov’s direct method. The simulation results show the effectiveness of the developed control scheme.

40 citations


Journal ArticleDOI
TL;DR: Off-policy reinforcement learning (RL) algorithm is established to solve the discrete-time unknown dynamics NZS games with completely unknown dynamics with the existence of Nash equilibrium proved.
Abstract: In this article, off-policy reinforcement learning (RL) algorithm is established to solve the discrete-time $N$ -player nonzero-sum (NZS) games with completely unknown dynamics. The $N$ -coupled generalized algebraic Riccati equations (GARE) are derived, and then policy iteration (PI) algorithm is used to obtain the $N$ -tuple of iterative control and iterative value function. As the system dynamics is necessary in PI algorithm, off-policy RL method is developed for discrete-time $N$ -player NZS games. The off-policy $N$ -coupled Hamilton-Jacobi (HJ) equation is derived based on quadratic value functions. According to the Kronecker product, the $N$ -coupled HJ equation is decomposed into unknown parameter part and the system operation data part, which makes the $N$ -coupled HJ equation solved independent of system dynamics. The least square is used to calculate the iterative value function and $N$ -tuple of iterative control. The existence of Nash equilibrium is proved. The result of the proposed method for discrete-time unknown dynamics NZS games is indicated by the simulation examples.

38 citations


Journal ArticleDOI
TL;DR: In this article, an event-triggered adaptive dynamic programming (ADP) algorithm is developed to solve the tracking control problem for partially unknown constrained uncertain systems, where the learning of neural network weights not only relaxes the initial admissible control but also executes only when the predefined execution rule is violated.
Abstract: An event-triggered adaptive dynamic programming (ADP) algorithm is developed in this article to solve the tracking control problem for partially unknown constrained uncertain systems. First, an augmented system is constructed, and the solution of the optimal tracking control problem of the uncertain system is transformed into an optimal regulation of the nominal augmented system with a discounted value function. The integral reinforcement learning is employed to avoid the requirement of augmented drift dynamics. Second, the event-triggered ADP is adopted for its implementation, where the learning of neural network weights not only relaxes the initial admissible control but also executes only when the predefined execution rule is violated. Third, the tracking error and the weight estimation error prove to be uniformly ultimately bounded, and the existence of a lower bound for the interexecution times is analyzed. Finally, simulation results demonstrate the effectiveness of the present event-triggered ADP method.

35 citations


Journal ArticleDOI
TL;DR: A unified deep learning method that solves dynamic economic models by casting them into nonlinear regression equations for three fundamental objects of economic dynamics – lifetime reward functions, Bellman equations and Euler equations is introduced.

34 citations


Journal ArticleDOI
TL;DR: This work analyzes both the standard plug-in approach to this problem and a more robust variant, and establishes non-asymptotic bounds that depend on the (unknown) problem instance, as well as data-dependent bounds that can be evaluated based on the observations of state-transitions and rewards.
Abstract: Markov reward processes (MRPs) are used to model stochastic phenomena arising in operations research, control engineering, robotics, and artificial intelligence, as well as communication and transportation networks. In many of these cases, such as in the policy evaluation problem encountered in reinforcement learning, the goal is to estimate the long-term value function of such a process without access to the underlying population transition and reward functions. Working with samples generated under the synchronous model, we study the problem of estimating the value function of an infinite-horizon discounted MRP with finite state space in the $\ell _{\infty }$ -norm. We analyze both the standard plug-in approach to this problem and a more robust variant, and establish non-asymptotic bounds that depend on the (unknown) problem instance, as well as data-dependent bounds that can be evaluated based on the observations of state-transitions and rewards. We show that these approaches are minimax-optimal up to constant factors over natural sub-classes of MRPs. Our analysis makes use of a leave-one-out decoupling argument tailored to the policy evaluation problem, one which may be of independent interest.

33 citations


Journal ArticleDOI
TL;DR: In this article, the adaptive control problem for continuous-time nonlinear systems described by differential equations is studied and a learning-based control algorithm is proposed to learn robust optimal controllers directly from real-time data.
Abstract: This article studies the adaptive optimal control problem for continuous-time nonlinear systems described by differential equations. A key strategy is to exploit the value iteration (VI) method proposed initially by Bellman in 1957 as a fundamental tool to solve dynamic programming problems. However, previous VI methods are all exclusively devoted to the Markov decision processes and discrete-time dynamical systems. In this article, we aim to fill up the gap by developing a new continuous-time VI method that will be applied to address the adaptive or nonadaptive optimal control problems for continuous-time systems described by differential equations. Like the traditional VI, the continuous-time VI algorithm retains the nice feature that there is no need to assume the knowledge of an initial admissible control policy. As a direct application of the proposed VI method, a new class of adaptive optimal controllers is obtained for nonlinear systems with totally unknown dynamics. A learning-based control algorithm is proposed to show how to learn robust optimal controllers directly from real-time data. Finally, two examples are given to illustrate the efficacy of the proposed methodology.

Journal ArticleDOI
TL;DR: The generalized value iteration with a discount factor is developed for optimal control of discrete-time nonlinear systems, which is initialized with a positive definite value function rather than zero, and the convergence analysis of the discounted value function sequence is provided.

Journal ArticleDOI
TL;DR: In this article, a novel formulation of the value function is presented for the optimal tracking problem (TP) of nonlinear discrete-time systems, and the optimal control policy can be deduced without considering the reference control input.

Journal Article
TL;DR: This work proposes to reweight experiences based on their likelihood under the stationary distribution of the current policy, using a likelihood-free density ratio estimator over the replay buffer to assign the prioritization weights.
Abstract: The use of past experiences to accelerate temporal difference (TD) learning of value functions, or experience replay, is a key component in deep reinforcement learning. In this work, we propose to reweight experiences based on their likelihood under the stationary distribution of the current policy, and justify this with a contraction argument over the Bellman evaluation operator. The resulting TD objective encourages small approximation errors on the value function over frequently encountered states. To balance bias and variance in practice, we use a likelihood-free density ratio estimator between on-policy and off-policy experiences, and use the ratios as the prioritization weights. We apply the proposed approach empirically on three competitive methods, Soft Actor Critic (SAC), Twin Delayed Deep Deterministic policy gradient (TD3) and Data-regularized Q (DrQ), over 11 tasks from OpenAI gym and DeepMind control suite. We achieve superior sample complexity on 35 out of 45 method-task combinations compared to the best baseline and similar sample complexity on the remaining 10.

Journal ArticleDOI
TL;DR: In this article, the authors developed algorithms for high-dimensional stochastic control problems based on deep learning and dynamic programming (DP) and provided a theoretical justification of these algorithms.
Abstract: This paper develops algorithms for high-dimensional stochastic control problems based on deep learning and dynamic programming (DP). Differently from the classical approximate DP approach, we first approximate the optimal policy by means of neural networks in the spirit of deep reinforcement learning, and then the value function by Monte Carlo regression. This is achieved in the DP recursion by performance or hybrid iteration, and regress now or later/quantization methods from numerical probabilities. We provide a theoretical justification of these algorithms. Consistency and rate of convergence for the control and value function estimates are analyzed and expressed in terms of the universal approximation error of the neural networks. Numerical results on various applications are presented in a companion paper [2] and illustrate the performance of our algorithms.

Journal ArticleDOI
TL;DR: In the present GACL optimal control method, it is the first time that three iteration processes, which are global iteration, local iteration, and interior iteration, respectively, are established to obtain the optimal energy control law.
Abstract: This article is concerned with a new generalized actor-critic learning (GACL) optimal control method. It aims at the optimal energy control and management for smart home systems, which is expected to minimize the consumption cost for home users. In the present GACL optimal control method, it is the first time that three iteration processes, which are global iteration, local iteration, and interior iteration, respectively, are established to obtain the optimal energy control law. The main contribution of the developed method is to establish a common iteration structure for both value and policy iterations in adaptive dynamic programming based on a control law sequence in each iteration for periodic time-varying systems, instead of a single control law, and simultaneously accelerates the convergence rate. The monotonicity, convergence, and optimality of the iterative value function for the GACL optimal control method are proven. Finally, numerical results and comparisons are displayed to show the superiority of the developed method.

Journal ArticleDOI
TL;DR: A novel method, event-triggered heuristic dynamic programming (ETHDP), is applied to derive the optimal control policy and two neural networks are utilized to approximate the value function and control law, respectively.
Abstract: This article considers the problem of event-triggered optimal control for discrete-time switched nonlinear systems with constrained control input. First, an event-triggered condition is given to make the closed-loop switched system asymptotically stable. Second, a novel method, event-triggered heuristic dynamic programming (ETHDP), is applied to derive the optimal control policy. Two neural networks (NNs) are utilized to approximate the value function and control law, respectively. When the event-triggered condition is violated, the weights of the two NNs are updated, which can decrease the networks calculation and transmission load notably. A proof of the convergence of the ETHDP is also carried out. Finally, the effectiveness of the proposed method is verified by an example.

Journal ArticleDOI
TL;DR: In this paper, optimal feedback control for nonlinear systems generally requires solving Hamilton-Jacobi-Bellman (HJB) equations, which is notoriously difficult when the state dimension is large.
Abstract: Computing optimal feedback controls for nonlinear systems generally requires solving Hamilton--Jacobi--Bellman (HJB) equations, which are notoriously difficult when the state dimension is large. Ex...

Journal ArticleDOI
Jun Moon1
TL;DR: The generalized risk-sensitive dynamic programming principle for the value function via the backward semigroup associated with the BSDE is obtained, and it is shown that the corresponding value function is a viscosity solution to the Hamilton–Jacobi–Bellman equation.
Abstract: In this article, we consider the generalized risk-sensitive optimal control problem, where the objective functional is defined by the controlled backward stochastic differential equation (BSDE) with quadratic growth coefficient We extend the earlier results of the risk-sensitive optimal control problem to the case of the objective functional given by the controlled BSDE Note that the risk-neutral stochastic optimal control problem corresponds to the BSDE objective functional with linear growth coefficient, which can be viewed as a special case of the article We obtain the generalized risk-sensitive dynamic programming principle for the value function via the backward semigroup associated with the BSDE Then we show that the corresponding value function is a viscosity solution to the Hamilton–Jacobi–Bellman equation Under an additional parameter condition, the viscosity solution is unique, which implies that the solution characterizes the value function We apply the theoretical results to the risk-sensitive European option pricing problem

Journal ArticleDOI
TL;DR: It is proved that viscosity solutions of Hamilton--Jacobi--Bellman (HJB) equations, corresponding either to deterministic optimal control problems for systems of systems of $n$ particles or to stochastic optimal ...
Abstract: We prove that viscosity solutions of Hamilton--Jacobi--Bellman (HJB) equations, corresponding either to deterministic optimal control problems for systems of $n$ particles or to stochastic optimal ...

Journal ArticleDOI
01 Jan 2021
TL;DR: A novel model-free Q-learning based approach is developed to solve the tracking problem for linear discrete-time systems and it is proved that probing noises in maintaining the persistence of excitation (PE) condition do not result in any bias.
Abstract: In this letter, a novel model-free Q-learning based approach is developed to solve the H ∞ tracking problem for linear discrete-time systems. A new exponential discounted value function is introduced that includes the cost of the whole control input and tracking error. The tracking Bellman equation and the game algebraic Riccati equation (GARE) are derived. The solution to the GARE leads to the feedback and feedforward parts of the control input. A Q-learning algorithm is then developed to learn the solution of the GARE online without requiring any knowledge of the system dynamics. Convergence of the algorithm is analyzed, and it is also proved that probing noises in maintaining the persistence of excitation (PE) condition do not result in any bias. An example of the F-16 aircraft short period dynamics is developed to validate the proposed algorithm.

Proceedings Article
18 Jul 2021
TL;DR: This work proposes a novel tensorised formulation of the Bellman equation, which gives rise to the method Tesseract, which utilises the view of Q-function seen as a tensor where the modes correspond to action spaces of different agents.
Abstract: Reinforcement Learning in large action spaces is a challenging problem. This is especially true for cooperative multi-agent reinforcement learning (MARL), which often requires tractable learning while respecting various constraints like communication budget and information about other agents. In this work, we focus on the fundamental hurdle affecting both value-based and policy-gradient approaches: an exponential blowup of the action space with the number of agents. For value-based methods, it poses challenges in accurately representing the optimal value function for value-based methods, thus inducing suboptimality. For policy gradient methods, it renders the critic ineffective and exacerbates the problem of the lagging critic. We show that from a learning theory perspective, both problems can be addressed by accurately representing the associated action-value function with a low-complexity hypothesis class. This requires accurately modelling the agent interactions in a sample efficient way. To this end, we propose a novel tensorised formulation of the Bellman equation. This gives rise to our method Tesseract, which utilises the view of Q-function seen as a tensor where the modes correspond to action spaces of different agents. Algorithms derived from Tesseract decompose the Q-tensor across the agents and utilise low-rank tensor approximations to model the agent interactions relevant to the task. We provide PAC analysis for Tesseract based algorithms and highlight their relevance to the class of rich observation MDPs. Empirical results in different domains confirm the gains in sample efficiency using Tesseract as supported by the theory.

Journal ArticleDOI
TL;DR: In this paper, a family of approximate equilibrium strategies is constructed associated with partitions of the time intervals, and an equilibrium Hamilton-Jacobi-Bellman (HJB) equation is derived, through which the equilibrium value function and equilibrium strategy are obtained.
Abstract: An optimal control problem is considered for a stochastic differential equation with the cost functional determined by a backward stochastic Volterra integral equation (BSVIE, for short). This kind of cost functional can cover the general discounting (including exponential and non-exponential) situations with a recursive feature. It is known that such a problem is time-inconsistent in general. Therefore, instead of finding a global optimal control, we look for a time-consistent locally near optimal equilibrium strategy. With the idea of multi-person differential games, a family of approximate equilibrium strategies is constructed associated with partitions of the time intervals. By sending the mesh size of the time interval partition to zero, an equilibrium Hamilton–Jacobi–Bellman (HJB, for short) equation is derived, through which the equilibrium value function and an equilibrium strategy are obtained. Under certain conditions, a verification theorem is proved and the well-posedness of the equilibrium HJB is established. As a sort of Feynman–Kac formula for the equilibrium HJB equation, a new class of BSVIEs (containing the diagonal value Z (r , r ) of Z (⋅ , ⋅)) is naturally introduced and the well-posedness of such kind of equations is briefly presented.

Journal ArticleDOI
TL;DR: An adaptive learning algorithm is developed based on policy iteration technique to approximately obtain the Nash equilibrium using real-time data to solve the optimal control problem for nonlinear nonzero-sum differential game in the environment of no initial admissible policies.
Abstract: This article investigates the optimal control problem for nonlinear nonzero-sum differential game in the environment of no initial admissible policies while considering the control constraint. An adaptive learning algorithm is thus developed based on policy iteration technique to approximately obtain the Nash equilibrium using real-time data. A two-player continuous-time system is used to present this approximate mechanism, which is implemented as a critic–actor architecture for every player. The constraint is incorporated into this optimization by introducing the nonquadratic value function, and the associated constrained Hamilton–Jacobi equation is derived. The critic neural network (NN) and actor NN are utilized to learn the value function and the optimal control policy, respectively, in the light of novel weight tuning laws. In order to tackle the stability during the learning phase, two stable operators are designed for two actors. The proposed algorithm is proved to be convergent as a Newton’s iteration, and the stability of this closed-loop system is also ensured by Lyapunov analysis. Finally, two simulation examples demonstrate the effectiveness of the proposed learning scheme by considering different constraint scenes.

Journal ArticleDOI
TL;DR: The study indicates that individual human movements can be predicted with low error using an infinite-horizon optimal control problem with constraints on the shoulder movement.
Abstract: This brief presents an inverse optimal control methodology and its application to training a predictive model of human motor control from a manipulation task. It introduces a convex formulation for learning both objective function and constraints of an infinite-horizon constrained optimal control problem with nonlinear system dynamics. The inverse approach utilizes Bellman’s principle of optimality to formulate the infinite-horizon optimal control problem as a shortest path problem and the Lagrange multipliers to identify constraints. We highlight the key benefit of using the shortest path formulation, i.e., the possibility of training the predictive model with short and selected trajectory segments. The method is applied to training a predictive model of movements of a human subject from a manipulation task. The study indicates that individual human movements can be predicted with low error using an infinite-horizon optimal control problem with constraints on the shoulder movement.

Journal ArticleDOI
TL;DR: It can be proved that the ETOC based on ADP approach can ensure that the CNN weight errors and states of system are semi-globally uniformly ultimately bounded in probability.
Abstract: For nonlinear Ito-type stochastic systems, the problem of event-triggered optimal control (ETOC) is studied in this paper, and the adaptive dynamic programming (ADP) approach is explored to implement it. The value function of the Hamilton–Jacobi–Bellman(HJB) equation is approximated by applying critical neural network (CNN). Moreover, a new event-triggering scheme is proposed, which can be used to design ETOC directly via the solution of HJB equation. By utilizing the Lyapunov direct method, it can be proved that the ETOC based on ADP approach can ensure that the CNN weight errors and states of system are semi-globally uniformly ultimately bounded in probability. Furthermore, an upper bound is given on predetermined cost function. Specifically, there has been no published literature on the ETOC for nonlinear Ito-type stochastic systems via the ADP method. This work is the first attempt to fill the gap in this subject. Finally, the effectiveness of the proposed method is illustrated through two numerical examples.

Journal ArticleDOI
TL;DR: This paper model the condition-based maintenance problem as a discrete-time continuous-state MDP without discretizing the deterioration condition of the system, and proposes a RL algorithm to minimize the long-run average cost.

Journal ArticleDOI
TL;DR: These estimates show the quasi-optimality of the method, and provide one with an adaptive finite element method that only assumes that the solution of the Hamilton-Jacobi-Bellman equation belongs to $H^2$.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed two policy iteration methods, called differential PI (DPI) and integral PI (IPI), for a general RL framework in continuous time and space (CTS), where the environment is modeled by a system of ODEs.

Journal ArticleDOI
TL;DR: It is shown that a POMDP policy inherently leverages the notion of VoI to guide observational actions in an optimal way at every decision step, and that the permanent or intermittent information provided by SHM or inspection visits, respectively, can only improve the cost of this policy in the long-term.