scispace - formally typeset
Search or ask a question

Showing papers on "Bellman equation published in 2019"


Posted Content
TL;DR: This paper proves that an optimistic modification of Least-Squares Value Iteration (LSVI) achieves regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps, and is independent of the number of states and actions.
Abstract: Modern Reinforcement Learning (RL) is commonly applied to practical problems with an enormous number of states, where function approximation must be deployed to approximate either the value function or the policy. The introduction of function approximation raises a fundamental set of challenges involving computational and statistical efficiency, especially given the need to manage the exploration/exploitation tradeoff. As a result, a core RL question remains open: how can we design provably efficient RL algorithms that incorporate function approximation? This question persists even in a basic setting with linear dynamics and linear rewards, for which only linear function approximation is needed. This paper presents the first provable RL algorithm with both polynomial runtime and polynomial sample complexity in this linear setting, without requiring a "simulator" or additional assumptions. Concretely, we prove that an optimistic modification of Least-Squares Value Iteration (LSVI)---a classical algorithm frequently studied in the linear setting---achieves $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret, where $d$ is the ambient dimension of feature space, $H$ is the length of each episode, and $T$ is the total number of steps. Importantly, such regret is independent of the number of states and actions.

337 citations


Posted Content
01 Aug 2019
TL;DR: In this article, the authors provide provable characterizations of computational, approximation, and sample size issues with policy gradient methods in the context of discounted Markov Decision Processes (MDPs).
Abstract: Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution (say with a sufficiently rich policy class); how they cope with approximation error due to using a restricted class of parametric policies; or their finite sample behavior. Such characterizations are important not only to compare these methods to their approximate value function counterparts (where such issues are relatively well understood, at least in the worst case), but also to help with more principled approaches to algorithm design. This work provides provable characterizations of computational, approximation, and sample size issues with regards to policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: 1) "tabular" policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy, and 2) restricted policy classes, which may not contain the optimal policy and where we provide agnostic learning results. One insight of this work is in formalizing the importance how a favorable initial state distribution provides a means to circumvent worst-case exploration issues. Overall, these results place policy gradient methods under a solid theoretical footing, analogous to the global convergence guarantees of iterative value function based algorithms.

94 citations


Journal ArticleDOI
TL;DR: A novel off-policy interleaved Q-learning algorithm is presented for solving optimal control problem of affine nonlinear discrete-time (DT) systems, using only the measured data along the system trajectories and its convergence is rigorously proven.
Abstract: In this paper, a novel off-policy interleaved Q-learning algorithm is presented for solving optimal control problem of affine nonlinear discrete-time (DT) systems, using only the measured data along the system trajectories. Affine nonlinear feature of systems, unknown dynamics, and off-policy learning approach pose tremendous challenges on approximating optimal controllers. To this end, on-policy Q-learning method for optimal control of affine nonlinear DT systems is reviewed first, and its convergence is rigorously proven. The bias of solution to Q-function-based Bellman equation caused by adding probing noises to systems for satisfying persistent excitation is also analyzed when using on-policy Q-learning approach. Then, a behavior control policy is introduced followed by proposing an off-policy Q-learning algorithm. Meanwhile, the convergence of algorithm and no bias of solution to optimal control problem when adding probing noise to systems are investigated. Third, three neural networks run by the interleaved Q-learning approach in the actor-critic framework. Thus, a novel off-policy interleaved Q-learning algorithm is derived, and its convergence is proven. Simulation results are given to verify the effectiveness of the proposed method.

86 citations


Journal ArticleDOI
TL;DR: In this paper, a functional central limit theorem (CLT) was established for mean field games, which characterizes the limiting fluctuations around the LLN limit as the unique solution of a linear stochastic PDE.
Abstract: Mean field games (MFGs) describe the limit, as $n$ tends to infinity, of stochastic differential games with $n$ players interacting with one another through their common empirical distribution. Under suitable smoothness assumptions that guarantee uniqueness of the MFG equilibrium, a form of law of large of numbers (LLN), also known as propagation of chaos, has been established to show that the MFG equilibrium arises as the limit of the sequence of empirical measures of the $n$-player game Nash equilibria, including the case when player dynamics are driven by both idiosyncratic and common sources of noise. The proof of convergence relies on the so-called master equation for the value function of the MFG, a partial differential equation on the space of probability measures. In this work, under additional assumptions, we establish a functional central limit theorem (CLT) that characterizes the limiting fluctuations around the LLN limit as the unique solution of a linear stochastic PDE. The key idea is to use the solution to the master equation to construct an associated McKean-Vlasov interacting $n$-particle system that is sufficiently close to the Nash equilibrium dynamics of the $n$-player game for large $n$. We then derive the CLT for the latter from the CLT for the former. Along the way, we obtain a new multidimensional CLT for McKean-Vlasov systems. We also illustrate the broader applicability of our methodology by applying it to establish a CLT for a specific linear-quadratic example that does not satisfy our main assumptions, and we explicitly solve the resulting stochastic PDE in this case.

84 citations


Journal ArticleDOI
TL;DR: The optimal tracking problem is reformulated as finding a Nash-equilibrium solution to multiplayer games, which can be done by solving associated coupled Hamilton–Jacobi equations, and a data- based error estimator is designed to obtain the data-based control.
Abstract: This paper studies an optimal consensus tracking problem of heterogeneous linear multiagent systems. By introducing tracking error dynamics, the optimal tracking problem is reformulated as finding a Nash-equilibrium solution to multiplayer games, which can be done by solving associated coupled Hamilton–Jacobi equations. A data-based error estimator is designed to obtain the data-based control for the multiagent systems. Using the quadratic functional to approximate every agent’s value function, we can obtain the optimal cooperative control by the input–output (I/O) ${Q}$ -learning algorithm with a value iteration technique in the least-square sense. The control law solves the optimal consensus problem for multiagent systems with measured I/O information, and does not rely on the model of multiagent systems. A numerical example is provided to illustrate the effectiveness of the proposed algorithm.

81 citations


Posted Content
TL;DR: A generalized version of the Bellman equation is proposed to learn a single parametric representation for optimal policies over the space of all possible preferences in MORL, with the goal of enabling few-shot adaptation to new tasks.
Abstract: We introduce a new algorithm for multi-objective reinforcement learning (MORL) with linear preferences, with the goal of enabling few-shot adaptation to new tasks. In MORL, the aim is to learn policies over multiple competing objectives whose relative importance (preferences) is unknown to the agent. While this alleviates dependence on scalar reward design, the expected return of a policy can change significantly with varying preferences, making it challenging to learn a single model to produce optimal policies under different preference conditions. We propose a generalized version of the Bellman equation to learn a single parametric representation for optimal policies over the space of all possible preferences. After an initial learning phase, our agent can execute the optimal policy under any given preference, or automatically infer an underlying preference with very few samples. Experiments across four different domains demonstrate the effectiveness of our approach.

80 citations


Proceedings Article
01 Jan 2019
TL;DR: This paper proves for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation and establishes the global convergence of neural (soft) Q-learning, which is further connected to that of policy gradient algorithms.
Abstract: Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, due to the nonlinearity in value function approximation, such a coupling leads to nonconvexity and even divergence in optimization. As a result, the global convergence of neural TD remains unclear. In this paper, we prove for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation. In particular, we show how such global convergence is enabled by the overparametrization of neural networks, which also plays a vital role in the empirical success of neural TD. Beyond policy evaluation, we establish the global convergence of neural (soft) Q-learning, which is further connected to that of policy gradient algorithms.

73 citations


Posted Content
William Fedus1, Carles Gelada1, Yoshua Bengio, Marc G. Bellemare1, Hugo Larochelle1 
TL;DR: It is demonstrated that a simple approach approximates hyperbolic discount functions while still using familiar temporal-difference learning techniques in RL and a surprising discovery is made that simultaneously learning value functions over multiple time-horizons is an effective auxiliary task which often improves over a strong value-based RL agent, Rainbow.
Abstract: Reinforcement learning (RL) typically defines a discount factor as part of the Markov Decision Process. The discount factor values future rewards by an exponential scheme that leads to theoretical convergence guarantees of the Bellman equation. However, evidence from psychology, economics and neuroscience suggests that humans and animals instead have hyperbolic time-preferences. In this work we revisit the fundamentals of discounting in RL and bridge this disconnect by implementing an RL agent that acts via hyperbolic discounting. We demonstrate that a simple approach approximates hyperbolic discount functions while still using familiar temporal-difference learning techniques in RL. Additionally, and independent of hyperbolic discounting, we make a surprising discovery that simultaneously learning value functions over multiple time-horizons is an effective auxiliary task which often improves over a strong value-based RL agent, Rainbow.

70 citations


Journal ArticleDOI
TL;DR: A data-driven method to approximate semi-global solutions to HJB equations for general high dimensional nonlinear systems and compute optimal feedback controls in real-time with neural networks trained on data generated independently of any state space discretization is proposed.
Abstract: Computing optimal feedback controls for nonlinear systems generally requires solving Hamilton-Jacobi-Bellman (HJB) equations, which, in high dimensions, are notoriously difficult Existing strategies for high dimensional problems generally rely on specific, restrictive problem structures, or are valid only locally around some nominal trajectory In this paper, we propose a data-driven method to approximate semi-global solutions to HJB equations for general high dimensional nonlinear systems and compute optimal feedback controls in real-time To accomplish this, we model solutions to HJB equations with neural networks (NNs) trained on data generated independently of any state space discretization Training is made more effective and efficient by leveraging the known physics of the problem and using the partially trained NN to aid in adaptive data generation We demonstrate the effectiveness of our method by learning the approximate solution to the HJB equation corresponding to the stabilization of six dimensional nonlinear rigid body, and controlling the system with the trained NN

67 citations


Proceedings ArticleDOI
20 May 2019
TL;DR: This work shows how a time-discounted modification of the problem of maximizing the minimum payoff over time, central to safety analysis, through a modified dynamic programming equation that induces a contraction mapping can render reinforcement learning techniques amenable to quantitative safety analysis as tools to approximate the safe set and optimal safety policy.
Abstract: Safety analysis is a necessary component in the design and deployment of autonomous robotic systems. Techniques from robust optimal control theory, such as Hamilton-Jacobi reachability analysis, allow a rigorous formalization of safety as guaranteed constraint satisfaction. Unfortunately, the computational complexity of these tools for general dynamical systems scales poorly with state dimension, making existing tools impractical beyond small problems. Modern reinforcement learning methods have shown promising ability to find approximate yet proficient solutions to optimal control problems in complex and high-dimensional systems, however their application has in practice been restricted to problems with an additive payoff over time, unsuitable for reasoning about safety. In recent work, we introduced a time-discounted modification of the problem of maximizing the minimum payoff over time, central to safety analysis, through a modified dynamic programming equation that induces a contraction mapping. Here, we show how a similar contraction mapping can render reinforcement learning techniques amenable to quantitative safety analysis as tools to approximate the safe set and optimal safety policy. This opens a new avenue of research connecting control-theoretic safety analysis and the reinforcement learning domain. We validate the correctness of our formulation by comparing safety results computed through Q-learning to analytic and numerical solutions, and demonstrate its scalability by learning safe sets and control policies for simulated systems of up to 18 state dimensions using value learning and policy gradient techniques.

62 citations


Posted Content
TL;DR: This work introduces generic model-free algorithms based on the state-action value function at the mean field level and proves convergence for a prototypical Q-learning method for mean field control problems.
Abstract: We develop a general reinforcement learning framework for mean field control (MFC) problems. Such problems arise for instance as the limit of collaborative multi-agent control problems when the number of agents is very large. The asymptotic problem can be phrased as the optimal control of a non-linear dynamics. This can also be viewed as a Markov decision process (MDP) but the key difference with the usual RL setup is that the dynamics and the reward now depend on the state's probability distribution itself. Alternatively, it can be recast as a MDP on the Wasserstein space of measures. In this work, we introduce generic model-free algorithms based on the state-action value function at the mean field level and we prove convergence for a prototypical Q-learning method. We then implement an actor-critic method and report numerical results on two archetypal problems: a finite space model motivated by a cyber security application and a continuous space model motivated by an application to swarm motion.

Journal ArticleDOI
TL;DR: The convergence of MsHDP algorithm is proved by demonstrating that it converges to the solution of the Bellman equation.
Abstract: In this paper, the optimal output tracking control problem of discrete-time nonlinear systems is considered. First, the augmented system is derived and the tracking control problem is converted to the regulation problem with a discounted performance index, which relies on the solution of the Bellman equation. It is known that policy iteration and value iteration are two classical algorithms for solving the Bellman equation. Through analysis of the two algorithms, it is found that policy iteration converges fast while requires an initial admissible control policy, and value iteration avoids the requirement of an initial admissible control policy but converges slowly. To achieve the tradeoff between policy iteration and value iteration, the multistep heuristic dynamic programming (MsHDP) is proposed by using multistep policy evaluation scheme. The convergence of MsHDP algorithm is proved by demonstrating that it converges to the solution of the Bellman equation. Subsequently, neural network-based actor-critic structure is developed to implement the MsHDP algorithm. The effectiveness and advantages of the developed MsHDP method are validated through comparative simulation studies.

Posted Content
TL;DR: This work puts forth a new variant of actor-critic that employs Monte Carlo rollouts during the policy search updates, which results in controllable bias that depends on the number of critic evaluations, providing insight into the interplay between optimization and generalization in reinforcement learning.
Abstract: Reinforcement learning, mathematically described by Markov Decision Problems, may be approached either through dynamic programming or policy search. Actor-critic algorithms combine the merits of both approaches by alternating between steps to estimate the value function and policy gradient updates. Due to the fact that the updates exhibit correlated noise and biased gradient updates, only the asymptotic behavior of actor-critic is known by connecting its behavior to dynamical systems. This work puts forth a new variant of actor-critic that employs Monte Carlo rollouts during the policy search updates, which results in controllable bias that depends on the number of critic evaluations. As a result, we are able to provide for the first time the convergence rate of actor-critic algorithms when the policy search step employs policy gradient, agnostic to the choice of policy evaluation technique. In particular, we establish conditions under which the sample complexity is comparable to stochastic gradient method for non-convex problems or slower as a result of the critic estimation error, which is the main complexity bottleneck. These results hold for in continuous state and action spaces with linear function approximation for the value function. We then specialize these conceptual results to the case where the critic is estimated by Temporal Difference, Gradient Temporal Difference, and Accelerated Gradient Temporal Difference. These learning rates are then corroborated on a navigation problem involving an obstacle, which suggests that learning more slowly may lead to improved limit points, providing insight into the interplay between optimization and generalization in reinforcement learning.

Proceedings Article
01 Jan 2019
TL;DR: A novel loss function is proposed, which can be optimized using standard gradient-based methods with guaranteed convergence, and can be easily approximated using sampled transitions, avoiding the need for double samples required by prior algorithms like residual gradient.
Abstract: Value function learning plays a central role in many state-of-the-art reinforcement learning algorithms. Many popular algorithms like Q-learning do not optimize any objective function, but are fixed-point iterations of some variants of Bellman operator that are not necessarily a contraction. As a result, they may easily lose convergence guarantees, as can be observed in practice. In this paper, we propose a novel loss function, which can be optimized using standard gradient-based methods with guaranteed convergence. The key advantage is that its gradient can be easily approximated using sampled transitions, avoiding the need for double samples required by prior algorithms like residual gradient. Our approach may be combined with general function classes such as neural networks, using either on- or off-policy data, and is shown to work reliably and effectively in several benchmarks, including classic problems where standard algorithms are known to diverge.

Proceedings Article
01 Jan 2019
TL;DR: This article propose a generalized version of the Bellman equation to learn a single parametric representation for optimal policies over the space of all possible preferences, which can learn policies over multiple competing objectives whose relative importance (preferences) is unknown to the agent.
Abstract: We introduce a new algorithm for multi-objective reinforcement learning (MORL) with linear preferences, with the goal of enabling few-shot adaptation to new tasks. In MORL, the aim is to learn policies over multiple competing objectives whose relative importance (preferences) is unknown to the agent. While this alleviates dependence on scalar reward design, the expected return of a policy can change significantly with varying preferences, making it challenging to learn a single model to produce optimal policies under different preference conditions. We propose a generalized version of the Bellman equation to learn a single parametric representation for optimal policies over the space of all possible preferences. After an initial learning phase, our agent can execute the optimal policy under any given preference, or automatically infer an underlying preference with very few samples. Experiments across four different domains demonstrate the effectiveness of our approach.

Journal ArticleDOI
TL;DR: A Q-learning scheme for the optimal consensus control of discrete-time multiagent systems is investigated by reinforcement learning using system data instead of system dynamics information, and least square method is employed to motivate the implementation process.
Abstract: This paper investigates a Q-learning scheme for the optimal consensus control of discrete-time multiagent systems. The Q-learning algorithm is conducted by reinforcement learning (RL) using system data instead of system dynamics information. In the multiagent systems, the agents are interacted with each other and at least one agent can communicate with the leader directly, which is described by an algebraic graph structure. The objective is to make all the agents achieve synchronization with leader and make the performance indices reach Nash equilibrium. On one hand, the solutions of the optimal consensus control for multiagent systems are acquired by solving the coupled Hamilton–Jacobi–Bellman (HJB) equation. However, it is difficult to get analytical solutions directly of the discrete-time HJB equation. On the other hand, accurate mathematical models of most systems in real world are hard to be obtained. To overcome these difficulties, Q-learning algorithm is developed using system data rather than the accurate system model. We formulate performance index and corresponding Bellman equation of each agent i. Then, the Q-function Bellman equation is acquired on the basis of Q-function. Policy iteration is adopted to calculate the optimal control iteratively, and least square (LS) method is employed to motivate the implementation process. Stability analysis of proposed Q-learning algorithm for multiagent systems by policy iteration is given. Two simulation examples are experimented to verify the effectiveness of the proposed scheme.

Journal ArticleDOI
TL;DR: A new approach for finite horizon optimal control problems where the value function is computed using a DP algorithm on a tree structure algorithm (TSA) constructed by the time discrete dynamics allowing for the solution of very high-dimensional problems.
Abstract: The classical dynamic programming (DP) approach to optimal control problems is based on the characterization of the value function as the unique viscosity solution of a Hamilton--Jacobi--Bellman eq...

Journal ArticleDOI
TL;DR: In this article, a stochastic optimal control problem for a partially observed diffusion is studied and a corresponding randomized dynamic programming principle for the value function is obtained from a flow property of an associated filter process.

Journal ArticleDOI
TL;DR: In this article, a new optimization formulation of the linear quadratic regulator (LQR) problem via the Lagrangian duality theories was proposed to lay theoretical foundations of potentially effective RL algorithms.
Abstract: Recently, reinforcement learning (RL) is receiving more and more attentions due to its successful demonstrations outperforming human performance in certain challenging tasks. The goal of this paper is to study a new optimization formulation of the linear quadratic regulator (LQR) problem via the Lagrangian duality theories in order to lay theoretical foundations of potentially effective RL algorithms. The new optimization problem includes the Q-function parameters so that it can be directly used to develop Q-learning algorithms, known to be one of the most popular RL algorithms. We prove relations between saddle-points of the Lagrangian function and the optimal solutions of the Bellman equation. As an example of its applications, we propose a model-free primal-dual Q-learning algorithm to solve the LQR problem and demonstrate its validity through examples.

Proceedings ArticleDOI
01 May 2019
TL;DR: LVIS is introduced, which circumvents the issue of local minima through global mixed-integer optimization and theissue of non-uniqueness through learning the optimal value function rather than the optimal policy, and is applied to a fundamentally hard problem in feedback control–control through contact.
Abstract: Guided policy search is a popular approach for training controllers for high-dimensional systems, but it has a number of pitfalls. Non-convex trajectory optimization has local minima, and non-uniqueness in the optimal policy itself can mean that independently-optimized samples do not describe a coherent policy from which to train. We introduce LVIS, which circumvents the issue of local minima through global mixed-integer optimization and the issue of non-uniqueness through learning the optimal value function rather than the optimal policy. To avoid the expense of solving the mixed-integer programs to full global optimality, we instead solve them only partially, extracting intervals containing the true cost-to-go from early termination of the branch-and-bound algorithm. These interval samples are used to weakly supervise the training of a neural net which approximates the true cost-to-go. Online, we use that learned cost-to-go as the terminal cost of a one-step model-predictive controller, which we solve via a small mixed-integer optimization. We demonstrate LVIS on piecewise affine models of a cart-pole system with walls and a planar humanoid robot and show that it can be applied to a fundamentally hard problem in feedback control–control through contact.

Posted Content
TL;DR: An explicit upper bound is obtained on the rate of convergence of this algorithm as a function of the network topology and the discount factor when the communication network between the agents is time-varying in general.
Abstract: We study the policy evaluation problem in multi-agent reinforcement learning. In this problem, a group of agents works cooperatively to evaluate the value function for the global discounted accumulative reward problem, which is composed of local rewards observed by the agents. Over a series of time steps, the agents act, get rewarded, update their local estimate of the value function, then communicate with their neighbors. The local update at each agent can be interpreted as a distributed consensus-based variant of the popular temporal difference learning algorithm TD(0). While distributed reinforcement learning algorithms have been presented in the literature, almost nothing is known about their convergence rate. Our main contribution is providing a finite-time analysis for the convergence of the distributed TD(0) algorithm. We do this when the communication network between the agents is time-varying in general. We obtain an explicit upper bound on the rate of convergence of this algorithm as a function of the network topology and the discount factor. Our results mirror what we would expect from using distributed stochastic gradient descent for solving convex optimization problems.

Journal ArticleDOI
TL;DR: Novel results on the solution of a class of leavable, undiscounted optimal control problems in the minimax sense for nonlinear, continuous-state, discrete-time plants are presented.
Abstract: We present novel results on the solution of a class of leavable, undiscounted optimal control problems in the minimax sense for nonlinear, continuous-state, discrete-time plants. The problem class includes entry-(exit-)time problems as well as minimum-time, pursuit-evasion, and reach-avoid games as special cases. We utilize auxiliary optimal control problems (“abstractions”) to compute both upper bounds of the value function, i.e., of the achievable closed-loop performance, and symbolic feedback controllers realizing those bounds. The abstractions are obtained from discretizing the problem data, and we prove that the computed bounds and the performance of the symbolic controllers converge to the value function as the discretization parameters approach zero. In particular, if the optimal control problem is solvable on some compact subset of the state space, and if the discretization parameters are sufficiently small, then we obtain a symbolic feedback controller solving the problem on that subset. These results do not assume the continuity of the value function or any problem data, and they fully apply in the presence of hard state and control constraints.

Journal ArticleDOI
TL;DR: In this paper, the authors develop a dynamic model of rational behavior under uncertainty, in which the agent maximizes the stream of future τ-quantile utilities, for τ ∈ (0, 1).
Abstract: This paper develops a dynamic model of rational behavior under uncertainty, in which the agent maximizes the stream of future τ‐quantile utilities, for τ ∈ (0,1). That is, the agent has a quantile utility preference instead of the standard expected utility. Quantile preferences have useful advantages, including the ability to capture heterogeneity and allowing the separation between risk aversion and elasticity of intertemporal substitution. Although quantiles do not share some of the helpful properties of expectations, such as linearity and the law of iterated expectations, we are able to establish all the standard results in dynamic models. Namely, we show that the quantile preferences are dynamically consistent, the corresponding dynamic problem yields a value function, via a fixed point argument, this value function is concave and differentiable, and the principle of optimality holds. Additionally, we derive the corresponding Euler equation, which is well suited for using well‐known quantile regression methods for estimating and testing the economic model. In this way, the parameters of the model can be interpreted as structural objects. Therefore, the proposed methods provide microeconomic foundations for quantile regression methods. To illustrate the developments, we construct an intertemporal consumption model and estimate the discount factor and elasticity of intertemporal substitution parameters across the quantiles. The results provide evidence of heterogeneity in these parameters.

Journal ArticleDOI
TL;DR: In this article, a Wasserstein distance on the set of the probability distributions of strong solutions to stochastic differential equations is defined by restricting a set of possible coupling measures.
Abstract: In this paper we introduce a Wasserstein-type distance on the set of the probability distributions of strong solutions to stochastic differential equations. This new distance is defined by restricting the set of possible coupling measures. We prove that it may also be defined by means of the value function of a stochastic control problem whose Hamilton–Jacobi– Bellman equation has a smooth solution, which allows one to deduce a priori estimates or to obtain numerical evaluations. We exhibit an optimal coupling measure and characterizes it as a weak solution to an explicit stochastic differential equation, and we finally describe procedures to approximate this optimal coupling measure. A notable application concerns the following modeling issue: given an exact diffusion model, how to select a simplified diffusion model within a class of admissible models under the constraint that the probability distribution of the exact model is preserved as much as possible?

Posted Content
26 Nov 2019
TL;DR: This paper presents a constrained deep adaptive dynamic programming algorithm to solve general nonlinear optimal control problems with known dynamics and proposes a series of recovery rules to update the policy in case the primal problem is infeasible.
Abstract: This paper presents a constrained deep adaptive dynamic programming (CDADP) algorithm to solve general nonlinear optimal control problems with known dynamics. Unlike previous ADP algorithms, it can directly deal with problems with state constraints. Both the policy and value function are approximated by deep neural networks (NNs), which directly map the system state to action and value function respectively without needing to use hand-crafted basis function. The proposed algorithm considers the state constraints by transforming the policy improvement process to a constrained optimization problem. Meanwhile, a trust region constraint is added to prevent excessive policy update. We first linearize this constrained optimization problem locally into a quadratically-constrained quadratic programming problem, and then obtain the optimal update of policy network parameters by solving its dual problem. We also propose a series of recovery rules to update the policy in case the primal problem is infeasible. In addition, parallel learners are employed to explore different state spaces and then stabilize and accelerate the learning speed. The vehicle control problem in path-tracking task is used to demonstrate the effectiveness of this proposed method.

Journal ArticleDOI
TL;DR: In this article, the authors considered two-player risk-sensitive zero-sum differential games (RSZSDGs), where both the drift term and the diffusion term in the controlled stochastic differential equation are dependent on the state and controls of both players, and the objective functional is of the risk sensitive type.
Abstract: We consider two-player risk-sensitive zero-sum differential games (RSZSDGs). In our problem setup, both the drift term and the diffusion term in the controlled stochastic differential equation are dependent on the state and controls of both players, and the objective functional is of the risk-sensitive type. First, a stochastic maximum principle type necessary condition for an open-loop saddle point of the RSZSDG is established via nonlinear transformations of the adjoint processes of the equivalent risk-neutral stochastic zero-sum differential game. In particular, we obtain two variational inequalities, namely, the pair of saddle-point inequalities of the RSZSDG. Next, we obtain the Hamilton–Jacobi–Isaacs partial differential equation for the RSZSDG, which provides a sufficient condition for a feedback saddle point of the RSZSDG, using a logarithmic transformation of the associated value function. Finally, we study the extended linear-quadratic RSZSDG (LQ-RSZSDG). We show intractability of the extended LQ-RSZSDG with the state and/or controls of both players appearing in the diffusion term. This unexpected intractability could lead to nonlinear open-loop and feedback saddle points even if the problem itself is essentially LQ and the Isaacs condition holds.

Posted Content
TL;DR: Recently, this article showed that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation, enabled by overparametrization of neural networks, which also plays a vital role in the empirical success of neural TD.
Abstract: Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, due to the nonlinearity in value function approximation, such a coupling leads to nonconvexity and even divergence in optimization. As a result, the global convergence of neural TD remains unclear. In this paper, we prove for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation. In particular, we show how such global convergence is enabled by the overparametrization of neural networks, which also plays a vital role in the empirical success of neural TD. Beyond policy evaluation, we establish the global convergence of neural (soft) Q-learning, which is further connected to that of policy gradient algorithms.

Proceedings Article
01 Jan 2019
TL;DR: It is proved that if the features at any state can be represented as a convex combination of features at the anchor points, then errors are propagated linearly over iterations (instead of exponentially) and the method achieves a polynomial sample complexity bound in the horizon and the number of anchor points.
Abstract: We study linear approximate value iteration (LAVI) with a generative model. While linear models may accurately represent the optimal value function using a few parameters, several empirical and theoretical studies show the combination of least-squares projection with the Bellman operator may be expansive, thus leading LAVI to amplify errors over iterations and eventually diverge. We introduce an algorithm that approximates value functions by combining Q-values estimated at a set of \textit{anchor} states. Our algorithm tries to balance the generalization and compactness of linear methods with the small amplification of errors typical of interpolation methods. We prove that if the features at any state can be represented as a convex combination of features at the anchor points, then errors are propagated linearly over iterations (instead of exponentially) and our method achieves a polynomial sample complexity bound in the horizon and the number of anchor points. These findings are confirmed in preliminary simulations in a number of simple problems where a traditional least-square LAVI method diverges.

Journal ArticleDOI
TL;DR: A novel event-triggered single-network adaptive dynamic programming (ADP) method is proposed to obtain the solution of constrained OTCP, and the convergence of critic NN weights and the stability of closed-loop system are demonstrated.

Posted Content
TL;DR: Classical differential game theory is extended to simultaneously address weapon assignments and multiplayer pursuit-evasion scenarios and Saddle-point strategies that provide guaranteed performance for each team regardless of the actual strategies implemented by the opponent are devised.
Abstract: In this paper an N-pursuer vs. M-evader team conflict is studied. The differential game of border defense is addressed and we focus on the game of degree in the region of the state space where the pursuers are able to win. This work extends classical differential game theory to simultaneously address weapon assignments and multi-player pursuit-evasion scenarios. Saddle-point strategies that provide guaranteed performance for each team regardless of the actual strategies implemented by the opponent are devised. The players' optimal strategies require the co-design of cooperative optimal assignments and optimal guidance laws. A representative measure of performance is proposed and the Value function of the game is obtained. It is shown that the Value function is continuous, continuously differentiable, and that it satisfies the Hamilton-Jacobi-Isaacs equation - the curse of dimensionality is overcome and the optimal strategies are obtained. The cases of N=M and N>M are considered. In the latter case, cooperative guidance strategies are also developed in order for the pursuers to exploit their numerical advantage. This work provides a foundation to formally analyze complex and high-dimensional conflicts between teams of N pursuers and M evaders by means of differential game theory.