Showing papers on "Bellman equation published in 2018"

PDF

Open Access

Proceedings Article•

SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation

[...]

Bo Dai¹, Albert Shaw¹, Lihong Li, Lin Xiao², Niao He³, Niao He⁴, Zhen Liu¹, Jianshu Chen, Le Song¹ - Show less +5 more•Institutions (4)

Georgia Institute of Technology¹, Microsoft², National Center for Supercomputing Applications³, University of Illinois at Urbana–Champaign⁴

03 Jul 2018

TL;DR: The authors reformulate the Bellman optimality equation into a primal-dual optimization problem using Nesterov smoothing technique and the Legendre-Fenchel transformation, and develop a new algorithm, called Smoothed Bellman Error Embedding, to solve this optimization problem.

...read moreread less

Abstract: When function approximation is used, solving the Bellman optimality equation with stability guarantees has remained a major open problem in reinforcement learning for decades. The fundamental difficulty is that the Bellman operator may become an expansion in general, resulting in oscillating and even divergent behavior of popular algorithms like Q-learning. In this paper, we revisit the Bellman equation, and reformulate it into a novel primal-dual optimization problem using Nesterov’s smoothing technique and the Legendre-Fenchel transformation. We then develop a new algorithm, called Smoothed Bellman Error Embedding, to solve this optimization problem where any differentiable function class may be used. We provide what we believe to be the first convergence guarantee for general nonlinear function approximation, and analyze the algorithm’s sample complexity. Empirically, our algorithm compares favorably to state-of-the-art baselines in several benchmark control problems.

...read moreread less

224 citations

Journal Article•DOI•

Distributed Optimal Consensus Control for Nonlinear Multiagent System With Unknown Dynamic

[...]

Jilie Zhang¹, Huaguang Zhang², Tao Feng¹•Institutions (2)

Southwest Jiaotong University¹, Northeastern University (China)²

01 Aug 2018-IEEE Transactions on Neural Networks

TL;DR: It is theoretically proved that the iterative value function sequence strictly converges to the solution of the coupled Hamilton–Jacobi–Bellman equation and a novel online iterative scheme is proposed, which runs based on the data sampled from the augmented system and the gradient of the value function.

...read moreread less

Abstract: This paper focuses on the distributed optimal cooperative control for continuous-time nonlinear multiagent systems (MASs) with completely unknown dynamics via adaptive dynamic programming (ADP) technology. By introducing predesigned extra compensators, the augmented neighborhood error systems are derived, which successfully circumvents the system knowledge requirement for ADP. It is revealed that the optimal consensus protocols actually work as the solutions of the MAS differential game. Policy iteration algorithm is adopted, and it is theoretically proved that the iterative value function sequence strictly converges to the solution of the coupled Hamilton–Jacobi–Bellman equation. Based on this point, a novel online iterative scheme is proposed, which runs based on the data sampled from the augmented system and the gradient of the value function. Neural networks are employed to implement the algorithm and the weights are updated, in the least-square sense, to the ideal value, which yields approximated optimal consensus protocols. Finally, a numerical example is given to illustrate the effectiveness of the proposed scheme.

...read moreread less

142 citations

Proceedings Article•

The Uncertainty Bellman Equation and Exploration.

[...]

Brendan O'Donoghue¹, Ian Osband¹, Rémi Munos¹, Volodymyr Mnih¹•Institutions (1)

Google¹

03 Jul 2018

TL;DR: In this paper, the authors consider the exploration/exploitation problem in reinforcement learning and show that the unique fixed point of the UBE yields an upper bound on the variance of the posterior distribution of the Q-values induced by any policy.

...read moreread less

Abstract: We consider the exploration/exploitation problem in reinforcement learning. For exploitation, it is well known that the Bellman equation connects the value at any time-step to the expected value at subsequent time-steps. In this paper we consider a similar \textit{uncertainty} Bellman equation (UBE), which connects the uncertainty at any time-step to the expected uncertainties at subsequent time-steps, thereby extending the potential exploratory benefit of a policy beyond individual time-steps. We prove that the unique fixed point of the UBE yields an upper bound on the variance of the posterior distribution of the Q-values induced by any policy. This bound can be much tighter than traditional count-based bonuses that compound standard deviation rather than variance. Importantly, and unlike several existing approaches to optimism, this method scales naturally to large systems with complex generalization. Substituting our UBE-exploration strategy for $\epsilon$-greedy improves DQN performance on 51 out of 57 games in the Atari suite.

...read moreread less

131 citations

Journal Article•DOI•

Discrete-Time Local Value Iteration Adaptive Dynamic Programming: Convergence Analysis

[...]

Qinglai Wei¹, Frank L. Lewis², Derong Liu³, Ruizhuo Song³, Hanquan Lin¹ - Show less +1 more•Institutions (3)

Chinese Academy of Sciences¹, University of Texas at Arlington², University of Science and Technology Beijing³

01 Jun 2018-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: Monotonicity of the local value iteration ADP algorithm is presented, which shows that under some special conditions of the initial value function and the learning rate function, the iterative value function can monotonically converge to the optimum.

...read moreread less

Abstract: In this paper, convergence properties are established for the newly developed discrete-time local value iteration adaptive dynamic programming (ADP) algorithm. The present local iterative ADP algorithm permits an arbitrary positive semidefinite function to initialize the algorithm. Employing a state-dependent learning rate function, for the first time, the iterative value function and iterative control law can be updated in a subset of the state space instead of the whole state space, which effectively relaxes the computational burden. A new analysis method for the convergence property is developed to prove that the iterative value functions will converge to the optimum under some mild constraints. Monotonicity of the local value iteration ADP algorithm is presented, which shows that under some special conditions of the initial value function and the learning rate function, the iterative value function can monotonically converge to the optimum. Finally, three simulation examples and comparisons are given to illustrate the performance of the developed algorithm.

...read moreread less

128 citations

Journal Article•DOI•

Deep neural networks algorithms for stochastic control problems on finite horizon: convergence analysis

[...]

Côme Huré¹, Huyên Pham¹, Achref Bachouch², Nicolas Langrené³•Institutions (3)

Paris Diderot University¹, University of Oslo², Commonwealth Scientific and Industrial Research Organisation³

11 Dec 2018-arXiv: Probability

TL;DR: This paper develops algorithms for high-dimensional stochastic control problems based on deep learning and dynamic programming, and first approximate the optimal policy by means of neural networks in the spirit of deep reinforcement learning, and then the value function by Monte Carlo regression.

...read moreread less

Abstract: This paper develops algorithms for high-dimensional stochastic control problems based on deep learning and dynamic programming. Unlike classical approximate dynamic programming approaches, we first approximate the optimal policy by means of neural networks in the spirit of deep reinforcement learning, and then the value function by Monte Carlo regression. This is achieved in the dynamic programming recursion by performance or hybrid iteration, and regress now methods from numerical probabilities. We provide a theoretical justification of these algorithms. Consistency and rate of convergence for the control and value function estimates are analyzed and expressed in terms of the universal approximation error of the neural networks, and of the statistical error when estimating network function, leaving aside the optimization error. Numerical results on various applications are presented in a companion paper (arxiv.org/abs/1812.05916) and illustrate the performance of the proposed algorithms.

...read moreread less

83 citations

Journal Article•DOI•

Adaptive Constrained Optimal Control Design for Data-Based Nonlinear Discrete-Time Systems With Critic-Only Structure

[...]

Biao Luo¹, Derong Liu², Huai-Ning Wu³•Institutions (3)

Chinese Academy of Sciences¹, Guangdong University of Technology², Beihang University³

01 Jun 2018-IEEE Transactions on Neural Networks

TL;DR: An adaptive optimal control approach is developed by using the value iteration-based Q-learning (VIQL) with the critic-only structure to design the adaptive constrained optimal controller based on the gradient descent scheme.

...read moreread less

Abstract: Reinforcement learning has proved to be a powerful tool to solve optimal control problems over the past few years. However, the data-based constrained optimal control problem of nonaffine nonlinear discrete-time systems has rarely been studied yet. To solve this problem, an adaptive optimal control approach is developed by using the value iteration-based Q-learning (VIQL) with the critic-only structure. Most of the existing constrained control methods require the use of a certain performance index and only suit for linear or affine nonlinear systems, which is unreasonable in practice. To overcome this problem, the system transformation is first introduced with the general performance index. Then, the constrained optimal control problem is converted to an unconstrained optimal control problem. By introducing the action-state value function, i.e., Q-function, the VIQL algorithm is proposed to learn the optimal Q-function of the data-based unconstrained optimal control problem. The convergence results of the VIQL algorithm are established with an easy-to-realize initial condition $Q^{(0)}(x,a)\geqslant 0 $ . To implement the VIQL algorithm, the critic-only structure is developed, where only one neural network is required to approximate the Q-function. The converged Q-function obtained from the critic-only VIQL method is employed to design the adaptive constrained optimal controller based on the gradient descent scheme. Finally, the effectiveness of the developed adaptive control method is tested on three examples with computer simulation.

...read moreread less

83 citations

Journal Article•DOI•

Event-Triggered Distributed Control of Nonlinear Interconnected Systems Using Online Reinforcement Learning With Exploration

[...]

Vignesh Narayanan¹, Sarangapani Jagannathan¹•Institutions (1)

Missouri University of Science and Technology¹

01 Sep 2018-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: A distributed control scheme for an interconnected system composed of uncertain input affine nonlinear subsystems with event triggered state feedback is presented by using a novel hybrid learning scheme-based approximate dynamic programming with online exploration.

...read moreread less

Abstract: In this paper, a distributed control scheme for an interconnected system composed of uncertain input affine nonlinear subsystems with event triggered state feedback is presented by using a novel hybrid learning scheme-based approximate dynamic programming with online exploration. First, an approximate solution to the Hamilton-Jacobi–Bellman equation is generated with event sampled neural network (NN) approximation and subsequently, a near optimal control policy for each subsystem is derived. Artificial NNs are utilized as function approximators to develop a suite of identifiers and learn the dynamics of each subsystem. The NN weight tuning rules for the identifier and event-triggering condition are derived using Lyapunov stability theory. Taking into account, the effects of NN approximation of system dynamics and boot-strapping, a novel NN weight update is presented to approximate the optimal value function. Finally, a novel strategy to incorporate exploration in online control framework, using identifiers, is introduced to reduce the overall cost at the expense of additional computations during the initial online learning phase. System states and the NN weight estimation errors are regulated and local uniformly ultimately bounded results are achieved. The analytical results are substantiated using simulation studies.

...read moreread less

80 citations

Journal Article•DOI•

Bellman equation and viscosity solutions for mean-field stochastic control problem

[...]

Huyên Pham, Xiaoli Wei

01 Jan 2018-ESAIM: Control, Optimisation and Calculus of Variations

TL;DR: In this paper, the authors consider the stochastic optimal control problem of McKean-Vlasov stochastically differential equation where the coefficients may depend upon the joint law of the state and control.

...read moreread less

Abstract: We consider the stochastic optimal control problem of McKean-Vlasov stochastic differential equation where the coefficients may depend upon the joint law of the state and control. By using feedback controls, we reformulate the problem into a deterministic control problem with only the marginal distribution of the process as controlled state variable, and prove that dynamic programming principle holds in its general form. Then, by relying on the notion of differentiability with respect to pro\-bability measures recently introduced by P.L. Lions in [32], and a special Ito formula for flows of probability measures, we derive the (dynamic programming) Bellman equation for mean-field stochastic control problem, and prove a veri\-fication theorem in our McKean-Vlasov framework. We give explicit solutions to the Bellman equation for the linear quadratic mean-field control problem, with applications to the mean-variance portfolio selection and a systemic risk model. We also consider a notion of lifted visc-sity solutions for the Bellman equation, and show the viscosity property and uniqueness of the value function to the McKean-Vlasov control problem. Finally, we consider the case of McKean-Vlasov control problem with open-loop controls and discuss the associated dynamic programming equation that we compare with the case of closed-loop controls.

...read moreread less

79 citations

Journal Article•DOI•

Adaptive Dynamic Programming for Robust Regulation and Its Application to Power Systems

[...]

Xiong Yang¹, Haibo He², Xiangnan Zhong•Institutions (2)

Tianjin University¹, University of Rhode Island²

01 Jul 2018-IEEE Transactions on Industrial Electronics

TL;DR: A simultaneous policy iteration (SPI) algorithm is developed to solve the optimal regulation problem within the framework of adaptive dynamic programming, and actor and critic networks are employed to approximate the optimal control and the optimal value function.

...read moreread less

Abstract: This paper presents a novel robust regulation method for a class of continuous-time nonlinear systems subject to unmatched perturbations. To begin with, the robust regulation problem is transformed into an optimal regulation problem by constructing a value function for the auxiliary system. Then, a simultaneous policy iteration (SPI) algorithm is developed to solve the optimal regulation problem within the framework of adaptive dynamic programming. To implement the SPI algorithm, actor and critic networks are employed to approximate the optimal control and the optimal value function, respectively, and the Monte Carlo integration method is applied to obtain the unknown weight parameters. Finally, two examples, including a power system, are provided to demonstrate the applicability of the developed approach.

...read moreread less

68 citations

Posted Content•

Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control

[...]

Kendall Lowrey¹, Aravind Rajeswaran¹, Sham M. Kakade², Emanuel Todorov¹, Igor Mordatch³ - Show less +1 more•Institutions (3)

University of Washington¹, Amazon.com², OpenAI³

05 Nov 2018-arXiv: Learning

TL;DR: In this paper, the authors propose a plan online and learn offline (POLO) framework for the setting where an agent, with an internal model, needs to continually act and learn in the world.

...read moreread less

Abstract: We propose a plan online and learn offline (POLO) framework for the setting where an agent, with an internal model, needs to continually act and learn in the world. Our work builds on the synergistic relationship between local model-based control, global value function learning, and exploration. We study how local trajectory optimization can cope with approximation errors in the value function, and can stabilize and accelerate value function learning. Conversely, we also study how approximate value functions can help reduce the planning horizon and allow for better policies beyond local solutions. Finally, we also demonstrate how trajectory optimization can be used to perform temporally coordinated exploration in conjunction with estimating uncertainty in value function approximation. This exploration is critical for fast and stable learning of the value function. Combining these components enable solutions to complex simulated control tasks, like humanoid locomotion and dexterous in-hand manipulation, in the equivalent of a few minutes of experience in the real world.

...read moreread less

67 citations

Journal Article•DOI•

Value-based deep reinforcement learning for adaptive isolated intersection signal control

[...]

Chia-Hao Wan, Ming-Chorng Hwang

01 Nov 2018-Iet Intelligent Transport Systems

TL;DR: A dynamic discount factor embedded in the iterative Bellman equation is proposed to prevent from a biased estimation of action-value function due to the effects of inconstant time step interval and shows that the trained agent outperforms a fixed timing plan in all testing cases with reducing system total delay by 20%.

...read moreread less

Abstract: Under efficiency improvement of road networks by utilizing advanced traffic signal control methods, intelligent transportation systems intend to characterize a smart city. Recently, due to significant progress in artificial intelligence, machine learning-based framework of adaptive traffic signal control has been highly concentrated. In particular, deep Q-learning neural network is a model-free technique and can be applied to optimal action selection problems. However, setting variable green time is a key mechanism to reflect traffic fluctuations such that time steps need not be fixed intervals in reinforcement learning framework. In this study, the authors proposed a dynamic discount factor embedded in the iterative Bellman equation to prevent from a biased estimation of action-value function due to the effects of inconstant time step interval. Moreover, action is added to the input layer of the neural network in the training process, and the output layer is the estimated action-value for the denoted action. Then, the trained neural network can be used to generate action that leads to an optimal estimated value within a finite set as the agents' policy. The preliminary results show that the trained agent outperforms a fixed timing plan in all testing cases with reducing system total delay by 20%..

...read moreread less

Journal Article•DOI•

Ranking and Selection as Stochastic Control

[...]

Yijie Peng¹, Edwin K. P. Chong², Chun-Hung Chen³, Michael C. Fu⁴•Institutions (4)

Peking University¹, Colorado State University², George Mason University³, University of Maryland, College Park⁴

23 Jan 2018-IEEE Transactions on Automatic Control

TL;DR: This work formulate the fully sequential sampling and selection decision in statistical ranking and selection as a stochastic control problem as a Bayesian framework, and derives an approximately optimal allocation policy that possesses both one-step-ahead and asymptotic optimality for independent normal sampling distributions.

...read moreread less

Abstract: Under a Bayesian framework, we formulate the fully sequential sampling and selection decision in statistical ranking and selection as a stochastic control problem, and derive the associated Bellman equation. Using a value function approximation, we derive an approximately optimal allocation policy. We show that this policy is not only computationally efficient but also possesses both one-step-ahead and asymptotic optimality for independent normal sampling distributions. Moreover, the proposed allocation policy is easily generalizable in the approximate dynamic programming paradigm.

...read moreread less

Journal Article•DOI•

Polynomial Approximation of High-Dimensional Hamilton--Jacobi--Bellman Equations and Applications to Feedback Control of Semilinear Parabolic PDEs

[...]

Dante Kalise, Karl Kunisch

01 Mar 2018-SIAM Journal on Scientific Computing

TL;DR: In this paper, a pseudospectral collocation approximation of the PDE dynamics and an iterative method for the nonlinear Hamilton-Jacobi-Bellman (HJB) equation associated to the feedback synthesis are proposed.

...read moreread less

Abstract: A procedure for the numerical approximation of high-dimensional Hamilton--Jacobi--Bellman (HJB) equations associated to optimal feedback control problems for semilinear parabolic equations is proposed. Its main ingredients are a pseudospectral collocation approximation of the PDE dynamics and an iterative method for the nonlinear HJB equation associated to the feedback synthesis. The latter is known as the successive Galerkin approximation. It can also be interpreted as Newton iteration for the HJB equation. At every step, the associated linear generalized HJB equation is approximated via a separable polynomial approximation ansatz. Stabilizing feedback controls are obtained from solutions to the HJB equations for systems of dimension up to fourteen.

...read moreread less

Journal Article•DOI•

Stochastic control for a class of nonlinear kernels and applications

[...]

Dylan Possamaï¹, Xiaolu Tan, Chao Zhou²•Institutions (2)

Paris Dauphine University¹, National University of Singapore²

01 Jan 2018-Annals of Probability

TL;DR: In this article, a stochastic control problem for a class of nonlinear kernels is considered and a dynamic programming principle for this control problem in an abstract setting is presented, which is then used to provide a semimartingale characterization of the value function.

...read moreread less

Abstract: We consider a stochastic control problem for a class of nonlinear kernels. More precisely, our problem of interest consists in the optimization, over a set of possibly nondominated probability measures, of solutions of backward stochastic differential equations (BSDEs). Since BSDEs are nonlinear generalizations of the traditional (linear) expectations, this problem can be understood as stochastic control of a family of nonlinear expectations, or equivalently of nonlinear kernels. Our first main contribution is to prove a dynamic programming principle for this control problem in an abstract setting, which we then use to provide a semimartingale characterization of the value function. We next explore several applications of our results. We first obtain a wellposedness result for second order BSDEs (as introduced in Soner, Touzi and Zhang [Probab. Theory Related Fields 153 (2012) 149–190]) which does not require any regularity assumption on the terminal condition and the generator. Then we prove a nonlinear optional decomposition in a robust setting, extending recent results of Nutz [Stochastic Process. Appl. 125 (2015) 4543–4555], which we then use to obtain a super-hedging duality in uncertain, incomplete and nonlinear financial markets. Finally, we relate, under additional regularity assumptions, the value function to a viscosity solution of an appropriate path–dependent partial differential equation (PPDE).

...read moreread less

Journal Article•DOI•

Distributionally robust optimization with matrix moment constraints: Lagrange duality and cutting plane methods

[...]

Huifu Xu¹, Yongchao Liu², Hailin Sun³•Institutions (3)

University of Southampton¹, Dalian University of Technology², Nanjing University of Science and Technology³

01 Jun 2018-Mathematical Programming

TL;DR: Effective ways for verifying the Slater type conditions are investigated and other conditions which are based on lower semicontinuity of the optimal value function of the inner maximization problem of the DRO are introduced.

...read moreread less

Abstract: A key step in solving minimax distributionally robust optimization (DRO) problems is to reformulate the inner maximization w.r.t. probability measure as a semiinfinite programming problem through Lagrange dual. Slater type conditions have been widely used for strong duality (zero dual gap) when the ambiguity set is defined through moments. In this paper, we investigate effective ways for verifying the Slater type conditions and introduce other conditions which are based on lower semicontinuity of the optimal value function of the inner maximization problem. Moreover, we propose two discretization schemes for solving the DRO with one for the dualized DRO and the other directly through the ambiguity set of the DRO. In the absence of strong duality, the discretization scheme via Lagrange duality may provide an upper bound for the optimal value of the DRO whereas the direct discretization approach provides a lower bound. Two cutting plane schemes are consequently proposed: one for the discretized dualized DRO and the other for the minimax DRO with discretized ambiguity set. Convergence analysis is presented for the approximation schemes in terms of the optimal value, optimal solutions and stationary points. Comparative numerical results are reported for the resulting algorithms.

...read moreread less

Journal Article•DOI•

Self-learning robust optimal control for continuous-time nonlinear systems with mismatched disturbances.

[...]

Xiong Yang¹, Haibo He²•Institutions (2)

Tianjin University¹, University of Rhode Island²

01 Mar 2018-Neural Networks

TL;DR: This paper presents a novel adaptive dynamic programming(ADP)-based self-learning robust optimal control scheme for input-affine continuous-time nonlinear systems with mismatched disturbances, designed by modifying the optimal control law of the auxiliary system.

...read moreread less

Journal Article•DOI•

Optimal Robust Output Containment of Unknown Heterogeneous Multiagent System Using Off-Policy Reinforcement Learning

[...]

Shan Zuo¹, Yongduan Song², Frank L. Lewis¹, Ali Davoudi¹•Institutions (2)

University of Texas at Arlington¹, University of Electronic Science and Technology of China²

01 Nov 2018-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: A model-free off-policy integral reinforcement learning algorithm is proposed to solve the optimal robust output containment problem of heterogeneous MAS, in real time, without requiring any knowledge of the system dynamics.

...read moreread less

Abstract: This paper investigates optimal robust output containment problem of general linear heterogeneous multiagent systems (MAS) with completely unknown dynamics. A model-based algorithm using offline policy iteration (PI) is first developed, where the ${p}$ -copy internal model principle is utilized to address the system parameter variations. This offline PI algorithm requires the nominal model of each agent, which may not be available in most real-world applications. To address this issue, a discounted performance function is introduced to express the optimal robust output containment problem as an optimal output-feedback design problem with bounded ${\mathcal {L}_{2}}$ -gain. To solve this problem online in real time, a Bellman equation is first developed to evaluate a certain control policy and find the updated control policies, simultaneously, using only the state/output information measured online. Then, using this Bellman equation, a model-free off-policy integral reinforcement learning algorithm is proposed to solve the optimal robust output containment problem of heterogeneous MAS, in real time, without requiring any knowledge of the system dynamics. Simulation results are provided to verify the effectiveness of the proposed method.

...read moreread less

Journal Article•DOI•

A dynamic game approach to distributionally robust safety specifications for stochastic systems

[...]

Insoon Yang¹•Institutions (1)

Systems Research Institute¹

01 Aug 2018-Automatica

TL;DR: A duality-based reformulation method that converts the infinite-dimensional minimax problem into a semi-infinite program that can be solved using existing convergent algorithms and it is proved that there is no duality gap, and that this approach thus preserves optimality.

...read moreread less

Journal Article•DOI•

Output feedback Q-learning for discrete-time linear zero-sum games with application to the H-infinity control

[...]

Syed Ali Asad Rizvi¹, Zongli Lin¹•Institutions (1)

University of Virginia¹

01 Sep 2018-Automatica

TL;DR: An output feedback Q-learning algorithm towards finding the optimal strategies for the discrete-time linear quadratic zero-sum game, which encompasses the H-infinity optimal control problem, and converges to the nominal GARE solution.

...read moreread less

Journal Article•DOI•

Event-Triggered Distributed Approximate Optimal State and Output Control of Affine Nonlinear Interconnected Systems

[...]

Vignesh Narayanan¹, Sarangapani Jagannathan¹•Institutions (1)

Missouri University of Science and Technology¹

07 Apr 2018-IEEE Transactions on Neural Networks

TL;DR: This paper presents an approximate optimal distributed control scheme for a known interconnected system composed of input affine nonlinear subsystems using event-triggered state and output feedback via a novel hybrid learning scheme to reduce the convergence time for the learning algorithm.

...read moreread less

Abstract: This paper presents an approximate optimal distributed control scheme for a known interconnected system composed of input affine nonlinear subsystems using event-triggered state and output feedback via a novel hybrid learning scheme. First, the cost function for the overall system is redefined as the sum of cost functions of individual subsystems. A distributed optimal control policy for the interconnected system is developed using the optimal value function of each subsystem. To generate the optimal control policy, forward-in-time, neural networks are employed to reconstruct the unknown optimal value function at each subsystem online. In order to retain the advantages of event-triggered feedback for an adaptive optimal controller, a novel hybrid learning scheme is proposed to reduce the convergence time for the learning algorithm. The development is based on the observation that, in the event-triggered feedback, the sampling instants are dynamic and results in variable interevent time. To relax the requirement of entire state measurements, an extended nonlinear observer is designed at each subsystem to recover the system internal states from the measurable feedback. Using a Lyapunov-based analysis, it is demonstrated that the system states and the observer errors remain locally uniformly ultimately bounded and the control policy converges to a neighborhood of the optimal policy. Simulation results are presented to demonstrate the performance of the developed controller.

...read moreread less

Posted Content•

Optimal Transport Based Distributionally Robust Optimization: Structural Properties and Iterative Schemes

[...]

Jose Blanchet¹, Karthyek Murthy², Fan Zhang¹•Institutions (2)

Stanford University¹, Singapore University of Technology and Design²

04 Oct 2018-arXiv: Optimization and Control

TL;DR: Structural results about the value function, the optimal policy, and the worst-case optimal transport adversarial model are obtained, exposing a rich structure embedded in the DRO problem and enabling efficient optimization procedures that have the same sample and iteration complexity as a natural non-DRO benchmark algorithm, such as stochastic gradient descent.

...read moreread less

Abstract: We consider optimal transport based distributionally robust optimization (DRO) problems with locally strongly convex transport cost functions and affine decision rules. Under conventional convexity assumptions on the underlying loss function, we obtain structural results about the value function, the optimal policy, and the worst-case optimal transport adversarial model. These results expose a rich structure embedded in the DRO problem (e.g. strong convexity even if the non-DRO problem was not strongly convex, a suitable scaling of the Lagrangian for the DRO constraint, etc. which are crucial for the design of efficient algorithms). As a consequence of these results, one can develop efficient optimization procedures which have the same sample and iteration complexity as a natural non-DRO benchmark algorithm such as stochastic gradient descent.

...read moreread less

Journal Article•DOI•

Discrete-Time Stable Generalized Self-Learning Optimal Control With Approximation Errors

[...]

Qinglai Wei¹, Benkai Li¹, Ruizhuo Song²•Institutions (2)

Chinese Academy of Sciences¹, University of Science and Technology Beijing²

01 Apr 2018-IEEE Transactions on Neural Networks

TL;DR: A generalized policy iteration (GPI) algorithm with approximation errors is developed for solving infinite horizon optimal control problems for nonlinear systems and provides a general structure of discrete-time iterative adaptive dynamic programming algorithms by which most of the discrete- time reinforcement learning algorithms can be described using the GPI structure.

...read moreread less

Abstract: In this paper, a generalized policy iteration (GPI) algorithm with approximation errors is developed for solving infinite horizon optimal control problems for nonlinear systems. The developed stable GPI algorithm provides a general structure of discrete-time iterative adaptive dynamic programming algorithms, by which most of the discrete-time reinforcement learning algorithms can be described using the GPI structure. It is for the first time that approximation errors are explicitly considered in the GPI algorithm. The properties of the stable GPI algorithm with approximation errors are analyzed. The admissibility of the approximate iterative control law can be guaranteed if the approximation errors satisfy the admissibility criteria. The convergence of the developed algorithm is established, which shows that the iterative value function is convergent to a finite neighborhood of the optimal performance index function, if the approximate errors satisfy the convergence criterion. Finally, numerical examples and comparisons are presented.

...read moreread less

Journal Article•DOI•

High-dimensional stochastic optimal control using continuous tensor decompositions

[...]

Alex Gorodetsky¹, Sertac Karaman¹, Youssef M. Marzouk¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Feb 2018-The International Journal of Robotics Research

TL;DR: This work proposes novel dynamic programming algorithms that alleviate the curse of dimensionality in problems that exhibit certain low-rank structure, and demonstrates the algorithms running in real time on board a quadcopter during a flight experiment under motion capture.

...read moreread less

Abstract: Motion planning and control problems are embedded and essential in almost all robotics applications. These problems are often formulated as stochastic optimal control problems and solved using dynamic programming algorithms. Unfortunately, most existing algorithms that guarantee convergence to optimal solutions suffer from the curse of dimensionality: the run time of the algorithm grows exponentially with the dimension of the state space of the system. We propose novel dynamic programming algorithms that alleviate the curse of dimensionality in problems that exhibit certain low-rank structure. The proposed algorithms are based on continuous tensor decompositions recently developed by the authors. Essentially, the algorithms represent high-dimensional functions e.g. the value function in a compressed format, and directly perform dynamic programming computations e.g. value iteration, policy iteration in this format. Under certain technical assumptions, the new algorithms guarantee convergence towards optimal solutions with arbitrary precision. Furthermore, the run times of the new algorithms scale polynomially with the state dimension and polynomially with the ranks of the value function. This approach realizes substantial computational savings in źcompressibleź problem instances, where value functions admit low-rank approximations. We demonstrate the new algorithms in a wide range of problems, including a simulated six-dimensional agile quadcopter maneuvering example and a seven-dimensional aircraft perching example. In some of these examples, we estimate computational savings of up to 10 orders of magnitude over standard value iteration algorithms. We further demonstrate the algorithms running in real time on board a quadcopter during a flight experiment under motion capture.

...read moreread less

Proceedings Article•DOI•

Primal-Dual Algorithm for Distributed Reinforcement Learning: Distributed GTD

[...]

Donghwan Lee¹, Hyung-Jin Yoon¹, Naira Hovakimyan¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Dec 2018

TL;DR: A primal-dual distributed GTD algorithm is proposed and it is proved that it almost surely converges to a set of stationary points of the optimization problem.

...read moreread less

Abstract: The goal of this paper is to study a distributed version of the gradient temporal-difference (GTD) learning algorithm for multi-agent Markov decision processes (MDPs). The temporal-difference (TD) learning is a reinforcement learning (RL) algorithm that learns an infinite horizon discounted cost function (or value function) for a given fixed policy without the model knowledge. In the distributed RL case each agent receives local reward through local processing. Information exchange over sparse communication network allows the agents to learn the global value function corresponding to a global reward, which is a sum of local rewards. In this paper, the problem is converted into a constrained convex optimization problem with a consensus constraint. We then propose a primal-dual distributed GTD algorithm and prove that it almost surely converges to a set of stationary points of the optimization problem.

...read moreread less

Journal Article•DOI•

Stochastic optimal operation of reservoirs based on copula functions

[...]

Xiaohui Lei, Qiaofeng Tan¹, Xu Wang, Hao Wang, Xin Wen¹, Chao Wang, Jingwen Zhang² - Show less +3 more•Institutions (2)

Hohai University¹, Wuhan University²

01 Feb 2018-Journal of Hydrology

TL;DR: In this paper, a stochastic optimization model for hydropower generation reservoirs was proposed, in which the transition probability matrix was calculated based on copula functions; and the value function of the last period was calculated by stepwise iteration.

...read moreread less

Journal Article•DOI•

Viscosity Solutions of Stochastic Hamilton--Jacobi--Bellman Equations

[...]

Jinniao Qiu

16 Oct 2018-Siam Journal on Control and Optimization

TL;DR: In this paper, the fully nonlinear stochastic Hamilton-Jacobi-Bellman (HJB) equation was studied for the optimal control problem of stochastically differential equations with random coefficiencies.

...read moreread less

Abstract: In this paper we study the fully nonlinear stochastic Hamilton--Jacobi--Bellman (HJB) equation for the optimal stochastic control problem of stochastic differential equations with random coefficien...

...read moreread less

Posted Content•

The Effect of Planning Shape on Dyna-style Planning in High-dimensional State Spaces.

[...]

G. Zacharias Holland, Erik Talvitie, Michael Bowling

05 Jun 2018-arXiv: Artificial Intelligence

TL;DR: This work finds that planning shape has a profound impact on the efficacy of Dyna for both perfect and learned models, suggesting that Dyna may be a viable approach to model-based reinforcement learning in the Arcade Learning Environment and other high-dimensional problems.

...read moreread less

Abstract: Dyna is a fundamental approach to model-based reinforcement learning (MBRL) that interleaves planning, acting, and learning in an online setting. In the most typical application of Dyna, the dynamics model is used to generate one-step transitions from selected start states from the agent's history, which are used to update the agent's value function or policy as if they were real experiences. In this work, one-step Dyna was applied to several games from the Arcade Learning Environment (ALE). We found that the model-based updates offered surprisingly little benefit over simply performing more updates with the agent's existing experience, even when using a perfect model. We hypothesize that to get the most from planning, the model must be used to generate unfamiliar experience. To test this, we experimented with the "shape" of planning in multiple different concrete instantiations of Dyna, performing fewer, longer rollouts, rather than many short rollouts. We found that planning shape has a profound impact on the efficacy of Dyna for both perfect and learned models. In addition to these findings regarding Dyna in general, our results represent, to our knowledge, the first time that a learned dynamics model has been successfully used for planning in the ALE, suggesting that Dyna may be a viable approach to MBRL in the ALE and other high-dimensional problems.

...read moreread less

Journal Article•DOI•

Mayer control problem with probabilistic uncertainty on initial positions

[...]

Antonio Marigonda¹, Marc Quincampoix²•Institutions (2)

University of Verona¹, Centre national de la recherche scientifique²

01 Mar 2018-Journal of Differential Equations

TL;DR: In this article, an optimal control problem in the Mayer's form in the space of probability measures on R n endowed with the Wasserstein distance is studied, where the knowledge of the initial state and velocity is subject to some uncertainty.

...read moreread less

Journal Article•DOI•

Policy Approximation in Policy Iteration Approximate Dynamic Programming for Discrete-Time Nonlinear Systems

[...]

Wentao Guo, Jennie Si¹, Feng Liu², Shengwei Mei²•Institutions (2)

Arizona State University¹, Tsinghua University²

01 Jul 2018-IEEE Transactions on Neural Networks

TL;DR: The effectiveness of the main ideas developed in this paper are illustrated using several examples including a practical problem of excitation control of a hydrogenerator and a new sufficient condition for the value function to converge to a bounded neighborhood of the optimal value function.

...read moreread less

Abstract: Policy iteration approximate dynamic programming (DP) is an important algorithm for solving optimal decision and control problems. In this paper, we focus on the problem associated with policy approximation in policy iteration approximate DP for discrete-time nonlinear systems using infinite-horizon undiscounted value functions. Taking policy approximation error into account, we demonstrate asymptotic stability of the control policy under our problem setting, show boundedness of the value function during each policy iteration step, and introduce a new sufficient condition for the value function to converge to a bounded neighborhood of the optimal value function. Aiming for practical implementation of an approximate policy, we consider using Volterra series, which has been extensively covered in controls literature for its good theoretical properties and for its success in practical applications. We illustrate the effectiveness of the main ideas developed in this paper using several examples including a practical problem of excitation control of a hydrogenerator.

...read moreread less

Journal Article•DOI•

Robust Time-Inconsistent Stochastic Control Problems

[...]

Chi Seng Pun¹•Institutions (1)

Nanyang Technological University¹

01 Aug 2018-Automatica

TL;DR: A general analytical framework for continuous-time stochastic control problems for an ambiguity-averse agent (AAA) with time-inconsistent preference, where the control problems do not satisfy Bellman’s principle of optimality is established.

...read moreread less

Collapse