scispace - formally typeset
Search or ask a question

Showing papers on "Bellman equation published in 2018"


Proceedings Article
03 Jul 2018
TL;DR: The authors reformulate the Bellman optimality equation into a primal-dual optimization problem using Nesterov smoothing technique and the Legendre-Fenchel transformation, and develop a new algorithm, called Smoothed Bellman Error Embedding, to solve this optimization problem.
Abstract: When function approximation is used, solving the Bellman optimality equation with stability guarantees has remained a major open problem in reinforcement learning for decades. The fundamental difficulty is that the Bellman operator may become an expansion in general, resulting in oscillating and even divergent behavior of popular algorithms like Q-learning. In this paper, we revisit the Bellman equation, and reformulate it into a novel primal-dual optimization problem using Nesterov’s smoothing technique and the Legendre-Fenchel transformation. We then develop a new algorithm, called Smoothed Bellman Error Embedding, to solve this optimization problem where any differentiable function class may be used. We provide what we believe to be the first convergence guarantee for general nonlinear function approximation, and analyze the algorithm’s sample complexity. Empirically, our algorithm compares favorably to state-of-the-art baselines in several benchmark control problems.

224 citations


Journal ArticleDOI
TL;DR: It is theoretically proved that the iterative value function sequence strictly converges to the solution of the coupled Hamilton–Jacobi–Bellman equation and a novel online iterative scheme is proposed, which runs based on the data sampled from the augmented system and the gradient of the value function.
Abstract: This paper focuses on the distributed optimal cooperative control for continuous-time nonlinear multiagent systems (MASs) with completely unknown dynamics via adaptive dynamic programming (ADP) technology. By introducing predesigned extra compensators, the augmented neighborhood error systems are derived, which successfully circumvents the system knowledge requirement for ADP. It is revealed that the optimal consensus protocols actually work as the solutions of the MAS differential game. Policy iteration algorithm is adopted, and it is theoretically proved that the iterative value function sequence strictly converges to the solution of the coupled Hamilton–Jacobi–Bellman equation. Based on this point, a novel online iterative scheme is proposed, which runs based on the data sampled from the augmented system and the gradient of the value function. Neural networks are employed to implement the algorithm and the weights are updated, in the least-square sense, to the ideal value, which yields approximated optimal consensus protocols. Finally, a numerical example is given to illustrate the effectiveness of the proposed scheme.

142 citations


Proceedings Article
03 Jul 2018
TL;DR: In this paper, the authors consider the exploration/exploitation problem in reinforcement learning and show that the unique fixed point of the UBE yields an upper bound on the variance of the posterior distribution of the Q-values induced by any policy.
Abstract: We consider the exploration/exploitation problem in reinforcement learning. For exploitation, it is well known that the Bellman equation connects the value at any time-step to the expected value at subsequent time-steps. In this paper we consider a similar \textit{uncertainty} Bellman equation (UBE), which connects the uncertainty at any time-step to the expected uncertainties at subsequent time-steps, thereby extending the potential exploratory benefit of a policy beyond individual time-steps. We prove that the unique fixed point of the UBE yields an upper bound on the variance of the posterior distribution of the Q-values induced by any policy. This bound can be much tighter than traditional count-based bonuses that compound standard deviation rather than variance. Importantly, and unlike several existing approaches to optimism, this method scales naturally to large systems with complex generalization. Substituting our UBE-exploration strategy for $\epsilon$-greedy improves DQN performance on 51 out of 57 games in the Atari suite.

131 citations


Journal ArticleDOI
TL;DR: Monotonicity of the local value iteration ADP algorithm is presented, which shows that under some special conditions of the initial value function and the learning rate function, the iterative value function can monotonically converge to the optimum.
Abstract: In this paper, convergence properties are established for the newly developed discrete-time local value iteration adaptive dynamic programming (ADP) algorithm. The present local iterative ADP algorithm permits an arbitrary positive semidefinite function to initialize the algorithm. Employing a state-dependent learning rate function, for the first time, the iterative value function and iterative control law can be updated in a subset of the state space instead of the whole state space, which effectively relaxes the computational burden. A new analysis method for the convergence property is developed to prove that the iterative value functions will converge to the optimum under some mild constraints. Monotonicity of the local value iteration ADP algorithm is presented, which shows that under some special conditions of the initial value function and the learning rate function, the iterative value function can monotonically converge to the optimum. Finally, three simulation examples and comparisons are given to illustrate the performance of the developed algorithm.

128 citations


Journal ArticleDOI
TL;DR: This paper develops algorithms for high-dimensional stochastic control problems based on deep learning and dynamic programming, and first approximate the optimal policy by means of neural networks in the spirit of deep reinforcement learning, and then the value function by Monte Carlo regression.
Abstract: This paper develops algorithms for high-dimensional stochastic control problems based on deep learning and dynamic programming. Unlike classical approximate dynamic programming approaches, we first approximate the optimal policy by means of neural networks in the spirit of deep reinforcement learning, and then the value function by Monte Carlo regression. This is achieved in the dynamic programming recursion by performance or hybrid iteration, and regress now methods from numerical probabilities. We provide a theoretical justification of these algorithms. Consistency and rate of convergence for the control and value function estimates are analyzed and expressed in terms of the universal approximation error of the neural networks, and of the statistical error when estimating network function, leaving aside the optimization error. Numerical results on various applications are presented in a companion paper (arxiv.org/abs/1812.05916) and illustrate the performance of the proposed algorithms.

83 citations


Journal ArticleDOI
TL;DR: An adaptive optimal control approach is developed by using the value iteration-based Q-learning (VIQL) with the critic-only structure to design the adaptive constrained optimal controller based on the gradient descent scheme.
Abstract: Reinforcement learning has proved to be a powerful tool to solve optimal control problems over the past few years. However, the data-based constrained optimal control problem of nonaffine nonlinear discrete-time systems has rarely been studied yet. To solve this problem, an adaptive optimal control approach is developed by using the value iteration-based Q-learning (VIQL) with the critic-only structure. Most of the existing constrained control methods require the use of a certain performance index and only suit for linear or affine nonlinear systems, which is unreasonable in practice. To overcome this problem, the system transformation is first introduced with the general performance index. Then, the constrained optimal control problem is converted to an unconstrained optimal control problem. By introducing the action-state value function, i.e., Q-function, the VIQL algorithm is proposed to learn the optimal Q-function of the data-based unconstrained optimal control problem. The convergence results of the VIQL algorithm are established with an easy-to-realize initial condition $Q^{(0)}(x,a)\geqslant 0 $ . To implement the VIQL algorithm, the critic-only structure is developed, where only one neural network is required to approximate the Q-function. The converged Q-function obtained from the critic-only VIQL method is employed to design the adaptive constrained optimal controller based on the gradient descent scheme. Finally, the effectiveness of the developed adaptive control method is tested on three examples with computer simulation.

83 citations


Journal ArticleDOI
TL;DR: A distributed control scheme for an interconnected system composed of uncertain input affine nonlinear subsystems with event triggered state feedback is presented by using a novel hybrid learning scheme-based approximate dynamic programming with online exploration.
Abstract: In this paper, a distributed control scheme for an interconnected system composed of uncertain input affine nonlinear subsystems with event triggered state feedback is presented by using a novel hybrid learning scheme-based approximate dynamic programming with online exploration. First, an approximate solution to the Hamilton-Jacobi–Bellman equation is generated with event sampled neural network (NN) approximation and subsequently, a near optimal control policy for each subsystem is derived. Artificial NNs are utilized as function approximators to develop a suite of identifiers and learn the dynamics of each subsystem. The NN weight tuning rules for the identifier and event-triggering condition are derived using Lyapunov stability theory. Taking into account, the effects of NN approximation of system dynamics and boot-strapping, a novel NN weight update is presented to approximate the optimal value function. Finally, a novel strategy to incorporate exploration in online control framework, using identifiers, is introduced to reduce the overall cost at the expense of additional computations during the initial online learning phase. System states and the NN weight estimation errors are regulated and local uniformly ultimately bounded results are achieved. The analytical results are substantiated using simulation studies.

80 citations


Journal ArticleDOI
TL;DR: In this paper, the authors consider the stochastic optimal control problem of McKean-Vlasov stochastically differential equation where the coefficients may depend upon the joint law of the state and control.
Abstract: We consider the stochastic optimal control problem of McKean-Vlasov stochastic differential equation where the coefficients may depend upon the joint law of the state and control. By using feedback controls, we reformulate the problem into a deterministic control problem with only the marginal distribution of the process as controlled state variable, and prove that dynamic programming principle holds in its general form. Then, by relying on the notion of differentiability with respect to pro\-bability measures recently introduced by P.L. Lions in [32], and a special Ito formula for flows of probability measures, we derive the (dynamic programming) Bellman equation for mean-field stochastic control problem, and prove a veri\-fication theorem in our McKean-Vlasov framework. We give explicit solutions to the Bellman equation for the linear quadratic mean-field control problem, with applications to the mean-variance portfolio selection and a systemic risk model. We also consider a notion of lifted visc-sity solutions for the Bellman equation, and show the viscosity property and uniqueness of the value function to the McKean-Vlasov control problem. Finally, we consider the case of McKean-Vlasov control problem with open-loop controls and discuss the associated dynamic programming equation that we compare with the case of closed-loop controls.

79 citations


Journal ArticleDOI
TL;DR: A simultaneous policy iteration (SPI) algorithm is developed to solve the optimal regulation problem within the framework of adaptive dynamic programming, and actor and critic networks are employed to approximate the optimal control and the optimal value function.
Abstract: This paper presents a novel robust regulation method for a class of continuous-time nonlinear systems subject to unmatched perturbations. To begin with, the robust regulation problem is transformed into an optimal regulation problem by constructing a value function for the auxiliary system. Then, a simultaneous policy iteration (SPI) algorithm is developed to solve the optimal regulation problem within the framework of adaptive dynamic programming. To implement the SPI algorithm, actor and critic networks are employed to approximate the optimal control and the optimal value function, respectively, and the Monte Carlo integration method is applied to obtain the unknown weight parameters. Finally, two examples, including a power system, are provided to demonstrate the applicability of the developed approach.

68 citations


Posted Content
TL;DR: In this paper, the authors propose a plan online and learn offline (POLO) framework for the setting where an agent, with an internal model, needs to continually act and learn in the world.
Abstract: We propose a plan online and learn offline (POLO) framework for the setting where an agent, with an internal model, needs to continually act and learn in the world. Our work builds on the synergistic relationship between local model-based control, global value function learning, and exploration. We study how local trajectory optimization can cope with approximation errors in the value function, and can stabilize and accelerate value function learning. Conversely, we also study how approximate value functions can help reduce the planning horizon and allow for better policies beyond local solutions. Finally, we also demonstrate how trajectory optimization can be used to perform temporally coordinated exploration in conjunction with estimating uncertainty in value function approximation. This exploration is critical for fast and stable learning of the value function. Combining these components enable solutions to complex simulated control tasks, like humanoid locomotion and dexterous in-hand manipulation, in the equivalent of a few minutes of experience in the real world.

67 citations


Journal ArticleDOI
TL;DR: A dynamic discount factor embedded in the iterative Bellman equation is proposed to prevent from a biased estimation of action-value function due to the effects of inconstant time step interval and shows that the trained agent outperforms a fixed timing plan in all testing cases with reducing system total delay by 20%.
Abstract: Under efficiency improvement of road networks by utilizing advanced traffic signal control methods, intelligent transportation systems intend to characterize a smart city. Recently, due to significant progress in artificial intelligence, machine learning-based framework of adaptive traffic signal control has been highly concentrated. In particular, deep Q-learning neural network is a model-free technique and can be applied to optimal action selection problems. However, setting variable green time is a key mechanism to reflect traffic fluctuations such that time steps need not be fixed intervals in reinforcement learning framework. In this study, the authors proposed a dynamic discount factor embedded in the iterative Bellman equation to prevent from a biased estimation of action-value function due to the effects of inconstant time step interval. Moreover, action is added to the input layer of the neural network in the training process, and the output layer is the estimated action-value for the denoted action. Then, the trained neural network can be used to generate action that leads to an optimal estimated value within a finite set as the agents' policy. The preliminary results show that the trained agent outperforms a fixed timing plan in all testing cases with reducing system total delay by 20%..

Journal ArticleDOI
TL;DR: This work formulate the fully sequential sampling and selection decision in statistical ranking and selection as a stochastic control problem as a Bayesian framework, and derives an approximately optimal allocation policy that possesses both one-step-ahead and asymptotic optimality for independent normal sampling distributions.
Abstract: Under a Bayesian framework, we formulate the fully sequential sampling and selection decision in statistical ranking and selection as a stochastic control problem, and derive the associated Bellman equation. Using a value function approximation, we derive an approximately optimal allocation policy. We show that this policy is not only computationally efficient but also possesses both one-step-ahead and asymptotic optimality for independent normal sampling distributions. Moreover, the proposed allocation policy is easily generalizable in the approximate dynamic programming paradigm.

Journal ArticleDOI
TL;DR: In this paper, a pseudospectral collocation approximation of the PDE dynamics and an iterative method for the nonlinear Hamilton-Jacobi-Bellman (HJB) equation associated to the feedback synthesis are proposed.
Abstract: A procedure for the numerical approximation of high-dimensional Hamilton--Jacobi--Bellman (HJB) equations associated to optimal feedback control problems for semilinear parabolic equations is proposed. Its main ingredients are a pseudospectral collocation approximation of the PDE dynamics and an iterative method for the nonlinear HJB equation associated to the feedback synthesis. The latter is known as the successive Galerkin approximation. It can also be interpreted as Newton iteration for the HJB equation. At every step, the associated linear generalized HJB equation is approximated via a separable polynomial approximation ansatz. Stabilizing feedback controls are obtained from solutions to the HJB equations for systems of dimension up to fourteen.

Journal ArticleDOI
TL;DR: In this article, a stochastic control problem for a class of nonlinear kernels is considered and a dynamic programming principle for this control problem in an abstract setting is presented, which is then used to provide a semimartingale characterization of the value function.
Abstract: We consider a stochastic control problem for a class of nonlinear kernels. More precisely, our problem of interest consists in the optimization, over a set of possibly nondominated probability measures, of solutions of backward stochastic differential equations (BSDEs). Since BSDEs are nonlinear generalizations of the traditional (linear) expectations, this problem can be understood as stochastic control of a family of nonlinear expectations, or equivalently of nonlinear kernels. Our first main contribution is to prove a dynamic programming principle for this control problem in an abstract setting, which we then use to provide a semimartingale characterization of the value function. We next explore several applications of our results. We first obtain a wellposedness result for second order BSDEs (as introduced in Soner, Touzi and Zhang [Probab. Theory Related Fields 153 (2012) 149–190]) which does not require any regularity assumption on the terminal condition and the generator. Then we prove a nonlinear optional decomposition in a robust setting, extending recent results of Nutz [Stochastic Process. Appl. 125 (2015) 4543–4555], which we then use to obtain a super-hedging duality in uncertain, incomplete and nonlinear financial markets. Finally, we relate, under additional regularity assumptions, the value function to a viscosity solution of an appropriate path–dependent partial differential equation (PPDE).

Journal ArticleDOI
TL;DR: Effective ways for verifying the Slater type conditions are investigated and other conditions which are based on lower semicontinuity of the optimal value function of the inner maximization problem of the DRO are introduced.
Abstract: A key step in solving minimax distributionally robust optimization (DRO) problems is to reformulate the inner maximization w.r.t. probability measure as a semiinfinite programming problem through Lagrange dual. Slater type conditions have been widely used for strong duality (zero dual gap) when the ambiguity set is defined through moments. In this paper, we investigate effective ways for verifying the Slater type conditions and introduce other conditions which are based on lower semicontinuity of the optimal value function of the inner maximization problem. Moreover, we propose two discretization schemes for solving the DRO with one for the dualized DRO and the other directly through the ambiguity set of the DRO. In the absence of strong duality, the discretization scheme via Lagrange duality may provide an upper bound for the optimal value of the DRO whereas the direct discretization approach provides a lower bound. Two cutting plane schemes are consequently proposed: one for the discretized dualized DRO and the other for the minimax DRO with discretized ambiguity set. Convergence analysis is presented for the approximation schemes in terms of the optimal value, optimal solutions and stationary points. Comparative numerical results are reported for the resulting algorithms.

Journal ArticleDOI
TL;DR: This paper presents a novel adaptive dynamic programming(ADP)-based self-learning robust optimal control scheme for input-affine continuous-time nonlinear systems with mismatched disturbances, designed by modifying the optimal control law of the auxiliary system.

Journal ArticleDOI
TL;DR: A model-free off-policy integral reinforcement learning algorithm is proposed to solve the optimal robust output containment problem of heterogeneous MAS, in real time, without requiring any knowledge of the system dynamics.
Abstract: This paper investigates optimal robust output containment problem of general linear heterogeneous multiagent systems (MAS) with completely unknown dynamics. A model-based algorithm using offline policy iteration (PI) is first developed, where the ${p}$ -copy internal model principle is utilized to address the system parameter variations. This offline PI algorithm requires the nominal model of each agent, which may not be available in most real-world applications. To address this issue, a discounted performance function is introduced to express the optimal robust output containment problem as an optimal output-feedback design problem with bounded ${\mathcal {L}_{2}}$ -gain. To solve this problem online in real time, a Bellman equation is first developed to evaluate a certain control policy and find the updated control policies, simultaneously, using only the state/output information measured online. Then, using this Bellman equation, a model-free off-policy integral reinforcement learning algorithm is proposed to solve the optimal robust output containment problem of heterogeneous MAS, in real time, without requiring any knowledge of the system dynamics. Simulation results are provided to verify the effectiveness of the proposed method.

Journal ArticleDOI
TL;DR: A duality-based reformulation method that converts the infinite-dimensional minimax problem into a semi-infinite program that can be solved using existing convergent algorithms and it is proved that there is no duality gap, and that this approach thus preserves optimality.

Journal ArticleDOI
TL;DR: An output feedback Q-learning algorithm towards finding the optimal strategies for the discrete-time linear quadratic zero-sum game, which encompasses the H-infinity optimal control problem, and converges to the nominal GARE solution.

Journal ArticleDOI
TL;DR: This paper presents an approximate optimal distributed control scheme for a known interconnected system composed of input affine nonlinear subsystems using event-triggered state and output feedback via a novel hybrid learning scheme to reduce the convergence time for the learning algorithm.
Abstract: This paper presents an approximate optimal distributed control scheme for a known interconnected system composed of input affine nonlinear subsystems using event-triggered state and output feedback via a novel hybrid learning scheme. First, the cost function for the overall system is redefined as the sum of cost functions of individual subsystems. A distributed optimal control policy for the interconnected system is developed using the optimal value function of each subsystem. To generate the optimal control policy, forward-in-time, neural networks are employed to reconstruct the unknown optimal value function at each subsystem online. In order to retain the advantages of event-triggered feedback for an adaptive optimal controller, a novel hybrid learning scheme is proposed to reduce the convergence time for the learning algorithm. The development is based on the observation that, in the event-triggered feedback, the sampling instants are dynamic and results in variable interevent time. To relax the requirement of entire state measurements, an extended nonlinear observer is designed at each subsystem to recover the system internal states from the measurable feedback. Using a Lyapunov-based analysis, it is demonstrated that the system states and the observer errors remain locally uniformly ultimately bounded and the control policy converges to a neighborhood of the optimal policy. Simulation results are presented to demonstrate the performance of the developed controller.

Posted Content
TL;DR: Structural results about the value function, the optimal policy, and the worst-case optimal transport adversarial model are obtained, exposing a rich structure embedded in the DRO problem and enabling efficient optimization procedures that have the same sample and iteration complexity as a natural non-DRO benchmark algorithm, such as stochastic gradient descent.
Abstract: We consider optimal transport based distributionally robust optimization (DRO) problems with locally strongly convex transport cost functions and affine decision rules. Under conventional convexity assumptions on the underlying loss function, we obtain structural results about the value function, the optimal policy, and the worst-case optimal transport adversarial model. These results expose a rich structure embedded in the DRO problem (e.g. strong convexity even if the non-DRO problem was not strongly convex, a suitable scaling of the Lagrangian for the DRO constraint, etc. which are crucial for the design of efficient algorithms). As a consequence of these results, one can develop efficient optimization procedures which have the same sample and iteration complexity as a natural non-DRO benchmark algorithm such as stochastic gradient descent.

Journal ArticleDOI
TL;DR: A generalized policy iteration (GPI) algorithm with approximation errors is developed for solving infinite horizon optimal control problems for nonlinear systems and provides a general structure of discrete-time iterative adaptive dynamic programming algorithms by which most of the discrete- time reinforcement learning algorithms can be described using the GPI structure.
Abstract: In this paper, a generalized policy iteration (GPI) algorithm with approximation errors is developed for solving infinite horizon optimal control problems for nonlinear systems. The developed stable GPI algorithm provides a general structure of discrete-time iterative adaptive dynamic programming algorithms, by which most of the discrete-time reinforcement learning algorithms can be described using the GPI structure. It is for the first time that approximation errors are explicitly considered in the GPI algorithm. The properties of the stable GPI algorithm with approximation errors are analyzed. The admissibility of the approximate iterative control law can be guaranteed if the approximation errors satisfy the admissibility criteria. The convergence of the developed algorithm is established, which shows that the iterative value function is convergent to a finite neighborhood of the optimal performance index function, if the approximate errors satisfy the convergence criterion. Finally, numerical examples and comparisons are presented.

Journal ArticleDOI
TL;DR: This work proposes novel dynamic programming algorithms that alleviate the curse of dimensionality in problems that exhibit certain low-rank structure, and demonstrates the algorithms running in real time on board a quadcopter during a flight experiment under motion capture.
Abstract: Motion planning and control problems are embedded and essential in almost all robotics applications. These problems are often formulated as stochastic optimal control problems and solved using dynamic programming algorithms. Unfortunately, most existing algorithms that guarantee convergence to optimal solutions suffer from the curse of dimensionality: the run time of the algorithm grows exponentially with the dimension of the state space of the system. We propose novel dynamic programming algorithms that alleviate the curse of dimensionality in problems that exhibit certain low-rank structure. The proposed algorithms are based on continuous tensor decompositions recently developed by the authors. Essentially, the algorithms represent high-dimensional functions e.g. the value function in a compressed format, and directly perform dynamic programming computations e.g. value iteration, policy iteration in this format. Under certain technical assumptions, the new algorithms guarantee convergence towards optimal solutions with arbitrary precision. Furthermore, the run times of the new algorithms scale polynomially with the state dimension and polynomially with the ranks of the value function. This approach realizes substantial computational savings in źcompressibleź problem instances, where value functions admit low-rank approximations. We demonstrate the new algorithms in a wide range of problems, including a simulated six-dimensional agile quadcopter maneuvering example and a seven-dimensional aircraft perching example. In some of these examples, we estimate computational savings of up to 10 orders of magnitude over standard value iteration algorithms. We further demonstrate the algorithms running in real time on board a quadcopter during a flight experiment under motion capture.

Proceedings ArticleDOI
01 Dec 2018
TL;DR: A primal-dual distributed GTD algorithm is proposed and it is proved that it almost surely converges to a set of stationary points of the optimization problem.
Abstract: The goal of this paper is to study a distributed version of the gradient temporal-difference (GTD) learning algorithm for multi-agent Markov decision processes (MDPs). The temporal-difference (TD) learning is a reinforcement learning (RL) algorithm that learns an infinite horizon discounted cost function (or value function) for a given fixed policy without the model knowledge. In the distributed RL case each agent receives local reward through local processing. Information exchange over sparse communication network allows the agents to learn the global value function corresponding to a global reward, which is a sum of local rewards. In this paper, the problem is converted into a constrained convex optimization problem with a consensus constraint. We then propose a primal-dual distributed GTD algorithm and prove that it almost surely converges to a set of stationary points of the optimization problem.

Journal ArticleDOI
TL;DR: In this paper, a stochastic optimization model for hydropower generation reservoirs was proposed, in which the transition probability matrix was calculated based on copula functions; and the value function of the last period was calculated by stepwise iteration.

Journal ArticleDOI
TL;DR: In this paper, the fully nonlinear stochastic Hamilton-Jacobi-Bellman (HJB) equation was studied for the optimal control problem of stochastically differential equations with random coefficiencies.
Abstract: In this paper we study the fully nonlinear stochastic Hamilton--Jacobi--Bellman (HJB) equation for the optimal stochastic control problem of stochastic differential equations with random coefficien...

Posted Content
TL;DR: This work finds that planning shape has a profound impact on the efficacy of Dyna for both perfect and learned models, suggesting that Dyna may be a viable approach to model-based reinforcement learning in the Arcade Learning Environment and other high-dimensional problems.
Abstract: Dyna is a fundamental approach to model-based reinforcement learning (MBRL) that interleaves planning, acting, and learning in an online setting. In the most typical application of Dyna, the dynamics model is used to generate one-step transitions from selected start states from the agent's history, which are used to update the agent's value function or policy as if they were real experiences. In this work, one-step Dyna was applied to several games from the Arcade Learning Environment (ALE). We found that the model-based updates offered surprisingly little benefit over simply performing more updates with the agent's existing experience, even when using a perfect model. We hypothesize that to get the most from planning, the model must be used to generate unfamiliar experience. To test this, we experimented with the "shape" of planning in multiple different concrete instantiations of Dyna, performing fewer, longer rollouts, rather than many short rollouts. We found that planning shape has a profound impact on the efficacy of Dyna for both perfect and learned models. In addition to these findings regarding Dyna in general, our results represent, to our knowledge, the first time that a learned dynamics model has been successfully used for planning in the ALE, suggesting that Dyna may be a viable approach to MBRL in the ALE and other high-dimensional problems.

Journal ArticleDOI
TL;DR: In this article, an optimal control problem in the Mayer's form in the space of probability measures on R n endowed with the Wasserstein distance is studied, where the knowledge of the initial state and velocity is subject to some uncertainty.

Journal ArticleDOI
TL;DR: The effectiveness of the main ideas developed in this paper are illustrated using several examples including a practical problem of excitation control of a hydrogenerator and a new sufficient condition for the value function to converge to a bounded neighborhood of the optimal value function.
Abstract: Policy iteration approximate dynamic programming (DP) is an important algorithm for solving optimal decision and control problems. In this paper, we focus on the problem associated with policy approximation in policy iteration approximate DP for discrete-time nonlinear systems using infinite-horizon undiscounted value functions. Taking policy approximation error into account, we demonstrate asymptotic stability of the control policy under our problem setting, show boundedness of the value function during each policy iteration step, and introduce a new sufficient condition for the value function to converge to a bounded neighborhood of the optimal value function. Aiming for practical implementation of an approximate policy, we consider using Volterra series, which has been extensively covered in controls literature for its good theoretical properties and for its success in practical applications. We illustrate the effectiveness of the main ideas developed in this paper using several examples including a practical problem of excitation control of a hydrogenerator.

Journal ArticleDOI
TL;DR: A general analytical framework for continuous-time stochastic control problems for an ambiguity-averse agent (AAA) with time-inconsistent preference, where the control problems do not satisfy Bellman’s principle of optimality is established.