scispace - formally typeset
Search or ask a question

Showing papers on "Bellman equation published in 2023"


Journal ArticleDOI
TL;DR: In this paper , the authors proposed a simple and intuitive objective function based on the quadratic deviation (QD) from an elevated benchmark for benchmark outperformance, and provided closed-form solutions for the control under idealized assumptions.
Abstract: 5 We analyze dynamic investment strategies for benchmark outperformance using two widely-used ob- 6 jectives of practical interest to investors: (i) maximizing the information ratio (IR), and (ii) obtaining a 7 favorable tracking difference (cumulative outperformance) relative to the benchmark. In the case of the 8 tracking difference, we propose a simple and intuitive objective function based on the quadratic deviation 9 (QD) from an elevated benchmark. In order to gain some intuition about these strategies, we provide closed 10 form solutions for the controls under idealized assumptions. For more realistic cases, we represent the control 11 using a Neural Network (NN) and directly solve a sampled optimization problem, which approximates the 12 original optimal stochastic control formulation. Unlike the typical approach based on dynamic program- 13 ming (DP), e.g. reinforcement learning, solving the sampled optimization with an NN as a control avoids 14 computing conditional expectations and leads to an optimization problem with a small number of variables. 15 In addition, our NN parameter size is independent of the number of portfolio rebalancing times. Under 16 some assumptions, we prove that a traditional dynamic programming approach results in high dimensional 17 problem, whereas directly solving for the control without using DP yields a low dimensional problem. Our 18 analytical and numerical results illustrate that, compared with IR-optimal strategies with the same expected 19 value of terminal wealth, the QD-optimal investment strategies result in comparatively more diversified asset 20 allocations during certain periods of the investment time horizon. 21

8 citations


Journal ArticleDOI
TL;DR: In this paper , the convergence rate of actor-critic algorithms with Monte Carlo rollouts during policy search updates was shown to be controllable bias that depends on the number of critic evaluations.
Abstract: Reinforcement learning, mathematically described by Markov Decision Problems, may be approached either through dynamic programming or policy search. Actor-critic algorithms combine the merits of both approaches by alternating between steps to estimate the value function and policy gradient updates. Due to the fact that the updates exhibit correlated noise and biased gradient updates, only the asymptotic behavior of actor-critic is known by connecting its behavior to dynamical systems. This work puts forth a new variant of actor-critic that employs Monte Carlo rollouts during the policy search updates, which results in controllable bias that depends on the number of critic evaluations. As a result, we are able to provide for the first time the convergence rate of actor-critic algorithms when the policy search step employs policy gradient, agnostic to the choice of policy evaluation technique. In particular, we establish conditions under which the sample complexity is comparable to stochastic gradient method for non-convex problems or slower as a result of the critic estimation error, which is the main complexity bottleneck. These results hold in continuous state and action spaces with linear function approximation for the value function. We then specialize these conceptual results to the case where the critic is estimated by Temporal Difference, Gradient Temporal Difference, and Accelerated Gradient Temporal Difference. These learning rates are then corroborated on a navigation problem involving an obstacle and the pendulum problem which provide insight into the interplay between optimization and generalization in reinforcement learning.

4 citations


Journal ArticleDOI
TL;DR: In this article , a Weighted Dueling Double Deep Q-Network with embedded human expertise (WD3QNE) is proposed for real-time treatment of sepsis.
Abstract: Abstract Deep Reinforcement Learning (DRL) has been increasingly attempted in assisting clinicians for real-time treatment of sepsis. While a value function quantifies the performance of policies in such decision-making processes, most value-based DRL algorithms cannot evaluate the target value function precisely and are not as safe as clinical experts. In this study, we propose a Weighted Dueling Double Deep Q-Network with embedded human Expertise (WD3QNE). A target Q value function with adaptive dynamic weight is designed to improve the estimate accuracy and human expertise in decision-making is leveraged. In addition, the random forest algorithm is employed for feature selection to improve model interpretability. We test our algorithm against state-of-the-art value function methods in terms of expected return, survival rate, action distribution and external validation. The results demonstrate that WD3QNE obtains the highest survival rate of 97.81% in MIMIC-III dataset. Our proposed method is capable of providing reliable treatment decisions with embedded clinician expertise.

3 citations


Journal ArticleDOI
TL;DR: In this article , Itô's formula along flows of probability measures associated with general semimartingales has been established, which enables the derivation of dynamic programming equations and verification theorems for McKean-Vlasov controls with jump diffusions.

3 citations


Journal ArticleDOI
01 Jan 2023
TL;DR: In this paper , a neural network (NN) approach is proposed to obtain approximate solutions for high-dimensional optimal control problems and demonstrate its effectiveness using examples from multiagent path finding.
Abstract: We propose a neural network (NN) approach that yields approximate solutions for high-dimensional optimal control (OC) problems and demonstrate its effectiveness using examples from multiagent path finding. Our approach yields control in a feedback form, where the policy function is given by an NN. In particular, we fuse the Hamilton–Jacobi–Bellman (HJB) and Pontryagin maximum principle (PMP) approaches by parameterizing the value function with an NN. Our approach enables us to obtain approximately OCs in real time without having to solve an optimization problem. Once the policy function is trained, generating a control at a given space–time location takes milliseconds; in contrast, efficient nonlinear programming methods typically perform the same task in seconds. We train the NN offline using the objective function of the control problem and penalty terms that enforce the HJB equations. Therefore, our training algorithm does not involve data generated by another algorithm. By training on a distribution of initial states, we ensure the controls’ optimality on a large portion of the state space. Our grid-free approach scales efficiently to dimensions where grids become impractical or infeasible. We apply our approach to several multiagent collision-avoidance problems in up to 150 dimensions. Furthermore, we empirically observe that the number of parameters in our approach scales linearly with the dimension of the control problem, thereby mitigating the curse of dimensionality.

2 citations


Journal ArticleDOI
TL;DR: In this paper , a game-theoretic approach is adopted to study continuous-time non-Markovian stochastic control problems which are inherently time-inconsistent.
Abstract: We develop a theory for continuous-time non-Markovian stochastic control problems which are inherently time-inconsistent. Their distinguishing feature is that the classical Bellman optimality principle no longer holds. Our formulation is cast within the framework of a controlled non-Markovian forward stochastic differential equation, and a general objective functional setting. We adopt a game-theoretic approach to study such problems, meaning that we seek for subgame perfect Nash equilibrium points. As a first novelty of this work, we introduce and motivate a refinement of the definition of equilibrium that allows us to establish a direct and rigorous proof of an extended dynamic programming principle, in the same spirit as in the classical theory. This in turn allows us to introduce a system consisting of an infinite family of backward stochastic differential equations analogous to the classical HJB equation. We prove that this system is fundamental, in the sense that its well-posedness is both necessary and sufficient to characterise the value function and equilibria. As a final step, we provide an existence and uniqueness result. Some examples and extensions of our results are also presented.

2 citations



Journal ArticleDOI
TL;DR: In this article , a complete analysis of non-convex dynamic optimization problems, within a dynamic programming approach, is presented, and some continuity properties of the value function of the associated optimization problem are proved.
Abstract: A large number of recent studies consider a compartmental SIR model to study optimal control policies aimed at containing the diffusion of COVID-19 while minimizing the economic costs of preventive measures. Such problems are non-convex and standard results need not to hold. We use a Dynamic Programming approach and prove some continuity properties of the value function of the associated optimization problem. We study the corresponding Hamilton-Jacobi-Bellman equation and show that the value function solves it in the viscosity sense. Finally, we discuss some optimality conditions. Our paper represents a first contribution towards a complete analysis of non-convex dynamic optimization problems, within a Dynamic Programming approach.

2 citations


Journal ArticleDOI
01 Jan 2023
TL;DR: In this article , an online dual event-triggered adaptive dynamic programming (ADP) optimal control algorithm for a class of nonlinear systems with constrained state and input is proposed, and two comparative experiments based on the robot arm model are simulated to verify that the algorithm can make control policies update only when the system has the requirement and keep satisfactory control effect.
Abstract: Taking safety and performance into consideration, the state and control input of the actual engineering system are often constrained. For this kind of problem, this article puts forward an online dual event-triggered (ET) adaptive dynamic programming (ADP) optimal control algorithm for a class of nonlinear systems with constrained state and input. First, the original system is transformed into another system through the barrier function, after that, a suitable value function with a nonquadratic utility function is designed to obtain the optimal control pair. In addition, on the premise of the asymptotic stability of the system, the trigger condition is devised, and the intersampling time analysis is proved that the algorithm can avoid the Zeno phenomenon. What is more, the critic, action, and disturbance neural networks (NNs) are trained to approximate value function and control sequences, subsequently, the approximation error is proved to be uniformly ultimately boundedness (UUB). Finally, two comparative experiments based on the robot arm model are simulated to verify that the algorithm can make control policies update only when the system has the requirement and keep satisfactory control effect, which can effectively decrease the number of data transfers and reduce the calculation burden.

2 citations


Journal ArticleDOI
TL;DR: In this paper , an event-triggered adaptive dynamic programming (ADP) method is proposed to deal with the H∞ problem with unknown dynamic and constrained input, which is regarded as a two-player zero-sum game with the nonquadratic value function.
Abstract: In this paper, an event-triggered adaptive dynamic programming (ADP) method is proposed to deal with the H∞ problem with unknown dynamic and constrained input. Firstly, the H∞-constrained problem is regarded as the two-player zero-sum game with the nonquadratic value function. Secondly, we develop the event-triggered Hamilton–Jacobi–Isaacs(HJI) equation, and an event-triggered ADP method is proposed to solve the HJI equation, which is equivalent to solving the Nash saddle point of the zero-sum game. An event-based single-critic neural network (NN) is applied to obtain the optimal value function, which reduces the communication resource and computational cost of algorithm implementation. For the event-triggered control, a triggering condition with the level of disturbance attenuation is developed to limit the number of sampling states, and the condition avoids Zeno behavior by proving the existence of events with minimum triggering interval. It is proved theoretically that the closed-loop system is asymptotically stable, and the critic NN weight error is uniformly ultimately boundedness (UUB). The learning performance of the proposed algorithm is verified by two examples.

1 citations


Proceedings ArticleDOI
31 May 2023
TL;DR: In this article , the authors investigate the benefits of leveraging structural information about the system in terms of reducing sample complexity and show that there can be a significant saving in sample complexity by leveraging structural knowledge about the model.
Abstract: Model-based Reinforcement Learning (RL) integrates learning and planning and has received increasing attention in recent years. However, learning the model can incur a significant cost (in terms of sample complexity), due to the need to obtain a sufficient number of samples for each state-action pair. In this paper, we investigate the benefits of leveraging structural information about the system in terms of reducing sample complexity. Specifically, we consider the setting where the transition probability matrix is a known function of a number of structural parameters, whose values are initially unknown. We then consider the problem of estimating those parameters based on the interactions with the environment. We characterize the difference between the Q estimates and the optimal Q value as a function of the number of samples. Our analysis shows that there can be a significant saving in sample complexity by leveraging structural information about the model. We illustrate the findings by considering how to control a queuing system with heterogeneous servers.

Journal ArticleDOI
TL;DR: In this paper , the authors studied a dynamic optimal reinsurance and dividend-payout problem for an insurance company in a finite time horizon, where the insurance company is allowed to buy reinsurance contracts dynamically over the whole time horizon to cede its risk exposure with other reinsurance companies.
Abstract: This paper studies a dynamic optimal reinsurance and dividend-payout problem for an insurance company in a finite time horizon. The goal of the company is to maximize the expected cumulative discounted dividend payouts until bankruptcy or maturity, whichever comes earlier. The company is allowed to buy reinsurance contracts dynamically over the whole time horizon to cede its risk exposure with other reinsurance companies. This is a mixed singular–classical stochastic control problem, and the corresponding Hamilton–Jacobi–Bellman equation is a variational inequality with a fully nonlinear operator and subject to a gradient constraint. We obtain the [Formula: see text] smoothness of the value function and a comparison principle for its gradient function by the penalty approximation method so that one can establish an efficient numerical scheme to compute the value function. We find that the surplus-time space can be divided into three nonoverlapping regions by a risk-magnitude and time-dependent reinsurance barrier and a time-dependent dividend-payout barrier. The insurance company should be exposed to a higher risk as its surplus increases, be exposed to the entire risk once its surplus upward crosses the reinsurance barrier, and pay out all its reserves exceeding the dividend-payout barrier. The estimated localities of these regions are also provided. Funding: This work was supported by the Hong Kong Research Grants Council [Grants GRF 15202421 and GRF 15202817], the Guangdong Basic and Applied Basic Research Foundation [Grants 2021A1515012031 and 2022A1515010263], the National Natural Science Foundation of China [Grants 11901244 and 11971409], the PolyU-SDU Joint Research Center on Financial Mathematics, the CAS AMSS-PolyU Joint Laboratory of Applied Mathematics, and Hong Kong Polytechnic University.

Journal ArticleDOI
TL;DR: In this paper , an event-triggered pinning optimal consensus control for switched multi-agent system (SMAS) via a switched adaptive dynamic programming (ADP) method is investigated.

Journal ArticleDOI
TL;DR: In this article , the pricing of American options whose asset price dynamics follow Azzalini Itô-McKean skew Brownian motions is considered, and the corresponding optimal stopping time problem is then formulated and the main properties of its value function are provided.

Proceedings ArticleDOI
04 Jun 2023
TL;DR: In this paper , the authors proposed a networked policy gradient play algorithm for solving Markov potential games, where agents use stochastic gradients and local parameter values received from their neighbors to update their policies.
Abstract: We propose a networked policy gradient play algorithm for solving Markov potential games. In a Markov game, each agent has a reward function that depends on the actions of all the agents and a common dynamic state. A differentiable Markov potential game admits a potential value function that has local gradients equal to the gradients of agents’ local value functions. In the proposed algorithm, agents use parameterized policies that depend on the state and other agents’ policies. Agents use stochastic gradients and local parameter values received from their neighbors to update their policies. We show that the joint policy parameters converge to a first-order stationary point of a Markov potential game in expectation for general action and state spaces. Numerical results on the lake game exemplify the convergence of the proposed method.

Journal ArticleDOI
TL;DR: In this article , a hybrid and dynamic policy gradient (HDPG) was proposed to learn a separate value function for each reward component, leading to hybrid policy gradients for bipedal locomotion.
Abstract: Controlling a non-statically bipedal robot is challenging due to the complex dynamics and multi-criterion optimization involved. Recent works have demonstrated the effectiveness of deep reinforcement learning (DRL) for simulation and physical robots. In these methods, the rewards from different criteria are normally summed to learn a scalar function. However, a scalar is less informative and may be insufficient to derive effective information for each reward channel from the complex hybrid rewards. In this work, we propose a novel reward-adaptive reinforcement learning method for biped locomotion, allowing the control policy to be simultaneously optimized by multiple criteria using a dynamic mechanism. The proposed method applies a multi-head critic to learn a separate value function for each reward component, leading to hybrid policy gradients. We further propose dynamic weight, allowing each component to optimize the policy with different priorities. This hybrid and dynamic policy gradient (HDPG) design makes the agent learn more efficiently. We show that the proposed method outperforms summed-up-reward approaches and is able to transfer to physical robots. The MuJoCo results further demonstrate the effectiveness and generalization of HDPG.

Journal ArticleDOI
01 Jan 2023
TL;DR: SMIX( λ ) as mentioned in this paper uses an off-policy training to avoid the greedy assumption commonly made in CVF learning, and uses the λ -return as a proxy to compute the temporal difference (TD) error.
Abstract: Learning a stable and generalizable centralized value function (CVF) is a crucial but challenging task in multiagent reinforcement learning (MARL), as it has to deal with the issue that the joint action space increases exponentially with the number of agents in such scenarios. This article proposes an approach, named SMIX( λ ), that uses an OFF-policy training to achieve this by avoiding the greedy assumption commonly made in CVF learning. As importance sampling for such OFF-policy training is both computationally costly and numerically unstable, we proposed to use the λ -return as a proxy to compute the temporal difference (TD) error. With this new loss function objective, we adopt a modified QMIX network structure as the base to train our model. By further connecting it with the Q(λ) approach from a unified expectation correction viewpoint, we show that the proposed SMIX( λ ) is equivalent to Q(λ) and hence shares its convergence properties, while without being suffered from the aforementioned curse of dimensionality problem inherent in MARL. Experiments on the StarCraft Multiagent Challenge (SMAC) benchmark demonstrate that our approach not only outperforms several state-of-the-art MARL methods by a large margin but also can be used as a general tool to improve the overall performance of other centralized training with decentralized execution (CTDE)-type algorithms by enhancing their CVFs.

Journal ArticleDOI
TL;DR: In this article , a combined value iteration (CVI) framework was developed to address discounted optimal control problems for discrete-time affine nonlinear systems, and the convergence rate was investigated for the iterative control policy derived from stabilizing value iteration.
Abstract: In this essay, we develop a combined value iteration (CVI) framework to address discounted optimal control problems for discrete-time affine nonlinear systems. First, generated by novel value iteration (NVI), the admissibility is investigated for the iterative control policy. Note that the relaxation factor leads to the adjustable convergence speed. Second, the constraint condition for the discount factor is established to guarantee the admissibility of the iterative control policy derived from stabilizing value iteration (SVI). In addition, the monotonicity is discussed for the iterative cost function sequence. More importantly, CVI is constructed based on NVI and SVI. Third, produced by CVI, the system stability under the evolving control sequence is ensured by the introduction of the attraction region. In the end, a numerical example is involved to confirm the related theoretical results.

Journal ArticleDOI
TL;DR: In this paper , the authors propose a methodology that allies both theoretical and empirical aspects to model and solve the problem of finding the optimal market-based corporate financial structure in a risky environment; the firm's asset value follows a geometric Brownian motion with a return adjusted by the probability of default which is stochastic and follows an Ornstein-Uhlenbeck process.
Abstract: The purpose of this paper is to propose a methodology that allies both theoretical and empirical aspects to model and solve the problem of finding the optimal market-based corporate financial structure in a risky environment; the firm's asset value follows a geometric Brownian motion with a return adjusted by the probability of default which is stochastic and follows an Ornstein-Uhlenbeck process. We consider a decision manager to optimize the financial structure of his company, maximizes the expected utility of the shareholders' final wealth by solving a dynamic programming problem. The value function obeys a quadratic Hamilton-Jacobi-Bellman (HJB) equation and the explicit solution gives the optimal debt ratio. The Kalman filter approach is used to estimate the model parameters. We empirically examined the implications of the credit ratings on the optimal capital structure decision. The data used are daily, obtained from the 'Data Stream' database and cover a sample of 16 U.S companies of various sectors and different rating categories over the period from January 01, 2015, to December 31, 2015. The estimated average optimal debt ratios are relatively high showing different patterns between and within investment and speculative-grade firms. In fact, the evolution of the optimal debt ratio according to the rating is influenced by a restrictive or an easy entry into the debt market and by the value of the estimated rate of return.

Journal ArticleDOI
TL;DR: In this paper , a new iterative adaptive dynamic programming algorithm, which is the discrete-time time-varying policy iteration (DTTV) algorithm, was developed to update the iterative value function which approximates the index function of optimal performance.
Abstract: Aimed at infinite horizon optimal control problems of discrete time-varying nonlinear systems, in this paper, a new iterative adaptive dynamic programming algorithm, which is the discrete-time time-varying policy iteration (DTTV) algorithm, is developed. The iterative control law is designed to update the iterative value function which approximates the index function of optimal performance. The admissibility of the iterative control law is analyzed. The results show that the iterative value function is non-increasingly convergent to the Bellman-equation optimal solution. To implement the algorithm, neural networks are employed and a new implementation structure is established, which avoids solving the generalized Bellman equation in each iteration. Finally, the optimal control laws for torsional pendulum and inverted pendulum systems are obtained by using the DTTV policy iteration algorithm, where the mass and pendulum bar length are permitted to be time-varying parameters. The effectiveness of the developed method is illustrated by numerical results and comparisons.

Journal ArticleDOI
1
TL;DR: In this article , the authors study the portfolio problem of an agent who is aware that a future pandemic can affect her health and personal finances, and show that the optimal portfolio strategy is significantly affected by the mere threat of a potential pandemic.

Journal ArticleDOI
TL;DR: The authors showed that an optimistic modification of least-squares value iteration achieves regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps.
Abstract: Modern reinforcement learning (RL) is commonly applied to practical problems with an enormous number of states, where function approximation must be deployed to approximate either the value function or the policy. The introduction of function approximation raises a fundamental set of challenges involving computational and statistical efficiency, especially given the need to manage the exploration/exploitation trade-off. As a result, a core RL question remains open: how can we design provably efficient RL algorithms that incorporate function approximation? This question persists even in a basic setting with linear dynamics and linear rewards, for which only linear function approximation is needed. This paper presents the first provable RL algorithm with both polynomial run time and polynomial sample complexity in this linear setting, without requiring a “simulator” or additional assumptions. Concretely, we prove that an optimistic modification of least-squares value iteration—a classical algorithm frequently studied in the linear setting—achieves [Formula: see text] regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps. Importantly, such regret is independent of the number of states and actions. Funding: This work was supported by the Defense Advanced Research Projects Agency program on Lifelong Learning Machines.

Journal ArticleDOI
TL;DR: In this paper , the authors considered a risk model with two-sided jumps and proportional investment, where upward jumps and downward jumps represent gains and claims, respectively, and the error between the exact solution (ES) and the sinc approximate solution (SA) was analyzed.
Abstract: In this paper, we consider a risk model with two-sided jumps and proportional investment. The upward jumps and downward jumps represent gains and claims, respectively. Suppose the company invests all of its surplus in a certain proportion in two types of investments, one is risk-free (such as bank accounts) and the other is risky (such as stocks). Our aim is to find the optimal admissible strategy (including the optimal dividend rate and the optimal ratio of investment in risky assets), to maximize the dividend value function, and discuss the effects of a number of parameters on dividend payments. Firstly, the HJB equation of the dividend value function is obtained by the stochastic analysis theory and the dynamic programming method, and the optimal admissible strategy is obtained. Since the integro-differential equation satisfied by the dividend value function is difficult to solve, we turn to the sinc numerical method to approximate solve it. Then, the error between the exact solution (ES) and the sinc approximate solution (SA) is analyzed. Finally, the relative error of a special numerical solution and an ES is given, and some examples of sensitivity analysis are discussed. This study provides a theoretical basis for insurance companies to prevent risks better.

Journal ArticleDOI
TL;DR: In this paper , the relationship between the maximum principle and dynamic programming for a large class of optimal control problems with maximum running cost was investigated, and a global and a partial sensitivity relation that link the coextremal with the value function of the problem at hand was obtained.
Abstract: This paper is concerned with the relationship between the maximum principle and dynamic programming for a large class of optimal control problems with maximum running cost. Inspired by a technique introduced by Vinter in the 1980s, we are able to obtain jointly a global and a partial sensitivity relation that link the coextremal with the value function of the problem at hand. One of the main contributions of this work is that these relations are derived by using a single perturbed problem, and therefore, both sensitivity relations hold, at the same time, for the same coextremal. As a by-product, and thanks to the level-set approach, we obtain a new set of sensitivity relations for Mayer problems with state constraints. One important feature of this last result is that it holds under mild assumptions, without the need of imposing strong compatibility assumptions between the dynamics and the state constraints set.

Journal ArticleDOI
TL;DR: In this article , the authors derive a class of Hamilton-Jacobi-Bellman (HJB) equations and prove that the optimal value function of the maximum entropy control problem corresponds to the unique viscosity solution of the HJB equation.
Abstract: Maximum entropy reinforcement learning methods have been successfully applied to a range of challenging sequential decision-making and control tasks. However, most of the existing techniques are designed for discrete-time systems although there has been a growing interest to handle physical processes evolving in continuous time. As a first step toward their extension to continuous-time systems, this article aims to study the theory of maximum entropy optimal control in continuous time. Applying the dynamic programming principle, we derive a novel class of Hamilton–Jacobi–Bellman (HJB) equations and prove that the optimal value function of the maximum entropy control problem corresponds to the unique viscosity solution of the HJB equation. We further show that the optimal control is uniquely characterized as Gaussian in the case of control-affine systems and that, for linear-quadratic problems, the HJB equation is reduced to a Riccati equation, which can be used to obtain an explicit expression of the optimal control. The results of our numerical experiments demonstrate the performance of our maximum entropy method in continuous-time optimal control and reinforcement learning problems.

Posted ContentDOI
24 Feb 2023
TL;DR: The authors proposed a new uncertainty Bellman equation whose solution converges to the true posterior variance over values and explicitly characterizes the gap in previous work, which is easily integrated into common exploration strategies and scales naturally beyond the tabular setting.
Abstract: We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. In particular, we focus on characterizing the variance over values induced by a distribution over MDPs. Previous work upper bounds the posterior variance over values by solving a so-called uncertainty Bellman equation, but the over-approximation may result in inefficient exploration. We propose a new uncertainty Bellman equation whose solution converges to the true posterior variance over values and explicitly characterizes the gap in previous work. Moreover, our uncertainty quantification technique is easily integrated into common exploration strategies and scales naturally beyond the tabular setting by using standard deep reinforcement learning architectures. Experiments in difficult exploration tasks, both in tabular and continuous control settings, show that our sharper uncertainty estimates improve sample-efficiency.

Journal ArticleDOI
TL;DR: In this article , a stochastic game in the context of forward-backward Stochastic Differential Equations (SDEs) is considered, where one player implements an impulse control while the opponent controls the system continuously, and it is shown that the upper and lower value functions are both solutions to the same Hamilton-Jacobi-Bellman-Isaacs obstacle problem.

Journal ArticleDOI
TL;DR: In this paper , an adaptive dynamic programming (ADP) based approach is proposed to solve the optimal tracking problem for completely unknown discrete time systems, where the cost function considers tracking performance, energy consumption and as a novelty, consecutive changes in the control inputs.
Abstract: Adaptive dynamic programming (ADP) based approaches are effective for solving nonlinear Hamilton–Jacobi–Bellman (HJB) in an approximative sense. This paper develops a novel ADP-based approach, in that the focus is on minimizing the consecutive changes in control inputs over a finite horizon to solve the optimal tracking problem for completely unknown discrete time systems. To that end, the cost function considers within its arguments: tracking performance, energy consumption and as a novelty, consecutive changes in the control inputs. Through suitable system transformation, the optimal tracking problem is transformed to a regulation problem with respect to state tracking error. The latter leads to a novel performance index function over finite horizon and corresponding nonlinear HJB equation that is solved in an approximative iterative sense using a novel iterative ADP-based algorithm. A suitable Neural network-based structure is proposed to learn the initial admissible one step zero control law. The proposed iterative ADP is implemented using heuristic dynamic programming technique based on actor-critic Neural Network structure. Finally, simulation studies are presented to illustrate the effectiveness of the proposed algorithm.

Journal ArticleDOI
TL;DR: In this article , the authors studied multi-asset pairs trading strategies of maximizing the expected exponential utility of terminal wealth and provided some numerical results to show the characteristics of pairs trading.
Abstract:

This paper studies multi-asset pairs trading strategies of maximizing the expected exponential utility of terminal wealth. We model the log-relationship between each pair of stock prices as an Ornstein-Uhlenbeck (O-U) process, and formulate a portfolio optimization problem. Using the classical stochastic control approach based on the Hamilton-Jacobi-Bellman (HJB) equation, we characterize the optimal strategies and provide a verification result for the value function. Finally, we give some numerical results to show the characteristics of pairs trading.


Posted ContentDOI
18 Apr 2023
TL;DR: In this paper , the authors propose a feasible policy iteration (FPI) algorithm, which iteratively uses the feasible region of the last policy to constrain the current policy to solve an optimal control problem under safety constraints.
Abstract: Safe reinforcement learning (RL) aims to solve an optimal control problem under safety constraints. Existing $\textit{direct}$ safe RL methods use the original constraint throughout the learning process. They either lack theoretical guarantees of the policy during iteration or suffer from infeasibility problems. To address this issue, we propose an $\textit{indirect}$ safe RL method called feasible policy iteration (FPI) that iteratively uses the feasible region of the last policy to constrain the current policy. The feasible region is represented by a feasibility function called constraint decay function (CDF). The core of FPI is a region-wise policy update rule called feasible policy improvement, which maximizes the return under the constraint of the CDF inside the feasible region and minimizes the CDF outside the feasible region. This update rule is always feasible and ensures that the feasible region monotonically expands and the state-value function monotonically increases inside the feasible region. Using the feasible Bellman equation, we prove that FPI converges to the maximum feasible region and the optimal state-value function. Experiments on classic control tasks and Safety Gym show that our algorithms achieve lower constraint violations and comparable or higher performance than the baselines.