scispace - formally typeset
Search or ask a question

Showing papers on "Bellman equation published in 2017"


Journal ArticleDOI
TL;DR: This paper studies a class of continuous-time stochastic control problems which are time-inconsistent in the sense that they do not admit a Bellman optimality principle, and derives an extension of the standard Hamilton–Jacobi–Bellman equation in the form of a system of nonlinear equations for the determination of the equilibrium strategy as well as the equilibrium value function.
Abstract: In this paper, which is a continuation of the discrete-time paper (Bjork and Murgoci in Finance Stoch. 18:545–592, 2004), we study a class of continuous-time stochastic control problems which, in various ways, are time-inconsistent in the sense that they do not admit a Bellman optimality principle. We study these problems within a game-theoretic framework, and we look for Nash subgame perfect equilibrium points. For a general controlled continuous-time Markov process and a fairly general objective functional, we derive an extension of the standard Hamilton–Jacobi–Bellman equation, in the form of a system of nonlinear equations, for the determination of the equilibrium strategy as well as the equilibrium value function. The main theoretical result is a verification theorem. As an application of the general theory, we study a time-inconsistent linear-quadratic regulator. We also present a study of time-inconsistency within the framework of a general equilibrium production economy of Cox–Ingersoll–Ross type (Cox et al. in Econometrica 53:363–384, 1985).

252 citations


Journal ArticleDOI
TL;DR: In this article, the authors consider discrete-time infinite horizon problems of optimal control to a terminal set of states, and establish the uniqueness of the solution of Bellman's equation, and provide convergence results for value and policy iterations.
Abstract: In this paper, we consider discrete-time infinite horizon problems of optimal control to a terminal set of states. These are the problems that are often taken as the starting point for adaptive dynamic programming. Under very general assumptions, we establish the uniqueness of the solution of Bellman’s equation, and we provide convergence results for value and policy iterations.

152 citations


Journal ArticleDOI
TL;DR: A new trigger threshold for discrete-time systems is designed and a detailed Lyapunov stability analysis shows that the proposed event-triggered controller can asymptotically stabilize the discrete- time systems.
Abstract: This paper presents the design of a novel adaptive event-triggered control method based on the heuristic dynamic programming (HDP) technique for nonlinear discrete-time systems with unknown system dynamics. In the proposed method, the control law is only updated when the event-triggered condition is violated. Compared with the periodic updates in the traditional adaptive dynamic programming (ADP) control, the proposed method can reduce the computation and transmission cost. An actor–critic framework is used to learn the optimal event-triggered control law and the value function. Furthermore, a model network is designed to estimate the system state vector. The main contribution of this paper is to design a new trigger threshold for discrete-time systems. A detailed Lyapunov stability analysis shows that our proposed event-triggered controller can asymptotically stabilize the discrete-time systems. Finally, we test our method on two different discrete-time systems, and the simulation results are included.

148 citations


Journal ArticleDOI
TL;DR: This paper develops an off-policy reinforcement learning (RL) algorithm to solve optimal synchronization of multiagent systems by using the framework of graphical games and shows that the optimal distributed policies found by the proposed algorithm satisfy the global Nash equilibrium and synchronize all agents to the leader.
Abstract: This paper develops an off-policy reinforcement learning (RL) algorithm to solve optimal synchronization of multiagent systems. This is accomplished by using the framework of graphical games. In contrast to traditional control protocols, which require complete knowledge of agent dynamics, the proposed off-policy RL algorithm is a model-free approach, in that it solves the optimal synchronization problem without knowing any knowledge of the agent dynamics. A prescribed control policy, called behavior policy, is applied to each agent to generate and collect data for learning. An off-policy Bellman equation is derived for each agent to learn the value function for the policy under evaluation, called target policy, and find an improved policy, simultaneously. Actor and critic neural networks along with least-square approach are employed to approximate target control policies and value functions using the data generated by applying prescribed behavior policies. Finally, an off-policy RL algorithm is presented that is implemented in real time and gives the approximate optimal control policy for each agent using only measured data. It is shown that the optimal distributed policies found by the proposed algorithm satisfy the global Nash equilibrium and synchronize all agents to the leader. Simulation results illustrate the effectiveness of the proposed method.

136 citations


Journal ArticleDOI
TL;DR: A novel mixed iterative adaptive dynamic programming algorithm is developed to solve the optimal battery energy management and control problem in smart residential microgrid systems and it is proven that the performance index function is finite under the iterative control law sequence.
Abstract: In this paper, a novel mixed iterative adaptive dynamic programming (ADP) algorithm is developed to solve the optimal battery energy management and control problem in smart residential microgrid systems. Based on the data of the load and electricity rate, two iterations are constructed, which are $P$ -iteration and $V$ -iteration, respectively. The $V$ -iteration is implemented based on value iteration, which aims to obtain the iterative control law sequence in each period. The $P$ -iteration is implemented based on policy iteration, which updates the iterative value function according to the iterative control law sequence. Properties of the developed mixed iterative ADP algorithm are analyzed. It is shown that the iterative value function is monotonically nonincreasing and converges to the solution of the Bellman equation. In each iteration, it is proven that the performance index function is finite under the iterative control law sequence. Finally, numerical results and comparisons are given to illustrate the performance of the developed algorithm.

120 citations


Journal ArticleDOI
TL;DR: In this paper, the optimal control of general stochastic McKean-Vlasov equation under common noise is studied. But the authors focus on the control of the value function in the Wasserstein space of probability measures, which is proved from a flow property of the controlled state process.
Abstract: We study the optimal control of general stochastic McKean-Vlasov equation. Such problem is motivated originally from the asymptotic formulation of cooperative equilibrium for a large population of particles (players) in mean-field interaction under common noise. Our first main result is to state a dynamic programming principle for the value function in the Wasserstein space of probability measures, which is proved from a flow property of the conditional law of the controlled state process. Next, by relying on the notion of differentiability with respect to probability measures due to P.L. Lions [32], and Ito's formula along a flow of conditional measures, we derive the dynamic programming Hamilton-Jacobi-Bellman equation, and prove the viscosity property together with a uniqueness result for the value function. Finally, we solve explicitly the linear-quadratic stochastic McKean-Vlasov control problem and give an application to an interbank systemic risk model with common noise.

117 citations


Proceedings Article
17 Jul 2017
TL;DR: In this paper, the empirical policy evaluation problem is transformed into a (quadratic) convex-concave saddle point problem, and then presented a primal-dual batch gradient method, as well as two stochastic variance reduction methods for solving the problem.
Abstract: Policy evaluation is a crucial step in many reinforcement-learning procedures, which estimates a value function that predicts states’ long-term value under a given policy. In this paper, we focus on policy evaluation with linear function approximation over a fixed dataset. We first transform the empirical policy evaluation problem into a (quadratic) convex-concave saddle point problem, and then present a primal-dual batch gradient method, as well as two stochastic variance reduction methods for solving the problem. These algorithms scale linearly in both sample size and feature dimension. Moreover, they achieve linear convergence even when the saddle-point problem has only strong concavity in the dual variables but no strong convexity in the primal variables. Numerical experiments on benchmark problems demonstrate the effectiveness of our methods.

98 citations


Proceedings ArticleDOI
01 May 2017
TL;DR: A novel formulation of DDP is presented that is able to accommodate arbitrary nonlinear inequality constraints on both state and control and is shown to outperform other methods for accommodating constraints.
Abstract: Differential dynamic programming (DDP) is a widely used trajectory optimization technique that addresses nonlinear optimal control problems, and can readily handle nonlinear cost functions. However, it does not handle either state or control constraints. This paper presents a novel formulation of DDP that is able to accommodate arbitrary nonlinear inequality constraints on both state and control. The main insight in standard DDP is that a quadratic approximation of the value function can be derived using a recursive backward pass, however the recursive formulae are only valid for unconstrained problems. The main technical contribution of the presented method is a derivation of the recursive quadratic approximation formula in the presence of nonlinear constraints, after a set of active constraints has been identified at each point in time. This formula is used in a new Constrained-DDP (CDDP) algorithm that iteratively determines these active set and is guaranteed to converge toward a local minimum. CDDP is demonstrated on several underactuated optimal control problems up to 12D with obstacle avoidance and control constraints and is shown to outperform other methods for accommodating constraints.

95 citations


Journal ArticleDOI
TL;DR: Relations between model predictive control and reinforcement learning are studied for discrete-time linear time-invariant systems with state and input constraints and a quadratic value function.

87 citations


Journal ArticleDOI
TL;DR: Convergence properties of the local policy iteration algorithm are presented to show that the iterative value function is monotonically nonincreasing and converges to the optimum under some mild conditions, and the admissibility of the iteratives control law is proven.
Abstract: In this paper, a discrete-time optimal control scheme is developed via a novel local policy iteration adaptive dynamic programming algorithm. In the discrete-time local policy iteration algorithm, the iterative value function and iterative control law can be updated in a subset of the state space, where the computational burden is relaxed compared with the traditional policy iteration algorithm. Convergence properties of the local policy iteration algorithm are presented to show that the iterative value function is monotonically nonincreasing and converges to the optimum under some mild conditions. The admissibility of the iterative control law is proven, which shows that the control system can be stabilized under any of the iterative control laws, even if the iterative control law is updated in a subset of the state space. Finally, two simulation examples are given to illustrate the performance of the developed method.

85 citations


Journal ArticleDOI
TL;DR: An approach to solve finite time horizon suboptimal feedback control problems for partial differential equations is proposed by solving dynamic programming equations on adaptive sparse grids with semi-discrete optimal control problem and the feedback control is derived from the corresponding value function.
Abstract: An approach to solve finite time horizon suboptimal feedback control problems for partial differential equations is proposed by solving dynamic programming equations on adaptive sparse grids. A semi-discrete optimal control problem is introduced and the feedback control is derived from the corresponding value function. The value function can be characterized as the solution of an evolutionary Hamilton---Jacobi Bellman (HJB) equation which is defined over a state space whose dimension is equal to the dimension of the underlying semi-discrete system. Besides a low dimensional semi-discretization it is important to solve the HJB equation efficiently to address the curse of dimensionality. We propose to apply a semi-Lagrangian scheme using spatially adaptive sparse grids. Sparse grids allow the discretization of the value functions in (higher) space dimensions since the curse of dimensionality of full grid methods arises to a much smaller extent. For additional efficiency an adaptive grid refinement procedure is explored. The approach is illustrated for the wave equation and an extension to equations of Schrodinger type is indicated. We present several numerical examples studying the effect the parameters characterizing the sparse grid have on the accuracy of the value function and the optimal trajectory.

Journal ArticleDOI
TL;DR: In this paper, a queue-aware power and rate allocation with constraints of average fronthaul consumption for delay-sensitive traffic is formulated as an infinite horizon constrained partially observed Markov decision process, which takes both the urgent queue state information and the imperfect channel state information at transmitters (CSIT) into account.
Abstract: The cloud radio access network (C-RAN) provides high spectral and energy efficiency performances, low expenditures, and intelligent centralized system structures to operators, which have attracted intense interests in both academia and industry. In this paper, a hybrid coordinated multipoint transmission (H-CoMP) scheme is designed for the downlink transmission in C-RANs and fulfills the flexible tradeoff between cooperation gain and fronthaul consumption. The queue-aware power and rate allocation with constraints of average fronthaul consumption for the delay-sensitive traffic are formulated as an infinite horizon constrained partially observed Markov decision process, which takes both the urgent queue state information and the imperfect channel state information at transmitters (CSIT) into account. To deal with the curse of dimensionality involved with the equivalent Bellman equation, the linear approximation of postdecision value functions is utilized. A stochastic gradient algorithm is presented to allocate the queue-aware power and transmission rate with H-CoMP, which is robust against unpredicted traffic arrivals and uncertainties caused by the imperfect CSIT. Furthermore, to substantially reduce the computing complexity, an online learning algorithm is proposed to estimate the per-queue postdecision value functions and update the Lagrange multipliers. The simulation results demonstrate performance gains of the proposed stochastic gradient algorithms and confirm the asymptotical convergence of the proposed online learning algorithm.

Journal ArticleDOI
TL;DR: This paper presents a Hamiltonian-driven framework of adaptive dynamic programming (ADP) for continuous time nonlinear systems, which consists of evaluation of an admissible control, comparison between two different admissible policies with respect to the corresponding the performance function, and the performance improvement of anadmissible control.
Abstract: This paper presents a Hamiltonian-driven framework of adaptive dynamic programming (ADP) for continuous time nonlinear systems, which consists of evaluation of an admissible control, comparison between two different admissible policies with respect to the corresponding the performance function, and the performance improvement of an admissible control. It is showed that the Hamiltonian can serve as the temporal difference for continuous-time systems. In the Hamiltonian-driven ADP, the critic network is trained to output the value gradient. Then, the inner product between the critic and the system dynamics produces the value derivative. Under some conditions, the minimization of the Hamiltonian functional is equivalent to the value function approximation. An iterative algorithm starting from an arbitrary admissible control is presented for the optimal control approximation with its convergence proof. The implementation is accomplished by a neural network approximation. Two simulation studies demonstrate the effectiveness of Hamiltonian-driven ADP.

Journal ArticleDOI
TL;DR: In this article, a stochastic optimal control problem with an autonomous forward process was studied and the dynamic programming principle was rigorously proved for V. The DPP is important in obtaining a characterization of the value function as a solution of a non-linear partial differential equation (the so-called Hamilton-Jacobi-Belman equation), in this case on the Wasserstein space of measures.
Abstract: We analyze a stochastic optimal control problem, where the state process follows a McKean-Vlasov dynamics and the diffusion coefficient can be degenerate. We prove that its value function V admits a nonlinear Feynman-Kac representation in terms of a class of forward-backward stochastic differential equations, with an autonomous forward process. We exploit this probabilistic representation to rigorously prove the dynamic programming principle (DPP) for V. The Feynman-Kac representation we obtain has an important role beyond its intermediary role in obtaining our main result: in fact it would be useful in developing probabilistic numerical schemes for V. The DPP is important in obtaining a characterization of the value function as a solution of a non-linear partial differential equation (the so-called Hamilton-Jacobi-Belman equation), in this case on the Wasserstein space of measures. We should note that the usual way of solving these equations is through the Pontryagin maximum principle, which requires some convexity assumptions. There were attempts in using the dynamic programming approach before, but these works assumed a priori that the controls were of Markovian feedback type, which helps write the problem only in terms of the distribution of the state process (and the control problem becomes a deterministic problem). In this paper, we will consider open-loop controls and derive the dynamic programming principle in this most general case. In order to obtain the Feynman-Kac representation and the randomized dynamic programming principle, we implement the so-called randomization method, which consists in formulating a new McKean-Vlasov control problem, expressed in weak form taking the supremum over a family of equivalent probability measures. One of the main results of the paper is the proof that this latter control problem has the same value function V of the original control problem.

Journal ArticleDOI
TL;DR: In this article, a time-inconsistent stochastic optimal control problem with a recursive cost functional is studied, and an approximate equilibrium strategy is introduced, which is time-consistent and locally approximately optimal.
Abstract: A time-inconsistent stochastic optimal control problem with a recursive cost functional is studied. Equilibrium strategy is introduced, which is time-consistent and locally approximately optimal. By means of multiperson hierarchical differential games associated with partitions of the time interval, a family of approximate equilibrium strategy is constructed, and by sending the mesh size of the time interval partition to zero, an equilibrium Hamilton--Jacobi--Bellman (HJB) equation is derived through which the equilibrium value function can be identified and the equilibrium strategy can be obtained. Moreover, a well-posedness result of the equilibrium HJB equation is established under certain conditions, and a verification theorem is proved. Finally, an illustrative example is presented, and some comparisons of different definitions of equilibrium strategy are put in order.

Journal ArticleDOI
TL;DR: It is proved that the iteration sequences of value functions and control policies can converge to the optimal ones and a model-free iterative equation derived based on the model-based algorithm and the integral reinforcement learning is equivalent to themodel-based iterative equations.

Posted Content
TL;DR: A Primal-Dual $\pi$ Learning method is proposed that makes primal-dual updates to the policy and value vectors as new data are revealed and gives a sublinear-time algorithm for solving the averaged-reward MDP.
Abstract: Consider the problem of approximating the optimal policy of a Markov decision process (MDP) by sampling state transitions. In contrast to existing reinforcement learning methods that are based on successive approximations to the nonlinear Bellman equation, we propose a Primal-Dual $\pi$ Learning method in light of the linear duality between the value and policy. The $\pi$ learning method is model-free and makes primal-dual updates to the policy and value vectors as new data are revealed. For infinite-horizon undiscounted Markov decision process with finite state space $S$ and finite action space $A$, the $\pi$ learning method finds an $\epsilon$-optimal policy using the following number of sample transitions $$ \tilde{O}( \frac{(\tau\cdot t^*_{mix})^2 |S| |A| }{\epsilon^2} ),$$ where $t^*_{mix}$ is an upper bound of mixing times across all policies and $\tau$ is a parameter characterizing the range of stationary distributions across policies. The $\pi$ learning method also applies to the computational problem of MDP where the transition probabilities and rewards are explicitly given as the input. In the case where each state transition can be sampled in $\tilde{O}(1)$ time, the $\pi$ learning method gives a sublinear-time algorithm for solving the averaged-reward MDP.

Journal ArticleDOI
TL;DR: This paper presents an approximate optimal control of nonlinear continuous-time systems in affine form by using the adaptive dynamic programming (ADP) with event-sampled state and input vectors and a numerical example is utilized to evaluate the performance of the near-optimal design.
Abstract: This paper presents an approximate optimal control of nonlinear continuous-time systems in affine form by using the adaptive dynamic programming (ADP) with event-sampled state and input vectors. The knowledge of the system dynamics is relaxed by using a neural network (NN) identifier with event-sampled inputs. The value function, which becomes an approximate solution to the Hamilton–Jacobi–Bellman equation, is generated by using event-sampled NN approximator. Subsequently, the NN identifier and the approximated value function are utilized to obtain the optimal control policy. Both the identifier and value function approximator weights are tuned only at the event-sampled instants leading to an aperiodic update scheme. A novel adaptive event sampling condition is designed to determine the sampling instants, such that the approximation accuracy and the stability are maintained. A positive lower bound on the minimum inter-sample time is guaranteed to avoid accumulation point, and the dependence of inter-sample time upon the NN weight estimates is analyzed. A local ultimate boundedness of the resulting nonlinear impulsive dynamical closed-loop system is shown. Finally, a numerical example is utilized to evaluate the performance of the near-optimal design. The net result is the design of an event-sampled ADP-based controller for nonlinear continuous-time systems.

Posted Content
TL;DR: This paper first transforms the empirical policy evaluation problem into a (quadratic) convex-concave saddle point problem, and then presents a primal-dual batch gradient method, as well as two stochastic variance reduction methods for solving the problem.
Abstract: Policy evaluation is a crucial step in many reinforcement-learning procedures, which estimates a value function that predicts states' long-term value under a given policy. In this paper, we focus on policy evaluation with linear function approximation over a fixed dataset. We first transform the empirical policy evaluation problem into a (quadratic) convex-concave saddle point problem, and then present a primal-dual batch gradient method, as well as two stochastic variance reduction methods for solving the problem. These algorithms scale linearly in both sample size and feature dimension. Moreover, they achieve linear convergence even when the saddle-point problem has only strong concavity in the dual variables but no strong convexity in the primal variables. Numerical experiments on benchmark problems demonstrate the effectiveness of our methods.

Journal ArticleDOI
TL;DR: In this paper, the problem of computing the social optimum in models with heterogeneous agents subject to idiosyncratic shocks is analyzed, which is equivalent to a deterministic optimal control problem in which the state variable is the infinite-dimensional cross-sectional distribution.

Journal ArticleDOI
TL;DR: In this article, the Merton portfolio optimization problem in the presence of stochastic volatility using asymptotic approximations when the volatility process is characterized by its time scales of fluctuation is studied.
Abstract: We study the Merton portfolio optimization problem in the presence of stochastic volatility using asymptotic approximations when the volatility process is characterized by its time scales of fluctuation. This approach is tractable because it treats the incomplete markets problem as a perturbation around the complete market constant volatility problem for the value function, which is well-understood. When volatility is fast mean-reverting, this is a singular perturbation problem for a nonlinear Hamilton-JacobiBellman PDE, while when volatility is slowly varying, it is a regular perturbation. These analyses can be combined for multifactor multiscale stochastic volatility models. The asymptotics shares remarkable similarities with the linear option pricing problem, which follows from some new properties of the Merton risk-tolerance function. We give examples in the family of mixture of power utilities and also we use our asymptotic analysis to suggest a “practical” strategy which does not require tracking the fast-moving volatility. In this paper, we present formal derivations of asymptotic approximations, and we provide a convergence proof in the case of power utility and single factor stochastic volatility. We assess our approximation in a particular case where there is an explicit solution.

Journal ArticleDOI
TL;DR: In this paper, the primal-dual methodology is generalized to a backward dynamic programming equation associated with time discretization schemes of (reflected) backward stochastic differential equations (BSDEs).
Abstract: We generalize the primal–dual methodology, which is popular in the pricing of early-exercise options, to a backward dynamic programming equation associated with time discretization schemes of (reflected) backward stochastic differential equations (BSDEs). Taking as an input some approximate solution of the backward dynamic program, which was precomputed, e.g., by least-squares Monte Carlo, this methodology enables us to construct a confidence interval for the unknown true solution of the time-discretized (reflected) BSDE at time 0. We numerically demonstrate the practical applicability of our method in two 5-dimensional nonlinear pricing problems where tight price bounds were previously unavailable.

Journal ArticleDOI
TL;DR: In this article, the authors present a fast and accurate computational method for solving and estimating a class of dynamic programming models with discrete and continuous choice variables, which are typically interpreted as "unobserved state variables" in structural econometric applications.
Abstract: We present a fast and accurate computational method for solving and estimating a class of dynamic programming models with discrete and continuous choice variables. The solution method we develop for structural estimation extends the endogenous grid‐point method (EGM) to discrete‐continuous (DC) problems. Discrete choices can lead to kinks in the value functions and discontinuities in the optimal policy rules, greatly complicating the solution of the model. We show how these problems are ameliorated in the presence of additive choice‐specific independent and identically distributed extreme value taste shocks that are typically interpreted as “unobserved state variables” in structural econometric applications, or serve as “random noise” to smooth out kinks in the value functions in numerical applications. We present Monte Carlo experiments that demonstrate the reliability and efficiency of the DC‐EGM algorithm and the associated maximum likelihood estimator for structural estimation of a life‐cycle model of consumption with discrete retirement decisions. Life‐cycle model discrete and continuous choice Bellman equation Euler equation retirement choice endogenous grid‐point method nested fixed point algorithm extreme value taste shocks smoothed max function structural estimation C13 C63 D91

Posted Content
TL;DR: In this paper, a new on-line scheme is presented to design the optimal coordination control for the consensus problem of multi-agent differential games by fuzzy adaptive dynamic programming (FADP), which brings together game theory, generalized fuzzy hyperbolic model (GFHM) and adaptive programming.
Abstract: In this paper, a new on-line scheme is presented to design the optimal coordination control for the consensus problem of multi-agent differential games by fuzzy adaptive dynamic programming (FADP), which brings together game theory, generalized fuzzy hyperbolic model (GFHM) and adaptive dynamic programming. In general, the optimal coordination control for multi-agent differential games is the solution of the coupled Hamilton-Jacobi (HJ) equations. Here, for the first time, GFHMs are used to approximate the solution (value functions) of the coupled HJ equations, based on policy iteration (PI) algorithm. Namely, for each agent, GFHM is used to capture the mapping between the local consensus error and local value function. Since our scheme uses the single-network rchitecture for each agent (which eliminates the action network model compared with dual-network architecture), it is a more reasonable architecture for multi-agent systems. Furthermore, the approximation solution is utilized to obtain the optimal coordination controls. Finally, we give the stability analysis for our scheme, and prove the weight estimation error and the local consensus error are uniformly ultimately bounded. Further, the control node trajectory is proven to be cooperative uniformly ultimately bounded.

Journal ArticleDOI
TL;DR: In this paper infinite horizon optimal control problems for nonlinear high-dimensional dynamical systems are studied and a reduced-order model is derived for the dynamical system, using the method of proper orthogonal decomposition (POD).
Abstract: In this paper infinite horizon optimal control problems for nonlinear high-dimensional dynamical systems are studied. Nonlinear feedback laws can be computed via the value function characterized as the unique viscosity solution to the corresponding Hamilton--Jacobi--Bellman (HJB) equation which stems from the dynamic programming approach. However, the bottleneck is mainly due to the curse of dimensionality, and HJB equations are solvable only in a relatively small dimension. Therefore, a reduced-order model is derived for the dynamical system, using the method of proper orthogonal decomposition (POD). The resulting errors in the HJB equations are estimated by an a priori error analysis, which is utilized in the numerical approximation to ensure a desired accuracy for the POD method. Numerical experiments illustrates the theoretical findings.

Journal ArticleDOI
TL;DR: This work proves existence and uniqueness of a solution to the BSDE system and characterize both the value function and the optimal strategy in terms of the unique solution to that system.
Abstract: We study an optimal execution problem in illiquid markets with both instantaneous and persistent price impact and stochastic resilience when only absolutely continuous trading strategies are admissible. In our model the value function can be described by a three-dimensional system of backward stochastic differential equations (BSDE) with a singular terminal condition in one component. We prove existence and uniqueness of a solution to the BSDE system and characterize both the value function and the optimal strategy in terms of the unique solution to the BSDE system. Our existence proof is based on an asymptotic expansion of the BSDE system at the terminal time that allows us to express the system in terms of a equivalent system with finite terminal value but singular driver.

Journal ArticleDOI
TL;DR: In this paper, an adaptive dynamic programming (ADP)-based robust neural control scheme is developed for a class of unknown continuous-time (CT) non-linear systems, where only system data is necessary to update simultaneously the actor neural network (NN) weights and the critic NN weights.
Abstract: The design of robust controllers for continuous-time (CT) non-linear systems with completely unknown non-linearities is a challenging task. The inability to accurately identify the non-linearities online or offline motivates the design of robust controllers using adaptive dynamic programming (ADP). In this study, an ADP-based robust neural control scheme is developed for a class of unknown CT non-linear systems. To begin with, the robust non-linear control problem is converted into a non-linear optimal control problem via constructing a value function for the nominal system. Then an ADP algorithm is developed to solve the non-linear optimal control problem. The ADP algorithm employs actor-critic dual networks to approximate the control policy and the value function, respectively. Based on this architecture, only system data is necessary to update simultaneously the actor neural network (NN) weights and the critic NN weights. Meanwhile, the persistence of excitation assumption is no longer required by using the Monte Carlo integration method. The closed-loop system with unknown non-linearities is demonstrated to be asymptotically stable under the obtained optimal control. Finally, two examples are provided to validate the developed method.

Journal ArticleDOI
TL;DR: It is shown that under certain assumptions, the adjoint process in the Hybrid Minimum Principle and the gradient of the value function in Hybrid Dynamic Programming are governed by the same set of differential equations and have the same boundary conditions and hence are almost everywhere identical to each other along optimal trajectories.
Abstract: Hybrid optimal control problems are studied for a general class of hybrid systems, where autonomous and controlled state jumps are allowed at the switching instants, and in addition to terminal and running costs, switching between discrete states incurs costs. The statements of the Hybrid Minimum Principle and Hybrid Dynamic Programming are presented in this framework, and it is shown that under certain assumptions, the adjoint process in the Hybrid Minimum Principle and the gradient of the value function in Hybrid Dynamic Programming are governed by the same set of differential equations and have the same boundary conditions and hence are almost everywhere identical to each other along optimal trajectories. Analytic examples are provided to illustrate the results.

ReportDOI
TL;DR: In this article, an empirical framework is introduced to analyze the permanent-transitory decomposition of stochastic discount factor (SDF) processes in dynamic economies, where the permanent component characterizes pricing over long investment horizons.
Abstract: Stochastic discount factor (SDF) processes in dynamic economies admit a permanent-transitory decomposition in which the permanent component characterizes pricing over long investment horizons. This paper introduces an empirical framework to analyze the permanent-transitory decomposition of SDF processes. Specifically, we show how to estimate nonparametrically the solution to the Perron–Frobenius eigenfunction problem of Hansen and Scheinkman, 2009. Our empirical framework allows researchers to (i) construct time series of the estimated permanent and transitory components and (ii) estimate the yield and the change of measure which characterize pricing over long investment horizons. We also introduce nonparametric estimators of the continuation value function in a class of models with recursive preferences by reinterpreting the value function recursion as a nonlinear Perron–Frobenius problem. We establish consistency and convergence rates of the eigenfunction estimators and asymptotic normality of the eigenvalue estimator and estimators of related functionals. As an application, we study an economy where the representative agent is endowed with recursive preferences, allowing for general (nonlinear) consumption and earnings growth dynamics.

Journal ArticleDOI
TL;DR: This paper shows that the long-run average total cost is monotone in the service rate and the optimal control is a bang–bang control, and proposes a recursive algorithm to compute the value function related quantities by utilizing the MAM theory.
Abstract: In this paper, we study the optimal control of service rates in a queueing system with a Markovian arrival process (MAP) and exponential service times. The service rate is allowed to be state dependent, i.e., we can adjust the service rate according to the queue length and the phase of the MAP. The cost function consists of holding cost and operating cost. The goal is to find the optimal service rates that minimize the long-run average total cost. To achieve that, we use the matrix-analytic methods (MAM) together with the sensitivity-based optimization (SBO) theory. A performance difference formula is derived, which can quantify the difference of the long-run average total cost under any two different settings of service rates. Based on the difference formula, we show that the long-run average total cost is monotone in the service rate and the optimal control is a bang–bang control. We also show that, under some mild conditions, the optimal control policy of service rates is of a quasi-threshold-type. By utilizing the MAM theory, we propose a recursive algorithm to compute the value function related quantities. An iterative algorithm to efficiently find the optimal policy, which is similar to policy iteration, is proposed based on the SBO theory. We further study some special cases of the problem, such as the optimality of the threshold-type policy for the M/M/1 queue. Finally, a number of numerical examples are presented to demonstrate the main results and explore the impact of the phase of the MAP on the optimization in the MAP/M/1 queue.