scispace - formally typeset
Search or ask a question

Showing papers on "Bellman equation published in 2014"


Journal ArticleDOI
TL;DR: An integral reinforcement learning algorithm on an actor-critic structure is developed to learn online the solution to the Hamilton-Jacobi-Bellman equation for partially-unknown constrained-input systems and it is shown that using this technique, an easy-to-check condition on the richness of the recorded data is sufficient to guarantee convergence to a near-optimal control law.

410 citations


Journal ArticleDOI
TL;DR: A novel approach based on the Q -learning algorithm is proposed to solve the infinite-horizon linear quadratic tracker (LQT) for unknown discrete-time systems in a causal manner and the optimal control input is obtained by only solving an augmented ARE.

397 citations


Journal ArticleDOI
TL;DR: An online learning algorithm is developed to solve the linear quadratic tracking (LQT) problem for partially-unknown continuous-time systems and it is shown that the value function is Quadratic in terms of the state of the system and the command generator.
Abstract: In this technical note, an online learning algorithm is developed to solve the linear quadratic tracking (LQT) problem for partially-unknown continuous-time systems. It is shown that the value function is quadratic in terms of the state of the system and the command generator. Based on this quadratic form, an LQT Bellman equation and an LQT algebraic Riccati equation (ARE) are derived to solve the LQT problem. The integral reinforcement learning technique is used to find the solution to the LQT ARE online and without requiring the knowledge of the system drift dynamics or the command generator dynamics. The convergence of the proposed online algorithm to the optimal control solution is verified. To show the efficiency of the proposed approach, a simulation example is provided.

320 citations


Journal ArticleDOI
TL;DR: An online adaptive policy learning algorithm (APLA) based on adaptive dynamic programming (ADP) is proposed for learning in real-time the solution to the Hamilton-Jacobi-Isaacs (HJI) equation, which appears in the H∞ control problem.
Abstract: The problem of H∞ state feedback control of affine nonlinear discrete-time systems with unknown dynamics is investigated in this paper. An online adaptive policy learning algorithm (APLA) based on adaptive dynamic programming (ADP) is proposed for learning in real-time the solution to the Hamilton-Jacobi-Isaacs (HJI) equation, which appears in the H∞ control problem. In the proposed algorithm, three neural networks (NNs) are utilized to find suitable approximations of the optimal value function and the saddle point feedback control and disturbance policies. Novel weight updating laws are given to tune the critic, actor, and disturbance NNs simultaneously by using data generated in real-time along the system trajectories. Considering NN approximation errors, we provide the stability analysis of the proposed algorithm with Lyapunov approach. Moreover, the need of the system input dynamics for the proposed algorithm is relaxed by using a NN identification scheme. Finally, simulation examples show the effectiveness of the proposed algorithm.

197 citations


Journal ArticleDOI
TL;DR: A theory for a general class of discrete-time stochastic control problems that, in various ways, are time-inconsistent in the sense that they do not admit a Bellman optimality principle is developed.
Abstract: We develop a theory for a general class of discrete-time stochastic control problems that, in various ways, are time-inconsistent in the sense that they do not admit a Bellman optimality principle. We attack these problems by viewing them within a game theoretic framework, and we look for subgame perfect Nash equilibrium points. For a general controlled Markov process and a fairly general objective functional, we derive an extension of the standard Bellman equation, in the form of a system of nonlinear equations, for the determination of the equilibrium strategy as well as the equilibrium value function. Most known examples of time-inconsistent stochastic control problems in the literature are easily seen to be special cases of the present theory. We also prove that for every time-inconsistent problem, there exists an associated time-consistent problem such that the optimal control and the optimal value function for the consistent problem coincide with the equilibrium control and value function, respectively for the time-inconsistent problem. To exemplify the theory, we study some concrete examples, such as hyperbolic discounting and mean–variance control.

188 citations


Journal ArticleDOI
TL;DR: An integral reinforcement learning algorithm based on policy iteration to learn online the Nash equilibrium solution for a two-player zero-sum differential game with completely unknown linear continuous-time dynamics is developed.
Abstract: In this paper, we develop an integral reinforcement learning algorithm based on policy iteration to learn online the Nash equilibrium solution for a two-player zero-sum differential game with completely unknown linear continuous-time dynamics. This algorithm is a fully model-free method solving the game algebraic Riccati equation forward in time. The developed algorithm updates value function, control and disturbance policies simultaneously. The convergence of the algorithm is demonstrated to be equivalent to Newton's method. To implement this algorithm, one critic network and two action networks are used to approximate the game value function, control and disturbance policies, respectively, and the least squares method is used to estimate the unknown parameters. The effectiveness of the developed scheme is demonstrated in the simulation by designing an H-infinity state feedback controller for a power system.

149 citations


Journal ArticleDOI
TL;DR: In this paper, a class of risk-sensitive mean-field stochastic differential games with exponential cost functions is studied and the corresponding mean field equilibria are characterized in terms of backward-forward macroscopic McKean-Vlasov equations, Fokker-Planck-Kolmogorov equations and HJB equations.
Abstract: In this paper, we study a class of risk-sensitive mean-field stochastic differential games. We show that under appropriate regularity conditions, the mean-field value of the stochastic differential game with exponentiated integral cost functional coincides with the value function satisfying a Hamilton -Jacobi- Bellman (HJB) equation with an additional quadratic term. We provide an explicit solution of the mean-field best response when the instantaneous cost functions are log-quadratic and the state dynamics are affine in the control. An equivalent mean-field risk-neutral problem is formulated and the corresponding mean-field equilibria are characterized in terms of backward-forward macroscopic McKean-Vlasov equations, Fokker-Planck-Kolmogorov equations, and HJB equations. We provide numerical examples on the mean field behavior to illustrate both linear and McKean-Vlasov dynamics.

132 citations


Journal ArticleDOI
TL;DR: Two theorems illustrate how this boundedness condition can be concluded from structural properties like controllability and stabilizability of the control system under consideration of the class of strictly dissipative systems under consideration.
Abstract: We investigate the exponential turnpike property for finite horizon undiscounted discrete time optimal control problems without any terminal constraints. Considering a class of strictly dissipative systems, we derive a boundedness condition for an auxiliary optimal value function which implies the exponential turnpike property. Two theorems illustrate how this boundedness condition can be concluded from structural properties like controllability and stabilizability of the control system under consideration.

116 citations


Journal ArticleDOI
TL;DR: Value iteration-based approximate/adaptive dynamic programming (ADP) as an approximate solution to infinite-horizon optimal control problems with deterministic dynamics and continuous state and action spaces is investigated and a relatively simple proof for the convergence of the outer-loop iterations to the optimal solution is provided.
Abstract: Value iteration-based approximate/adaptive dynamic programming (ADP) as an approximate solution to infinite-horizon optimal control problems with deterministic dynamics and continuous state and action spaces is investigated. The learning iterations are decomposed into an outer loop and an inner loop. A relatively simple proof for the convergence of the outer-loop iterations to the optimal solution is provided using a novel idea with some new features. It presents an analogy between the value function during the iterations and the value function of a fixed-final-time optimal control problem. The inner loop is utilized to avoid the need for solving a set of nonlinear equations or a nonlinear optimization problem numerically, at each iteration of ADP for the policy update. Sufficient conditions for the uniqueness of the solution to the policy update equation and for the convergence of the inner-loop iterations to the solution are obtained. Afterwards, the results are formed as a learning algorithm for training a neurocontroller or creating a look-up table to be used for optimal control of nonlinear systems with different initial conditions. Finally, some of the features of the investigated method are numerically analyzed.

114 citations


Journal ArticleDOI
TL;DR: The method is proved to be consistent and stable, with convergence rates that are optimal with respect to mesh size, and suboptimal in the polynomial degree by only half an order.
Abstract: We propose an $hp$-version discontinuous Galerkin finite element method for fully nonlinear second-order elliptic Hamilton--Jacobi--Bellman equations with Cordes coefficients. The method is proved to be consistent and stable, with convergence rates that are optimal with respect to mesh size, and suboptimal in the polynomial degree by only half an order. Numerical experiments on problems with nonsmooth solutions and strongly anisotropic diffusion coefficients illustrate the accuracy and computational efficiency of the scheme. An existence and uniqueness result for strong solutions of the fully nonlinear problem and a semismoothness result for the nonlinear operator are also provided.

100 citations


Proceedings Article
08 Dec 2014
TL;DR: Compared with the classical DDP and a state-of-the-art GP-based policy search method, PDDP offers a superior combination of data-efficiency, learning speed, and applicability.
Abstract: We present a data-driven, probabilistic trajectory optimization framework for systems with unknown dynamics, called Probabilistic Differential Dynamic Programming (PDDP). PDDP takes into account uncertainty explicitly for dynamics models using Gaussian processes (GPs). Based on the second-order local approximation of the value function, PDDP performs Dynamic Programming around a nominal trajectory in Gaussian belief spaces. Different from typical gradient-based policy search methods, PDDP does not require a policy parameterization and learns a locally optimal, time-varying control policy. We demonstrate the effectiveness and efficiency of the proposed algorithm using two nontrivial tasks. Compared with the classical DDP and a state-of-the-art GP-based policy search method, PDDP offers a superior combination of data-efficiency, learning speed, and applicability.

Journal ArticleDOI
TL;DR: It is proved the existence of a weak solution to the system with prescribed initial and terminal conditions m0, m1 (positive and smooth) for the density m, which is also a special case of an exact controllability result for the Fokker–Planck equation through some optimal transport field.
Abstract: We consider the planning problem for a class of mean field games, consisting in a coupled system of a Hamilton–Jacobi–Bellman equation for the value function u and a Fokker–Planck equation for the density m of the players, whereas one wishes to drive the density of players from the given initial configuration to a target one at time T through the optimal decisions of the agents. Assuming that the coupling F(x,m) in the cost criterion is monotone with respect to m, and that the Hamiltonian has some growth bounded below and above by quadratic functions, we prove the existence of a weak solution to the system with prescribed initial and terminal conditions m 0, m 1 (positive and smooth) for the density m. This is also a special case of an exact controllability result for the Fokker–Planck equation through some optimal transport field.

Journal ArticleDOI
TL;DR: In this paper, the so-called pessimistic version of bilevel programming programs is studied and several types of lower subdifferential necessary optimality conditions are derived by using the lower-level value function approach and the Karush-Kuhn-Tucker representation of lower level optimal solution maps.
Abstract: This article is devoted to the so-called pessimistic version of bilevel programming programs. Minimization problems of this type are challenging to handle partly because the corresponding value functions are often merely upper (while not lower) semicontinuous. Employing advanced tools of variational analysis and generalized differentiation, we provide rather general frameworks ensuring the Lipschitz continuity of the corresponding value functions. Several types of lower subdifferential necessary optimality conditions are then derived by using the lower-level value function approach and the Karush–Kuhn–Tucker representation of lower-level optimal solution maps. We also derive upper subdifferential necessary optimality conditions of a new type, which can be essentially stronger than the lower ones in some particular settings. Finally, certain links are established between the obtained necessary optimality conditions for the pessimistic and optimistic versions in bilevel programming.

Journal ArticleDOI
TL;DR: In this paper, the authors considered the optimal reinsurance and investment problem in an unobservable Markov-modulated compound Poisson risk model, where the intensity and jump size distribution are not known but have to be inferred from the observations of claim arrivals.
Abstract: We consider the optimal reinsurance and investment problem in an unobservable Markov-modulated compound Poisson risk model, where the intensity and jump size distribution are not known but have to be inferred from the observations of claim arrivals. Using a recently developed result from filtering theory, we reduce the partially observable control problem to an equivalent problem with complete observations. Then using stochastic control theory, we get the closed form expressions of the optimal strategies which maximize the expected exponential utility of terminal wealth. In particular, we investigate the effect of the safety loading and the unobservable factors on the optimal reinsurance strategies. With the help of a generalized Hamilton–Jacobi–Bellman equation where the derivative is replaced by Clarke’s generalized gradient as in Bauerle and Rieder (2007), we characterize the value function, which helps us verify that the strategies we constructed are optimal.

Journal ArticleDOI
TL;DR: This paper extends Carroll's endogenous grid method and its combination with value function iteration to a class of dynamic programming problems, such as problems with both discrete and continuous choices, in which the value function is non-smooth and non-concave.

Journal ArticleDOI
01 Dec 2014
TL;DR: The maximum causal entropy framework is extended to the infinite time horizon setting and a gradient-based algorithm for the maximum discounted causal entropy formulation is developed that enjoys the desired feature of being model agnostic, a property that is absent in many previous IRL algorithms.
Abstract: Inverse reinforcement learning (IRL) attempts to use demonstrations of “expert” decision making in a Markov decision process to infer a corresponding policy that shares the “structured, purposeful” qualities of the expert's actions. In this paper, we extend the maximum causal entropy framework, a notable paradigm in IRL, to the infinite time horizon setting. We consider two formulations (maximum discounted causal entropy and maximum average causal entropy) appropriate for the infinite horizon case and show that both result in optimization programs that can be reformulated as convex optimization problems; thus, admitting efficient computation. We then develop a gradient-based algorithm for the maximum discounted causal entropy formulation that enjoys the desired feature of being model agnostic, a property that is absent in many previous IRL algorithms. We propose the stationary soft Bellman policy, a key building block in the gradient-based algorithm, and study its properties in depth, which not only lead to theoretical insight into its analytical properties, but also help motivate a large toolkit of methods for implementing the gradient-based algorithm. Finally, we select three algorithms of this type and apply them to two problem instances involving demonstration data from a simple controlled queuing network model inspired by problems in air traffic management.

Journal ArticleDOI
TL;DR: The main results are to identify the right Hamilton--Jacobi--Bellman equation and to provide the maximal and minimal solutions, as well as conditions for uniqueness.
Abstract: This article is a continuation of a previous work where we studied infinite horizon control problems for which the dynamic, running cost, and control space may be different in two half-spaces of some Euclidian space $\mathbb{R}^N$. In this article we extend our results in several directions: (i) to more general domains; (ii) to consideration of finite horizon control problems; (iii) to weakening the controllability assumptions. We use a Bellman approach and our main results are to identify the right Hamilton--Jacobi--Bellman equation (and, in particular, the right conditions to be put on the interfaces separating the regions where the dynamic and running cost are different) and to provide the maximal and minimal solutions, as well as conditions for uniqueness. We also provide stability results for such equations.

Journal ArticleDOI
TL;DR: This work derives HJB equations and applies them to two examples, a portfolio optimization and a systemic risk model, and shows that Bellman's principle applies to the dynamic programming value function V(\tau,\rho_\tau) where the dependency on $\rho$ is functional as in P.L. Lions' analysis of mean-filed games (2007).

01 Apr 2014
TL;DR: In this article, the authors introduce the idea of transforming a Dec-POMDP into a continuous-state deterministic MDP with a piecewise-linear and convex value function.
Abstract: Decentralized partially observable Markov decision processes (Dec-POMDPs) provide a general model for decision-making under uncertainty in cooperative decentralized settings, but are difficult to solve optimally (NEXP-Complete). As a new way of solving these problems, we introduce the idea of transforming a Dec-POMDP into a continuous-state deterministic MDP with a piecewise-linear and convex value function. This approach makes use of the fact that planning can be accomplished in a centralized offline manner, while execution can still be distributed. This new Dec-POMDP formulation, which we call an occupancy MDP, allows powerful POMDP and continuous-state MDP methods to be used for the first time. When the curse of dimensionality becomes too prohibitive, we refine this basic approach and present ways to combine heuristic search and compact representations that exploit the structure present in multi-agent domains, without losing the ability to eventually converge to an optimal solution. In particular, we introduce feature-based heuristic search that relies on feature-based compact representations, point-based updates and efficient action selection. A theoretical analysis demonstrates that our feature-based heuristic search algorithms terminate in finite time with an optimal solution. We include an extensive empirical analysis using well known benchmarks, thereby demonstrating our approach provides significant scalability improvements compared to the state of the art.

Posted Content
TL;DR: A general finite-horizon problem setting where the optimal value function is monotone is described, a convergence proof for Monotone-ADP is presented, and numerical results are shown for three application domains: optimal stopping, energy storage/allocation, and glycemic control for diabetes patients.
Abstract: Many sequential decision problems can be formulated as Markov Decision Processes (MDPs) where the optimal value function (or cost-to-go function) can be shown to satisfy a monotone structure in some or all of its dimensions. When the state space becomes large, traditional techniques, such as the backward dynamic programming algorithm (i.e., backward induction or value iteration), may no longer be effective in finding a solution within a reasonable time frame, and thus we are forced to consider other approaches, such as approximate dynamic programming (ADP). We propose a provably convergent ADP algorithm called Monotone-ADP that exploits the monotonicity of the value functions in order to increase the rate of convergence. In this paper, we describe a general finite-horizon problem setting where the optimal value function is monotone, present a convergence proof for Monotone-ADP under various technical assumptions, and show numerical results for three application domains: optimal stopping, energy storage/allocation, and glycemic control for diabetes patients. The empirical results indicate that by taking advantage of monotonicity, we can attain high quality solutions within a relatively small number of iterations, using up to two orders of magnitude less computation than is needed to compute the optimal solution exactly.

Journal ArticleDOI
TL;DR: In this paper, the authors considered the problem of risk-sensitive control of continuous time Markov chains taking values in discrete state space and developed a policy iteration algorithm for finding an optimal control.
Abstract: We study risk-sensitive control of continuous time Markov chains taking values in discrete state space. We study both finite and infinite horizon problems. In the finite horizon problem we characterize the value function via Hamilton Jacobi Bellman equation and obtain an optimal Markov control. We do the same for infinite horizon discounted cost case. In the infinite horizon average cost case we establish the existence of an optimal stationary control under certain Lyapunov condition. We also develop a policy iteration algorithm for finding an optimal control.

Journal ArticleDOI
TL;DR: In this paper, the authors studied the stochastic optimal control problem of fully coupled forward-backward stochastically differential equations (FBSDEs) and proved that the value functions are deterministic, satisfy the dynamic programming principle, and are viscosity solutions.
Abstract: In this paper we study stochastic optimal control problems of fully coupled forward-backward stochastic differential equations (FBSDEs). The recursive cost functionals are defined by controlled fully coupled FBSDEs. We use a new method to prove that the value functions are deterministic, satisfy the dynamic programming principle, and are viscosity solutions to the associated generalized Hamilton--Jacobi--Bellman (HJB) equations. For this we generalize the notion of stochastic backward semigroup introduced by Peng Topics on Stochastic Analysis, Science Press, Beijing, 1997, pp. 85--138. We emphasize that when $\sigma$ depends on the second component of the solution $(Y, Z)$ of the BSDE it makes the stochastic control much more complicated and has as a consequence that the associated HJB equation is combined with an algebraic equation. We prove that the algebraic equation has a unique solution, and moreover, we also give the representation for this solution. On the other hand, we prove a new local existence...

Journal ArticleDOI
TL;DR: This paper proposes a particular form of the problem that exposes some useful properties of the gauge optimization framework (such as the variational properties of its value function), and yet maintains most of the generality of the abstract form of gauge optimization.
Abstract: Gauge functions significantly generalize the notion of a norm, and gauge optimization, as defined by [R. M. Freund, Math. Programming, 38 (1987), pp. 47--67], seeks the element of a convex set that is minimal with respect to a gauge function. This conceptually simple problem can be used to model a remarkable array of useful problems, including a special case of conic optimization, and related problems that arise in machine learning and signal processing. The gauge structure of these problems allows for a special kind of duality framework. This paper explores the duality framework proposed by Freund, and proposes a particular form of the problem that exposes some useful properties of the gauge optimization framework (such as the variational properties of its value function), and yet maintains most of the generality of the abstract form of gauge optimization.

Proceedings ArticleDOI
01 Dec 2014
TL;DR: In this paper, the authors combine the structure of the Hamilton Jacobi Bellman Equation and its reduction to a linear Partial Differential Equation (PDE), with methods based on low rank tensor representations, known as a separated representations, to address the curse of dimensionality.
Abstract: The Hamilton Jacobi Bellman Equation (HJB) provides the globally optimal solution to large classes of control problems. Unfortunately, this generality comes at a price, the calculation of such solutions is typically intractible for systems with more than moderate state space size due to the curse of dimensionality. This work combines recent results in the structure of the HJB, and its reduction to a linear Partial Differential Equation (PDE), with methods based on low rank tensor representations, known as a separated representations, to address the curse of dimensionality. The result is an algorithm to solve optimal control problems which scales linearly with the number of states in a system, and is applicable to systems that are nonlinear with stochastic forcing in finite-horizon, average cost, and first-exit settings. The method is demonstrated on inverted pendulum, VTOL aircraft, and quadcopter models, with system dimension two, six, and twelve respectively.

Journal ArticleDOI
TL;DR: A numerical algorithm to compute high-order approximate solutions to Bellman’s dynamic programming equation that arises in the optimal stabilization of discrete-time nonlinear control systems using a patchy technique to build local Taylor polynomial approximations defined on small domains.
Abstract: In this paper, we present a numerical algorithm to compute high-order approximate solutions to Bellman’s dynamic programming equation that arises in the optimal stabilization of discrete-time nonlinear control systems. The method uses a patchy technique to build local Taylor polynomial approximations defined on small domains, which are then patched together to create a piecewise smooth approximation. The numerical domain is dynamically computed as the level sets of the value function are propagated in reverse time under the closed-loop dynamics. The patch domains are constructed such that their radial boundaries are contained in the level sets of the value function and their lateral boundaries are constructed as invariant sets of the closed-loop dynamics. To minimize the computational effort, an adaptive subdivision algorithm is used to determine the number of patches on each level set depending on the relative error in the dynamic programming equation. Numerical tests in 2D and 3D are given to illustrate the accuracy of the method.

Journal ArticleDOI
TL;DR: In this paper, a state-reduced, recursive dynamic programming implementation of the DICE-2007 model is presented, which simplifies the carbon cycle and the temperature delay equations and solves the infinite planning horizon problem in an arbitrary time step.
Abstract: We introduce a version of the DICE-2007 model designed for uncertainty analysis. DICE is a wide-spread deterministic integrated assessment model of climate change. Climate change, long-term economic development, and their interactions are highly uncertain. The quantitative analysis of optimal mitigation policy under uncertainty requires a recursive dynamic programming implementation of integrated assessment models. Such implementations are subject to the curse of dimensionality. Every increase in the dimension of the state space is paid for by a combination of (exponentially) increasing processor time, lower quality of the value or policy function approximations, and reductions of the uncertainty domain. The paper promotes a state-reduced, recursive dynamic programming implementation of the DICE-2007 model. We achieve the reduction by simplifying the carbon cycle and the temperature delay equations. We compare our model’s performance and that of the DICE model to the scientific AOGCM models emulated by MAGICC 6.0 and find that our simplified model performs equally well as the original DICE model. Our implementation solves the infinite planning horizon problem in an arbitrary time step. The paper is the first to carefully analyze the quality of the value function approximation using two different types of basis functions and systematically varying the dimension of the basis. We present the closed form, continuous time approximation to the exogenous (discretely and inductively defined) processes in DICE, and we present a numerically more efficient re-normalized Bellman equation that, in addition, can disentangle risk attitude from the propensity to smooth consumption over time.

Journal ArticleDOI
TL;DR: Results indicate how the uncertainty in the target motion, the tracker capabilities, and the time since the last observation can affect the control law, and simulations illustrate that the control can be applied to other continuous, smooth trajectories with no need for additional computation.
Abstract: An optimal feedback control is developed for fixed-speed, fixed-altitude Unmanned Aerial Vehicle (UAV) to maintain a nominal distance from a ground target in a way that anticipates its unknown future trajectory. Stochasticity is introduced in the problem by assuming that the target motion can be modeled as Brownian motion, which accounts for possible realizations of the unknown target kinematics. Moreover, the possibility for the interruption of observations is included by assuming that the duration of observation times of the target is exponentially distributed, giving rise to two discrete states of operation. A Bellman equation based on an approximating Markov chain that is consistent with the stochastic kinematics is used to compute an optimal control policy that minimizes the expected value of a cost function based on a nominal UAV-target distance. Results indicate how the uncertainty in the target motion, the tracker capabilities, and the time since the last observation can affect the control law, and simulations illustrate that the control can further be applied to other continuous, smooth trajectories with no need for additional computation.

Journal ArticleDOI
TL;DR: A novel mean-field framework is proposed that offers a more efficient modeling tool and a more accurate solution scheme in tackling directly the issue of nonseparability and deriving the optimal policies analytically for the multi-period mean-variance-type portfolio selection problems.
Abstract: When a dynamic optimization problem is not decomposable by a stage-wise backward recursion, it is nonseparable in the sense of dynamic programming. The classical dynamic programming-based optimal stochastic control methods would fail in such nonseparable situations as the principle of optimality no longer applies. Among these notorious nonseparable problems, the dynamic mean-variance portfolio selection formulation had posed a great challenge to our research community until recently. Different from the existing literature that invokes embedding schemes and auxiliary parametric formulations to solve the dynamic mean-variance portfolio selection formulation, we propose in this paper a novel mean-field framework that offers a more efficient modeling tool and a more accurate solution scheme in tackling directly the issue of nonseparability and deriving the optimal policies analytically for the multi-period mean-variance-type portfolio selection problems.

Journal ArticleDOI
TL;DR: This work considers power allocation for an access-controlled transmitter with energy harvesting capability based on causal observations of the channel fading state and proposes power allocation algorithms for both the finite- and infinite-horizon cases whose computational complexity is significantly lower than that of the standard discrete MDP method but with improved performance.
Abstract: We consider power allocation for an access-controlled transmitter with energy harvesting capability based on causal observations of the channel fading state. We assume that the system operates in a time-slotted fashion and the channel gain in each slot is a random variable which is independent across slots. Further, we assume that the transmitter is solely powered by a renewable energy source and the energy harvesting process can practically be predicted. With the additional access control for the transmitter and the maximum power constraint, we formulate the stochastic optimization problem of maximizing the achievable rate as a Markov decision process (MDP) with continuous state. To effi- ciently solve the problem, we define an approximate value function based on a piecewise linear fit in terms of the battery state. We show that with the approximate value function, the update in each iteration consists of a group of convex problems with a continuous parameter. Moreover, we derive the optimal solution to these con- vex problems in closed-form. Further, we propose power allocation algorithms for both the finite- and infinite-horizon cases, whose computational complexity is significantly lower than that of the standard discrete MDP method but with improved performance. Extension to the case of a general payoff function and imperfect energy prediction is also considered. Finally, simulation results demonstrate that the proposed algorithms closely approach the optimal performance.

Journal ArticleDOI
TL;DR: In this paper, the authors established some elementary results on solutions to the Bellman equation without introducing any topological assumption, and applied these results to two optimal growth models: one with a discontinuous production function and the other with a roughly increasing return.
Abstract: We establish some elementary results on solutions to the Bellman equation without introducing any topological assumption. Under a small number of conditions, we show that the Bellman equation has a unique solution in a certain set, that this solution is the value function, and that the value function can be computed by value iteration with an appropriate initial condition. In addition, we show that the value function can be computed by the same procedure under alternative conditions. We apply our results to two optimal growth models: one with a discontinuous production function and the other with “roughly increasing” returns.