scispace - formally typeset
Search or ask a question
Topic

Bellman equation

About: Bellman equation is a research topic. Over the lifetime, 5884 publications have been published within this topic receiving 135589 citations.


Papers
More filters
Posted Content
TL;DR: In a newsvendor problem with partially observed Markovian demand, the optimal order is set to exceed the myopic optimal order, and a near-optimal solution is characterized by establishing that the value function is piecewise linear.
Abstract: We consider a newsvendor problem with partially observed Markovian demand. Demand is observed if it is less than the inventory. Otherwise, only the event that it is larger than or equal to the inventory is observed. These observations are used to update the demand distribution from one period to the next. The state of the resulting dynamic programming equation is the current demand distribution, which is generally infinite dimensional. We use unnormalized probabilities to convert the nonlinear state transition equation to a linear one. This helps in proving the existence of an optimal feedback ordering policy. So as to learn more about the demand, the optimal order is set to exceed the myopic optimal order. The optimal cost decreases as the demand distribution decreases in the hazard rate order. In a special case with finitely many demand values, we characterize a near-optimal solution by establishing that the value function is piecewise linear.

80 citations

Posted Content
TL;DR: A generalized version of the Bellman equation is proposed to learn a single parametric representation for optimal policies over the space of all possible preferences in MORL, with the goal of enabling few-shot adaptation to new tasks.
Abstract: We introduce a new algorithm for multi-objective reinforcement learning (MORL) with linear preferences, with the goal of enabling few-shot adaptation to new tasks. In MORL, the aim is to learn policies over multiple competing objectives whose relative importance (preferences) is unknown to the agent. While this alleviates dependence on scalar reward design, the expected return of a policy can change significantly with varying preferences, making it challenging to learn a single model to produce optimal policies under different preference conditions. We propose a generalized version of the Bellman equation to learn a single parametric representation for optimal policies over the space of all possible preferences. After an initial learning phase, our agent can execute the optimal policy under any given preference, or automatically infer an underlying preference with very few samples. Experiments across four different domains demonstrate the effectiveness of our approach.

80 citations

Journal ArticleDOI
TL;DR: An antialiasing trajectory optimization method is developed based on Bellman’s principle of optimality and is extremely simple to implement, and optimal feedback controls are obtained without recourse to the complexities of the Hamilton–Jacobi theory.
Abstract: J OURNAL OF G UIDANCE , C ONTROL , AND D YNAMICS Vol. 30, No. 4, July–August 2007 Low-Thrust, High-Accuracy Trajectory Optimization I. Michael Ross, ∗ Qi Gong, † and Pooya Sekhavat ‡ Naval Postgraduate School, Monterey, California 93943 DOI: 10.2514/1.23181 Multirevolution, very low-thrust trajectory optimization problems have long been considered difficult problems due to their large time scales and high-frequency responses. By relating this difficulty to the well-known problem of aliasing in information theory, an antialiasing trajectory optimization method is developed. The method is based on Bellman’s principle of optimality and is extremely simple to implement. Appropriate technical conditions are derived for generating candidate optimal solutions to a high accuracy. The proposed method is capable of detecting suboptimality by way of three simple tests. These tests are used for verifying the optimality of a candidate solution without the need for computing costates or other covectors that are necessary in the Pontryagin framework. The tests are universal in the sense that they can be used in conjunction with any numerical method whether or not antialiasing is sought. Several low-thrust example problems are solved to illustrate the proposed ideas. It is shown that the antialiased solutions are, in fact, closed-loop solutions; hence, optimal feedback controls are obtained without recourse to the complexities of the Hamilton–Jacobi theory. Because the proposed method is easy to implement, it can be coded on an onboard computer for practical space guidance. the field to exchange ideas over several workshops. These workshops, held over 2003–2006, further clarified the scope of the problems, and ongoing efforts to address them are described in [12]. From a practical point of view, the goal is to quickly obtain verifiably optimal or near-optimal solutions to finite- and low-thrust problems so that alternative mission concepts can be analyzed I. Introduction C ONTINUOUS-THRUST trajectory optimization problems have served as one of the motivating problems for optimal control theory since its inception [1–4]. The classic problem posed by Moyer and Pinkham [2] is widely discussed in textbooks [1,3,4] and research articles [5–7]. When the continuity of thrust is removed from such problems, the results can be quite dramatic as illustrated in Fig. 1. This trajectory was obtained using recent advances in optimal control techniques and is extensively discussed in [8]. In canonical units, the problem illustrated in Fig. 1 corresponds to doubling the semimajor axis (a 0 1, a f 2), doubling the eccentricity (e 0 0:1, e f 0:2), and rotating the line of apsides by 1 rad. Note that the extremal thrust steering program for minimizing fuel is not tangential over a significant portion of the trajectory. Furthermore, the last burn is a singular control as demonstrated in Fig. 2 by the vanishing of the switching function. Although such finite-thrust problems can be solved quite readily nowadays, it has long been recognized [9–11] that as the thrust authority is reduced, new problems emerge. These well-known challenges chiefly arise as a result of a long flight time measured in terms of the number of orbital revolutions. Consequently, such problems are distinguished from finite-thrust problems as low-thrust problems although the boundary between finite thrust and low thrust is not altogether sharp. Although ad hoc techniques may circumvent some of the low- thrust challenges, it is not quite clear if the solutions generated from such methods are verifiably optimal. As detailed in [8], the engineering feasibility of a space mission is not dictated by trajectory generation, but by optimality. This is because fuel in space is extraordinarily expensive as the cost of a propellant is driven by the routine of space operations, or the lack of it, and not the chemical composition of the fuel. In an effort to circumvent ad hoc techniques to efficiently solve emerging problems in finite- and low-thrust trajectory optimization, NASA brought together leading experts in al Or iti bit In it Fin rb al O Transfer Trajectory Fig. 1 A benchmark minimum-fuel finite-thrust orbit transfer problem. Thrust Acceleration, u s = 0 Switching Function, s Received 13 February 2006; accepted for publication 21 August 2006. This material is declared a work of the U.S. Government and is not subject to copyright protection in the United States. Copies of this paper may be made for personal or internal use, on condition that the copier pay the $10.00 per- copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923; include the code 0731-5090/07 $10.00 in correspondence with the CCC. Professor, Department of Mechanical and Astronautical Engineering; imross@nps.edu. Associate Fellow AIAA. Research Associate, Department of Mechanical and Astronautical Engineering; qgong@nps.edu. Research Scientist, Department of Mechanical and Astronautical Engineering; psekhava@nps.edu. Singular Control s u time (canonical units) Fig. 2 Extremal thrust acceleration (control) program t7 !u and the corresponding switching function t7 !s for the trajectory shown in Fig. 1.

80 citations

Proceedings Article
11 Jun 2011
TL;DR: A new heuristic-search-based family of algorithms, FRET (Find, Revise, Eliminate Traps), is presented and a preliminary empirical evaluation shows that FRET solves GSSPs much more efficiently than Value Iteration.
Abstract: Research in efficient methods for solving infinite-horizon MDPs has so far concentrated primarily on discounted MDPs and the more general stochastic shortest path problems (SSPs). These are MDPs with 1) an optimal value function V* that is the unique solution of Bellman equation and 2) optimal policies that are the greedy policies w.r.t. V*. This paper's main contribution is the description of a new class of MDPs, that have well-defined optimal solutions that do not comply with either 1 or 2 above. We call our new class Generalized Stochastic Shortest Path (GSSP) problems. GSSP allows more general reward structure than SSP and subsumes several established MDP types including SSP, positive-bounded, negative, and discounted-reward models. While existing efficient heuristic search algorithms like LAO* and LRTDP are not guaranteed to converge to the optimal value function for GSSPs, we present a new heuristic-search-based family of algorithms, FRET (Find, Revise, Eliminate Traps). A preliminary empirical evaluation shows that FRET solves GSSPs much more efficiently than Value Iteration.

79 citations

Journal ArticleDOI
TL;DR: In this paper, the authors consider the stochastic optimal control problem of McKean-Vlasov stochastically differential equation where the coefficients may depend upon the joint law of the state and control.
Abstract: We consider the stochastic optimal control problem of McKean-Vlasov stochastic differential equation where the coefficients may depend upon the joint law of the state and control. By using feedback controls, we reformulate the problem into a deterministic control problem with only the marginal distribution of the process as controlled state variable, and prove that dynamic programming principle holds in its general form. Then, by relying on the notion of differentiability with respect to pro\-bability measures recently introduced by P.L. Lions in [32], and a special Ito formula for flows of probability measures, we derive the (dynamic programming) Bellman equation for mean-field stochastic control problem, and prove a veri\-fication theorem in our McKean-Vlasov framework. We give explicit solutions to the Bellman equation for the linear quadratic mean-field control problem, with applications to the mean-variance portfolio selection and a systemic risk model. We also consider a notion of lifted visc-sity solutions for the Bellman equation, and show the viscosity property and uniqueness of the value function to the McKean-Vlasov control problem. Finally, we consider the case of McKean-Vlasov control problem with open-loop controls and discuss the associated dynamic programming equation that we compare with the case of closed-loop controls.

79 citations


Network Information
Related Topics (5)
Optimal control
68K papers, 1.2M citations
87% related
Bounded function
77.2K papers, 1.3M citations
85% related
Markov chain
51.9K papers, 1.3M citations
85% related
Linear system
59.5K papers, 1.4M citations
84% related
Optimization problem
96.4K papers, 2.1M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023261
2022537
2021369
2020411
2019348
2018353