scispace - formally typeset
Search or ask a question

Showing papers on "Markov decision process published in 1986"



Journal ArticleDOI
TL;DR: It is shown that, if updating is done in sufficiently small steps, the group will converge to the policy that maximizes the long-term expected reward per step.
Abstract: The principal contribution of this paper is a new result on the decentralized control of finite Markov chains with unknown transition probabilities and rewords. One decentralized decision maker is associated with each state in which two or more actions (decisions) are available. Each decision maker uses a simple learning scheme, requiring minimal information, to update its action choice. It is shown that, if updating is done in sufficiently small steps, the group will converge to the policy that maximizes the long-term expected reward per step. The analysis is based on learning in sequential stochastic games and on certain properties, derived in this paper, of ergodic Markov chains. A new result on convergence in identical payoff games with a unique equilibrium point is also presented.

102 citations


Journal ArticleDOI
TL;DR: In this paper, a Lagrange multiplier formulation involving a dynamic programming equation is utilized to relate the constrained optimization to an unconstrained optimization parametrized by the multiplier, leading to a proof for the existence of a semi-simple optimal constrained policy.
Abstract: Optimal causal policies maximizing the time-average reward over a semiMarkov decision process (SMDP), subject to a hard constraint on a timeaverage cost, are considered. Rewards and costs depend on the state and action, and contain running as well as switching components. It is supposed that the state space of the SMDP is finite, and the action space compact metric. The policy determines an action at each transition point of the SMDP. Under an accessibility hypothesis, several notions of time average are equivalent. A Lagrange multiplier formulation involving a dynamic programming equation is utilized to relate the constrained optimization to an unconstrained optimization parametrized by the multiplier. This approach leads to a proof for the existence of a semi-simple optimal constrained policy. That is, there is at most one state for which the action is randomized between two possibilities; at all other states, an action is uniquely chosen for each state. Affine forms for the rewards, costs and transition probabilities further reduce the optimal constrained policy to 'almost bang-bang' form, in which the optimal policy is not randomized, and is bang-bang except perhaps at one state. Under the same assumptions, one can alternatively find an optimal constrained policy that is strictly bang-bang, but may be randomized at one state. Application is made to flow control of a birth-and-death process (e.g., an MIMIs queue); under certain monotonicity restrictions on the reward and cost structure the preceding results apply, and in addition there is a simple acceptance region.

84 citations


Journal ArticleDOI
TL;DR: In this article, a two-stage optimization framework, which consists of a real-time model followed by a steady state model, is proposed, which is optimized with the generalized policy iteration procedure.
Abstract: In recognition of hydrologic uncertainty and seasonality, reservoir inflows are described as periodic Markov processes. The optimization of reservoir operations involves determination of the optimal release volumes in the successive time periods so that the expected total rewards resulting from the operations are maximized. A two-stage optimization framework, which consists of a real time model followed by a steady state model, is proposed. The steady state model that describes the convergent nature of the prospective future operations is regarded as a periodic Markov decision process and is optimized with the generalized policy iteration procedure. This result is in turn used as an interim step for deriving the optimal immediate decisions for the current period in the real-time model. Significant computational efficiency results from this framework and the respective optimization procedure.

75 citations


Proceedings ArticleDOI
01 Dec 1986
TL;DR: The (optimal) design of many engineering systems can be adequately recast as a Markov decision process, where requirements on system performance are captured in the form of constraints.
Abstract: The (optimal) design of many engineering systems can be adequately recast as a Markov decision process, where requirements on system performance are captured in the form of constraints. In this paper, various optimality results for constrained Markov decision processes are briefly reviewed; the corresponding implementation issues are discussed and shown to lead to several problems of parameter estimation. Simple situations where such constrained problems naturally arise, are presented in the context of queueing systems, in order to illustrate various points of the theory. In each case, the structure of the optimal policy is exhibited.

65 citations


Journal ArticleDOI
TL;DR: In this paper, a new form of the optimality equation is derived for the case in which every stationary policy gives rise to an ergodic Markov chain, and conditions are given under which an unbounded solution to the average cost optimalality equation exists and yields an optimal stationary policy.

50 citations


Journal ArticleDOI
TL;DR: It is shown that when the state space is finite the computation of the dynamic allocation indices can be handled by linear programming methods.
Abstract: We consider the multi-armed bandit problem. We show that when the state space is finite the computation of the dynamic allocation indices can be handled by linear programming methods.

48 citations


Journal ArticleDOI
TL;DR: Convergence theorems that, when applied to the case of bounded rewards, give stronger results than those in [9] are proved and bounds on the rates of convergence under several assumptions are given.
Abstract: A finite-state iterative scheme introduced by White [9] to approximate the optimal value function of denumerable-state Markov decision processes with bounded rewards, is extended to the case of unbounded rewards. Convergence theorems that, when applied to the case of bounded rewards, give stronger results than those in [9] are proved. Moreover, bounds on the rates of convergence under several assumptions are given and the extended scheme is used to obtain policies with asymptotic optimality properties.

37 citations


Journal ArticleDOI
TL;DR: This work examines a finite state, finite action dynamic program having a one-step transition value-function that is affine in an imprecisely known parameter and presents conditions that guarantee the existence of a parameter-independent strategy that maximizes the minimum value of its expected reward function over all possible parameter values.
Abstract: In order to model parameter imprecision associated with a problem's reward or preference structure, we examine a finite state, finite action dynamic program having a one-step transition value-function that is affine in an imprecisely known parameter. For the finite horizon case, we also assume that the terminal value function is affine in the imprecise parameter. We assume that the parameter of interest has no dynamics, no new information about its value is received once the decision process begins, and its imprecision is described by set inclusion. We seek the set of all parameter-independent strategies that are optimal for some value of the imprecisely known parameter. We present a successive approximations procedure for solving the finite horizon case and a policy iteration procedure for determining the solution of the discounted infinite horizon case. These algorithms are then applied to a decision analysis problem with imprecise utility function and to a Markov decision process with imprecise reward structure. We also present conditions that guarantee the existence of a parameter-independent strategy that maximizes, with respect to all other parameter invariant strategies, the minimum value of its expected reward function over all possible parameter values.

35 citations


Journal ArticleDOI
TL;DR: This paper presents a simple successive approximation approach to the characterization of optimal policies for finite horizon, semi-Markov decision processes by analyzing the optimal liquidation of an asset and shows that several aspects of the standard, discrete-time, infinite horizon optimal policy carry over to the continuous- time, finite horizon policy.
Abstract: This paper presents a simple successive approximation approach to the characterization of optimal policies for finite horizon, semi-Markov decision processes. Optimal policies are nonstationary, for in this setting they depend on both time and state. We illustrate this approach by analyzing the optimal liquidation of an asset; we also show that several aspects of the standard, discrete-time, infinite horizon optimal policy carry over to the continuous-time, finite horizon policy.

34 citations


Journal ArticleDOI
TL;DR: An efficient algorithm is developed which suggests a method for approximating g* and an associated average-return optimal policy and can be applied to some special cases, such as control of arrivals to a queue, control of the service rate, and controlled random walks.
Abstract: We consider a Markovian decision process with countable state space states 0, 1, 2,... which is skip-free to the right a transition from i to j is impossible if j >i + 1. In this type of system it is easy to calculate by forward recursion the maximal total expected reward going from state 0 to state i; the same can be done, of course, for the case where a constant g is subtracted from the one-period reward function g-revised reward. Let -wgi be the maximal total expected g-revised reward going from state 0 to state i. We show that wg· satisfies the average-reward optimality equation. If wg· satisfies a growth condition, then g = g*, the maximal average reward. For all other g, the function wg increases or decreases so fast that this cannot be the case. Thus, in principle the solution wg can be used to check if g g*, which suggests a method for approximating g* and an associated average-return optimal policy. We develop an efficient algorithm based on this idea. In a companion paper we shall show how the algorithm, or modifications of it, can be applied to some special cases, such as control of arrivals to a queue, control of the service rate, and controlled random walks.

Journal ArticleDOI
Masami Kurano1
TL;DR: The validity of the optimality equation and the existence of e-optimal stationary policies are proved by use of this method, and a p-step contraction property for the average cost case is introduced.
Abstract: We consider a Markov decision process with a Borel measurable cost function. We introduce a p-step contraction property for the average cost case. By use of this method, the validity of the optimality equation and the existence of e-optimal stationary policies are proved. As some applications, the sequential replacement model and the inventory model are considered.

Book ChapterDOI
TL;DR: Various algorithms for numerical solutions of discounted stochastic games and a new mathematical programming formulation which permits the numerical solution of a game by using a non-linear programming code is presented.

Book
01 Jan 1986
TL;DR: This book discusses linear programming, game theory, and decision making in the context of management science with a focus on dynamic programming.
Abstract: Introduction to management science. Mathematical review. Breakeven analysis. Forecasting. Introduction to linear programming. Linear programming. Model formulations. LP simplex method. Sensitivity analysis and duality. PERT/CPM. Transportation and assignment models. Other network models. Goal programming. Integer programming. Inventory models. Probability review. Decision making. Decision mwdels. Markov processes. Game theory. Queuing analysis: waiting-line problems. Simulation. Dynamic programming. Calculus review. Non linear models. Implementation.


Journal ArticleDOI
TL;DR: In this paper, Hartley et al. extended the finite-state iterative scheme introduced by White to approximate the value function of denumerable-state Markov decision processes to denumerable multidimensional state space.

Journal ArticleDOI
TL;DR: This paper demonstrates how a Markov decision process MDP can be approximated to generate a policy bound, i.e., a function that bounds the optimal policy from below or from above for all states.
Abstract: This paper demonstrates how a Markov decision process MDP can be approximated to generate a policy bound, i.e., a function that bounds the optimal policy from below or from above for all states. We present sufficient conditions for several computationally attractive approximations to generate rigorous policy bounds. These approximations include approximating the optimal value function, replacing the original MDP with a separable approximate MDP, and approximating a stochastic MDP with its deterministic counterpart. An example from the field of fisheries management demonstrates the practical applicability of the results.

Journal ArticleDOI
TL;DR: In this article, a new derivation of the linear program corresponding to a Markov Decision Process (MDP) in steady state, which seeks to minimize discounted total expected cost, is given.

Journal ArticleDOI
Zvi Rosberg1, I. Gopal
TL;DR: The optimal control of hop-by-hop flow control in a computer network is shown to be a linear truncated function of the state and the explicit form is found when the arrival process of the messages is a Bernoulli process.
Abstract: The problem of hop-by-hop flow control in a computer network is formulated as a Markov decision process with a cost function composed of the delay of the messages and the buffer constraints. The optimal control is shown to be a linear truncated function of the state and the explicit form is found when the arrival process of the messages is a Bernoulli process. For a renewal arrival process, the long-rnn average cost of any policy with a linear truncated structure is expressed by a set of linear equations.

Journal Article
TL;DR: In this article, the relation entre politiques efficaces dans un processus de decision de Markov multibojectif and les points efficacesdans un programme lineaire multiobjectif relie
Abstract: On cherche a clarifier la relation entre politiques efficaces dans un processus de decision de Markov multibojectif et les points efficaces dans un programme lineaire multiobjectif relie

Book ChapterDOI
01 Jan 1986
TL;DR: In Markov decision theory, discrete-time MarkOV decision processes are distinguished from semi-Markov decision processes, which are continuous time decision processes.
Abstract: In Markov decision theory we distinguish (a) discrete-time Markov decision processes (b) semi-Markov decision processes (c) continuous time Markov decision processes.


Journal ArticleDOI
TL;DR: In this article, a planning and decision support system (PDSS) was developed to determine the number of communications satellites to purchase and the timing of the these purchases for INTELSAT was complicated by the length of time necessary to manufacture a spacecraft, the potential for spacecraft failure, uncertain future costs and capacity requirements, multiple, conflicting, and noncommensurate objectives, and various exogenous factors.
Abstract: Developing a planning and decision support system (PDSS) to determine the number of spacecraft (communications satellites) to purchase and the timing of the these purchases for INTELSAT was complicated by the length of time necessary to manufacture a spacecraft, the potential for spacecraft failure, uncertain future costs and capacity requirements, multiple, conflicting, and noncommensurate objectives, and various exogenous factors. The PDSS uses the simulation of a large Markov decision process (MDP) to evaluate purchase strategies generated by (1) experts, (2) heuristic procedures, and (3) the solution of an aggregated version of the MDP, thus integrating knowledge engineering and formal reasoning approaches to decision aiding and problem solving.

Journal ArticleDOI
Cheng Kan1
TL;DR: This paper gives a brief description of recent O.R. activity in China in four parts: mathematical programming; queueing theory and Markov decision processes; reliability theory; simulation.
Abstract: This paper gives a brief description of recent O.R. activity in China. It consists of four parts: mathematical programming; queueing theory and Markov decision processes; reliability theory; simulation. Emphasis is placed on the current situation of practical O.R.

Journal ArticleDOI
TL;DR: In this article, the existence of a solution to the optimality equation for discounted finite Markov decision processes by means of Birkhoff's fixed point theorem was established, and the proof yields the well-known linear programming formulation for the optimal value function.

Journal ArticleDOI
TL;DR: A computational comparison of the policy iteration algorithms for solving discounted Markov decision processes is described, examining the different forms of iterations, reordering, extrapolation and action elimination.

Journal ArticleDOI
TL;DR: In this paper, a structural property for policies, the likelihood consistency property, was introduced for partially observed Markov decision problems, where the decision maker must formulate a policy of response to an unobservable transition to an undesirable state.

Journal ArticleDOI
TL;DR: In this paper, the authors derived bounds and variational characterizations for the solutions of variational Markov decision processes, and used them to measure the deviation of the current solution from optimality.