scispace - formally typeset
Search or ask a question

Showing papers on "Markov decision process published in 1977"


Journal ArticleDOI
01 Oct 1977

1,016 citations


Journal ArticleDOI
TL;DR: The convergence rate of Markov decision processes as the horizon length increases can be important for computations and judging the appropriateness of models as mentioned in this paper, and the convergence rate is commonly defined as:
Abstract: The rate at which Markov decision processes converge as the horizon length increases can be important for computations and judging the appropriateness of models. The convergence rate is commonly as...

51 citations


Journal ArticleDOI
TL;DR: A countable stage, countable state, finite action decision problem is considered where the objective is the maximization of the expectation of an arbitrary utility function defined on the sequence of states.
Abstract: A countable stage, countable state, finite action decision problem is considered where the objective is the maximization of the expectation of an arbitrary utility function defined on the sequence of states. Basic concepts are formulated, generalizing the standard notions of the optimality equations, conserving and unimprovable strategies, and strategy and value iteration. Analogues of positive, negative and convergent dynamic programming are analyzed.

45 citations


Journal ArticleDOI
TL;DR: It is obtained necessary and sufficient conditions which guarantee that the maximal total expected reward for a planning horizon of n epochs minus n times the long run average expected reward has a finite limit as n → ∞ for each initial state and each final reward vector.
Abstract: This paper considers undiscounted Markov Decision Problems. For the general multichain case, we obtain necessary and sufficient conditions which guarantee that the maximal total expected reward for a planning horizon of n epochs minus n times the long run average expected reward has a finite limit as n → ∞ for each initial state and each final reward vector. In addition, we obtain a characterization of the chain and periodicity structure of the set of one-step and J-step maximal gain policies. Finally, we discuss the asymptotic properties of the undiscounted value-iteration method.

43 citations


Dissertation
01 Jan 1977
TL;DR: This dissertation introduces concepts and associated computational procedures that are applicable to a mathematical problem arising in the context of Operations Research and Stochastic Control to design a strategy for real-time decision-making on the basis of imperfect (state) information and finite memory.
Abstract: : This dissertation introduces concepts and associated computational procedures that are applicable to a mathematical problem arising in the context of Operations Research and Stochastic Control. Briefly stated, the problem is to design a strategy for real-time decision-making on the basis of imperfect (state) information and finite memory. The plant (i.e. the object to be controlled) is modelled as a finite probabilistic system (FPS) or stationary discrete-time finite-input finite-output finite-state controlled stochastic process, a generalization of the partially-observed Markov decision model initiated by Drake (1962), which itself generalizes the Markov decision model of Bellman (1957a).

38 citations


Journal ArticleDOI
TL;DR: In this article, a number of successive approximation algorithms for the repeated two-person zero-sum game called Markov game using the criterion of total expected discounted rewards were presented, where stopping times are introduced in order to simplify the proofs.
Abstract: This paper presents a number of successive approximation algorithms for the repeated two-person zero-sum game called Markov game using the criterion of total expected discounted rewards. AsWessels [1977] did for Markov decision processes stopping times are introduced in order to simplify the proofs. It is shown that each algorithm provides upper and lower bounds for the value of the game and nearly optimal stationary strategies for both players.

26 citations


Journal ArticleDOI
TL;DR: In the policy-iteration algorithm resulting from this approach the number of equations to be solved in any iteration step can be substantially reduced, and by its flexibility, this algorithm allows us to exploit any structure of the particular problem to be solve.
Abstract: This paper provides a new approach for solving a wide class of Markov decision problems including problems in which the space is general and the system can be continuously controlled. The optimality criterion is the long-run average cost per unit time. We decompose the decision processes into a common underlying stochastic process and a sequence of interventions so that the decision processes can be embedded upon a reduced set of states. Consequently, in the policy-iteration algorithm resulting from this approach the number of equations to be solved in any iteration step can be substantially reduced. Further, by its flexibility, this algorithm allows us to exploit any structure of the particular problem to be solved.

25 citations


01 Jan 1977
TL;DR: The final author version and the galley proof are versions of the publication after peer review that features the final layout of the paper including the volume, issue and page numbers.
Abstract: • A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website. • The final author version and the galley proof are versions of the publication after peer review. • The final published version features the final layout of the paper including the volume, issue and page numbers.

19 citations


01 Jan 1977
TL;DR: In this article, the present state of the art of value-iteration and related successive approximation methods, as well as resulting turnpike properties, in both the discounted and undiscounted version of finite state and action Markov Decision Problems, are surveyed.
Abstract: A survey is given of the present state of the art of value-iteration and related successive approximation methods, as well as of resulting turnpike properties, in both the discounted and undiscounted version of finite state and action Markov Decision Problems.

18 citations


Journal ArticleDOI
TL;DR: To prove the continuity of the average costs as function on the space of strategies some perturbation results for quasi-compact linear operators are used to prove the existence of an average optimal strategy.
Abstract: In this paper stationary Markov decision problems are considered with arbitrary state space and compact space of strategies. Conditions are given for the existence of an average optimal strategy. This is done by using the fact that a continuous function on a compact space attains its minimum. To prove the continuity of the average costs as function on the space of strategies some perturbation results for quasi-compact linear operators are used. In a first set of conditions the boundedness of the one-period cost functions and the quasi-compactness of the Markov processes are assumed. In more general conditions the boundedness of the cost functions is replaced by the boundedness, on a subset A of the state space, of the recurrence time and costs until A and the quasi-compactness of the Markov processes are replaced by the quasi-compactness of the embedded Markov processes on A.

18 citations


01 Jan 1977
TL;DR: In this paper, the authors consider finite state Markov decision processes with finite decision spaces for each state and consider the optimality criterion will be total expected discounted reward over an infinite time horizon.
Abstract: In this paper we consider finite state Markov decision processes with finite decision spaces for each state. The optimality criterion will be total expected discounted reward over an infinite time horizon. For these problems a great variety of optimization procedures has been developed. We divide them in two classes: policy improvement procedures and policy improvement-value determination procedures.

Journal ArticleDOI
TL;DR: In this article, the equivalence of the sensitive optimality criteria as introduced by Veinott is shown, and the Laurent expansion of the total discounted expected return for the various policies is derived.
Abstract: Discrete time Markov decision processes with a countable state space are investigated. Under a condition of Liapunov function type the Laurent expansion of the total discounted expected return for the various policies is derived. Moreover, the equivalence of the sensitive optimality criteria as introduced by Veinott is shown.

Journal ArticleDOI
TL;DR: For positive problems, unimprovable strategies are optimal and the optimal value sequence is the least solution of the optimality equations exceeding an obvious lower bound as mentioned in this paper, which is the standard model of positive and negative dynamic programming.
Abstract: The analysis of structured countable stage decision processes, initiated in Porteus [11], is continued. The standard models of positive and negative dynamic programming are given in this context, thus extending these results to criteria other than the usual expected sum of rewards, such as expected utility criteria, certain stochastic games, risk sensitive Markov decision processes, and maximin criteria.For positive problems, (what are called) unimprovable strategies are optimal and the optimal value sequence is the least solution of the optimality equations exceeding an obvious lower bound. For negative problems, conserving strategies are optimal, and if one strategy is a one-step improvement on another, then it nets a greater value. (This rules out cycling in the strategy iteration procedure.) Also, transfinite methods are used to prove that the optimal value sequence is the greatest solution of the optimality equations less than an obvious upper bound. We indicate how all these results can be extended ...

Journal ArticleDOI
TL;DR: In the discounted version of this finite-state Markov decision problem, it is shown that the optimal value is unique and the optimal strategy is pure and stationary; however, they are dependent on the starting state.
Abstract: A finite-state Markov decision process, in which, associated with each action in each state, there are two rewards, is considered. The objective is to optimize the ratio of the two rewards over an infinite horizon. In the discounted version of this decision problem, it is shown that the optimal value is unique and the optimal strategy is pure and stationary; however, they are dependent on the starting state. Also, a finite algorithm for computing the solution is given.


Book ChapterDOI
01 Jan 1977
TL;DR: Finite state Markov decision processes with finite decision spaces for each state with optimality criterion will be total expected discounted reward over an infinite time horizon is considered.
Abstract: In this paper we consider finite state Markov decision processes with finite decision spaces for each state. The optimality criterion will be total expected discounted reward over an infinite time horizon. For these problems a great variety of optimization procedures has been developed. We divide them in two classes: policy improvement procedures and policy improvement-value determination procedures.

Journal ArticleDOI
TL;DR: A fresh perspective on the Markov reward process is presented, and a special case: the mean-variability models decision rule of maximizing μ/σ is worked out in detail.
Abstract: This paper presents a fresh perspective on the Markov reward process. In order to bring Howard's [Howard, R. A. 1969. Dynamic Programing and Markov-Process. The M.I.T. Press, 5th printing.] model closer to practical applicability, two very important aspects of the model are restated: a We make the rewards random variables instead of known constants, and b we allow for any decision rule over the moment set of the portfolio distribution, rather than assuming maximization of the expected value of the portfolio outcome. These modifications provide a natural setting for the rewards to be normally distributed, and thus, applying the mean variance models becomes possible. An algorithm for solution is presented, and a special case: the mean-variability models decision rule of maximizing μ/σ is worked out in detail.

Journal ArticleDOI
TL;DR: Finite state and action, discrete time parameters normalized Markov decision chains with transition matrices that are nonnegative with spectral radius not exceeding one, and it is shown that the periodical reward gained in period N is bounded by a polynom, uniformly over the set of all policies.
Abstract: In this paper we consider finite state and action, discrete time parameters normalized Markov decision chains, i.e., Markov decision processes with transition matrices that are nonnegative with spectral radius not exceeding one (but not necessarily substochastic). We show that the periodical reward gained in period N is bounded by a polynom, uniformly over the set of all policies. The degree of this polynom can be obtained by considering only the set of stationary policies. Extending and improving results of Sladky (1974) for the stochastic case, we obtain necessary and sufficient conditions for n discount optimality of arbitrary (not necessarily stationary) policies.


Journal ArticleDOI
01 Oct 1977
TL;DR: In this paper, a general discrete decision process is formulated which includes both undiscounted and discounted semi-Markovian decision processes as special cases, and a policy-iteration algorithm is presented and shown to converge to an optimal policy.
Abstract: A general discrete decision process is formulated which includes both undiscounted and discounted semi-Markovian decision processes as special cases. A policy-iteration algorithm is presented and shown to converge to an optimal policy. Properties of the coupled functional equations are derived. Primal and dual linear programming formulations of the optimization problem are also given. An application is given to Markov ratio decision process.

01 Jan 1977
TL;DR: The main result in this paper is the characterization of certain strong kinds of equilibrium points in Markov games with a countable set of players and uncountable decision sets.
Abstract: The main result in this paper is the characterization of certain strong kinds of equilibrium points in Markov games with a countable set of players and uncountable decision sets. Two person Markov games are studied beforehand, since this paper gives an extension of the existing theory for two person zero sum Markov games; finally we consider the special cases of N-person Markov games and Markov decision processes.


Journal ArticleDOI
TL;DR: In this paper, Tarski's Principle is applied to real closed fields to a field of asymptotic expansions, which the authors termed the field of real Puiseux series.
Abstract: : The authors study two person, zero sum, stochastic games with zero stop probabilities Two distinct formulations are emphasized, (1) the infinite stage game with payoffs discounted at an interest rate close to zero and (2) the game with a large but finite number of stages The authors give a complete theory of such games The work implies all known existence theorems for optimal policies in Markov decision processes It also generalizes all previous existence theorems for the value of a stochastic game The approach differs from previous work in that it is algebraic and makes no use of the theory of Markov chains The essential idea of this approach is to apply Tarski's Principle on real closed fields to a field of asymptotic expansions, which the authors term the field of real Puiseux series

Journal ArticleDOI
TL;DR: The decision rules presented in the paper give a policy in estimating the transition probabilities successively from the viewpoint of the dual control, and the policy leads to an optimal Markovian decision process with discounted rewards.
Abstract: This paper is concerned with an approach to Markovian decision processes with discounted rewards in which the transition probabilities are unknown. The processes are assumed to be finite-state, discrete-time and stationary. Thee decision rules presented in the paper give a policy in estimating the transition probabilities successively from the viewpoint of the dual control, and the policy leads to an optimal

Book ChapterDOI
01 Jan 1977
TL;DR: An axiomatization of discounted dynamic programming with a continuous time parameter (CDP) when the Markov policies are used is given and necessary and sufficient conditions for the existence of an optimal policy are given.
Abstract: We consider the problem of discounted dynamic programming with a continuous time parameter (CDP) when the Markov policies are used. We give an axiomatization of such discounted CDP. We also give necessary and sufficient conditions for the existence of an optimal policy. Analogously to the discrete case we formulate improvement’s theorems and a theorem on the existence of a (p, e)-optimal policy in a class of semi-Markov policies.

Proceedings ArticleDOI
01 Dec 1977
TL;DR: This paper treats the computational solution of terminating stochastic games using concepts from deterministic matrix games and Markovian decision processes and a suboptimal approach suitable for large problems is presented.
Abstract: This paper treats the computational solution of terminating stochastic games using concepts from deterministic matrix games and Markovian decision processes. The algorithms due to Shapley and Pollatschek/ Avi-Itzhak are discussed and a suboptimal approach suitable for large problems is presented. Numerical results are given from an example comparing the three approaches.

Book ChapterDOI
01 Jan 1977
TL;DR: In this article, a recurrence formula for the difference between expected rewards and sojourn times generated by N transitions of a semi-Markov decision process with finite state space is presented.
Abstract: The paper presents a recurrence formula for the difference between expected rewards and sojourn times generated by N transitions of a semi-Markov decision process with finite state space. Using the recurrence formula convergence of policy iteration method can be easily verified and also necessary and sufficient optimality conditions for average optimal and more selective average overtaking optimal policies are established.

Journal ArticleDOI
R.C.H. Cheng1
TL;DR: An altarnative approach is proposed which converts a problem involving the optimal control of a distributed parameter system into a so called Markov decision process which can be solved by mathematical programming methods.