scispace - formally typeset
Search or ask a question

Showing papers on "Markov decision process published in 1982"


Journal ArticleDOI
TL;DR: A wide range of models in such areas as quality control, machine maintenance, internal auditing, learning, and optimal stopping are discussed within the POMDP-framework.
Abstract: This paper surveys models and algorithms dealing with partially observable Markov decision processes. A partially observable Markov decision process (POMDP) is a generalization of a Markov decision process which permits uncertainty regarding the state of a Markov process and allows for state information acquisition. A general framework for finite state and action POMDP's is presented. Next, there is a brief discussion of the development of POMDP's and their relationship with other decision processes. A wide range of models in such areas as quality control, machine maintenance, internal auditing, learning, and optimal stopping are discussed within the POMDP-framework. Lastly, algorithms for computing optimal solutions to POMDP's are presented.

703 citations


Journal ArticleDOI
TL;DR: In this article, the variance and higher moments of the present value of single-stage rewards in a finite Markov decision process are presented for a semi-Markov Decision Process.
Abstract: Formulae are presented for the variance and higher moments of the present value of single-stage rewards in a finite Markov decision process. Similar formulae are exhibited for a semi-Markov decision process. There is a short discussion of the obstacles to using the variance formula in algorithms to maximize the mean minus a multiple of the standard deviation.

243 citations



Journal ArticleDOI
TL;DR: Global convergence of the algorithm is proven under very weak assumptions and the proof relates this technique to other iterative methods that have been suggested for general linear programs.
Abstract: An iterative aggregation procedure is described for solving large scale, finite state, finite action Markov decision processes MDPs. At each iteration, an aggregate master problem and a sequence of smaller subproblems are solved. The weights used to form the aggregate master problem are based on the estimates from the previous iteration. Each subproblem is a finite state, finite action MDP with a reduced state space and unequal row sums. Global convergence of the algorithm is proven under very weak assumptions. The proof relates this technique to other iterative methods that have been suggested for general linear programs.

65 citations



Journal ArticleDOI
TL;DR: It is shown that the algorithms are asymptotically optimal in the sense that the probability of selecting an optimal policy converges to unity.
Abstract: For a Markovian decision problem in which the transition probabilities are unknown, two learning algorithms are devised from the viewpoint of asymptotic optimality. Each time the algorithms select decisions to be used on the basis of not only the estimates of the unknown probabilities but also uncertainty of them. It is shown that the algorithms are asymptotically optimal in the sense that the probability of selecting an optimal policy converges to unity.

31 citations


Journal ArticleDOI
TL;DR: In this article, a family of alternative bandit processes have been used as models for problems in a variety of areas, and optimal strategies for these decision processes are determined by dynamic allocation indices.
Abstract: Families of alternative bandit processes have been used as models for problems in a variety of areas. Optimal strategies for these decision processes are determined by dynamic allocation indices. These indices are here shown to play an important role in the evaluation of suboptimal strategies. BANDIT PROBLEM; DYNAMIC ALLOCATION INDEX; GITTINS INDEX; MARKOV DECISION PROCESS; SUBOPTIMAL STRATEGIES

27 citations



Book ChapterDOI
01 Jan 1982
TL;DR: A sufficient condition for this to occur in the case where the problem can be modelled by a Markov decision process with costs depending only on the state of the process is presented.
Abstract: Some problems of stochastic allocation and scheduling have the property that there is a single strategy which minimizes the expected value of the costs incurred up to every finite time horizon. We present a sufficient condition for this to occur in the case where the problem can be modelled by a Markov decision process with costs depending only on the state of the process. The condition is used to establish the nature of the optimal strategies for problems of customer assignment, dynamic memory allocation, optimal gambling, maintenance and scheduling.

22 citations



Journal ArticleDOI
TL;DR: Gittins has shown that for a class of Markov decision processes called alternative bandit processes, optimal policies can easily be determined once the dynamic allocation indices (DAIs) for the constituentBandit processes are computed.

Journal ArticleDOI
TL;DR: In this paper, the authors consider several classes of control problems for Markov processes (continuous control, optimal stopping, impulse control) and study the discrete time approximation of the dynamic programming equation, using mainly an analytical approach.
Abstract: We consider several classes of control problems for Markov processes (continuous control, optimal stopping, impulse control). The formulation we use is valid for general Markov semigroups. We study the discrete time approximation of the dynamic programming equation, using mainly an analytical approach. Probabilistic interpretation is given for some of the results.

Book ChapterDOI
TL;DR: In this article, the authors discuss Markov processes and functional analysis, and the corresponding important analytical data are infinitesimal generators (of transition semigroups) of Markov Processes.
Abstract: Publisher Summary This chapter discusses Markov processes and functional analysis For the theory of Markov processes, the corresponding important analytical data are infinitesimal generators (of transition semigroups) Equivalent roles are played by Dirichlet forms in a large class of Markov processes These notions, being relevant to diverse spaces of functions defined on the state space, may well be the objects of independent interests without referring to the associated Markov processes on the state space The Hille–Yosida theory of semigroups and the Beurling–Deny theory of Dirichlet spaces are also discussed in the chapter The chapter outlines a difference between the formulation of a Markov process and that of other important stochastic process—for example, a martingale

Journal ArticleDOI
TL;DR: In this paper, a generalization of Markov Decision Processes with discreet time is presented, where the immediate rewards in every period are not deterministic but random, with the two first moments of the distribution given.
Abstract: In this article we present a generalization of Markov Decision Processes with discreet time where the immediate rewards in every period are not deterministic but random, with the two first moments of the distribution given. Formulas are developed to calculate the expected value and the variance of the reward of the process, which formulas generalize and partially correct other results. We make some observations about the distribution of rewards for processes with limited or unlimited horizon and with or without discounting. Applications with risk sensitive policies are possible; this is illustrated in a numerical example where the results are revalidated by simulation.



Journal ArticleDOI
TL;DR: Regardless of the method used, a straight application of one step of Howard's policy space method will give the desired results.
Abstract: In the general area of Markov decision processes, a lot of attention has been given to deriving upper and lower bounds for approximating the optimal performance level. These are, in themselves, not useful unless they can be used to derive an approximately optimal policy. The existing literature does this specifically in the context of the computational methods being used at the time. However, irrespective of the method used, a straight application of one step of Howard's policy space method will give the desired results.

Journal ArticleDOI
TL;DR: In this article, a random walk type Markov decision process is considered, where the state space is an integer subset of IR and the action space is independent of i eI. The natural order is assumed, overI, and a quasi order, overK, together with aconditional convexity assumption on the returns.
Abstract: This paper considers a random walk type Markov decision process in which the state spaceI is an integer subset of IR m , and the action spaceK is independent ofi eI. The natural order , overI, and a quasi order, ′, overK, is assumed, together with aconditional convexity assumption on the returns {r i k }, and certain other assumptions about these rewards and the transition probabilities in relationship to the orders and ′.A negatively isotone policy is one for whichi i′→δ(i)⊁′)δ(i′) (i.e.δ(i) ′δ(i)′ orδ(i′) ′δi)). It is shown that, under specified conditions, a negatively isotone optimal policy exists. Some consideration is given to computational implications in particular relationship to Howard's policy space method.



Journal ArticleDOI
01 Dec 1982
TL;DR: It is shown how to treat the problems of time optimization and cost optimization of STEOR networks (GERT networks with only nodes of the “stochastic exclusive-or” type) within the scope of Markov decision processes and the related dynamic programming techniques.
Abstract: We show how to treat the problems of time optimization and cost optimization of STEOR networks (GERT networks with only nodes of the “stochastic exclusive-or” type) within the scope of Markov decision processes and present the related dynamic programming techniques.

Proceedings ArticleDOI
30 Aug 1982
TL;DR: A Markov Decision Process model is developed to analyze buffer assignment at the transport level of the ARPAnet protocol and results are a method for obtaining an assignment policy which is optimal with respect to a delay/throughput/overhead reward function.
Abstract: A Markov Decision Process model is developed to analyze buffer assignment at the transport level of the ARPAnet protocol. The result of the analysis is a method for obtaining an assignment policy which is optimal with respect to a delay/throughput/overhead reward function. The nature of the optimal policy is investigated by varying parameters of the reward.

Journal ArticleDOI
TL;DR: It is shown that the multilayer control scheme of the above paper can be constructed by using available results on Markov renewal theory and semi-Markov decision processes.
Abstract: It is shown that the multilayer control scheme of the above paper can be constructed by using available results on Markov renewal theory and semi-Markov decision processes.

01 Jan 1982
TL;DR: A submitted manuscript is the author's version of the article upon submission and before peer-review as discussed by the authors, and the final published version features the final layout of the paper including the volume, issue and page numbers.
Abstract: • A submitted manuscript is the author's version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website. • The final author version and the galley proof are versions of the publication after peer review. • The final published version features the final layout of the paper including the volume, issue and page numbers.


01 Jan 1982
TL;DR: In the general area of Markov decision processes, a lot of attention has been given to deriving upper and lower bounds for approximating the optimal performance level as discussed by the authors, which is not useful unless they can be used to derive an approximately optimal policy.
Abstract: In the general area of Markov decision processes, a lot of attention has been given to deriving upper and lower bounds for approximating the optimal performance level. These are, in them? selves, not useful unless they can be used to derive an approximately optimal policy. The existing literature does this specifically in the context of the computational methods being used at the time. However, irrespective of the method used, a straight application of one step of Howard's policy space method will give the desired results.


01 Jan 1982
TL;DR: It is demonstrated that significant improvements in system throughput can be realized by the use of optimal delayed resolution policies, and some simple procedures which use the values of model parameters can solve the analogue of the identification problem in adaptive control systems.
Abstract: A multi-class queue with typed servers has been used as an analytic model for a subnetwork node of a message switching network In the context of such a model, this dissertation addresses the problem of designing an optimal policy for sharing finite buffers Namely, given the values of model parameters, ie, number of job classes, buffer size, arrival and service functions, what buffer sharing policy should be used to obtain the optimal performance? Unlike past work in this area, no policy is assumed a priori The search space of permissible buffer sharing policies is defined in terms of primitive actions and decisions It is shown that an optimal policy must belong to the class of policies termed stationary delayed resolution policies An iterative procedure based upon policy iteration methods for Markov decision processes is used to obtain the optimal delayed resolution policy It is demonstrated that significant improvements in system throughput can be realized by the use of optimal delayed resolution policies The policy iteration technique involves solving linear systems of equations The order of the number of equations to be solved is a function of B('K), where K and B are the number of classes and number of buffers, respectively Therefore, it becoms intractable to solve for the optimal policy for reasonable values of B and K A class of policies called SRS delayed resolution policies are proposed It is shown that performance of the best SRS delayed resolution policies is close to the performance of the optimal delayed resolution policies The buffer allocation model is applied to model a node of a store-and-forward network An analysis of the message delays with respect to a link level protocol is presented It is shown that under some situations, a non-delayed resolution policy may provide smaller message delays than a delayed resolution policy Analysis is presented to determine network parameters for which one class of policies are likely to provide smaller delays than the other class of policies In networks where the traffic does not remain constant, the buffer sharing policies must be able to adapt to such a changing environment Using the analogy of adaptive control systems, the adaptive policies may be obtained by generalizing the policies that are known to be optimal for static environments It is shown that some simple procedures which use the values of model parameters can solve the analogue of the identification problem in adaptive control systems

Journal ArticleDOI
TL;DR: In this article, the authors consider a continuous time Markov decision process with a finite state space and show that the expected reward always tends to a limit as the time parameter $t \to \infty $.
Abstract: We consider a continuous time Markov decision process with a finite state space. There is a specified terminal reward, but the reward or cost rate is always zero. The maximum expected final gain can then be obtained by means of the exponential of a certain sublinear operator on $R^n$ This representation allows us to describe the asymptotic properties of the reward vector. We prove that the expected reward always tends to a limit as the time parameter $t \to \infty $. If we assume that it is allowed to stop the process in any state, then we can construct an almost optimal stationary control. Finally, we characterize the case where the asymptotic gain is independent of the initial state.

Book ChapterDOI
01 Jan 1982
TL;DR: In dieser Arbeit werden einige Aspekte der numerisehen Analyse von Markoffschen Entscheidungsprozessen mit Diskontierung diskutiert, inwieweit Aggregation and Disaggregation vorteilhaft sind.
Abstract: In dieser Arbeit werden einige Aspekte der numerisehen Analyse von Markoffschen Entscheidungsprozessen mit Diskontierung diskutiert. Insbesondere wird versucht, die spezielle Problemstruktur zur Wahl effizienter Algorithmen auszunutzen. Beispiele solcher speziellen Strukturen sind die Periodizitat der Bedarfe und die Struktur des Aktionsraums in Lagerhaltungsmodellen. Fur letztere wird untersucht, inwieweit Aggregation und Disaggregation vorteilhaft sind.