scispace - formally typeset
Search or ask a question

Showing papers on "Markov decision process published in 1992"


Book
18 Dec 1992
TL;DR: In this paper, an introduction to optimal stochastic control for continuous time Markov processes and to the theory of viscosity solutions is given, as well as a concise introduction to two-controller, zero-sum differential games.
Abstract: This book is intended as an introduction to optimal stochastic control for continuous time Markov processes and to the theory of viscosity solutions. The authors approach stochastic control problems by the method of dynamic programming. The text provides an introduction to dynamic programming for deterministic optimal control problems, as well as to the corresponding theory of viscosity solutions. A new Chapter X gives an introduction to the role of stochastic optimal control in portfolio optimization and in pricing derivatives in incomplete markets. Chapter VI of the First Edition has been completely rewritten, to emphasize the relationships between logarithmic transformations and risk sensitivity. A new Chapter XI gives a concise introduction to two-controller, zero-sum differential games. Also covered are controlled Markov diffusions and viscosity solutions of Hamilton-Jacobi-Bellman equations. The authors have tried, through illustrative examples and selective material, to connect stochastic control theory with other mathematical areas (e.g. large deviations theory) and with applications to engineering, physics, management, and finance. In this Second Edition, new material on applications to mathematical finance has been added. Concise introductions to risk-sensitive control theory, nonlinear H-infinity control and differential games are also included.

3,885 citations


Journal ArticleDOI
TL;DR: In this article, the authors considered the multiarmed bandit problem and presented a new proof of the optimality of the Gittins index policy, which does not require an interchange argument.
Abstract: This paper considers the multiarmed bandit problem and presents a new proof of the optimality of the Gittins index policy. The proof is intuitive and does not require an interchange argument. The insight it affords is used to give a streamlined summary of previous research and to prove a new result: The optimal value function is a submodular set function of the available projects.

245 citations


Proceedings ArticleDOI
24 Oct 1992
TL;DR: This paper considers the problem of paging under the assumption that the sequence of pages accessed is generated by a Markov chain, and draws on the theory of Markov decision processes to characterize the paging algorithm that achieves optimal fault-rate on any Markov chains.
Abstract: This paper considers the problem of paging under the assumption that the sequence of pages accessed is generated by a Markov chain. The authors use this model to study the fault-rate of paging algorithms, a quantity of interest to practitioners. They first draw on the theory of Markov decision processes to characterize the paging algorithm that achieves optimal fault-rate on any Markov chain. They address the problem of efficiently devising a paging strategy with low fault-rate for a given Markov chain. They show that a number of intuitively good approaches fail. Their main result is an efficient procedure that, on any Markov chain, will give a paging algorithm with fault-rate at most a constant times optimal. Their techniques also show that some algorithms that do poorly in practice fail in the Markov setting, despite known (good) performance guarantees when the requests are generated independently from a probability distribution. >

92 citations


Journal ArticleDOI
TL;DR: It is shown that the error goes to zero for any fixed rolling horizon as this Doeblin measure of control over the future decreases, and provides a cost error bound for a general rolling horizon algorithm when applied to infinite horizon nonhomogeneous Markov decision processes.
Abstract: By far the most common planning procedure found in practice is to approximate the solution to an infinite horizon problem by a series of rolling finite horizon solutions. Although many empirical studies have been done, this so-called rolling horizon procedure has been the subject of few analytic studies. We provide a cost error bound for a general rolling horizon algorithm when applied to infinite horizon nonhomogeneous Markov decision processes, both in the discounted and average cost cases. We show that a Doeblin coefficient of ergodicity acts much like a discount factor to reduce this error. In particular, we show that the error goes to zero for any fixed rolling horizon as this Doeblin measure of control over the future decreases. The theory is illustrated through an application to vehicle deployment.

81 citations


Journal ArticleDOI
TL;DR: The objective of the study is to endogenize rework and scrap decisions in a multistage production process using a Markov decision process model developed and solved using dynamic programming techniques.
Abstract: This study is motivated by a make-to-order marketing environment where an order is met from a single production lot size. The objective of the study is to endogenize rework and scrap decisions in a multistage production process. A Markov decision process model is developed and solved using dynamic programming techniques. The model assumes that demand is given, and material, processing and rework costs are linear in the production lot size. Modeling random yield at each stage of the production process is of key interest. The solution to the problem is characterized and the sensitivity of the solution to the parameters of the model is examined.

70 citations


Journal ArticleDOI
TL;DR: Two algorithms for the solution of the underlying limit Markov control problem are presented, a linear program possessing the Wolfe-Dantzig structure inherited from the ergodic 'nearly decomposable' assumption in the model and an aggregation-disaggregation policy improvement algorithm.
Abstract: A singularly perturbed Markov decision process with the limiting average reward criterion is considered. It is assumed that the underlying process is composed of n separate irreducible processes, and that the small perturbation is such that it unites these processes into a single irreducible process. Two algorithms for the solution of the underlying limit Markov control problem are presented. The first of these is a linear program possessing the Wolfe-Dantzig structure inherited from the ergodic 'nearly decomposable' assumption in the model. The second is an aggregation-disaggregation policy improvement algorithm. >

69 citations


Journal ArticleDOI
TL;DR: A unified approach to the asymptotic analysis of a Markov decision process disturbed by an epsilon -additive perturbation is proposed in this article, where the underlying control problem that needs to be understood is the limit Markov control problem.
Abstract: A unified approach to the asymptotic analysis of a Markov decision process disturbed by an epsilon -additive perturbation is proposed. Irrespective of whether the perturbation is regular or singular, the underlying control problem that needs to be understood is the limit Markov control problem. The properties of this problem are studied. >

67 citations


Journal ArticleDOI
TL;DR: This work considers discrete time average cost Markov decision processes with countable state space and finite action sets and concludes that the Sennott conditions are the weakest.

66 citations


Journal ArticleDOI
TL;DR: In this article, the authors reconstruct a proof of a classical result due to Hardy and Littlewood, which is not covered by the Hardy-Littlewood theorem, and provide either examples or complete citations for other related cases which are not covered.
Abstract: In this note, we reconstruct a proof of a classical result due to Hardy and Littlewood. While this result has played an important role in the modern theories of Markov decision processes and stochastic games, it is not that easy to find its proof in the literature in the format in which it has been applied. Furthermore, we supply either examples or complete citations for the other related cases which are not covered by the Hardy-Littlewood theorem.

51 citations


Proceedings ArticleDOI
16 Dec 1992
TL;DR: It is shown that there exists an optimal stationary policy (such that the decisions depend only on the actual number of customers in the queue); it is of a threshold type, and it uses randomization in at most one state.
Abstract: The author considers the problem of dynamic flow control of arriving packets into an infinite buffer. The service rate may depend on the state of the system, may change in time, and is unknown to the controller. The goal of the controller is to design an efficient policy which guarantees the best performance under the worst service conditions. The cost is composed of a holding cost, a cost for rejecting customers (packets) and a cost that depends on the quality of the service. The problem is studied in the framework of zero-sum Markov games, and a value iteration algorithm is used to solve it. It is shown that there exists an optimal stationary policy (such that the decisions depend only on the actual number of customers in the queue); it is of a threshold type, and it uses randomization in at most one state. >

45 citations


Journal ArticleDOI
TL;DR: It is shown that the operator theoretical approach presented for multichain Markov decision processes with a countable state space, compact action sets and unbounded rewards can also be carried out under recurrence conditions.
Abstract: In a previous paper Dekker and Hordijk 1988 presented an operator theoretical approach for multichain Markov decision processes with a countable state space, compact action sets and unbounded rewards. Conditions were presented guaranteeing the existence of a Laurent series expansion for the discounted rewards, the existence of average and Blackwell optimal policies and the existence of solutions for the average and Blackwell optimality equations. While these assumptions were operator oriented and formulated as conditions for the deviation matrix, we will show in this paper that the same approach can also be carried out under recurrence conditions. These new conditions seem easier to check in general and are especially suited for applications in queueing models.

Proceedings ArticleDOI
01 May 1992
TL;DR: A state-dependent call admission and routing policy for a multiservice circuit-switched network is analyzed and the numeral study showed that the convergence of the analyzed strategy is achieved in at most two iterations and the good traffic efficiency of the approach was showed.
Abstract: A state-dependent call admission and routing policy for a multiservice circuit-switched network is analyzed. The policy is based on decomposition of the Markov decision problem into a set of separable link problems. To provide an exact link analysis model a value iteration algorithm was offered. This allows examination of the accuracy of several approximations used to reduce the complexity of the problem. The numeral study showed that the convergence of the analyzed strategy is achieved in at most two iterations. The study also showed the good traffic efficiency of the approach and confirmed the predicted ability to control the distribution of call classes grade of service. The approach, together with its sensitivity analysis with respect to the arrival rates, provides a very general framework for studying, constructing, and optimizing other call admission and routing strategies. The results of sensitivity analysis are used to compare the proposed decomposition approach with the decomposition approach developed by F.P. Kelly (1988) for optimization of a load sharing policy. Also, the relationship to other routing strategies based on Markov decision theory is investigated. >

Journal ArticleDOI
TL;DR: In this paper, the authors investigated semi-Markov games under discounted and limiting average payoff criteria, and proved the existence of a solution to the optimality equation under a natural ergodic condition.
Abstract: Semi-Markov games are investigated under discounted and limiting average payoff criteria. The issue of the existence of the value and a pair of stationary optimal strategies are settled; the optimality equation is studied and under a natural ergodic condition the existence of a solution to the optimality equation is proved for the limiting average case. Semi-Markov games provide useful flexibility in constructing recursive game models. All the work on Markov/semi-Markov decision processes and Markov (stochastic) games can be viewed as special cases of the developments in this paper.

Journal ArticleDOI
TL;DR: In this article, an approximate method combining dynamic programming and stochastic simulation in the determination of a set of descriptive parameters is suggested, which is used in the calculation of the multi-component replacement criterion for cows and heifers.

Proceedings ArticleDOI
Gary Bradski1
07 Jun 1992
TL;DR: An optimal control solution to change of machine setup scheduling based on dynamic programming average cost per stage value iteration as set forth by M. Caramanis et al. (1991) is demonstrated.
Abstract: Demonstrated is an optimal control solution to change of machine setup scheduling based on dynamic programming average cost per stage value iteration as set forth by M. Caramanis et al. (1991) for the 2-D case. The difficulty with the optimal approach lies in the explosive computational growth of the resulting solution. A method of reducing the computational complexity is developed using ideas from biology and neural networks. A real-time controller is described that uses a linear-log representation of state space with neural networks employed to fit cost surfaces. >

Journal ArticleDOI
TL;DR: Two definitions of variability are introduced, namely, the expected time- average variability and time-average expected variability, and a randomized stationary policy is constructed that is e-optimal for both criteria.
Abstract: Considered are time-average Markov Decision Processes MDPs with finite state and action spaces. Two definitions of variability are introduced, namely, the expected time-average variability and time-average expected variability. The two criteria are in general different, although they can both be employed to penalize for variance in the stream of rewards. For communicating MDPs, we construct a randomized stationary policy that is e-optimal for both criteria; the policy is optimal and pure for a specific variability function. For general multichain MDPs, a state space decomposition leads to a similar result for the expected time-average variability. We also consider the problem of the decision maker choosing the initial state along with the policy.

Journal ArticleDOI
TL;DR: An iterative algorithm for computing an e-optimal nonstationary policy with a very simple structure is presented, thereby allowing the decision maker to place more or less emphasis on the short-term versus the long-term rewards by varying their weights.
Abstract: The two most commonly considered reward criteria for Markov decision processes are the discounted reward and the long-term average reward. The first tends to “neglect” the future, concentrating on the short-term rewards, while the second one tends to do the opposite. We consider a new reward criterion consisting of the weighted combination of these two criteria, thereby allowing the decision maker to place more or less emphasis on the short-term versus the long-term rewards by varying their weights. The mathematical implications of the new criterion include: the deterministic stationary policies can be outperformed by the randomized stationary policies, which in turn can be outperformed by the nonstationary policies; an optimal policy might not exist. We present an iterative algorithm for computing an e-optimal nonstationary policy with a very simple structure.

Journal ArticleDOI
TL;DR: The result of Sennott [9] on the existence of optimal stationary policies in countable state Markov decision chains with finite action sets is generalized to arbitrary state space Markov decisions chains.
Abstract: The result of Sennott [9] on the existence of optimal stationary policies in countable state Markov decision chains with finite action sets is generalized to arbitrary state space Markov decision chains. The assumption of finite action sets occurring in a global countable action space allows a particularly simple theoretical structure for the general state space Markov decision chain. Two examples illustrate the results. Example 1 is a system of parallel queues with stochastic work requirements, a movable server with controllable service rate, and a reject option. Example 2 is a system of parallel queues with stochastic controllable inputs, a movable server with fixed service rates, and a reject option.

Journal ArticleDOI
TL;DR: A Bender's decomposition approach to solving this problem that evaluates the stopping rule, eliminates some suboptimal combinations of actions, and yields bounds on the maximum error that could result from the selection of a candidate action in the initial stage is given.
Abstract: We formulate a mixed integer program to determine whether a finite time horizon is a forecast horizon in a nonhomogeneous Markov decision process. We give a Bender's decomposition approach to solving this problem that evaluates the stopping rule, eliminates some suboptimal combinations of actions, and yields bounds on the maximum error that could result from the selection of a candidate action in the initial stage. The integer program arising from the decomposition has special properties that allow efficient solution. We illustrate the approach with numerical examples.

Journal ArticleDOI
01 Jan 1992
TL;DR: Competitive Markov Decision Processes in which the controllers/players are antagonistic and aggregate their sequences of expected rewards according to “weighted” or “horizonsensitive” criteria are considered.
Abstract: We consider Competitive Markov Decision Processes in which the controllers/players are antagonistic and aggregate their sequences of expected rewards according to “weighted” or “horizonsensitive” criteria. These are either a convex combination of two discounted objectives, or of one discounted and one limiting average reward objective. In both cases we establish the existence of the game-theoretic value vector, and supply a description of 6-optimal non-stationary strategies.

Journal ArticleDOI
TL;DR: In this article, the authors proved the equivalence of ten stability/ergodicity conditions on the transition law of the model, which imply the existence of average optimal stationary policies for an arbitrary continuous and bounded reward function.
Abstract: We are concerned with Markov decision processes with countable state space and discrete-time parameter. The main structural restriction on the model is the following: under the action of any stationary policy the state space is acommunicating class. In this context, we prove the equivalence of ten stability/ergodicity conditions on the transition law of the model, which imply the existence of average optimal stationary policies for an arbitrary continuous and bounded reward function; these conditions include the Lyapunov function condition (LFC) introduced by A. Hordijk. As a consequence of our results, the LFC is proved to be equivalent to the following: under the action of any stationary policy the corresponding Markov chain has a unique invariant distribution which depends continuously on the stationary policy being used. A weak form of the latter condition was used by one of the authors to establish the existence of optimal stationary policies using an approach based on renewal theory.

Journal ArticleDOI
TL;DR: A knowledge acquisition method that extracts the real-time scheduling rules from the optimal policy of the user-based semi-Markov decision processes and combines the human's knowledge of realtime scheduling with the optimization technique to create a better knowledge resource.
Abstract: This paper proposes a knowledge acquisition method that extracts the real-time scheduling rules from the optimal policy of the user-based semi-Markov decision processes. This method combines the human's knowledge of realtime scheduling with the optimization technique to create a better knowledge resource. A revised rule formation algorithm developed in trace driven knowledge acquisition (TDKA) is used to generalize the optimal policy derived from semi-Markov decision processes. A hoist scheduling problem in circuit board production lines demonstrates the feasibility and the superior performance of the proposed method.

Journal ArticleDOI
TL;DR: Under a certain penalizing condition on the cost for unstable behavior, the existence of a stable stationary strategy which is strong average optimal is established.

Journal ArticleDOI
TL;DR: In a sense, in order to be made precise, the algorithm offered is shown to attain asymptotically optimal performance, and rates are assured.
Abstract: A vigorous branch of automatic learning is directed at the task of locating a global minimum of an unknown multimodal function f( theta ) on the basis of noisy observations L( theta (i))=f( theta (i))+W( theta (i)) taken at sequentially-chosen control points ( theta (i)). In all preceding convergence deviations known to the authors, the noise is postulated to depend on the past only through control selection. Here they allow the observation noise sequence to be stochastically dependent, in particular, a function of an unknown underlying Markov decision process, the observations being the stagewise losses. In a sense, in order to be made precise, the algorithm offered is shown to attain asymptotically optimal performance, and rates are assured. A motivating example from queueing theory is offered, and connections with classical problems of Markov control theory and other disciplines are mentioned. >

Journal ArticleDOI
TL;DR: Three computational approaches for solving a variance-penalised Markov decision process are developed, viz. parametric linear programming, parametric Lagrangean programming, and a parametric policy space approach.
Abstract: This paper develops three computational approaches for solving a variance-penalised Markov decision process, viz. parametric linear programming, parametric Lagrangean programming, and a parametric policy space approach. Fur ein Markov-Entscheidungsmodell, in dem der Durchschnittsgewinn um einen Teil der mittleren Varianz vermindert wird, werden drei Typen von Losungsverfahren entwickelt, und zwar parametrische lineare Programmierung, parametrische Lagrange-Optimierung und parametrische Politik-Iterations-Algorithmen.

Journal ArticleDOI
Qi Ying Hu1
TL;DR: A new condition is presented about the unbounded rewards under which the discounted optimality equation has a unique solution, then sufficient conditions are exposed for the existence of a solution of the average optimality equations in the discrete time Markov decision processes.

Journal ArticleDOI
TL;DR: In this paper, the existence of an optimal stationary policy under structural restrictions on the model is proved; both the lim inf and lim sup average criteria are considered, and the arguments are based on well-known facts from Renewal Theory.
Abstract: We consider discrete-timeaverage reward Markov decision processes with denumerable state space andbounded reward function. Under structural restrictions on the model the existence of an optimal stationary policy is proved; both the lim inf and lim sup average criteria are considered. In contrast to the usual approach our results donot rely on the average regard optimality equation. Rather, the arguments are based on well-known facts fromRenewal Theory.

Journal ArticleDOI
TL;DR: In this paper, the authors considered the vector-valued Markov decision process and considered the characterization of optimal stationary policies among the set of all (randomized, history-dependent) policies.

Journal ArticleDOI
TL;DR: This article applied an ergodic Markov chain process to teachers' decision-making process and obtained a measure of teacher's ability as a decision maker in the process of asking questions in class, which may be helpful to the teacher in developing a better questioning technique that will encourage his students to think in higher levels.
Abstract: We apply an ergodic Markov chain process to teachers’ decision making. Through this we succeed in giving a new approach to the teachers’ ‘behaviour’ in the decision‐making process and obtain a measure of teacher's ability as a decision maker in the process of asking questions in class. This measure may be helpful to the teacher of mathematics in developing a better questioning technique that will encourage his students to think in higher levels.

Proceedings ArticleDOI
16 Dec 1992
TL;DR: These convergence properties provide an alternative proof for some of the properties of steering policies as well as an indirect argument that blends standard results on stochastic approximations with a version of the law of large number for martingale differences.
Abstract: The authors consider a specific multidimensional stochastic approximation scheme of the Robbins-Monro type that naturally arises in the study of steering policies for Markov decision processes. The usual convergence results (in the almost sure sense) do not seem to apply for this simple scheme. Almost sure convergence is established by an indirect argument that blends standard results on stochastic approximations with a version of the law of large number for martingale differences. These convergence properties provide an alternative proof for some of the properties of steering policies. >