scispace - formally typeset
Search or ask a question

Showing papers on "Markov decision process published in 1989"


Book ChapterDOI
01 Jan 1989
TL;DR: This chapter introduces the stochastic control processes, also known as Markov decision processes or Markov dynamic programs, and discusses (briefly) more general control systems, such as non-stationary CMP’s and semi-Markov control models.
Abstract: The objective of this chapter is to introduce the stochastic control processes we are interested in; these are the so-called (discrete-time) controlled Markov processes (CMP’s), also known as Markov decision processes or Markov dynamic programs. The main part is Section 1.2. It contains some basic definitions and the statement of the optimal and the adaptive control problems studied in this book. In Section 1.3 we present several examples; the idea is to illustrate the main concepts and provide sources for possible applications. Also in Section 1.3 we discuss (briefly) more general control systems, such as non-stationary CMP’s and semi-Markov control models. The chapter is concluded in Section 1.4 with some comments on related references.

399 citations


Journal ArticleDOI
TL;DR: In this paper, the authors considered infinite state Markov decision processes with unbounded costs and provided sufficient conditions for the existence of a distinguished state of smallest discounted value and a single stationary policy inducing an irreducible, ergodic Markov chain for which the average cost of a first passage from any state to the distinguished state is finite.
Abstract: We deal with infinite state Markov decision processes with unbounded costs. Three simple conditions, based on the optimal discounted value function, guarantee the existence of an expected average cost optimal stationary policy. Sufficient conditions are the existence of a distinguished state of smallest discounted value and a single stationary policy inducing an irreducible, ergodic Markov chain for which the average cost of a first passage from any state to the distinguished state is finite. A result to verify this is also given. Two examples illustrate the ease of applying the criteria.

199 citations


DOI
01 Jan 1989
TL;DR: The thesis develops methods to solve discrete-time finite-state partially observable Markov decision processes and proves that the policy improvement step in iterative discretization procedure can be replaced by the approximation version of linear support algorithm.
Abstract: The thesis develops methods to solve discrete-time finite-state partially observable Markov decision processes. For the infinite horizon problem, only discounted reward case is considered. For the finite horizon problem, two new algorithms are developed. The first algorithm is called the relaxed region algorithm. For each support in the value function, this algorithm determines a region not smaller than its support region and modifies it implicitly in later steps until the exact support region is found. The second algorithm, called linear support algorithm, systematically approximates the value function until all supports in the value function are found. The most important feature of this algorithm is that it can be modified to find an approximate value function. It has been shown that these two algorithms are more efficient than the one-pass algorithm. For the infinite horizon problem, it is first shown that the approximation version of linear support algorithm can be used to substitute the policy improvement step in a standard successive approximation method to obtain an $\epsilon$-optimal value function. Next, an iterative discretization procedure is developed which uses a small number of states to find new supports and improve the value function between two policy improvement steps. Since only a finite number of states are chosen in this process, some techniques developed for finite MDP can be applied here. Finally, we prove that the policy improvement step in iterative discretization procedure can be replaced by the approximation version of linear support algorithm. The last part of the thesis deals with problems with continuous signals. We first show that if the signal processes are uniformly distributed, then the problem can be reformulated as a problem with finite number of signals. Then the result is extended to where the signal processes are step functions. Since step functions can be easily used to approximate most of the probability distributions, this method can be used to approximate most of the problems with continuous signals. Finally, we present some conditions which guarantee that the linear support can be computed for any given state, then the methods developed for finite signal cases can be easily modified and applied to problems for which the conditions hold.

173 citations


Journal ArticleDOI
TL;DR: It is shown that a round-robin type policy is optimal, and conjecture the same for a steering policy that depends on the entire past history of the process, but whose implementation requires essentially no more storage than that of a pure policy.
Abstract: The Markov decision problem of locating a policy to maximize the long-run average reward subject to K long-run average cost constraints is considered. It is assumed that the state and action spaces are finite and the law of motion is unichain, that is, every pure policy gives rise to a Markov chain with one recurrent class. It is first proved that there exists an optimal stationary policy with a degree of randomization no greater than K; consequently, it is never necessary to randomize in more than K states. A linear program produces the optimal policy with limited randomization. For the special case of a single constraint, we also address the problem of finding optimal nonrandomized, but nonstationary, policies. We show that a round-robin type policy is optimal, and conjecture the same for a steering policy that depends on the entire past history of the process, but whose implementation requires essentially no more storage than that of a pure policy.

164 citations


Journal ArticleDOI
TL;DR: This work considers a Markov decision process with both the expected limiting average, and the discounted total return criteria, appropriately modified to include a penalty for the variability in the stream of rewards.
Abstract: We consider a Markov decision process with both the expected limiting average, and the discounted total return criteria, appropriately modified to include a penalty for the variability in the stream of rewards. In both cases we formulate appropriate nonlinear programs in the space of state-action frequencies averaged, or discounted whose optimal solutions are shown to be related to the optimal policies in the corresponding “variance-penalized MDP.” The analysis of one of the discounted cases is facilitated by the introduction of a “Cartesian product of two independent MDPs.”

164 citations


Journal ArticleDOI
TL;DR: This paper first derive instantaneous and cumulative measures of Markov and Markov reward model behavior, and compares the complexity of several competing algorithms for the computation of these measures.

161 citations


Journal ArticleDOI
TL;DR: Both linear programming and value-iteration MDP algorithms are coupled with a novel state descriptor in order to locate the optimal policy for reasonable-size problems (several T1 carriers in parallel for the access-port case, and small networks of T2 carriers for the network-access case).
Abstract: The problem of determining optimal access policies for circuit-switched networks that support traffic types with varying bandwidth requirements is addressed. The authors suppose that the network supports K classes of calls where each class is determined by a fixed route and a bandwidth requirement. A Markov decision process (MDP) approach is used to obtain optimal access policies for three models: the flexible scheme access-port model where a single link is shared; the contiguous scheme access-port model where wideband calls are required to occupy specific contiguous regions of the TDM frame; and the network-access model where a call holds several channels in different links simultaneously. Both linear programming and value-iteration MDP algorithms are coupled with a novel state descriptor in order to locate the optimal policy for reasonable-size problems (several T1 carriers in parallel for the access-port case, and small networks of T1 carriers for the network-access case). >

153 citations


Journal ArticleDOI
01 Jan 1989-Leonardo
TL;DR: A survey of Markov-based efforts in automated composition with a tutorial demonstrating how various theoretical properties associated with Markov processes can be put to practical use, and a contrast with alternative compositional strategies.
Abstract: The author combines a survey of Markov-based efforts in automated composition with a tutorial demonstrating how various theoretical properties associated with Markov processes can be put to practical use. The historical background is traced from A. A. Markov’s original formulation through to the present. A digression into Markov-chain theory introduces ‘waiting counts’ and ‘stationary probabilities’. The author’s Demonstration 4 for solo clarinet illustrates how these properties affect the behavior of a melody composed using Markov chains. This simple example becomes a point of departure for increasingly general interpretations of the Markov process. The interpretation of ‘states’ is reevaluated in the light of recent musical efforts that employ Markov chains of higher-level objects and in the light of other efforts that incorporate relative attributes into the possible interpretations. Other efforts expand Markov’s original definition to embrace ‘Nth-order’ transitions, evolving transition matrices and chains of chains. The remainder of this article contrasts Markov processes with alternative compositional strategies.

130 citations


Journal ArticleDOI
TL;DR: This paper explores the economics of investing in gradual process improvement, a key component, with empirically supported importance, of the well known Just-in-Time and Total Quality Control philosophies, and forms a Markov decision process, which is applied to the problem of setup reduction and process quality improvement.
Abstract: This paper explores the economics of investing in gradual process improvement, a key component, with empirically supported importance, of the well known Just-in-Time and Total Quality Control philosophies. We formulate a Markov decision process, analyze it, and apply it to the problem of setup reduction and process quality improvement. Instead of a one-time investment opportunity for a large predictable technological advance, we allow many smaller investments over time, with potential process improvements of random magnitude. We use a somewhat nonstandard formulation of the immediate return, which facilitates the derivation of results. The policy that simply maximizes the immediate return, called the last chance policy, provides an upper bound on the optimal investment amount. Furthermore, if the last chance policy invests in process improvement, then so does the optimal policy. Each continues investing until a shared target state is attained. We derive fairly restrictive conditions that must be met for the policy of investing forever in process improvements to be optimal. Decreasing the uncertainty of the process making the potential improvements more predictable has a desirable effect: the total return is increased and the target state increases, so the ultimate system is more productive. Numerical examples are presented and analyzed.

128 citations


Journal Article
TL;DR: In this paper, a bridge service life prediction model using Markov chain was developed to reflect the stochastic nature of bridge condition and service life, and the comparison of service life predictions by statistical and Markov Chain approaches was made.
Abstract: This paper describes the application of Markov chain technique in estimating bridge service life. The change of bridge conditions is a stochastic process and, therefore, the service life of bridges is related to the probabilities of condition transitions. A bridge service life prediction model, using the Markov chain, was developed to reflect the stochastic nature of bridge condition and service life. The paper includes a discussion on the concept of Markov chain, the development and application of the service life prediction model using the Markov chain, and the comparison of service life predictions by statistical and Markov chain approaches.

94 citations


Journal ArticleDOI
TL;DR: Assuming that a policy exists that meets the sample-path constraint, it is established that there exist nearly optimal stationary policies for communicating MDPs and a parametric linear programming algorithm is given to construct nearly optimalstationary policies.
Abstract: We consider time-average Markov Decision Processes MDPs, which accumulate a reward and cost at each decision epoch. A policy meets the sample-path constraint if the time-average cost is below a specified value with probability one. The optimization problem is to maximize the expected average reward over all policies that meet the sample-path constraint. The sample-path constraint is compared with the more commonly studied constraint of requiring the average expected cost to be less than a specified value. Although the two criteria are equivalent for certain classes of MDPs, their feasible and optimal policies differ for many nontrivial problems. In general, there does not exist optimal or nearly optimal stationary policies when the expected average-cost constraint is employed. Assuming that a policy exists that meets the sample-path constraint, we establish that there exist nearly optimal stationary policies for communicating MDPs. A parametric linear programming algorithm is given to construct nearly optimal stationary policies. The discussion relies on well known results from the theory of stochastic processes and linear programming. The techniques lead to simple proofs of the existence of optimal and nearly optimal stationary policies for unichain and deterministic MDPs, respectively.

Journal ArticleDOI
TL;DR: Three algorithms to solve the infinite horizon, expected discounted total reward partially observed Markov decision process POMDP, with an appropriately generalized numerical technique that has been shown to reduce CPU time until convergence for the completely observed case.
Abstract: We present three algorithms to solve the infinite horizon, expected discounted total reward partially observed Markov decision process POMDP. Each algorithm integrates a successive approximations algorithm for the POMDP due to A. Smallwood and E. Sondik with an appropriately generalized numerical technique that has been shown to reduce CPU time until convergence for the completely observed case. The first technique is reward revision. The second technique is reward revision integrated with modified policy iteration. The third is a standard extrapolation. A numerical study indicates the potentially significant computational value of these algorithms.

Journal ArticleDOI
TL;DR: In this article, the existence of an expected average cost optimal stationary policy is proved for infinite state semi-Markov decision processes with nonnegative, unbounded costs and finite action sets.
Abstract: Semi-Markov decision processes underlie the control of many queueing systems. In this paper, we deal with infinite state semi-Markov decision processes with nonnegative, unbounded costs and finite action sets. Axioms for the existence of an expected average cost optimal stationary policy are presented. These conditions generalize the work in Sennott [22] for Markov decision processes. Verifiable conditions for the axioms to hold are obtained. The theory is applied to control of the M/G/l queue with variable service parameter, with on-off server, and with batch processing, and to control of the G/M/m queue with variable arrival parameter and customer rejection. It is applied to a timesharing network of queues with a single server and finally to optimal routing of Poisson arrivals to parallel exponential servers. The final section extends the existence result to compact action spaces.

Journal ArticleDOI
TL;DR: In this article, the long run average cost control problem for discrete time Markov chains on a countable state space is studied in a very general framework and necessary and sufficient conditions for optimality in terms of the dynamic programming equations are given when an optimal stable stationary strategy is known to exist.
Abstract: The long-run average cost control problem for discrete time Markov chains on a countable state space is studied in a very general framework. Necessary and sufficient conditions for optimality in terms of the dynamic programming equations are given when an optimal stable stationary strategy is known to exist (e.g., for the situations studied in [Stochastic Differential Systems, Stochastic Control Theory and Applications, IMA Vol. Math. App. 10, Springer-Verlag, New York, Berlin, 1988, pp. 57–77]). A characterization of the desired solution of the dynamic programming equations is given in a special case. Also included is a novel convex analytic argument for deducing the existence of an optimal stable stationary.strategy when that of a randomized one is known.

Proceedings ArticleDOI
01 Oct 1989
TL;DR: This paper discusses likelihood ratio derivative estimation techniques for stochastic systems and presents estimators for time homogeneous discrete-time Markov chains, semi-Markov processes, non-time homogeneous continuous- time Markov Chains, and generalized semi- Markov processes.
Abstract: This paper discusses Iikelihood--ratio--derivative estimation techniques for stochastic systems. After a brief review of the basic concepts, likelihood-ratio-derivative estimators are presented for the following classes of stochastic processes: time homogeneous discrete-time Markov chains, non-time-homogeneous discrete-time Markov chains, time-homogeneous continuous-time Markov chains, semi-Markov processes, non-time-homogeneous continuous-time Markov chains, and generalized semi-Markov processes.

Journal ArticleDOI
TL;DR: The problem of converting a labor-intensive batch production process to one that incorporates flexible automation as a finite-state Markov decision process is formulated and the qualitative characteristics of optimal strategies for acquiring flexible automation are illustrated.
Abstract: We formulate the problem of converting a labor-intensive batch production process to one that incorporates flexible automation as a finite-state Markov decision process. Interest rates and the level of automated technology influence both operating and acquisition costs and are treated as random variables. The model specifies the optimal level of capacity to convert to flexible automation. The optimization criterion is the minimization of the sum of expected, discounted costs incurred over a finite planning horizon. The optimal acquisition strategy depends upon the time period, the current interest rate, the current level of technology, and a measure of the remaining capacity that is not automated. We investigate the structure of optimal acquisition strategies using mathematical analysis and simulation. Our objective is to illustrate the qualitative characteristics of optimal strategies for acquiring flexible automation. As a step toward the implementation of the model, we examine the qualitative consequen...

Journal ArticleDOI
TL;DR: In this article, the authors considered the average-cost Markov decision process with compact state and action spaces and bounded lower semicontinuous cost functions, and proved that under additional weak conditions there exists an optimal stationary policy in the usual sense.
Abstract: This paper studies the average-cost Markov decision process with compact state and action spaces and bounded lower semicontinuous cost functions. Following the idea of Borkar’s excellent papers [SIAMJ. Control Optim., 21 (1983), pp. 652–666; 22 (1984), pp. 965–978], the general case where irreducibility is not assumed is considered under the hypothesis of Doeblin and the existence of a minimum pair of state and policy, which attains the infimum of the average expected cost over all initial states and policies, is established. Further, it is proved that under additional weak conditions there exists an optimal stationary policy in the usual sense.

Journal ArticleDOI
TL;DR: It is the purpose in this paper to quantitatively formulate the problem of controlling resources in a distributed system so as to optimize a reward function, and derive optimal control strategies using Markov decision theory.
Abstract: The authors quantitatively formulate the problem of controlling resources in a distributed system so as to optimize a reward function and derive optimal control strategies using Markov decision theory. The control variables treated are quite general; they could be control decisions related to system configuration, repair, diagnostics, files, or data. Two algorithms for resource control in distributed systems are derived for time-invariant and periodic environments, respectively. A detailed example to demonstrate the power and usefulness of the approach is provided.

Proceedings ArticleDOI
13 Dec 1989
TL;DR: Algorithms for adaptive control of unknown finite Markov chains are proposed and are easy to implement and converge to the optimal policy in finite time.
Abstract: Algorithms for adaptive control of unknown finite Markov chains are proposed. The algorithms consist of two parts: part one estimates the unknown parameters; part two computes the optimal policy. In this study the emphasis is on efficient online computation of the optimal policy. No a priori knowledge of the optimal policy is assumed. The optimal policy is computed recursively online. At each step a small amount of computation is required. At each transition of the chain, only the act corresponding to the present state of the chain is updated. The algorithms are easy to implement and converge to the optimal policy in finite time. >

Journal ArticleDOI
TL;DR: A procedure for identifying forecast horizons in nonhomogeneous Markov decision processes, based on convergence results for relative value functions, is developed and a closed form expression for computing sufficiently long horizons to guarantee epsilon optimality is presented.
Abstract: A procedure for identifying forecast horizons in nonhomogeneous Markov decision processes, based on convergence results for relative value functions, is developed. Two different algorithmic implementations of this procedure are discussed, and a closed form expression for computing sufficiently long horizons to guarantee epsilon optimality is presented.

Book
01 Jan 1989
TL;DR: Time Series and Econometric Models: Examples.
Abstract: Deterministic Models and Their Control Problems. Stochastic Models. Stochastic Control Problems. Time Series and Econometric Models: Examples. Estimation. Convergence Questions. Adaptive Control Systems and Bayesian Optimal Control Problems. Linear Rational Expectations Models. Approximations in Sequential Decision Processes. References. Appendix: Markov Processes.

Journal ArticleDOI
TL;DR: In this article, an average-reward Markov decision process (MDP) with discretetime parameter, denumerable state space, and bounded reward function is considered, and necessary and sufficient conditions are given so that the optimality equations have a bounded solution with an additional property.
Abstract: An average-reward Markov decision process (MDP) with discretetime parameter, denumerable state space, and bounded reward function is considered. With such a model, we associate a family of MDPs. Then, we determinenecessary conditions for the existence of a bounded solution to the optimality equation for each one of the models in the family. Moreover,necessary andsufficient conditions are given so that the optimality equations have a bounded solution with an additional property.

Proceedings ArticleDOI
13 Dec 1989
TL;DR: A state-dependent routing policy for a multi-service circuit-switched network is synthesized and it is shown that the proposed model provides good traffic efficiency and automatic flow control, and that by means of the call reward parameters one can almost independently control the grade of service of each call class.
Abstract: A state-dependent routing policy for a multi-service circuit-switched network is synthesized. To meet different requirements, the objective function is defined as the mean value of reward from the network. The theory of Markov decision processes is applied to find the optimal routing policy. It is shown that under the link independence assumption the problem can be decomposed into a set of link analysis problems. In this approach the optimal decision is a function of state-dependent link shadow prices, which are interpreted as prices for using each link from the path. The approach is implementable even for large systems if certain approximations are used. It is shown that the proposed model provides good traffic efficiency and automatic flow control, and that by means of the call reward parameters one can almost independently control the grade of service of each call class. >

Journal ArticleDOI
TL;DR: In this article, a self-contained approach based on the Drazin generalized inverse is used to derive many basic results in discrete time, finite state Markov decision processes, including the average reward evaluation equations, Laurent series expansions, as well as the finite test for Blackwell optimality.
Abstract: A new self-contained approach based on the Drazin generalized inverse is used to derive many basic results in discrete time, finite state Markov decision processes. A product form representation for the transition matrix of a stationary policy gives new derivations of the average reward evaluation equations, Laurent series expansions, as well as the finite test for Blackwell optimality. This representation also suggests new computational methods.

Proceedings ArticleDOI
13 Dec 1989
TL;DR: In this paper, the authors considered partially observable Markov decision processes with finite or countable state and observation spaces and finite control space, and provided sufficient conditions for a bounded solution to the average-cost optimality equation to exist.
Abstract: Consideration is given to partially observable Markov decision processes with finite or countable (core) state and observation spaces and finite control space. Following a standard approach, an equivalent completely observed problem is formulated, with the same finite control space but with an uncountable state space, namely, the space of probability distributions of the original core state space. It is observed that some characteristics induced in the original problem due to the finiteness, or countability, of the spaces involved are retained by the equivalent problem. Sufficient conditions are derived for a bounded solution to the average-cost optimality equation to exist. These results are illustrated in the context of machine replacement problems. Structural properties for average-cost policies are obtained for a two-state replacement problem and are similar to available results for discount optimal policies. The set of assumptions used seems to be significantly less restrictive than others currently available. >

Journal ArticleDOI
TL;DR: Optimality problems in infinite horizon, discrete time, vector criterion Markov and semi-Markov decision processes are expressed as standard problems of multi-objective linear programming as discussed by the authors, and methods for solving these problems are overviewed and simple numerical examples are given.
Abstract: Optimality problems in infinite horizon, discrete time, vector criterion Markov and semi-Markov decision processes are expressed as standard problems of multiobjective linear programming. Processes with discounting, absorbing processes and completely ergodie processes without discounting are investigated. The common properties and special structure of derived multiobjective linear programming problems are overviewed. Computational simplicities associated with these problems in comparison with general multiobjective linear programming problems are discussed. Methods for solving these problems are overviewed and simple numerical examples are given.

Proceedings ArticleDOI
13 Dec 1989
TL;DR: In this paper, it was shown that if each block is controlled by only one agent, then it is possible to obtain policies arbitrarily close to the optimal control policy by making use of the fact that the coupling between the blocks is weak.
Abstract: For Markov chains controlled by a team of agents there is no generally applicable method for obtaining the optimal control policy if the delay in information sharing between the agents is more than one-step. the authors consider such a problem for a Markov chain whose transition probability matrix consists of blocks, with the coupling between the blocks being on the order of epsilon , where epsilon is a small parameter. It is shown that if each block is controlled by only one agent, then it is possible to obtain policies arbitrarily close to the optimal control policy by making use of the fact that the coupling between the blocks is weak. The authors present a complete set of results for the finite-horizon case and discuss possible extensions to the finite-horizon case. >

Journal ArticleDOI
TL;DR: In this paper, the authors present a general model for the formulation and solution of the risk-sensitive dynamic decision problem that maximizes the certain equivalent of the discounted rewards of a time-varying Markov decision process.
Abstract: This paper presents a general model for the formulation and solution of the risk‐sensitive dynamic decision problem that maximizes the certain equivalent of the discounted rewards of a time‐varying Markov decision process. The problem is solved by applying the principle of optimality and stochastic dynamic programming to the immediate rewards and the certain equivalent associated with the remaining transitions of a time‐varying Markov process over a finite or infinite time horizon, under the assumptions of constant risk aversion and discounting of future cash flows. The solution provides transient and stationary optimal decision policies that depend on the presence or absence of discounting. The construction equipment replacement problem serves as an example application of the model to illustrate the solution methodology and the sensitivity of the optimal policy to the discount factor and the degree of risk aversion.

Journal ArticleDOI
TL;DR: In this article, the authors extend the concept of decision and forecast horizons from classes of stationary to classes of nonstationary Markov decision problems and obtain the horizons explicitly for a family of inventory models.
Abstract: The paper extends the concept of decision and forecast horizons from classes of stationary to classes of nonstationary Markov decision problems. The horizons are explicitly obtained for a family of inventory models. The family is indexed by nonstationary Markov chains and deterministic sequences. For the proof only reference to simlier work on the stationary case is made.

Proceedings ArticleDOI
13 Dec 1989
TL;DR: In this paper, a stationary policy maximizes one of these criteria, namely, the expected long-run average variability, and an algorithm that produces such an optimal stationary policy is given.
Abstract: Time-average Markov decision processes with finite state and action spaces are considered. Several definitions of variability are introduced and compared. It is shown that a stationary policy maximizes one of these criteria, namely, the expected long-run average variability. An algorithm that produces such an optimal stationary policy is given. >