scispace - formally typeset
Search or ask a question

Showing papers on "Markov decision process published in 1994"


Book
15 Apr 1994
TL;DR: Puterman as discussed by the authors provides a uniquely up-to-date, unified, and rigorous treatment of the theoretical, computational, and applied research on Markov decision process models, focusing primarily on infinite horizon discrete time models and models with discrete time spaces while also examining models with arbitrary state spaces, finite horizon models, and continuous time discrete state models.
Abstract: From the Publisher: The past decade has seen considerable theoretical and applied research on Markov decision processes, as well as the growing use of these models in ecology, economics, communications engineering, and other fields where outcomes are uncertain and sequential decision-making processes are needed. A timely response to this increased activity, Martin L. Puterman's new work provides a uniquely up-to-date, unified, and rigorous treatment of the theoretical, computational, and applied research on Markov decision process models. It discusses all major research directions in the field, highlights many significant applications of Markov decision processes models, and explores numerous important topics that have previously been neglected or given cursory coverage in the literature. Markov Decision Processes focuses primarily on infinite horizon discrete time models and models with discrete time spaces while also examining models with arbitrary state spaces, finite horizon models, and continuous-time discrete state models. The book is organized around optimality criteria, using a common framework centered on the optimality (Bellman) equation for presenting results. The results are presented in a "theorem-proof" format and elaborated on through both discussion and examples, including results that are not available in any other book. A two-state Markov decision process model, presented in Chapter 3, is analyzed repeatedly throughout the book and demonstrates many results and algorithms. Markov Decision Processes covers recent research advances in such areas as countable state space models with average reward criterion, constrained models, and models with risk sensitive optimality criteria. It also explores several topics that have received little or no attention in other books, including modified policy iteration, multichain models with average reward criterion, and sensitive optimality. In addition, a Bibliographic Remarks section in each chapter comments on relevant historic

11,625 citations


MonographDOI
TL;DR: Markov Decision Processes covers recent research advances in such areas as countable state space models with average reward criterion, constrained models, and models with risk sensitive optimality criteria, and explores several topics that have received little or no attention in other books.
Abstract: From the Publisher: The past decade has seen considerable theoretical and applied research on Markov decision processes, as well as the growing use of these models in ecology, economics, communications engineering, and other fields where outcomes are uncertain and sequential decision-making processes are needed. A timely response to this increased activity, Martin L. Puterman's new work provides a uniquely up-to-date, unified, and rigorous treatment of the theoretical, computational, and applied research on Markov decision process models. It discusses all major research directions in the field, highlights many significant applications of Markov decision processes models, and explores numerous important topics that have previously been neglected or given cursory coverage in the literature. Markov Decision Processes focuses primarily on infinite horizon discrete time models and models with discrete time spaces while also examining models with arbitrary state spaces, finite horizon models, and continuous-time discrete state models. The book is organized around optimality criteria, using a common framework centered on the optimality (Bellman) equation for presenting results. The results are presented in a \"theorem-proof\" format and elaborated on through both discussion and examples, including results that are not available in any other book. A two-state Markov decision process model, presented in Chapter 3, is analyzed repeatedly throughout the book and demonstrates many results and algorithms. Markov Decision Processes covers recent research advances in such areas as countable state space models with average reward criterion, constrained models, and models with risk sensitive optimality criteria. It also explores several topics that have received little or no attention in other books, including modified policy iteration, multichain models with average reward criterion, and sensitive optimality. In addition, a Bibliographic Remarks section in each chapter comments on relevant historic

5,188 citations


Book ChapterDOI
10 Jul 1994
TL;DR: A Q-learning-like algorithm for finding optimal policies and its application to a simple two-player game in which the optimal policy is probabilistic is demonstrated.
Abstract: In the Markov decision process (MDP) formalization of reinforcement learning, a single adaptive agent interacts with an environment defined by a probabilistic transition function. In this solipsis-tic view, secondary agents can only be part of the environment and are therefore fixed in their behavior. The framework of Markov games allows us to widen this view to include multiple adaptive agents with interacting or competing goals. This paper considers a step in this direction in which exactly two agents with diametrically opposed goals share an environment. It describes a Q-learning-like algorithm for finding optimal policies and demonstrates its application to a simple two-player game in which the optimal policy is probabilistic.

2,643 citations


Journal ArticleDOI
TL;DR: The Q-learning algorithm, a reinforcement learning method for solving Markov decision problems, is studied to establish its convergence under conditions more general than previously available.
Abstract: We provide some general results on the convergence of a class of stochastic approximation algorithms and their parallel and asynchronous variants. We then use these results to study the Q-learning algorithm, a reinforcement learning method for solving Markov decision problems, and establish its convergence under conditions more general than previously available.

747 citations


Book ChapterDOI
TL;DR: This chapter summarizes the ability of the models to track the shift in departure rates induced by the 1982 window plan, based on the estimated utility function parameters using data prior to 1982.
Abstract: Publisher Summary This chapter summarizes the ability of the models to track the shift in departure rates induced by the 1982 window plan. All forecasts were based on the estimated utility function parameters using data prior to 1982. Using these parameters, predictions were generated from all four models after incorporating the extra bonus provisions of the window plan. The structural models were generally able to accurately predict the large increase in departure rates induced by the window plan, although once again none of the models was able to capture the peak in departure rates at age 65. On the other hand, the reduced-form probit model predicted that the window plan had essentially no effect on departure rates. Other reduced-form specifications greatly overpredicted departure rates under the window plan.

607 citations


Book
01 Jan 1994
TL;DR: In this paper, the authors present a model-based approach to solving linear programming problems, which is based on the Gauss-Jordan method for solving systems of linear equations, and the Branch-and-Bound method for solving mixed integer programming problems.
Abstract: 1. INTRODUCTION TO MODEL BUILDING. An Introduction to Modeling. The Seven-Step Model-Building Process. Examples. 2. BASIC LINEAR ALGEBRA. Matrices and Vectors. Matrices and Systems of Linear Equations. The Gauss-Jordan Method for Solving Systems of Linear Equations. Linear Independence and Linear Dependence. The Inverse of a Matrix. Determinants. 3. INTRODUCTION TO LINEAR PROGRAMMING. What is a Linear Programming Problem? The Graphical Solution of Two-Variable Linear Programming Problems. Special Cases. A Diet Problem. A Work-Scheduling Problem. A Capital Budgeting Problem. Short-term Financial Planning. Blending Problems. Production Process Models. Using Linear Programming to Solve Multiperiod Decision Problems: An Inventory Model. Multiperiod Financial Models. Multiperiod Work Scheduling. 4. THE SIMPLEX ALGORITHM AND GOAL PROGRAMMING. How to Convert an LP to Standard Form. Preview of the Simplex Algorithm. The Simplex Algorithm. Using the Simplex Algorithm to Solve Minimization Problems. Alternative Optimal Solutions. Unbounded LPs. The LINDO Computer Package. Matrix Generators, LINGO, and Scaling of LPs. Degeneracy and the Convergence of the Simplex Algorithm. The Big M Method. The Two-Phase Simplex Method. Unrestricted-in-Sign Variables. Karmarkar"s Method for Solving LPs. Multiattribute Decision-Making in the Absence of Uncertainty: Goal Programming. Solving LPs with Spreadsheets. 5. SENSITIVITY ANALYSIS: AN APPLIED APPROACH. A Graphical Introduction to Sensitivity Analysis. The Computer and Sensitivity Analysis. Managerial Use of Shadow Prices. What Happens to the Optimal z-value if the Current Basis is No Longer Optimal? 6. SENSITIVITY ANALYSIS AND DUALITY. A Graphical Introduction to Sensitivity Analysis. Some Important Formulas. Sensitivity Analysis. Sensitivity Analysis When More Than One Parameter is Changed: The 100% Rule. Finding the Dual of an LP. Economic Interpretation of the Dual Problem. The Dual Theorem and Its Consequences. Shadow Prices. Duality and Sensitivity Analysis. 7. TRANSPORTATION, ASSIGNMENT, AND TRANSSHIPMENT PROBLEMS. Formulating Transportation Problems. Finding Basic Feasible Solutions for Transportation Problems. The Transportation Simplex Method. Sensitivity Analysis for Transportation Problems. Assignment Problems. Transshipment Problems. 8. NETWORK MODELS. Basic Definitions. Shortest Path Problems. Maximum Flow Problems. CPM and PERT. Minimum Cost Network Flow Problems. Minimum Spanning Tree Problems. The Network Simplex Method. 9. INTEGER PROGRAMMING. Introduction to Integer Programming. Formulation Integer Programming Problems. The Branch-and-Bound Method for Solving Pure Integer Programming Problems. The Branch-and-Bound Method for Solving Mixed Integer Programming Problems. Solving Knapsack Problems by the Branch-and-Bound Method. Solving Combinatorial Optimization Problems by the Branch-and-Bound Method. Implicit Enumeration. The Cutting Plane Algorithm. 10. ADVANCED TOPICS IN LINEAR PROGRAMMING. The Revised Simplex Algorithm. The Product Form of the Inverse. Using Column Generation to Solve Large-Scale LPs. The Dantzig-Wolfe Decomposition Algorithm. The Simplex Methods for Upper-Bounded Variables. Karmarkar"s Method for Solving LPs. 11. NONLINEAR PROGRAMMING. Review of Differential Calculus. Introductory Concepts. Convex and Concave Functions. Solving NLPs with One Variable. Golden Section Search. Unconstrained Maximization and Minimization with Several Variables. The Method of Steepest Ascent. Lagrange Multiples. The Kuhn-Tucker Conditions. Quadratic Programming. Separable Programming. The Method of Feasible Directions. Pareto Optimality and Tradeoff Curves. 12. REVIEW OF CALCULUS AND PROBABILITY. Review of Integral Calculus. Differentiation of Integrals. Basic Rules of Probability. Bayes" Rule. Random Variables. Mean Variance and Covariance. The Normal Distribution. Z-Transforms. Review Problems. 13. DECISION MAKING UNDER UNCERTAINTY. Decision Criteria. Utility Theory. Flaws in Expected Utility Maximization: Prospect Theory and Framing Effects. Decision Trees. Bayes" Rule and Decision Trees. Decision Making with Multiple Objectives. The Analytic Hierarchy Process. Review Problems. 14. GAME THEORY. Two-Person Zero-Sum and Constant-Sum Games: Saddle Points. Two-Person Zero-Sum Games: Randomized Strategies, Domination, and Graphical Solution. Linear Programming and Zero-Sum Games. Two-Person Nonconstant-Sum Games. Introduction to n-Person Game Theory. The Core of an n-Person Game. The Shapley Value. 15. DETERMINISTIC EOQ INVENTORY MODELS. Introduction to Basic Inventory Models. The Basic Economic Order Quantity Model. Computing the Optimal Order Quantity When Quantity Discounts Are Allowed. The Continuous Rate EOQ Model. The EOQ Model with Back Orders Allowed. Multiple Product Economic Order Quantity Models. Review Problems. 16. PROBABILISTIC INVENTORY MODELS. Single Period Decision Models. The Concept of Marginal Analysis. The News Vendor Problem: Discrete Demand. The News Vendor Problem: Continuous Demand. Other One-Period Models. The EOQ with Uncertain Demand: the (r, q) and (s,S models). The EOQ with Uncertain Demand: the Service Level Approach to Determining Safety Stock Level. Periodic Review Policy. The ABC Inventory Classification System. Exchange Curves. Review Problems. 17. MARKOV CHAINS. What is a Stochastic Process. What is a Markov Chain? N-Step Transition Probabilities. Classification of States in a Markov Chain. Steady-State Probabilities and Mean First Passage Times. Absorbing Chains. Work-Force Planning Models. 18.DETERMINISTIC DYNAMIC PROGRAMMING. Two Puzzles. A Network Problem. An Inventory Problem. Resource Allocation Problems. Equipment Replacement Problems. Formulating Dynamic Programming Recursions. The Wagner-Whitin Algorithm and the Silver-Meal Heuristic. Forward Recursions. Using Spreadsheets to Solve Dynamic Programming Problems. Review Problems. 19. PROBABILISTIC DYNAMIC PROGRAMMING. When Current Stage Costs are Uncertain but the Next Period"s State is Certain. A Probabilistic Inventory Model. How to Maximize the Probability of a Favorable Event Occurring. Further Examples of Probabilistic Dynamic Programming Formulations. Markov Decision Processes. Review Problems. 20. QUEUING THEORY. Some Queuing Terminology. Modeling Arrival and Service Processes. Birth-Death Processes. M/M/1/GD/o/o Queuing System and the Queuing Formula L=o W, The M/M/1/GD/o Queuing System. The M/M/S/ GD/o/o Queuing System. The M/G/ o/GD/oo and GI/G/o/GD/o/oModels. The M/ G/1/GD/o/o Queuing System. Finite Source Models: The Machine Repair Model. Exponential Queues in Series and Opening Queuing Networks. How to Tell whether Inter-arrival Times and Service Times Are Exponential. The M/G/S/GD/S/o System (Blocked Customers Cleared). Closed Queuing Networks. An Approximation for the G/G/M Queuing System. Priority Queuing Models. Transient Behavior of Queuing Systems. Review Problems. 21.SIMULATION. Basic Terminology. An Example of a Discrete Event Simulation. Random Numbers and Monte Carlo Simulation. An Example of Monte Carlo Simulation. Simulations with Continuous Random Variables. An Example of a Stochastic Simulation. Statistical Analysis in Simulations. Simulation Languages. The Simulation Process. 22.SIMULATION WITH PROCESS MODEL. Simulating an M/M/1 Queuing System. Simulating an M/M/2 System. A Series System. Simulating Open Queuing Networks. Simulating Erlang Service Times. What Else Can Process Models Do? 23. SPREADSHEET SIMULATION WITH @RISK. Introduction to @RISK: The Newsperson Problem. Modeling Cash Flows From A New Product. Bidding Models. Reliability and Warranty Modeling. Risk General Function. Risk Cumulative Function. Risktrigen Function. Creating a Distribution Based on a Point Forecast. Forecasting Income of a Major Corporation. Using Data to Obtain Inputs For New Product Simulations. Playing Craps with @RISK. Project Management. Simulating the NBA Finals. 24. FORECASTING. Moving Average Forecasting Methods. Simple Exponential Smoothing. Holt"s Method: Exponential Smoothing with Trend. Winter"s Method: Exponential Smoothing with Seasonality. Ad Hoc Forecasting, Simple Linear Regression. Fitting Non-Linear Relationships. Multiple Regression. Answers to Selected Problems. Index.

427 citations


Proceedings Article
01 Jan 1994
TL;DR: This work proposes and analyze a new learning algorithm to solve a certain class of non-Markov decision problems and operates in the space of stochastic policies, a space which can yield a policy that performs considerably better than any deterministic policy.
Abstract: Increasing attention has been paid to reinforcement learning algorithms in recent years, partly due to successes in the theoretical analysis of their behavior in Markov environments. If the Markov assumption is removed, however, neither generally the algorithms nor the analyses continue to be usable. We propose and analyze a new learning algorithm to solve a certain class of non-Markov decision problems. Our algorithm applies to problems in which the environment is Markov, but the learner has restricted access to state information. The algorithm involves a Monte-Carlo policy evaluation combined with a policy improvement method that is similar to that of Markov decision problems and is guaranteed to converge to a local maximum. The algorithm operates in the space of stochastic policies, a space which can yield a policy that performs considerably better than any deterministic policy. Although the space of stochastic policies is continuous--even for a discrete action space--our algorithm is computationally tractable.

404 citations


Proceedings Article
01 Jan 1994
TL;DR: This work proposes algorithms similar to those named above, adapted to the solution of semi-Markov Decision Problems, and demonstrates these algorithms by applying them to the problem of determining the optimal control for a simple queueing system.
Abstract: Semi-Markov Decision Problems are continuous time generalizations of discrete time Markov Decision Problems. A number of reinforcement learning algorithms have been developed recently for the solution of Markov Decision Problems, based on the ideas of asynchronous dynamic programming and stochastic approximation. Among these are TD(λ), Q-learning, and Real-time Dynamic Programming. After reviewing semi-Markov Decision Problems and Bellman's optimality equation in that context, we propose algorithms similar to those named above, adapted to the solution of semi-Markov Decision Problems. We demonstrate these algorithms by applying them to the problem of determining the optimal control for a simple queueing system. We conclude with a discussion of circumstances under which these algorithms may be usefully applied.

328 citations


Journal ArticleDOI
TL;DR: A methodology, the Latent Markov Decision Process (LMDP), which explicitly recognizes the presence of random errors in the measurement of the condition of infrastructure facilities and minimizes the sum of inspection and M & R costs is presented.
Abstract: State-of-the-art decision-making models in the area of infrastructure maintenance and rehabilitation (which are based on the Markov Decision Process) do not take into account the uncertainty in the measurement of facility condition. This paper presents a methodology, the Latent Markov Decision Process (LMDP), which explicitly recognizes the presence of random errors in the measurement of the condition of infrastructure facilities. Two versions of the LMDP are presented. In the first version, the inspection schedule is fixed, which is the usual assumption made in state-of-the-art models. The second version of the LMDP minimizes the sum of inspection and M & R costs. An empirical comparison of the two versions of the LMDP and the traditional MDP illustrates the importance of incorporating measurement uncertainty in decision-making and of optimizing the inspection schedule.

197 citations


Book ChapterDOI
01 Jan 1994
TL;DR: The purpose of this paper is to draw the reader's attention to the problems of the expected value criterion in Markov decision processes and to give Dynamic Programming algorithms for an alternative criterion, namely the minimax criterion.
Abstract: Most Reinforcement Learning (RL) work supposes policies for sequential decision tasks to be optimal that minimize the expected total discounted cost (e.g. Q-Learning; AHC architecture). On the other hand, it is well known that it is not always reliable and can be treacherous to use the expected value as a decision criterion. A lot of alternative decision criteria have been suggested in decision theory to get a more sophisticated consideration of risk but most RL researchers have not concerned themselves with this subject until now. The purpose of this paper is to draw the reader's attention to the problems of the expected value criterion in Markov decision processes and to give Dynamic Programming algorithms for an alternative criterion, namely the minimax criterion. A counterpart to Watkins' Q-Learning with regard to the minimax criterion is presented. The new algorithm, called Qˆ-learning, finds policies that minimize the worst-case total discounted cost.

188 citations


Journal ArticleDOI
TL;DR: The numerical procedures for calculating an optimal max-min strategy are based on successive approximations, reward revision, and modified policy iteration, and the bounds that are determined are at least as tight as currently available bounds for the case where the transition probabilities are precise.
Abstract: We present new numerical algorithms and bounds for the infinite horizon, discrete stage, finite state and action Markov decision process with imprecise transition probabilities. We assume that the transition probability mass vector for each state and action is described by a finite number of linear inequalities. This model of imprecision appears to be well suited for describing statistically determined confidence limits and/or natural language statements of likelihood. The numerical procedures for calculating an optimal max-min strategy are based on successive approximations, reward revision, and modified policy iteration. The bounds that are determined are at least as tight as currently available bounds for the case where the transition probabilities are precise.

Journal ArticleDOI
TL;DR: The research indicates that MDPs are a powerful and useful technique for bridge management systems, and addresses the issues of state‐space cardinality and compliance with the Markovian property.
Abstract: The typical infrastructure maintenance decision‐making environment involves multiple objectives and uncertainty, and is dynamic. One of the most commonly used infrastructure models is a Markov decision process (MDP). MDP models have been applied to numerous sequential decision‐making situations involving uncertainty and multiple objectives, including applications related to infrastructure problems. In this paper we explore the use of Markov models for bridge management systems. In particular, we explore two critical issues associated with the use of MDP models. The first involves state‐space explosion, one of the most common problems with MDP models. We address the issues of state‐space cardinality and present approaches for dealing with the complexity. The second issue with MDP models is the compliance with the Markovian property. With both issues we use the Virginia bridge system and data to illustrate the concepts. Our research indicates that MDPs are a powerful and useful technique for bridge manageme...

Journal ArticleDOI
TL;DR: An upper bound on performance loss is derived that is slightly tighter than that in Bertsekas (1987), and the extension of the bound to Q-learning is shown to provide a partial theoretical rationale for the approximation of value functions.
Abstract: Many reinforcement learning approaches can be formulated using the theory of Markov decision processes and the associated method of dynamic programming (DP) The value of this theoretical understanding, however, is tempered by many practical concerns One important question is whether DP-based approaches that use function approximation rather than lookup tables can avoid catastrophic effects on performance This note presents a result of Bertsekas (1987) which guarantees that small errors in the approximation of a task's optimal value function cannot produce arbitrarily bad performance when actions are selected by a greedy policy We derive an upper bound on performance loss that is slightly tighter than that in Bertsekas (1987), and we show the extension of the bound to Q-learning (Watkins, 1989) These results provide a partial theoretical rationale for the approximation of value functions, an issue of great practical importance in reinforcement learning

01 Dec 1994
TL;DR: It is argued that the witness algorithm is superior to existing algorithms for solving POMDP problems in an important complexity-theoretic sense.
Abstract: Markov decision processes (MDP''s) are a mathematical formalization of problems in which a decision-maker must choose how to act to maximize its reward over a series of interactions with its environment Partially observable Markov decision processes (POMDP''s) generalize the MDP framework to the case where the agent must make its decisions in partial ignorance of its current situation This paper describes the POMDP framework and presents some well-known results from the field It then presents a novel method called the witness algorithm for solving POMDP problems and analyzes its computational complexity The paper argues that the witness algorithm is superior to existing algorithms for solving POMDP''s in an important complexity-theoretic sense

Proceedings Article
01 Aug 1994
TL;DR: This work explores a method for generating abstractions that allow approximately optimal policies to be constructed; computational gains are achieved through reduction of the state space.
Abstract: Recently Markov decision processes and optimal control policies have been applied to the problem of decision-theoretic planning. However, the classical methods for generating optimal policies are highly intractable, requiring explicit enumeration of large state spaces. We explore a method for generating abstractions that allow approximately optimal policies to be constructed; computational gains are achieved through reduction of the state space. Abstractions are generated by identifying propositions that are "relevant" either through their direct impact on utility, or their influence on actions. This information is gleaned from the representation of utilities and actions. We prove bounds on the loss in value due to abstraction and describe some preliminary experimental results.

Journal ArticleDOI
TL;DR: This paper proves the existence of optimal mixed stationary policies for constrained problems when the constraints are of the same nature as the objective functions, and provides linear programming algorithms for the computation of optimal policies.
Abstract: This paper deals with constrained average reward Semi-Markov Decision Processes (SMDPs) with finite state and action sets. We consider two average reward criteria. The first criterion is time-average rewards, which equal the lower limits of the expected average rewards per unit time, as the horizon tends to infinity. The second criterion is ratio-average rewards, which equal the lower limits of the ratios of the expected total rewards during the firstn steps to the expected total duration of thesen steps asn → ∞. For both criteria, we prove the existence of optimal mixed stationary policies for constrained problems when the constraints are of the same nature as the objective functions. For unichain problems, we show the existence of randomized stationary policies which are optimal for both criteria. However, optimal mixed stationary policies may be different for each of these critria even for unichain problems. We provide linear programming algorithms for the computation of optimal policies.

01 Aug 1994
TL;DR: A new algorithm is proposed that has empirically been faster than the existing exact techniques and is aimed at those who do not have a lot of experience with the techniques or concepts of POMDPs.
Abstract: The main objective of this report is to provide implementation details for the more popular exact algorithms for solving finite horizon partially observable Markov decision processes (POMDPs). Along with the existing algorithms, a new algorithm, Witness, is proposed that has empirically been faster than the existing exact techniques. In addition to algorithmic details, the basic formulas and concepts of POMDPs are presented, as well as explanations and discussion about the basic form of POMDP solutions. This document is aimed at those who do not have a lot of experience with the techniques or concepts of POMDPs.

Journal ArticleDOI
TL;DR: It is shown that there exists an optimal stationary policy (such that the decisions depend only on the actual number of customers in the queue); it is of a threshold type, and it uses randomization in at most one state.
Abstract: Considers the problem of dynamic flow control of arriving packets into an infinite buffer. The service rate may depend on the state of the system, may change in time, and is unknown to the controller. The goal of the controller is to design an efficient policy which guarantees the best performance under the worst service conditions. The cost is composed of a holding cost, a cost of rejecting customers (packets), and a cost that depends on the quality of the service. The problem is studied in the framework of zero-sum Markov games, and a value iteration algorithm is used to solve it. It is shown that there exists an optimal stationary policy (such that the decisions depend only on the actual number of customers in the queue); it is of a threshold type, and it uses randomization in at most one state. >

Journal ArticleDOI
TL;DR: A stationary policy and an initial state in an MDP Markov decision process induce a stationary probability distribution of the reward that generates the Pareto optima in the sense of high mean and low variance of the stationary distribution.
Abstract: A stationary policy and an initial state in an MDP Markov decision process induce a stationary probability distribution of the reward. The problem analyzed here is generating the Pareto optima in the sense of high mean and low variance of the stationary distribution. In the unichain case, Pareto optima can be computed either with policy improvement or with a linear program having the same number of variables and one more constraint than the formulation for gain-rate optimization. The same linear program suffices in the multichain case if the ergodic class is an element of choice.

Journal ArticleDOI
TL;DR: It is shown that for this criterion for some positive e there need not exist an e-optimal randomized stationary strategy, even when the state and action sets are finite, and an explicit algorithm is provided for the computation of such strategies.
Abstract: We consider a discrete time Markov Decision Process with infinite horizon. The criterion to be maximized is the sum of a number of standard discounted rewards, each with a different discount factor. Situations in which such criteria arise include modeling investments, production, modeling projects of different durations and systems with multiple criteria, and some axiomatic formulations of multi-attribute preference theory. We show that for this criterion for some positive e there need not exist an e-optimal randomized stationary strategy, even when the state and action sets are finite. However, e-optimal Markov nonrandomized strategies and optimal Markov strategies exist under weak conditions. We exhibit e-optimal Markov strategies which are stationary from some time onward. When both state and action spaces are finite, there exists an optimal Markov strategy with this property. We provide an explicit algorithm for the computation of such strategies and give a description of the set of optimal strategies.

09 May 1994
TL;DR: A Markoov fuzzy process is constructed, which represents transitions of grades of fuzzy sets, with a transition possibility measure and a general state space, to solve fuzzy dynamic programming with optimal stopping times and with general state spaces and action spaces under fuzzy transitions.
Abstract: Abstract This paper constructs a Markoov fuzzy process, which represents transitions of grades of fuzzy sets, with a transition possibility measure and a general state space. We analyse Snell's optimal stopping problem for the process and we apply the results to solve fuzzy dynamic programming with optimal stopping times and with general state spaces and action spaces under fuzzy transitions.

Journal ArticleDOI
TL;DR: The complexity of the policy improvement algorithm for Markov decision processes is considered and it is shown that four variants of the algorithm require exponential time in the worst case.
Abstract: We consider the complexity of the policy improvement algorithm for Markov decision processes. We show that four variants of the algorithm require exponential time in the worst case. INFORMS Journal on Computing, ISSN 1091-9856, was published as ORSA Journal on Computing from 1989 to 1995 under ISSN 0899-1499.

Proceedings Article
01 Aug 1994
TL;DR: New algorithms for local planning over Markov decision processes are presented, showing to expand the agent's knowledge where the world warrants it, with appropriate responsiveness to time pressure and randomness and an introspective algorithm, using an internal representation of what computational work has already been done.
Abstract: We present new algorithms for local planning over Markov decision processes. The base-level algorithm possesses several interesting features for control of computation, based on selecting computations according to their expected benefit to decision quality. The algorithms are shown to expand the agent's knowledge where the world warrants it, with appropriate responsiveness to time pressure and randomness. We then develop an introspective algorithm, using an internal representation of what computational work has already been done. This strategy extends the agent's knowledge base where warranted by the agent's world model and the agent's knowledge of the work already put into various parts of this model. It also enables the agent to act so as to take advantage of the computational savings inherent in staying in known parts of the state space. The control flexibility provided by this strategy, by incorporating natural problem-solving methods, directs computational effort towards where it's needed better than previous approaches, providing greater hopes for scalability to large domains.

Journal ArticleDOI
TL;DR: The theory of discounted constrained Markov decision processes with a countable state and action spaces with general multi-chain structure is established and the convergence of optimal values and policies for both the discounted and the expected average cost is established.
Abstract: The purpose of this paper is two fold. First to establish the theory of discounted constrained Markov decision processes with a countable state and action spaces with general multi-chain structure. Second, to Introduce finite approximation methods. We define the occupation measures and obtain properties of the set of all achievable occupation measures under the different admissible policies. We establish the optimality of stationary policies for the constrained control problem, and obtain an LP with a countable number of decision variables through which stationary optimal policies are computed. Since for such an LP one cannot expect to find an optimal solution in a finite number of operations, we present two schemes for finite approximations and establish the convergence of optimal values and policies for both the discounted and the expected average cost, with unbounded cost. Sometimes It turns out to be easier to solve the problem with infinite state space than the problem with finite yet large state space. Based on the optimal policy for the problem with infinite state space, we construct policies which are almost optimal for the problem with truncated state space. This method is applied to obtain an e-optimal policy for a problem of optimal priority assignment under constraints for a system of K finite queues.

Journal ArticleDOI
TL;DR: New characterizations of the Hamiltonian cycles of a directed graph, and a new LP-relaxation of the Traveling Salesman Problem are derived via an embedding of these combinatorial optimization problems in suitably perturbed controlled Markov chains.
Abstract: In this paper we derive new characterizations of the Hamiltonian cycles of a directed graph, and a new LP-relaxation of the Traveling Salesman Problem. Our results are obtained via an embedding of these combinatorial optimization problems in suitably perturbed controlled Markov chains. This embedding lends probabilistic interpretation to many of the quantities of interest, which in turn lead naturally to the introduction of a quadratic entropy-like function.

Journal ArticleDOI
TL;DR: Bounds on the value function and a suboptimal design for the partially observed Markov decision process and an a priori measure of the quality of these bounds is given and it is shown that larger M implies tighter bounds.
Abstract: We develop bounds on the value function and a suboptimal design for the partially observed Markov decision process. These bounds and suboptimal design are based on the M most recent observations and actions. An a priori measure of the quality of these bounds is given. We show that larger M implies tighter bounds. An operations count analysis indicates that (#A#Z)M+1(#S) multiplications and additions are required per successive approximations iteration of the suboptimal design algorithm, where A, Z, and S are the action, observation, and state spaces, respectively, suggesting the algorithm is of potential use for problems with large state spaces. A preliminary numerical study indicates that the quality of the suboptimal design can be excellent.

Proceedings Article
01 Jan 1994
TL;DR: An application of reinforcement learning to a linear-quadratic, differential game is presented, and the results show that advantage updating converges faster than Q-learning in all simulations; the results also show advantage updating convergence converges regardless of the time step duration; Q- learning is unable to converge as the timeStep duration grows small.
Abstract: An application of reinforcement learning to a linear-quadratic, differential game is presented. The reinforcement learning system uses a recently developed algorithm, the residual gradient form of advantage updating. The game is a Markov Decision Process (MDP) with continuous time, states, and actions, linear dynamics, and a quadratic cost function. The game consists of two players, a missile and a plane; the missile pursues the plane and the plane evades the missile. The reinforcement learning algorithm for optimal control is modified for differential games in order to find the minimax point, rather than the maximum. Simulation results are compared to the optimal solution, demonstrating that the simulated reinforcement learning system converges to the optimal answer. The performance of both the residual gradient and non-residual gradient forms of advantage updating and Q-learning are compared. The results show that advantage updating converges faster than Q-learning in all simulations. The results also show advantage updating converges regardless of the time step duration; Q-learning is unable to converge as the time step duration grows small.

01 Jan 1994
TL;DR: In this paper, the convergence of the value iteration (or successive approximations) algorithm for average cost (AC) Markov control processes on Borel spaces, with possibly unbounded cost, under appropriate hypotheses on weighted norms for the cost function and the transition law, was shown.
Abstract: This paper shows the convergence of the value iteration (or successive approximations) algorithm for average cost (AC) Markov control processes on Borel spaces, with possibly unbounded cost, under appropriate hypotheses on weighted norms for the cost function and the transition law. It is also shown that the aforementioned convergence implies strong forms of AC-optimality and the existence of forecast horizons.

Journal ArticleDOI
TL;DR: The bound emerges as a solution to a dual pair of convex programming problems, where the primal problem describes mean flows through the network and the dual problem describes implied costs and surplus values at resources of the network.
Abstract: We describe a procedure for bounding the performance of dynamic routing schemes for loss or queueing networks The bound is developed from a network flow synthesis of a collection of Markov decision processes, one for each resource of the network The bound emerges as a solution to a dual pair of convex programming problems, where the primal problem describes mean flows through the network and the dual problem describes implied costs and surplus values at resources of the network The bound is particularly appropriate for large highly connected networks, where it may be approached by simple trunk reservation or threshold routing schemes

Journal ArticleDOI
TL;DR: Two properties of the set of Markov chains induced by the deterministic policies in a Markov decision chain are studied, called µ-uniform geometric ergodicity and µ- uniform geometric recurrence, which imply the existence of deterministic average and sensitive optimal policies.
Abstract: This paper studies two properties of the set of Markov chains induced by the deterministic policies in a Markov decision chain These properties are called µ-uniform geometric ergodicity and µ-uniform geometric recurrence µ-uniform ergodicity generalises a quasi-compactness condition It can be interpreted as a strong version of stability, as it implies that the Markov chains generated by the deterministic stationary policies are uniformly stable µ-uniform geometric recurrence can be shown to be equivalent to the simultaneous Doeblin condition, If µ is bounded Both properties imply the existence of deterministic average and sensitive optimal policies The second Key theorem in this paper shows the equivalence of µ-uniform geometric ergodicity and weak µ-uniform geometric recurrence under appropriate continuity conditions In the literature numerous recurrence conditions have been used The first Key theorem derives the relation between several of these conditions, which interestingly turn out to be equivalent in most cases