scispace - formally typeset
Search or ask a question

Showing papers on "Markov decision process published in 1995"


Proceedings Article
18 Aug 1995
TL;DR: It is argued that, although MDPs can be solved efficiently in theory, more study is needed to reveal practical algorithms for solving large problems quickly and to encourage future research, some alternative methods of analysis are sketched that rely on the structure of M DPs.
Abstract: Markov decision problems (MDPs) provide the foundations for a number of problems of interest to AI researchers studying automated planning and reinforcement learning. In this paper, we summarize results regarding the complexity of solving MDPs and the running time of MDP solution algorithms. We argue that, although MDPs can be solved efficiently in theory, more study is needed to reveal practical algorithms for solving large problems quickly. To encourage future research, we sketch some alternative methods of analysis that rely on the structure of MDPs.

554 citations


Proceedings Article
20 Aug 1995
TL;DR: This work presents an algorithm, called structured policy Iteration (SPI), that constructs optimal policies without explicit enumeration of the state space, and retains the fundamental computational steps of the commonly used modified policy iteration algorithm, but exploits the variable and prepositional independencies reflected in a temporal Bayesian network representation of MDPs.
Abstract: Markov decision processes (MDPs) have recently been applied to the problem of modeling decision-theoretic planning. While traditional methods for solving MDPs are often practical for small states spaces, their effectiveness for large AI planning problems is questionable. We present an algorithm, called structured policy Iteration (SPI), that constructs optimal policies without explicit enumeration of the state space. The algorithm retains the fundamental computational steps of the commonly used modified policy iteration algorithm, but exploits the variable and prepositional independencies reflected in a temporal Bayesian network representation of MDPs. The principles behind SPI can be applied to any structured representation of stochastic actions, policies and value functions, and the algorithm itself can be used in conjunction with recent approximation methods.

437 citations



Proceedings Article
20 Aug 1995
TL;DR: Smooth Partially Observable Value Approximation (SPOVA) is introduced, a new approximation method that can quickly yield good approximations which can improve over time and can be combined with reinforcement learning meth ods a combination that was very effective in test cases.
Abstract: The problem of making optimal decisions in uncertain conditions is central to Artificial Intelligence If the state of the world is known at all times, the world can be modeled as a Markov Decision Process (MDP) MDPs have been studied extensively and many methods are known for determining optimal courses of action or policies. The more realistic case where state information is only partially observable Partially Observable Markov Decision Processes (POMDPs) have received much less attention. The best exact algorithms for these problems can be very inefficient in both space and time. We introduce Smooth Partially Observable Value Approximation (SPOVA), a new approximation method that can quickly yield good approximations which can improve over time. This method can be combined with reinforcement learning meth ods a combination that was very effective in our test cases.

186 citations


Proceedings Article
20 Aug 1995
TL;DR: This paper presents algorithms that decompose planning problems into smaller problems given an arbitrary partition of the state space and shows how properties of a specified partition affect the time and storage required for these algorithms.
Abstract: This paper is concerned with modeling planning problems involving uncertainty as discrete-time, finite-state stochastic automata Solving planning problems is reduced to computing policies for Markov decision processes. Classical methods for solving Markov decision processes cannot cope with the size of the state spaces for typical problems encountered in practice. As an alternative, we investigate methods that decompose global planning problems into a number of local problems solve the local problems separately and then combine the local solutions to generate a global solution. We present algorithms that decompose planning problems into smaller problems given an arbitrary partition of the state space. The local problems are interpreted as Markov decision processes and solutions to the local problems are interpreted as policies restricted to the subsets of the state space defined by the partition. One algorithm relies on constructing and solving an abstract version of the original decision problem. A second algorithm iteratively approximates parameters of the local problems to converge to an optimal solution. We show how properties of a specified partition affect the time and storage required for these algorithms.

175 citations


Journal ArticleDOI
TL;DR: Two general approaches to extending reinforcement learning to hidden state tasks are examined, which unifies recent approaches such as the Lion algorithm, the G-algorithm, and CS-QL and assumes that, by appropriate control of perception, the external states can be identified at each point in time from the immediate sensory inputs.

132 citations


Journal ArticleDOI

126 citations


Journal ArticleDOI
TL;DR: The authors present a complete (and discrete) classification of both the maximal achievable target levels and of their corresponding percentiles and provide an algorithm for computing a deterministic policy corresponding to any feasible target-percentile pair.
Abstract: Addresses the following basic feasibility problem for infinite-horizon Markov decision processes (MDPs): can a policy be found that achieves a specified value (target) of the long-run limiting average reward at a specified probability level (percentile)? Related optimization problems of maximizing the target for a specified percentile and vice versa are also considered. The authors present a complete (and discrete) classification of both the maximal achievable target levels and of their corresponding percentiles. The authors also provide an algorithm for computing a deterministic policy corresponding to any feasible target-percentile pair. Next the authors consider similar problems for an MDP with multiple rewards and/or constraints. This case presents some difficulties and leads to several open problems. An LP-based formulation provides constructive solutions for most cases. >

115 citations


Proceedings ArticleDOI
13 Dec 1995
TL;DR: A general, unified framework for hybrid control problems that encompasses several types of hybrid phenomena and several models of hybrid systems is proposed and an existence result was obtained for optimal controls.
Abstract: The authors previously (1994) proposed a general, unified framework for hybrid control problems that encompasses several types of hybrid phenomena and several models of hybrid systems An existence result was obtained for optimal controls The value function associated with this problem satisfies a set of "generalized quasi-variational inequalities" (GQVIs) We give a classification of the types of hybrid systems models covered by our framework and algorithms We review our general framework and results Then, we outline three explicit approaches for computing the solutions to the GQVIs that arise in optimal hybrid control The approaches are generalizations to hybrid systems of shooting methods for boundary value problems, impulse control for piecewise-deterministic processes (PDPs), and value and policy iteration for piecewise-continuous dynamical systems In the central case, we make clear the strong connection between impulse control for PDPs and optimal hybrid control This allows us to give exact and approximate ("epsilon-optimal") algorithms for computing the value function associated with such problems and give some theoretical results Also following previous work, we find that we can compute optimal solutions via linear programming (LP) The resulting LP problems are in general large, but sparse In each case, the underlying feedback controls can be subsequently computed Illustrative examples of each algorithm are solved in our framework

103 citations


Journal ArticleDOI
TL;DR: In this paper, the authors consider a process control procedure with fixed sample sizes and sampling intervals, where the fraction defective is the quality variable of interest, and they show that relatively standard cost assumptions lead to the formulation of the process control problem as a partially observed Markov decision process.
Abstract: We consider a process control procedure with fixed sample sizes and sampling intervals, where the fraction defective is the quality variable of interest, a standard attributes control chart methodology. We show that relatively standard cost assumptions lead to formulation of the process control problem as a partially observed Markov decision process, where the posterior probability of a process shift is a sufficient statistic for decision making. We characterize features of the optimal solution and show that the optimal policy has a simple control limit structure. Numerical results are provided which indicate that the procedure may provide significant savings over non-Bayesian techniques.

86 citations


Journal ArticleDOI
TL;DR: The properties of Markov decision processes which affect the performance of solution algorithms are identified, and a new problem generation technique is described which allows all of these properties to be controlled.
Abstract: Comparisons of the performance of solution algorithms for Markov decision processes rely heavily on problem generators to provide sizeable sets of test problems. Existing generation techniques allow little control over the properties of the test problems and often result in problems which are not typical of real-world examples. This paper identifies the properties of Markov decision processes which affect the performance of solution algorithms, and also describes a new problem generation technique which allows all of these properties to be controlled.

01 Nov 1995
TL;DR: A new algorithm is offered, called the witness algorithm, which can compute updated value functions efficiently on a restricted class of POMDPs in which the number of linear facets is not too great and it is found that it is the fastest algorithm over a wide range of PomDP sizes.
Abstract: We examine the problem of performing exact dynamic-programming updates in partially observable Markov decision processes (POMDPs) from a computational complexity viewpoint. Dynamic-programming updates are a crucial operation in a wide range of POMDP solution methods and we find that it is intractable to perform these updates on piecewise-linear convex value functions for general POMDPs. We offer a new algorithm, called the witness algorithm, which can compute updated value functions efficiently on a restricted class of POMDPs in which the number of linear facets is not too great. We compare the witness algorithm to existing algorithms analytically and empirically and find that it is the fastest algorithm over a wide range of POMDP sizes.

Journal ArticleDOI
TL;DR: It is proved that it a feasible policy exists, then there exists an optimal policy which is stationary (nonrandomized) from some step onward, and randomized, Markov before this step, but the total number of actions which are added by randomization is bounded by the number of constraints.
Abstract: This paper deals with constrained optimization of Markov Decision Processes. Both objective function and constraints are sums of standard discounted rewards, but each with a different discount factor Such models arise, e.g., in production and in applications involving multiple time scales. We prove that it a feasible policy exists, then there exists an optimal policy which is (i) stationary (nonrandomized) from some step onward, (ii) randomized, Markov before this step, but the total number of actions which are added by randomization is bounded by the number of constraints. Optimality of such policies for multi-criteria problems is also established. These new policies have the pleasing aesthetic property that the amount of randomization they require over any trajectory is restricted by the number of constraints. This result is new even for constrained optimization with a single discount factor, where the optimality of randomized stationary policies is known. However, a randomized stationary policy may req...

Book ChapterDOI
09 Jul 1995
TL;DR: The mathematical analysis shows the proposed method is a stochastic gradient ascent on discounted reward in Markov decision processes (MDPs), and is related to the average-reward framework, and assures that it can be extended to continuous environments.
Abstract: Reinforcement learning systems are often required to find not deterministic policies, but stochastic ones. They are also required to gain more reward while learning. Q-learning has not been designed for stochastic policies, and does not guarantee rational behavior on the halfway of learning. This paper presents a new reinforcement learning approach based on a simple credit-assignment for finding memory-less policies. It satisfies the above requirements with considering the policy and the exploration strategy identically. The mathematical analysis shows the proposed method is a stochastic gradient ascent on discounted reward in Markov decision processes (MDPs), and is related to the average-reward framework. The analysis assures that the proposed method can be extended to continuous environments. We also investigate its behavior in comparison with Q-learning on a small MDP example and a non-Markovian one.

Journal ArticleDOI
TL;DR: It is shown that a unified framework consisting of a sequential diagram, an influence diagram, and a common formulation table for the problem's data, suffices for compact and consistent representation, economical formulation, and efficient solution of (asymmetric) decision problems.
Abstract: In this paper we introduce a new graph, the sequential decision diagram, to aid in modeling formulation, and solution of sequential decision problems under uncertainty. While as compact as an influence diagram, the sequential diagram captures the asymmetric and sequential aspects of decision problems as effectively as decision trees. We show that a unified framework consisting of a sequential diagram, an influence diagram, and a common formulation table for the problem’s data, suffices for compact and consistent representation, economical formulation, and efficient solution of (asymmetric) decision problems. In addition to asymmetry, the framework exploits other sources of computational efficiency, such as conditional independence and value function decomposition, making it also useful in evaluating dynamic-programming problems. The formulation table and recursive algorithm can be readily implemented in computers for solving large-scale problems. Examples are provided to illustrate the methodology in both...

Journal ArticleDOI
TL;DR: In this article, the authors studied the Markov decision process under the maximization of the probability that total discounted rewards exceed a target level, and studied the dynamic programing equations of the model.
Abstract: The Markov decision process is studied under the maximization of the probability that total discounted rewards exceed a target level. We focus on and study the dynamic programing equations of the model. We give various properties of the optimal return operator and, for the infinite planning-horizon model, we characterize the optimal value function as a maximal fixed point of the previous operator. Various turnpike results relating the finite and infinite-horizon models are also given.

Journal ArticleDOI
TL;DR: A survey of stochastic games in queues, where both tools and applications are considered, and the structural properties of best policies of the controller, worst-case policies of nature, and of the value function are illustrated.
Abstract: Zero-sum stochastic games model situations where two persons, called players, control some dynamic system, and both have opposite objectives. One player wishes typically to minimize a cost which has to be paid to the other player. Such a game may also be used to model problems with a single controller who has only partial information on the system: the dynamic of the system may depend on some parameter that is unknown to the controller, and may vary in time in an unpredictable way. A worst-case criterion may be considered, where the unknown parameter is assumed to be chosen by “nature” (called player 1), and the objective of the controller (player 2) is then to design a policy that guarantees the best performance under worst-case behaviour of nature. The purpose of this paper is to present a survey of stochastic games in queues, where both tools and applications are considered. The first part is devoted to the tools. We present some existing tools for solving finite horizon and infinite horizon discounted Markov games with unbounded cost, and develop new ones that are typically applicable in queueing problems. We then present some new tools and theory of expected average cost stochastic games with unbounded cost. In the second part of the paper we present a survey on existing results on worst-case control of queues, and illustrate the structural properties of best policies of the controller, worst-case policies of nature, and of the value function. Using the theory developed in the first part of the paper, we extend some of the above results, which were known to hold for finite horizon costs or for the discounted cost, to the expected average cost.

Journal ArticleDOI
TL;DR: It is shown that by employing the theory of constrained Markov Decision Processes this problem can be reformulated as a linear Mixed Integer Program.
Abstract: In this paper, we consider the problem of locating discretionary facilities on a network. In contrast to previous work in the area, we no longer assume that information on customers' flows along all paths of the network is known (in practice such information is rarely available). Assuming that the fraction of customers that travel from any node to any adjacent node in the network is available, the problem of locating the facilities so as to maximize the fraction of customers that pass by a facility before reaching their destination is formulated as a nonlinear Integer Program. It is shown that by employing the theory of constrained Markov Decision Processes this problem can be reformulated as a linear Mixed Integer Program. The paper presents some preliminary computational results for this formulation as well as results for a greedy heuristic algorithm.

Journal ArticleDOI
TL;DR: In this article, a Markov decision process (MDP) model and structural reliability theory are used to generate a long-term maintenance policy based on minimum expected lifetime cost, with respect to the chosen initial design and used to select an optimum initial design to minimize cost and maintain acceptable reliability.
Abstract: An optimal structural design can be described as the synthesis of the initial structural design and its maintenance/management policy over the design lifetime. By using a Markov decision process (MDP) model and structural reliability theory, a designer at the initial design stage can incorporate a reliability-based model of the lifetime process of the structure. The MDP systematically characterizes the process, including decisions, costs, and system performance. In addition, it provides a solution of this dynamic problem through a static method, thus retaining computational tractability. A long-term maintenance policy based on minimum expected lifetime cost can be generated, with respect to the chosen initial design, and used to select an optimum initial design to minimize cost and maintain acceptable reliability. For existing structures, the MDP gives decisionmakers future maintenance policy, which leads to identification of the minimum discounted expected future cost of the structure, based on its present condition, and maintains acceptable reliability.

Book ChapterDOI
01 Jan 1995
TL;DR: The historical basis of reinforcement learning and some of the current work from a computer scientist’s point of view are surveyed and an assessment of the practical utility of current reinforcement-learning systems is assessed.
Abstract: This paper surveys the historical basis of reinforcement learning and some of the current work from a computer scientist’s point of view. It is an outgrowth of a number of talks given by the authors, including a NATO Advanced Study Institute and tutorials at AAAI’94 and Machine Learning’94. Reinforcement learning is a popular model of the learning problems that are encountered by an agent that learns behavior through trial-and-error interactions with a dynamic environment. It has a strong family resemblance to work in psychology, but differs considerably in the details and in the use of the word “reinforcement.” It is appropriately thought of as a class of problems, rather than as a set of techniques. The paper addresses a variety of subproblems in reinforcement learning, including exploration vs. exploitation, learning from delayed reinforcement, learning and using models, generalization and hierarchy, and hidden state. It concludes with a survey of some practical systems and an assessment of the practical utility of current reinforcement-learning systems

Journal ArticleDOI
TL;DR: It is shown that, if a problem without delay satisfies sufficient conditions for monotonicity of an optimal policy, then the same problem with information and/or action delay also has monotonic (e.g., threshold) optimal policies.
Abstract: We consider a discrete-time Markov decision process with a partially ordered state space and two feasible control actions in each state. Our goal is to find general conditions, which are satisfied in a broad class of applications to control of queues, under which an optimal control policy is monotonic. An advantage of our approach is that it easily extends to problems with both information and action delays, which are common in applications to high-speed communication networks, among others. The transition probabilities are stochastically monotone and the one-stage reward submodular. We further assume that transitions from different states are coupled, in the sense that the state after a transition is distributed as a deterministic function of the current state and two random variables, one of which is controllable and the other uncontrollable. Finally, we make a monotonicity assumption about the sample-path effect of a pairwise switch of the actions in consecutive stages. Using induction on the horizon length, we demonstrate that optimal policies for the finite- and infinite-horizon discounted problems are monotonic. We apply these results to a single queueing facility with control of arrivals and/or services, under very general conditions. In this case, our results imply that an optimal control policy has threshold form. Finally, we show how monotonicity of an optimal policy extends in a natural way to problems with information and/or action delay, including delays of more than one time unit. Specifically, we show that, if a problem without delay satisfies our sufficient conditions for monotonicity of an optimal policy, then the same problem with information and/or action delay also has monotonic (e.g., threshold) optimal policies.

Book ChapterDOI
01 Apr 1995
TL;DR: This paper suggests utilizing task-state-specific Q-learning agents to solve their respective restart-in-state-$i$ subproblems, and includes an example in which the online reinforcement learning approach is applied to a simple problem of stochastic scheduling.
Abstract: Multi-armed bandits may be viewed as decompositionally-structured Markov decision processes (MDP''s) with potentially very large state sets. A particularly elegant methodology for computing optimal policies was developed over twenty ago by Gittins [Gittins \& Jones, 1974]. Gittins'' approach reduces the problem of finding optimal policies for the original MDP to a sequence of low-dimensional stopping problems whose solutions determine the optimal policy through the so-called ``Gittins indices.'''' Katehakis and Veinott [Katehakis \& Veinott, 1987] have shown that the Gittins index for a task in state $i$ may be interpreted as a particular component of the maximum-value function associated with the ``restart-in-$i$'''' process, a simple MDP to which standard solution methods for computing optimal policies, such as successive approximation, apply. This paper explores the problem of learning the Gittins indices on-line without the aid of a process model; it suggests utilizing task-state-specific Q-learning agents to solve their respective restart-in-state-$i$ subproblems, and includes an example in which the online reinforcement learning approach is applied to a simple problem of stochastic scheduling---one instance drawn from a wide class of problems that may be formulated as bandit problems.

01 Apr 1995
TL;DR: It is shown how W-learning may be used to define spaces of agent-collections whose action selection is learnt rather than hand-designed, which is the kind of solution-space that may be searched with a genetic algorithm.
Abstract: W-learning is a self-organising action-selection scheme for systems with multiple parallel goals, such as autonomous mobile robots. It uses ideas drawn from the subsumption architecture for mobile robots (Brooks), implementing them with the Q-learning algorithm from reinforcement learning (Watkins). Brooks explores the idea of multiple sensing-and-acting agents within a single robot, more than one of which is capable of controlling the robot on its own if allowed. I introduce a model where the agents are not only autonomous, but are in fact engaged in direct competition with each other for control of the robot. Interesting robots are ones where no agent achieves total victory, but rather the state-space is fragmented among different agents. Having the agents operate by Q-learning proves to be a way to implement this, leading to a local, incremental algorithm (W-learning) to resolve competition. I present a sketch proof that this algorithm converges when the world is a discrete, finite Markov decision process. For each state, competition is resolved with the most likely winner of the state being the agent that is most likely to suffer the most if it does not win. In this way, W-learning can be viewed as `fair' resolution of competition. In the empirical section, I show how W-learning may be used to define spaces of agent-collections whose action selection is learnt rather than hand-designed. This is the kind of solution-space that may be searched with a genetic algorithm.

Journal ArticleDOI
TL;DR: A brief account of the methods being developed by reinforcement learning researchers is provided, what is novel about them, and what their advantages might be over classical applications of dynamic programming to large-scale stochastic optimal control problems are suggested.

Proceedings ArticleDOI
13 Dec 1995
TL;DR: In this article, a connection between continuous timed Petri nets and Markov decision processes is made, and the authors characterize the subclass of continuous timed petri nets corresponding to undiscounted average cost structure.
Abstract: We set up a connection between continuous timed Petri nets (the fluid version of usual timed Petri nets) and Markov decision processes. We characterize the subclass of continuous timed Petri nets corresponding to undiscounted average cost structure. This subclass satisfies conservation laws and shows a linear growth: one obtains as mere application of existing results for dynamic programming the existence of an asymptotic throughput. This rate can be computed using Howard type algorithms, or by an extension of the well known cycle time formula for timed event graphs. We present an illustrating example and briefly sketch the relation with the discrete case.

Journal ArticleDOI
TL;DR: The performance of both the residual-gradient and non-residual-gradient forms of advantage updating and Q-learning are compared, demonstrating that advantage updating converges faster than Q- learning in all simulations.
Abstract: An application of reinforcement learning to a linear-quadratic, differential game is presented. The reinforcement learning system uses a recently developed algorithm, the residual-gradient form of advantage updating. The game is a Markov decision process with continuous time, states, and actions, linear dynamics, and a quadratic cost function. The game consists of two players, a missile and a plane; the missile pursues the plane and the plane evades the missile. Although a missile and plane scenario was the chosen test bed, the reinforcement learning approach presented here is equally applicable to biologically based systems, such as a predator pursuing prey. The reinforcement learning algorithm for optimal control is modified for differential games to find the minimax point rather than the maximum. Simulation results are compared to the analytical solution, demonstrating that the simulated reinforcement learning system converges to the optimal answer. The performance of both the residual-gradient and non...

Book ChapterDOI
20 Sep 1995
TL;DR: A novel generic program for a certain class of optimisation problems, named sequential decision processes is presented, which is a perfect example of a class of problems which has until now escaped solution by a single program.
Abstract: 1 Motivation A generic program for a class of problems is a program which can be used to solve each problem in that class by suitably instantiating its parameters. The most celebrated example of a generic program is the algorithm of Aho, Hopcroft and Ullman for the algebraic path problem 1]. Here we have a single program, pa-rameterised by a number of operators, that solves numerous seemingly disparate problems. Furthermore, the applicability of the generic program is elegantly stated as the condition that the parameters form a closed semi-ring. Examples of such generic algorithms are admittedly scarce, and one might be led to believe that the majority of programs cannot be expressed in such a generic manner. I personally do not share that view, for the following two reasons: Firstly, little eeort has gone into attempts to classify existing algorithms, reeecting the fact that the computing community as a whole places more value on the invention of new algorithms than on the organisation of existing knowledge. In certain specialised areas where organisational tools are indispensable (for example in the design of architecture-independent parallel algorithms), numerous generic algorithms have been discovered, and form the core of the subject 13]. Secondly, to express generic algorithms one requires a higher-order notation in which both programs and types can be parameters to other programs. Such higher-order notations encourage the programmer to exploit genericity where possible, and indeed many functional programmers now do so as a matter of course. More traditional programming notations do not ooer similar support for writing generic programs. This paper is an attempt to persuade you of my viewpoint by presenting a novel generic program for a certain class of optimisation problems, named sequential decision processes. This class was originally identiied by Richard Bellman in his pioneering work on dynamic programming 4]. It is a perfect example of a class of problems which are very much alike, but which has until now escaped solution by a single program. Those readers who have followed some of the work that Richard Bird and I have been doing over the last ve years 6, 7] will recognise many individual examples: all of these have now been uniied. The point of this observation is that even when you are on the lookout for generic programs, it can take a rather long time to discover them. The presentation below will follow that earlier work,

Book ChapterDOI
11 Sep 1995
TL;DR: Techniques from operations research are brought to bear on the problem of choosing optimal actions in partially observable stochastic domains and it is shown how a finite-memory controller can be extracted from the solution to a Pomdp.
Abstract: In this paper, we bring techniques from operations research to bear on the problem of choosing optimal actions in partially observable stochastic domains. In many cases, we have developed new ways of viewing the problem that are, perhaps, more consistent with the AI perspective. We begin by introducing the theory of Markov decision processes (Mdps) and partially observable Markov decision processes Pomdps. We then outline a novel algorithm for solving Pomdps off line and show how, in many cases, a finite-memory controller can be extracted from the solution to a Pomdp. We conclude with a simple example.


Journal ArticleDOI
TL;DR: In this article, the problem of determining an optimal search and disposition strategy is formulated as a Markov decision process whose state space is polynomial in the number of possible records in the file.