scispace - formally typeset
Search or ask a question

Showing papers on "Markov decision process published in 1998"


Journal ArticleDOI
TL;DR: A novel algorithm for solving pomdps off line and how, in some cases, a finite-memory controller can be extracted from the solution to a POMDP is outlined.

4,283 citations


Journal ArticleDOI
TL;DR: This paper gives a comprehensive description of Markov modelling for economic evaluation, including a discussion of the assumptions on which the type of model is based, most notably the memoryless quality ofMarkov models often termed the ‘Markovian assumption’.
Abstract: Markov models are often employed to represent stochastic processes, that is, random processes that evolve over time. In a healthcare context, Markov models are particularly suited to modelling chronic disease. In this article, we describe the use of Markov models for economic evaluation of healthcare interventions. The intuitive way in which Markov models can handle both costs and outcomes make them a powerful tool for economic evaluation modelling. The time component of Markov models can offer advantages of standard decision tree models, particularly with respect to discounting. This paper gives a comprehensive description of Markov modelling for economic evaluation, including a discussion of the assumptions on which the type of model is based, most notably the memoryless quality of Markov models often termed the ‘Markovian assumption’. A hypothetical example of a drug intervention to slow the progression of a chronic disease is employed to demonstrate the modelling technique and the possible methods of analysing Markov models are explored. Analysts should be aware of the limitations of Markov models, particularly the Markovian assumption, although the adept modeller will often find ways around this problem.

845 citations


01 Jan 1998
TL;DR: This work looks at sequential decision making in environments where the actions have probabilistic outcomes and in which the system state is only partially observable and considers a number of approaches for deriving policies that yield sub-optimal control and empirically explore their performance on a range of problems.
Abstract: Automated sequential decision making is crucial in many contexts. In the face of uncertainty, this task becomes even more important, though at the same time, computing optimal decision policies becomes more complex. The more sources of uncertainty there are, the harder the problem becomes to solve. In this work, we look at sequential decision making in environments where the actions have probabilistic outcomes and in which the system state is only partially observable. We focus on using a model called a partially observable Markov decision process (POMDP) and explore algorithms which address computing both optimal and approximate policies for use in controlling processes that are modeled using POMDPs. Although solving for the optimal policy is PSPACE-complete (or worse), the study and improvements of exact algorithms lends insight into the optimal solution structure as well as providing a basis for approximate solutions. We present some improvements, analysis and empirical comparisons for some existing and some novel approaches for computing the optimal POMDP policy exactly. Since it is also hard (NP-complete or worse) to derive close approximations to the optimal solution for POMDPs, we consider a number of approaches for deriving policies that yield sub-optimal control and empirically explore their performance on a range of problems. These approaches borrow and extend ideas from a number of areas; from the more mathematically motivated techniques in reinforcement learning and control theory to entirely heuristic control rules.

428 citations


Journal ArticleDOI
Ralph Neuneier1, Oliver Mihatsch1
01 Dec 1998
TL;DR: This risk-sensitive reinforcement learning algorithm is based on a very different philosophy and reflects important properties of the classical exponential utility framework, but avoids its serious drawbacks for learning.
Abstract: Most reinforcement learning algorithms optimize the expected return of a Markov Decision Problem. Practice has taught us the lesson that this criterion is not always the most suitable because many applications require robust control strategies which also take into account the variance of the return. Classical control literature provides several techniques to deal with risk-sensitive optimization goals like the so-called worst-case optimality criterion exclusively focusing on risk-avoiding policies or classical risk-sensitive control, which transforms the returns by exponential utility functions. While the first approach is typically too restrictive, the latter suffers from the absence of an obvious way to design a corresponding model-free reinforcement learning algorithm. Our risk-sensitive reinforcement learning algorithm is based on a very different philosophy. Instead of transforming the return of the process, we transform the temporal differences during learning. While our approach reflects important properties of the classical exponential utility framework, we avoid its serious drawbacks for learning. Based on an extended set of optimality equations we are able to formulate risk-sensitive versions of various well-known reinforcement learning algorithms which converge with probability one under the usual conditions.

248 citations


01 Jan 1998
TL;DR: This dissertation introduces the HAM for generating hierarchical, temporally abstract actions and shows that traditional MDP algorithms can be used to optimally refine HAMs for new tasks.
Abstract: This dissertation investigates the use of hierarchy and problem decomposition as a means of solving large, stochastic, sequential decision problems. These problems are framed as Markov decision problems (MDPs). The new technical content of this dissertation begins with a discussion of the concept of temporal abstraction. Temporal abstraction is shown to be equivalent to the transformation of a policy defined over a region of an MDP to an action in a semi-Markov decision problem (SMDP). Several algorithms are presented for performing this transformation efficiently. This dissertation introduces the HAM for generating hierarchical, temporally abstract actions. This method permits the partial specification of abstract actions in a way that corresponds to an abstract plan or strategy. Abstract actions specified as HAMs can be optimally refined for new tasks by solving a reduced SMDP. The formal results show that traditional MDP algorithms can be used to optimally refine HAMs for new tasks. This can be achieved in much less time than it would take to learn a new policy for the task from scratch. HAMs complement some novel decomposition algorithms that are presented in this dissertation. These algorithms work by constructing a cache of policies for different regions of the MDP and then optimally combining the cached solution to produce a global solution that is within provable bounds of the optimal solution. Together, the methods developed in this dissertation provide important tools for producing good policies for large MDPs. Unlike some ad-hoc methods, these methods provide strong formal guarantees. They use prior knowledge in a principled way, and they reduce larger MDPs into smaller ones while maintaining a well-defined relationship between the smaller problem and the larger problem.

244 citations


Proceedings Article
24 Jul 1998
TL;DR: A hierarchical model is proposed (using an abstract MDP) that works with macro-actions only, and that significantly reduces the size of the state space, and is shown to justify the computational overhead of macro-action generation.
Abstract: We investigate the use of temporally abstract actions, or macro-actions, in the solution of Markov decision processes Unlike current models that combine both primitive actions and macro-actions and leave the state space unchanged, we propose a hierarchical model (using an abstract MDP) that works with macro-actions only, and that significantly reduces the size of the state space This is achieved by treating macroactions as local policies that act in certain regions of state space, and by restricting states in the abstract MDP to those at the boundaries of regions The abstract MDP approximates the original and can be solved more efficiently We discuss several ways in which macro-actions can be generated to ensure good solution quality Finally, we consider ways in which macro-actions can be reused to solve multiple, related MDPs; and we show that this can justify the computational overhead of macro-action generation

236 citations


Proceedings Article
01 Jul 1998
TL;DR: A technique for computing approximately optimal solutions to stochastic resource allocation problems modeled as Markov decision processes (MDPS) and describes heuristic techniques for dealing with several classes of constraints that use the solutions for individual MDPS to construct an approximate global solution.
Abstract: We present a technique for computing approximately optimal solutions to stochastic resource allocation problems modeled as Markov decision processes (MDPS). We exploit two key properties to avoid explicitly enumerating the very large state and action spaces associated with these problems. Fist, the problems are composed of multiple tasks whose utilities are independent. Second, the actions taken with respect to (or resources allocated to) a task do not influence the status of any other task. We can therefore view each task as an MDP. However these MDPS are weakly coupled by resource constraints: actions selected for one MDP restrict the actions available to others. We describe heuristic techniques for dealing with several classes of constraints that use the solutions for individual MDPS to construct an approximate global solution. We demonstrate this technique on problems involving thousands of tasks, approximating the solution to problems that are far beyond the reach of standard methods.

220 citations


Proceedings ArticleDOI
12 May 1998
TL;DR: A stochastic model for dialogue systems based on Markov decision process is introduced, showing that the problem of dialogue strategy design can be stated as an optimization problem, and solved by a variety of methods, including the reinforcement learning approach.
Abstract: We introduce a stochastic model for dialogue systems based on Markov decision process. Within this framework we show that the problem of dialogue strategy design can be stated as an optimization problem, and solved by a variety of methods, including the reinforcement learning approach. The advantages of this new paradigm include objective evaluation of dialogue systems and their automatic design and adaptation. We show some preliminary results on learning a dialogue strategy for an air travel information system.

212 citations


Proceedings Article
24 Jul 1998
TL;DR: This paper shows empirically that Sarsa(λ), a well known family of RL algorithms that use eligibility traces, can work very well on hidden state problems that have good memoryless policies, i.e., on RL problems in which there may well be very poor observability but there also exists a mapping from immediate observations to actions that yields near-optimal return.
Abstract: Recent research on hidden-state reinforcement learning (RL) problems has concentrated on overcoming partial observability by using memory to estimate state. However, such methods are computationally extremely expensive and thus have very limited applicability. This emphasis on state estimation has come about because it has been widely observed that the presence of hidden state or partial observability renders popular RL methods such as Q-learning and Sarsa useless. However, this observation is misleading in two ways: first, the theoretical results supporting it only apply to RL algorithms that do not use eligibility traces, and second these results are worst-case results, which leaves open the possibility that there may be large classes of hidden-state problems in which RL algorithms work well without any state estimation. In this paper we show empirically that Sarsa(λ), a well known family of RL algorithms that use eligibility traces, can work very well on hidden state problems that have good memoryless policies, i.e., on RL problems in which there may well be very poor observability but there also exists a mapping from immediate observations to actions that yields near-optimal return. We apply conventional Sarsa(λ) to four test problems taken from the recent work of Littman, Littman Cassandra and Kaelbling, Parr and Russell, and Chrisman, and in each case we show that it is able to find the best, or a very good, memoryless policy without any of the computational expense of state estimation.

155 citations


Book ChapterDOI
01 Oct 1998
TL;DR: In this paper, the authors investigate the existence of optimal policies and provide algorithms for the construction of such policies, where the optimal policy is defined in terms of a finite number of events represented as /spl omega/regular sets.
Abstract: Desirable properties of the infinite histories of a finite-state Markov decision process are specified in terms of a finite number of events represented as /spl omega/-regular sets. An infinite history of the process produces a reward which depends on the properties it satisfies. The authors investigate the existence of optimal policies and provide algorithms for the construction of such policies.

149 citations



Book ChapterDOI
21 Apr 1998
TL;DR: New Bellman equations that are satisfied for sets of multi-time models are defined, which can be used interchangeably with models of primitive actions in a variety of well-known planning methods including value iteration, policy improvement and policy iteration.
Abstract: We present new theoretical results on planning within the framework of temporally abstract reinforcement learning (Precup & Sutton, 1997; Sutton, 1995). Temporal abstraction is a key step in any decision making system that involves planning and prediction. In temporally abstract reinforcement learning, the agent is allowed to choose among "options", whole courses of action that may be temporally extended, stochastic, and contingent on previous events. Examples of options include closed-loop policies such as picking up an object, as well as primitive actions such as joint torques. Knowledge about the consequences of options is represented by special structures called multi-time models. In this paper we focus on the theory of planning with multi-time models. We define new Bellman equations that are satisfied for sets of multi-time models. As a consequence, multi-time models can be used interchangeably with models of primitive actions in a variety of well-known planning methods including value iteration, policy improvement and policy iteration.

Proceedings ArticleDOI
16 Dec 1998
TL;DR: A simulation-based algorithm for optimizing the average reward in a Markov reward process that depends on a set of parameters where optimization takes place within a parametrized set of policies is proposed.
Abstract: We propose a simulation-based algorithm for optimizing the average reward in a Markov reward process that depends on a set of parameters. As a special case, the method applies to Markov decision processes where optimization takes place within a parametrized set of policies. The algorithm involves the simulation of a single sample path, and can be implemented online. A convergence result (with probability 1) is provided.

Proceedings Article
Ronald Parr1
24 Jul 1998
TL;DR: In this article, a large stochastic decision problem is divided into smaller pieces, and information is communicated between the different problem pieces, allowing intelligent decisions to be made about which piece requires the most attention.
Abstract: This paper presents two new approaches to decomposing and solving large Markov decision problems (MDPs), a partial decoupling method and a complete decoupling method. In these approaches, a large, stochastic decision problem is divided into smaller pieces. The first approach builds a cache of policies for each part of the problem independently, and then combines the pieces in a separate, light-weight step. A second approach also divides the problem into smaller pieces, but information is communicated between the different problem pieces, allowing intelligent decisions to be made about which piece requires the most attention. Both approaches can be used to find optimal policies or approximately optimal policies with provable bounds. These algorithms also provide a framework for the efficient transfer of knowledge across problems that share similar structure.

Journal ArticleDOI
TL;DR: The increased use of modelling techniques as a methodological tool in the economic evaluation of health care technologies has, in the main, been limited to two approaches – decision trees and Markov chain models.
Abstract: The increased use of modelling techniques as a methodological tool in the economic evaluation of health care technologies has, in the main, been limited to two approaches – decision trees and Markov chain models. The former are suited to modelling simple scenarios that occur over a short time period, whilst Markov chain models allow longer time periods to be modelled, in continuous time, where the timing of an event is uncertain. In the context of economic evaluation, a less well developed technique is discrete event simulation, which may allow even greater flexibility.

01 Aug 1998
TL;DR: It is argued that options and their models provide hitherto missing aspects of a powerful, clear, and expressive framework for representing and organizing knowledge.
Abstract: Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key challenges for AI. In this paper we develop an approach to these problems based on the mathematical framework of reinforcement learning and Markov decision processes (MDPs). We extend the usual notion of action to include {\em options\/}---whole courses of behavior that may be temporally extended, stochastic, and contingent on events. Examples of options include picking up an object, going to lunch, and traveling to a distant city, as well as primitive actions such as muscle twitches or joint torques. Options may be given a priori, learned by experience, or both. They may be used interchangably with actions in a variety of planning and learning methods. The theory of semi-Markov decision processes (SMDPs) can be applied to model the consequences of options and to plan and learn with them. In this paper we develop these connections, building on prior work by Bradtke and Duff (1995), Parr (1998) and others. Our main novel results concern the interface between the MDP and SMDP levels of analysis. We show how a set of options can be altered by changing only their termination conditions to improve over SMDP methods with no additional cost. We also introduce {\it intra-option\/} temporal-difference methods that are able to learn from fragments of an option''s execution. Finally, we propose a notion of subgoal which can be used to improve the options themselves. Overall, we argue that options and their models provide hitherto missing aspects of a powerful, clear, and expressive framework for representing and organizing knowledge.

Journal ArticleDOI
TL;DR: In this article, a hidden Markov model (HMM) is proposed to model the source trajectory including source maneuver uncertainty, and the probability of position transition is deduced from the probabilities of velocity transitions, themselves directly related to the source maneuvering capability.
Abstract: Classical bearings-only target-motion analysis (TMA) is restricted to sources with constant motion parameters (usually position and velocity). However, most interesting sources have maneuvering abilities, thus degrading the performance of classical TMA. In the passive sonar context a long-time source-observer encounter is realistic, so the source maneuver possibilities may be important in regard to the source and array baseline. This advocates for the consideration and modeling of the whole source trajectory including source maneuver uncertainty. With that aim, a convenient framework is the hidden Markov model (HMM). A basic idea consists of a two-levels discretization of the state-space. The probabilities of position transition are deduced from the probabilities of velocity transitions which, themselves, are directly related to the source maneuvering capability. The source state sequence estimation is achieved by means of classical dynamic programming (DP). This approach does not require any prior information relative to the source maneuvers. However, the probabilistic nature of the source trajectory confers a major role to the optimization of the observer maneuvers. This problem is then solved by using the general framework of the Markov decision process (MDP).

Journal ArticleDOI
TL;DR: A generalisation to multi-stage decision making of Dubois and Prade's qualitative decision theory is proposed, and the computation of an optimal policy is done in a way similar to dynamic programming.

Journal ArticleDOI
TL;DR: This paper establishes, in particular, a calculation approach for the value function of the CMDP based on finite state approximation, and presents another type of LP that allows the computation of optimal mixed stationary-deterministic policies.
Abstract: The aim of this paper is to investigate the Lagrangian approach and a related Linear Programming (LP) that appear in constrained Markov decision processes (CMDPs) with a countable state space and total expected cost criteria (of which the expected discounted cost is a special case). We consider transient MDPs and MDPs with uniform Lyapunov functions, and obtain for these an LP which is the dual of another one that has been shown to provide the optimal values and stationary policies [3, 4]. We show that there is no duality gap between these LPs under appropriate conditions. In obtaining the Linear Program for the general transient case, we establish, in particular, a calculation approach for the value function of the CMDP based on finite state approximation. Unlike previous approaches for state approximations for CMDPs (most of which were derived for the contracting framework), we do not need here any Slater type condition. We finally present another type of LP that allows the computation of optimal mixed stationary-deterministic policies.

Book ChapterDOI
01 Jan 1998
TL;DR: It is shown that Continuous Timed Petri Nets (CTPN) can be modeled by generalized polynomial recurrent equations in the (min,+) semiring and a correspondence between CTPN and Markov decision processes is established.
Abstract: We show that Continuous Timed Petri Nets (CTPN) can be modeled by generalized polynomial recurrent equations in the (min,+) semiring. We establish a correspondence between CTPN and Markov decision processes. We survey the basic system theoretical results available: behavioral (inputoutput) properties, algebraic representations, asymptotic regime. A particular attention is paid to the subclass of stable systems (with asymptotic linear growth).

Journal ArticleDOI
TL;DR: This paper provides an introductory discussion for an important concept, the performance potentials of Markov processes, and its relations with perturbation analysis (PA), average-cost Markov decision processes, Poisson equations, α-potentials, the fundamental matrix, and the group inverse of the transition matrix.
Abstract: This paper provides an introductory discussion for an important concept, the performance potentials of Markov processes, and its relations with perturbation analysis (PA), average-cost Markov decision processes (MDP), Poisson equations, a-potentials, the fundamental matrix, and the group inverse of the transition matrix (or the infinitesimal generators). Applications to single sample path-based performance sensitivity estimation and performance optimization are also discussed. On-line algorithms for performance sensitivity estimates and on-line schemes for policy iteration methods are presented. The approach is closely related to reinforcement learning algorithms.

Journal ArticleDOI
TL;DR: In this article, it was shown that the only gain (average) optimal stationary policies with gain and bias which satisfy the optimality equation are of control limit type, that there are at most two and, if there are two, they occur consecutively.
Abstract: This paper studies an admission control M/M/1 queueing system. It shows that the only gain (average) optimal stationary policies with gain and bias which satisfy the optimality equation are of control limit type, that there are at most two and, if there are two, they occur consecutively. Conditions are provided which ensure the existence of two gain optimal control limit policies and are illustrated with an example. The main result is that bias optimality distinguishes these two gain optimal policies and that the larger of the two control limits is the unique bias optimal stationary policy. Consequently it is also Blackwell optimal. This result is established by appealing to the third optimality equation of the Markov decision process and some observations concerning the structure of solutions of the second optimality equation.

Proceedings Article
24 Jul 1998
TL;DR: A family of algorithms for structured reachability analysis of MDPs that are suitable when an initial state (or set of states) is known and can be used to eliminate variables oy variable values from the problem description, reducing the size of the MDP and making it easier to solve.
Abstract: Recent research in decision theoretic planning has focussed on making the solution of Markov decision processes (MDPs) more feasible. We develop a family of algorithms for structured reachability analysis of MDPs that are suitable when an initial state (or set of states) is known. Usin compact, structured representations of MDPs (e.g., Bayesian networks), our methods, which vary in the tradeoff between complexity and accurac roduce structured descriptions of (estimated) reacpagle states that can be used to eliminate variables oy variable values from the problem description, reducing the size of the MDP and making it easier to solve. One contribution of our work is the extension of ideas from GRAPHPLAN to deal with the distributed nature of action reoresentations typically embodied within Bayes nets and the problem of correlated action effects. We also demonstrate that our algorithm can be made more complete by using k-ary constraints instead of binary constraints. Another contribution is the illustration of how the compact representation of reachability constraints can be exploited by several existing (exact and approximate) abstraction algorithms for MDPs.

01 Jan 1998
TL;DR: Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Abstract: Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Their discussion ranges from the history of the field's intellectual foundations to the most recent developments and applications. The only necessary mathematical background is familiarity with elementary concepts of probability.The book is divided into three parts. Part I defines the reinforcement learning problem in terms of Markov decision processes. Part II provides basic solution methods: dynamic programming, Monte Carlo methods, and temporal-difference learning. Part III presents a unified view of the solution methods and incorporates artificial neural networks, eligibility traces, and planning; the two final chapters present case studies and consider the future of reinforcement learning.

Journal ArticleDOI
TL;DR: The study of expectation optimality criteria standard criteria has constituted most previous work in the area of Markov decision processes, but the optimal policies obtained are not reliable when considering a single or a few decision processes.

Proceedings ArticleDOI
16 Dec 1998
TL;DR: It is argued that a natural choice for the initial value function is the value function for the associated deterministic control problem based upon a fluid model, or the approximate solution to Poisson’s equation obtained from the LP of Kumar and Meyn.
Abstract: This paper considers in parallel the scheduling problem for multiclass queueing networks, and optimization of Markov decision processes. It is shown that the value iteration algorithm may perform poorly when the algorithm is not initialized properly. The algorithm is initialized with a stochastic Lyapunov function, then convergence is guaranteed, and each policy is stabilized. For the network scheduling problem it is argued that a natural choice for the initial value function is the value function for the associated deterministic control problem based upon a fluid model, or the approximate solution to Poisson's equation obtained from the LP of Kumar and Meyn (1996). Numerical studies show that either choice may lead to fast convergence to an optimal policy.

Book
30 Apr 1998
TL;DR: In this paper, Bensoussan presented a model for modelling Markov Chains and Markov Processes, as well as random walks and Stochastic Differential Equations.
Abstract: Foreword Alain Bensoussan. 1. Dynamics, Stochastic Models and Uncertainty. 2. Modelling Markov Chains and Markov Processes. 3. Random Walks and Stochastic Differential Equations. 4. Jump Processes and Special Problems. 5. Memory, Volatility Models and the Range Process. 6. Dynamic Optimization. 7. Numerical and Optimization Techniques.

Journal ArticleDOI
TL;DR: In this article, a new value iteration method for the classical average cost Markovian decision problem, under the assumption that all stationary policies are unichain and that there exists a state that is recurrent under all the stationary policies, was proposed.
Abstract: We propose a new value iteration method for the classical average cost Markovian decision problem, under the assumption that all stationary policies are unichain and that, furthermore, there exists a state that is recurrent under all stationary policies. This method is motivated by a relation between the average cost problem and an associated stochastic shortest path problem. Contrary to the standard relative value iteration, our method involves a weighted sup-norm contraction, and for this reason it admits a Gauss--Seidel implementation. Computational tests indicate that the Gauss--Seidel version of the new method substantially outperforms the standard method for difficult problems.

Proceedings Article
01 Dec 1998
TL;DR: This paper shows how an agent can plan with these high-level controllers and then use the results of such planning to find an even better plan, by modifying the existing controllers, with negligible additional cost and no re-planning.
Abstract: In robotics and other control applications it is commonplace to have a preexisting set of controllers for solving subtasks, perhaps hand-crafted or previously learned or planned, and still face a difficult problem of how to choose and switch among the controllers to solve an overall task as well as possible. In this paper we present a framework based on Markov decision processes and semi-Markov decision processes for phrasing this problem, a basic theorem regarding the improvement in performance that can be obtained by switching flexibly between given controllers, and example applications of the theorem. In particular, we show how an agent can plan with these high-level controllers and then use the results of such planning to find an even better plan, by modifying the existing controllers, with negligible additional cost and no re-planning. In one of our examples, the complexity of the problem is reduced from 24 billion state-action pairs to less than a million state-controller pairs.

Journal ArticleDOI
TL;DR: A new model, named controlled Markov set-chains, based on Markovset-chains is introduced, and its optimization under some partial order is discussed, to explain the theoretical results and the computation.
Abstract: In the framework of discounted Markov decision processes, we consider the case that the transition probability varies in some given domain at each time and its variation is unknown or unobservable. To this end we introduce a new model, named controlled Markov set-chains, based on Markov set-chains, and discuss its optimization under some partial order. Also, a numerical example is given to explain the theoretical results and the computation.