scispace - formally typeset
Search or ask a question

Showing papers on "Markov decision process published in 1997"


Book ChapterDOI
01 Oct 1997
TL;DR: This paper discusses several simple solution methods and shows that all are capable of finding near- optimal policies for a selection of extremely small POMDP'S taken from the learning literature, but shows that none are able to solve a slightly larger and noisier problem based on robot navigation.
Abstract: Partially observable Markov decision processes (POMDP's) model decision problems in which an agent tries to maximize its reward in the face of limited and/or noisy sensor feedback. While the study of POMDP's is motivated by a need to address realistic problems, existing techniques for finding optimal behavior do not appear to scale well and have been unable to find satisfactory policies for problems with more than a dozen states. After a brief review of POMDP's, this paper discusses several simple solution methods and shows that all are capable of finding near- optimal policies for a selection of extremely small POMDP'S taken from the learning literature. In contrast, we show that none are able to solve a slightly larger and noisier problem based on robot navigation. We find that a combination of two novel approaches performs well on these problems and suggest methods for scaling to even larger and more complicated domains.

663 citations


Proceedings Article
01 Aug 1997
TL;DR: It is found that incremental pruning is presently the most efficient exact method for solving POMDPS.
Abstract: Most exact algorithms for general partially observable Markov decision processes (POMDPs) use a form of dynamic programming in which a piecewise-linear and convex representation of one value function is transformed into another. We examine variations of the "incremental pruning" method for solving this problem and compare them to earlier algorithms from theoretical and empirical perspectives. We find that incremental pruning is presently the most efficient exact method for solving POMDPS.

441 citations


BookDOI
Masaaki Kijima1
TL;DR: Markov processes for stochastic modeling, Markov Processes for Stochastic Modeling, اطلاعات رسانی کشاورزی, £1.5bn in order to model the response of the immune system to shocks in the presence of natural disasters.
Abstract: Markov processes for stochastic modeling , Markov processes for stochastic modeling , مرکز فناوری اطلاعات و اطلاع رسانی کشاورزی

437 citations


Book
01 Jan 1997
TL;DR: This dissertation presents methods for the formal modeling and specification of probabilistic systems, and algorithms for the automated verification of these systems, which rely on the theory of Markov decision processes and exploit a connection between the graph-theoretical and Probabilistic properties of these processes.
Abstract: This dissertation presents methods for the formal modeling and specification of probabilistic systems, and algorithms for the automated verification of these systems. Our system models describe the behavior of a system in terms of probability, nondeterminism, fairness and time. The formal specification languages we consider are based on extensions of branching-time temporal logics, and enable the expression of single-event and long-run average system properties. This latter class of properties, not expressible with previous formal languages, includes most of the performance properties studied in the field of performance evaluation, such as system throughput and average response time. Our choice of system models and specification languages has been guided by the goal of providing efficient verification algorithms. The algorithms rely on the theory of Markov decision processes, and exploit a connection between the graph-theoretical and probabilistic properties of these processes. This connection also leads to new results about classical problems, such as an extension to the solvable cases of the stochastic shortest path problem, an improved algorithm for the computation of reachability probabilities, and new results on the average reward problem for semi-Markov decision processes.

435 citations


Journal ArticleDOI
TL;DR: It is argued that the ICL provides a natural and concise representation for multi-agent decision-making under uncertainty that allows for the representation of structured probability tables, the dynamic construction of networks and a way to handle uncertainty and decisions in a logical representation.

428 citations


Journal ArticleDOI
TL;DR: This paper gives the explicit form for a class of adaptive policies that possess optimal increase rate properties for the total expected finite horizon reward, under sufficient assumptions of finite state-action spaces and irreducibility of the transition law.
Abstract: In this paper we consider the problem of adaptive control for Markov Decision Processes. We give the explicit form for a class of adaptive policies that possess optimal increase rate properties for the total expected finite horizon reward, under sufficient assumptions of finite state-action spaces and irreducibility of the transition law. A main feature of the proposed policies is that the choice of actions, at each state and time period, is based on indices that are inflations of the right-hand side of the estimated average reward optimality equations.

255 citations


Proceedings Article
27 Jul 1997
TL;DR: This work provides an algorithm for finding the coarsest homogeneous refinement of any partition of the state space of an MDP, and shows that simple variations on this algorithm are equivalent or closely similar to several different recently published algorithms for finding optimal solutions to factored Markov decision processes.
Abstract: We use the notion of stochastic bisimulation homogeneity to analyze planning problems represented as Markov decision processes (MDPs). Informally, a partition of the state space for an MDP is said to be homogeneous if for each action, states in the same block have the same probability of being carried to each other block. We provide an algorithm for finding the coarsest homogeneous refinement of any partition of the state space of an MDP. The resulting partition can be used to construct a reduced MDP which is minimal in a well defined sense and can be used to solve the original MDP. Our algorithm is an adaptation of known automata minimization algorithms, and is designed to operate naturally on factored or implicit representations in which the full state space is never explicitly enumerated. We show that simple variations on this algorithm are equivalent or closely similar to several different recently published algorithms for finding optimal solutions to (partially or fully observable) factored Markov decision processes, thereby providing alternative descriptions of the methods and results regarding those algorithms.

228 citations


Journal ArticleDOI
TL;DR: A Markov analysis performed with current computer software programs provides a flexible and convenient means of modeling long-term scenarios, however, novices should be aware of several potential pitfalls when attempting to use these programs.
Abstract: Clinical decisions often have long-term implications. Analysis encounter difficulties when employing conventional decision-analytic methods to model these scenarios. This occurs because probability and utility variables often change with time and conventional decision trees do not easily capture this dynamic quality. A Markov analysis performed with current computer software programs provides a flexible and convenient means of modeling long-term scenarios. However, novices should be aware of several potential pitfalls when attempting to use these programs. When deciding how to model a given clinical problem, the analyst must weigh the simplicity and clarity of a conventional tree against the fidelity of a Markov analysis. In direct comparisons, both approaches gave the same qualitative answers.

228 citations


Book ChapterDOI
01 Jan 1997
TL;DR: Risk-sensitive control is an area of significant current interest in stochastic control theory, whereby the generalization of the classical, risk-neutral approach seeks to minimize an exponential of the sum of costs that depends not only on the expected cost, but on higher order moments as well.
Abstract: Risk-sensitive control is an area of significant current interest in stochastic control theory. It is a generalization of the classical, risk-neutral approach, whereby we seek to minimize an exponential of the sum of costs that depends not only on the expected cost, but on higher order moments as well.

219 citations


Journal ArticleDOI
TL;DR: An abstraction technique for MDPs that allows approximately optimal solutions to be computed quickly and described methods by which the abstract solution can be viewed as a set of default reactions that can be improved incrementally, and used as a heuristic for search-based planning or other MDP methods.

179 citations


Proceedings ArticleDOI
10 Dec 1997
TL;DR: A hierarchical algorithm approach for efficient solution of sensor scheduling problems with large numbers of objects, based on a combination of stochastic dynamic programming and nondifferentiable optimization techniques is described.
Abstract: This paper studies the problem of dynamic scheduling of multi-mode sensor resources for the problem of classification of multiple unknown objects. Because of the uncertain nature of the object types, the problem is formulated as a partially observed Markov decision problem with a large state space. The paper describes a hierarchical algorithm approach for efficient solution of sensor scheduling problems with large numbers of objects, based on a combination of stochastic dynamic programming and nondifferentiable optimization techniques. The algorithm is illustrated with an application involving classification of 10,000 unknown objects.

Proceedings ArticleDOI
14 Dec 1997
TL;DR: A stochastic model for dialogue systems based on the Markov decision process is introduced, showing that the problem of dialogue strategy design can be stated as an optimization problem, and solved by a variety of methods, including the reinforcement learning approach.
Abstract: We introduce a stochastic model for dialogue systems based on the Markov decision process. Within this framework we show that the problem of dialogue strategy design can be stated as an optimization problem, and solved by a variety of methods, including the reinforcement learning approach. The advantages of this new paradigm include objective evaluation of dialogue systems and their automatic design and adaptation. We show some preliminary results on learning a dialogue strategy for an air travel information system.

Journal ArticleDOI
TL;DR: In the control of multiclass queueing networks, it is found that there is a close connection between optimization of the network and optimal control of a far simpler fluid network model.
Abstract: The average cost optimal control problem is addressed for Markov decision processes with unbounded cost. It is found that the policy iteration algorithm generates a sequence of policies which are c-regular, where c is the cost function under consideration. This result only requires the existence of an initial c-regular policy and an irreducibility condition on the state space. Furthermore, under these conditions the sequence of relative value functions generated by the algorithm is bounded from below and "nearly" decreasing, from which it follows that the algorithm is always convergent. Under further conditions, it is shown that the algorithm does compute a solution to the optimality equations and hence an optimal average cost policy. These results provide elementary criteria for the existence of optimal policies for Markov decision processes with unbounded cost and recover known results for the standard linear-quadratic-Gaussian problem. In particular, in the control of multiclass queueing networks, it is found that there is a close connection between optimization of the network and optimal control of a far simpler fluid network model.

ReportDOI
01 Jan 1997
TL;DR: The intent is not to present a rigorous mathematical discussion that requires a great deal of effort on the part of the reader, but rather toPresent a conceptual framework that might serve as an introduction to a more rigorous study of RL.
Abstract: : The purpose of this tutorial is to provide an introduction to reinforcement learning (RL) at a level easily understood by students and researchers in a wide range of disciplines. The intent is not to present a rigorous mathematical discussion that requires a great deal of effort on the part of the reader, but rather to present a conceptual framework that might serve as an introduction to a more rigorous study of RL. The fundamental principles and techniques used to solve RL problems are presented. The most popular RL algorithms are presented. Section (1) presents an overview of RL and provides a simple example to develop intuition of the underlying dynamic programming mechanism. In Section (2) the parts of a reinforcement learning problem are discussed. These include the environment, reinforcement function, and value function. Section (3) gives a description of the most widely used reinforcement learning algorithms. These include TD(lambda) and both the residual and direct forms of value iteration, Q-learning, and advantage learning. In Section (4) some of the ancillary issues of RL are briefly discussed, such as choosing an exploration strategy and a discount factor. The conclusion is given in Section (5). Finally, Section (6) is a glossary of commonly used terms followed by references and bibliography.

Proceedings Article
01 Dec 1997
TL;DR: A more general form of temporally abstract model is introduced, the multi-time model, and its suitability for planning and learning by virtue of its relationship to the Bellman equations is established.
Abstract: Planning and learning at multiple levels of temporal abstraction is a key problem for artificial intelligence. In this paper we summarize an approach to this problem based on the mathematical framework of Markov decision processes and reinforcement learning. Current model-based reinforcement learning is based on one-step models that cannot represent common-sense higher-level actions, such as going to lunch, grasping an object, or flying to Denver. This paper generalizes prior work on temporally abstract models [Sutton, 1995] and extends it from the prediction setting to include actions, control, and planning. We introduce a more general form of temporally abstract model, the multi-time model, and establish its suitability for planning and learning by virtue of its relationship to the Bellman equations. This paper summarizes the theoretical framework of multi-time models and illustrates their potential advantages in a grid world planning task.

Proceedings Article
27 Jul 1997
TL;DR: A simple variable-grid solution method which yields good results on relatively large problems with modest computational effort is described.
Abstract: Partially observable Markov decision processes (POMDPs) are an appealing tool for modeling planning problems under uncertainty. They incorporate stochastic action and sensor descriptions and easily capture goal oriented and process onented tasks. Unfortunately, POMDPs are very difficult to solve. Exact methods cannot handle problems with much more than 10 states, so approximate methods must be used. In this paper, we describe a simple variable-grid solution method which yields good results on relatively large problems with modest computational effort.

Proceedings Article
01 Dec 1997
TL;DR: This paper forms a new theoretically-sound dynamic programming algorithm for finding an optimal policy for the composite MDP, and analyzes various aspects of this algorithm and illustrates its use on a simple merging problem.
Abstract: We are frequently called upon to perform multiple tasks that compete for our attention and resource. Often we know the optimal solution to each task in isolation; in this paper, we describe how this knowledge can be exploited to efficiently find good solutions for doing the tasks in parallel. We formulate this problem as that of dynamically merging multiple Markov decision processes (MDPs) into a composite MDP, and present a new theoretically-sound dynamic programming algorithm for finding an optimal policy for the composite MDP. We analyze various aspects of our algorithm and illustrate its use on a simple merging problem.

Proceedings Article
01 Aug 1997
TL;DR: A method for solving implicit (factored) Markov decision processes (MDPs) with very large state spaces using an e-homogeneous partition, and algorithms that operate on BMDPs to find policies that are approximately optimal with respect to the original MDP are presented.
Abstract: We present a method for solving implicit (factored) Markov decision processes (MDPs) with very large state spaces. We introduce a property of state space partitions which we call e-homogeneity. Intuitively, an e-homogeneous partition groups together states that behave approximately the same under all or some subset of policies. Borrowing from recent work on model minimization in computer-aided software verification, we present an algorithm that takes a factored representation of an MDP and an 0 ≤ e ≤ 1 and computes a factored e-homogeneous partition of the state space. This partition defines a family of related MDPs--those MDP's with state space equal to the blocks of the partition, and transition probabilities "approximately" like those of any (original MDP) state in the source block. To formally study such families of MDPs, we introduce the new notion of a "bounded parameter MDP" (BMDP), which is a family of (traditional) MDPs defined by specifying upper and lower bounds on the transition probabilities and rewards. We describe algorithms that operate on BMDPs to find policies that are approximately optimal with respect to the original MDP. In combination, our method for reducing a large implicit MDP to a possibly much smaller BMDP using an e-homogeneous partition, and our methods for selecting actions in BMDP's constitute a new approach for analyzing large implicit MDP's. Among its advantages, this new approach provides insight into existing algorithms to solving implicit MDPs, provides useful connections to work in automata theory and model minimization, and suggests methods, which involve varying e, to trade time and space (specifically in terms of the size of the corresponding state space) for solution quality.

Book ChapterDOI
TL;DR: The notion of a bounded parameter Markov decision process (BMDP) is introduced as a generalization of the familiar exact MDP to represent variation or uncertainty concerning the parameters of sequential decision problems in cases where no prior probabilities on the parameter values are available.
Abstract: In this paper, we introduce the notion of a bounded parameter Markov decision process (BMDP) as a generalization of the familiar exact MDP. A bounded parameter MDP is a set of exact MDPs specified by giving upper and lower bounds on transition probabilities and rewards (all the MDPs in the set share the same state and action space). BMDPs form an efficiently solvable special case of the already known class of MDPs with imprecise parameters (MDPIPs). Bounded parameter MDPs can be used to represent variation or uncertainty concerning the parameters of sequential decision problems in cases where no prior probabilities on the parameter values are available. Bounded parameter MDPs can also be used in aggregation schemes to represent the variation in the transition probabilities for different base states aggregated together in the same aggregate state.

Proceedings Article
01 Dec 1997
TL;DR: In this article, a new policy iteration algorithm for partially observable Markov decision processes is presented that is simpler and more efficient than an earlier algorithm of Sondik (1971, 1978).
Abstract: A new policy iteration algorithm for partially observable Markov decision processes is presented that is simpler and more efficient than an earlier policy iteration algorithm of Sondik (1971, 1978). The key simplification is representation of a policy as a finite-state controller. This representation makes policy evaluation straightforward. The paper's contribution is to show that the dynamic-programming update used in the policy improvement step can be interpreted as the transformation of a finite-state controller into an improved finite-state controller. The new algorithm consistently outperforms value iteration as an approach to solving infinite-horizon problems.

Journal ArticleDOI
TL;DR: A modified minimal repair/replacement problem that is formulated as a Markov decision process is studied and it is shown that a control limit policy, or in particular a ( t, T ) policy, is optimal over the space of all possible policies under the discounted cost criterion.

Proceedings Article
27 Jul 1997
TL;DR: Novel incremental versions of grid-based linear interpolation method and simple lower bound method with Sondik's updates are introduced and a new method for computing an initial upper bound - the fast informed bound method is introduced.
Abstract: Partially observable Markov decision processes (POMDPs) allow one to model complex dynamic decision or control problems that include both action outcome uncertainty and imperfect observability. The control problem is formulated as a dynamic optimization problem with a value function combining costs or rewards from multiple steps. In this paper we propose, analyse and test various incremental methods for computing bounds on the value function for control problems with infinite discounted horizon criteria. The methods described and tested include novel incremental versions of grid-based linear interpolation method and simple lower bound method with Sondik's updates. Both of these can work with arbitrary points of the belief space and can be enhanced by various heuristic point selection strategies. Also introduced is a new method for computing an initial upper bound - the fast informed bound method. This method is able to improve significantly on the standard and commonly used upper bound computed by the MDP-based method. The quality of resulting bounds are tested on a maze navigation problem with 20 states, 6 actions and 8 observations.

Dissertation
01 Jan 1997
TL;DR: Experimental results show that methods that preserve the shape of the value function over updates, such as the newly designed incremental linear vector and fast informed bound methods, tend to outperform other methods on the control performance test.
Abstract: Partially observable Markov decision processes (POMDPs) can be used to model complex control problems that include both action outcome uncertainty and imperfect observability. A control problem within the POMDP framework is expressed as a dynamic optimization problem with a value function that combines costs or rewards from multiple steps. Although the POMDP framework is more expressive than other simpler frameworks, like Markov decision processes (MDP), its associated optimization methods are more demanding computationally and only very small problems can be solved exactly in practice. The thesis focuses on two possible approaches that can he used to solve larger problems: approximation methods and exploitation of additional problem structure. First, a number of new efficient approximation methods and improvements of existing algorithms are proposed. These include (1) the fast informed bound method based on approximate dynamic programming updates that lead to piecewise linear and convex value functions with a constant number of linear vectors, (2) a grid-based point interpolation method that supports variable grids, (3) an incremental version of the linear vector method that updates value function derivatives, as well as (4) various heuristics for selecting grid-points. The new and existing methods are experimentally tested and compared on a set of three infinite discounted horizon problems of different complexity. The experimental results show that methods that preserve the shape of the value function over updates, such as the newly designed incremental linear vector and fast informed bound methods, tend to outperform other methods on the control performance test. Second, the thesis presents a number of techniques for exploiting additional structure in the model of complex control problems. These are studied as applied to a medical therapy planning problem--the management of patients with chronic ischemic heart disease. The new extensions proposed include factored and hierarchically structured models that combine the advantages of the POMDP and MDP frameworks and cut down the size and complexity of the information state space.

Journal ArticleDOI
TL;DR: The goal of this paper is to provide a theory of N-person Markov games with unbounded cost, for a countable state space and compact action spaces, and investigates the zero-sum 2 players game, for which the convergence of the value iteration algorithm is established.
Abstract: The goal of this paper is to provide a theory of N-person Markov games with unbounded cost, for a countable state space and compact action spaces. We investigate both the finite and infinite horizon problems. For the latter, we consider the discounted cost as well as the expected average cost. We present conditions for the infinite horizon problems for which equilibrium policies exist for all players within the stationary policies, and show that the costs in equilibrium policies exist for all players within the stationary policies, and show that the costs in equilibrium satisfy the optimality equations. Similar results are obtained for the finite horizon costs, for which equilibrium policies are shown to exist for all players within the Markov policies. As special case of N-person games, we investigate the zero-sum 2 players game, for which we establish the convergence of the value iteration algorithm. We conclude by studying an application of a zero-sum Markov game in a queueing model.

Proceedings Article
23 Aug 1997
TL;DR: An abstraction mechanism is used to generate abstract MDPs associated with different objectives, and several methods for merging the policies for these different objectives are considered.
Abstract: We describe an approach to goal decomposition for a certain class of Markov decision processes (MDPs). An abstraction mechanism is used to generate abstract MDPs associated with different objectives, and several methods for merging the policies for these different objectives are considered. In one t echnique, causal (least-commitment) structures are generated for

Journal ArticleDOI
TL;DR: Some illustrative examples are provided to show how to model complex stochastic decision systems by using dependent-chance programming and how to solve these models by employing a Monte Carlo simulation based genetic algorithm.

Journal ArticleDOI
TL;DR: This paper proposes a new approximation scheme to transform a POMDP into another one where additional information is provided by an oracle and uses its optimal policy to construct an approximate policy for the original PomDP.
Abstract: Partially observable Markov decision processes (POMDPs) are a natural model for planning problems where effects of actions are nondeterministic and the state of the world is not completely observable. It is difficult to solve POMDPs exactly. This paper proposes a new approximation scheme. The basic idea is to transform a POMDP into another one where additional information is provided by an oracle. The oracle informs the planning agent that the current state of the world is in a certain region. The transformed POMDP is consequently said to be region observable. It is easier to solve than the original POMDP. We propose to solve the transformed POMDP and use its optimal policy to construct an approximate policy for the original POMDP. By controlling the amount of additional information that the oracle provides, it is possible to find a proper tradeoff between computational time and approximation quality. In terms of algorithmic contributions, we study in details how to exploit region observability in solving the transformed POMDP. To facilitate the study, we also propose a new exact algorithm for general POMDPs. The algorithm is conceptually simple and yet is significantly more efficient than all previous exact algorithms.

Journal ArticleDOI
TL;DR: It is shown that the optimal value function of an MDP is monotone with respect to appropriately defined stochastic order relations, and conditions for continuity withrespect to suitable probability metrics are found.
Abstract: The present work deals with the comparison of discrete time Markov decision processes MDPs, which differ only in their transition probabilities. We show that the optimal value function of an MDP is monotone with respect to appropriately defined stochastic order relations. We also find conditions for continuity with respect to suitable probability metrics. The results are applied to some well-known examples, including inventory control and optimal stopping.

Proceedings Article
27 Jul 1997
TL;DR: It is shown how an NMDP, in which temporal logic is used to specify history dependence, can be automatically converted into an equivalent MDP by adding appropriate temporal variables.
Abstract: Markov Decision Processes (MDPs), currently a popular method for modeling and solving decision theoretic planning problems, are limited by the Markovian assumption: rewards and dynamics depend on the current state only, and not on previous history. Non-Markovian decision processes (NMDPs) can also be defined, but then the more tractable solution techniques developed for MDP's cannot be directly applied. In this paper, we show how an NMDP, in which temporal logic is used to specify history dependence, can be automatically converted into an equivalent MDP by adding appropriate temporal variables. The resulting MDP can be represented in a structured fashion and solved using structured policy construction methods. In many cases, this offers significant computational advantages over previous proposals for solving NMDPs.

Journal ArticleDOI
TL;DR: The actor-critic algorithm of Barto and others for simulation-based optimization of Markov decision processes is cast as a two time scale stochastic approximation.
Abstract: The actor-critic algorithm of Barto and others for simulation-based optimization of Markov decision processes is cast as a two time scale stochastic approximation. Convergence analysis, approximation issues and an example are studied.