Showing papers on "Markov decision process published in 1997"

PDF

Open Access

Book Chapter•DOI•

Learning policies for partially observable environments: scaling up

[...]

Michael L. Littman¹, Anthony R. Cassandra, Leslie Pack Kaelbling¹•Institutions (1)

01 Oct 1997

TL;DR: This paper discusses several simple solution methods and shows that all are capable of finding near- optimal policies for a selection of extremely small POMDP'S taken from the learning literature, but shows that none are able to solve a slightly larger and noisier problem based on robot navigation.

...read moreread less

Abstract: Partially observable Markov decision processes (POMDP's) model decision problems in which an agent tries to maximize its reward in the face of limited and/or noisy sensor feedback. While the study of POMDP's is motivated by a need to address realistic problems, existing techniques for finding optimal behavior do not appear to scale well and have been unable to find satisfactory policies for problems with more than a dozen states. After a brief review of POMDP's, this paper discusses several simple solution methods and shows that all are capable of finding near- optimal policies for a selection of extremely small POMDP'S taken from the learning literature. In contrast, we show that none are able to solve a slightly larger and noisier problem based on robot navigation. We find that a combination of two novel approaches performs well on these problems and suggest methods for scaling to even larger and more complicated domains.

...read moreread less

663 citations

Proceedings Article•

Incremental pruning: a simple, fast, exact method for partially observable Markov decision processes

[...]

Anthony R. Cassandra¹, Michael L. Littman², Nevin L. Zhang•Institutions (2)

Brown University¹, Duke University²

01 Aug 1997

TL;DR: It is found that incremental pruning is presently the most efficient exact method for solving POMDPS.

...read moreread less

Abstract: Most exact algorithms for general partially observable Markov decision processes (POMDPs) use a form of dynamic programming in which a piecewise-linear and convex representation of one value function is transformed into another. We examine variations of the "incremental pruning" method for solving this problem and compare them to earlier algorithms from theoretical and empirical perspectives. We find that incremental pruning is presently the most efficient exact method for solving POMDPS.

...read moreread less

441 citations

Book•DOI•

Markov processes for stochastic modeling

[...]

Masaaki Kijima¹•Institutions (1)

Thomson Reuters¹

01 Jan 1997-Journal of the American Statistical Association

TL;DR: Markov processes for stochastic modeling, Markov Processes for Stochastic Modeling, اطلاعات رسانی کشاورزی, Â£1.5bn in order to model the response of the immune system to shocks in the presence of natural disasters.

...read moreread less

Abstract: Markov processes for stochastic modeling , Markov processes for stochastic modeling , مرکز فناوری اطلاعات و اطلاع رسانی کشاورزی

...read moreread less

437 citations

Book•

Formal verification of probabilistic systems

[...]

Zohar Manna, Luca de Alfaro

01 Jan 1997

TL;DR: This dissertation presents methods for the formal modeling and specification of probabilistic systems, and algorithms for the automated verification of these systems, which rely on the theory of Markov decision processes and exploit a connection between the graph-theoretical and Probabilistic properties of these processes.

...read moreread less

Abstract: This dissertation presents methods for the formal modeling and specification of probabilistic systems, and algorithms for the automated verification of these systems. Our system models describe the behavior of a system in terms of probability, nondeterminism, fairness and time. The formal specification languages we consider are based on extensions of branching-time temporal logics, and enable the expression of single-event and long-run average system properties. This latter class of properties, not expressible with previous formal languages, includes most of the performance properties studied in the field of performance evaluation, such as system throughput and average response time. Our choice of system models and specification languages has been guided by the goal of providing efficient verification algorithms. The algorithms rely on the theory of Markov decision processes, and exploit a connection between the graph-theoretical and probabilistic properties of these processes. This connection also leads to new results about classical problems, such as an extension to the solvable cases of the stochastic shortest path problem, an improved algorithm for the computation of reachability probabilities, and new results on the average reward problem for semi-Markov decision processes.

...read moreread less

435 citations

Journal Article•DOI•

The independent choice logic for modelling multiple agents under uncertainty

[...]

David Poole¹•Institutions (1)

University of British Columbia¹

15 Jul 1997-Artificial Intelligence

TL;DR: It is argued that the ICL provides a natural and concise representation for multi-agent decision-making under uncertainty that allows for the representation of structured probability tables, the dynamic construction of networks and a way to handle uncertainty and decisions in a logical representation.

...read moreread less

428 citations

Journal Article•DOI•

Optimal adaptive policies for Markov decision processes

[...]

Apostolos Burnetas, Michael N. Katehakis

01 Feb 1997-Mathematics of Operations Research

TL;DR: This paper gives the explicit form for a class of adaptive policies that possess optimal increase rate properties for the total expected finite horizon reward, under sufficient assumptions of finite state-action spaces and irreducibility of the transition law.

...read moreread less

Abstract: In this paper we consider the problem of adaptive control for Markov Decision Processes. We give the explicit form for a class of adaptive policies that possess optimal increase rate properties for the total expected finite horizon reward, under sufficient assumptions of finite state-action spaces and irreducibility of the transition law. A main feature of the proposed policies is that the choice of actions, at each state and time period, is based on indices that are inflations of the right-hand side of the estimated average reward optimality equations.

...read moreread less

255 citations

Proceedings Article•

Model minimization in Markov decision processes

[...]

Thomas Dean¹, Robert Givan¹•Institutions (1)

Brown University¹

27 Jul 1997

TL;DR: This work provides an algorithm for finding the coarsest homogeneous refinement of any partition of the state space of an MDP, and shows that simple variations on this algorithm are equivalent or closely similar to several different recently published algorithms for finding optimal solutions to factored Markov decision processes.

...read moreread less

Abstract: We use the notion of stochastic bisimulation homogeneity to analyze planning problems represented as Markov decision processes (MDPs). Informally, a partition of the state space for an MDP is said to be homogeneous if for each action, states in the same block have the same probability of being carried to each other block. We provide an algorithm for finding the coarsest homogeneous refinement of any partition of the state space of an MDP. The resulting partition can be used to construct a reduced MDP which is minimal in a well defined sense and can be used to solve the original MDP. Our algorithm is an adaptation of known automata minimization algorithms, and is designed to operate naturally on factored or implicit representations in which the full state space is never explicitly enumerated. We show that simple variations on this algorithm are equivalent or closely similar to several different recently published algorithms for finding optimal solutions to (partially or fully observable) factored Markov decision processes, thereby providing alternative descriptions of the methods and results regarding those algorithms.

...read moreread less

228 citations

Journal Article•DOI•

Primer on medical decision analysis: Part 5--Working with Markov processes.

[...]

David Naimark¹, Murray Krahn, Gary Naglie, Donald A. Redelmeier, Allan S. Detsky - Show less +1 more•Institutions (1)

Sunnybrook Health Sciences Centre¹

01 Apr 1997-Medical Decision Making

TL;DR: A Markov analysis performed with current computer software programs provides a flexible and convenient means of modeling long-term scenarios, however, novices should be aware of several potential pitfalls when attempting to use these programs.

...read moreread less

Abstract: Clinical decisions often have long-term implications. Analysis encounter difficulties when employing conventional decision-analytic methods to model these scenarios. This occurs because probability and utility variables often change with time and conventional decision trees do not easily capture this dynamic quality. A Markov analysis performed with current computer software programs provides a flexible and convenient means of modeling long-term scenarios. However, novices should be aware of several potential pitfalls when attempting to use these programs. When deciding how to model a given clinical problem, the analyst must weigh the simplicity and clarity of a conventional tree against the fidelity of a Markov analysis. In direct comparisons, both approaches gave the same qualitative answers.

...read moreread less

228 citations

Book Chapter•DOI•

Risk Sensitive Markov Decision Processes

[...]

Steven I. Marcus¹, E. Fernandez-Gaucherand², Daniel Hernández-Hernández³, Stefano Coraluppi¹, P.J. Fard¹ - Show less +1 more•Institutions (3)

University of Maryland, College Park¹, University of Arizona², CINVESTAV³

01 Jan 1997

TL;DR: Risk-sensitive control is an area of significant current interest in stochastic control theory, whereby the generalization of the classical, risk-neutral approach seeks to minimize an exponential of the sum of costs that depends not only on the expected cost, but on higher order moments as well.

...read moreread less

Abstract: Risk-sensitive control is an area of significant current interest in stochastic control theory. It is a generalization of the classical, risk-neutral approach, whereby we seek to minimize an exponential of the sum of costs that depends not only on the expected cost, but on higher order moments as well.

...read moreread less

219 citations

Journal Article•DOI•

Abstraction and approximate decision-theoretic planning

[...]

Richard Dearden¹, Craig Boutilier¹•Institutions (1)

University of British Columbia¹

15 Jan 1997-Artificial Intelligence

TL;DR: An abstraction technique for MDPs that allows approximately optimal solutions to be computed quickly and described methods by which the abstract solution can be viewed as a set of default reactions that can be improved incrementally, and used as a heuristic for search-based planning or other MDP methods.

...read moreread less

179 citations

Proceedings Article•DOI•

Approximate dynamic programming for sensor management

[...]

David A. Castanon¹•Institutions (1)

Boston University¹

10 Dec 1997

TL;DR: A hierarchical algorithm approach for efficient solution of sensor scheduling problems with large numbers of objects, based on a combination of stochastic dynamic programming and nondifferentiable optimization techniques is described.

...read moreread less

Abstract: This paper studies the problem of dynamic scheduling of multi-mode sensor resources for the problem of classification of multiple unknown objects. Because of the uncertain nature of the object types, the problem is formulated as a partially observed Markov decision problem with a large state space. The paper describes a hierarchical algorithm approach for efficient solution of sensor scheduling problems with large numbers of objects, based on a combination of stochastic dynamic programming and nondifferentiable optimization techniques. The algorithm is illustrated with an application involving classification of 10,000 unknown objects.

...read moreread less

Proceedings Article•DOI•

Learning dialogue strategies within the Markov decision process framework

[...]

E. Levin¹, Roberto Pieraccini, Wieland Eckert•Institutions (1)

AT&T Labs¹

14 Dec 1997

TL;DR: A stochastic model for dialogue systems based on the Markov decision process is introduced, showing that the problem of dialogue strategy design can be stated as an optimization problem, and solved by a variety of methods, including the reinforcement learning approach.

...read moreread less

Abstract: We introduce a stochastic model for dialogue systems based on the Markov decision process. Within this framework we show that the problem of dialogue strategy design can be stated as an optimization problem, and solved by a variety of methods, including the reinforcement learning approach. The advantages of this new paradigm include objective evaluation of dialogue systems and their automatic design and adaptation. We show some preliminary results on learning a dialogue strategy for an air travel information system.

...read moreread less

Journal Article•DOI•

The policy iteration algorithm for average reward Markov decision processes with general state space

[...]

Sean P. Meyn¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Dec 1997-IEEE Transactions on Automatic Control

TL;DR: In the control of multiclass queueing networks, it is found that there is a close connection between optimization of the network and optimal control of a far simpler fluid network model.

...read moreread less

Abstract: The average cost optimal control problem is addressed for Markov decision processes with unbounded cost. It is found that the policy iteration algorithm generates a sequence of policies which are c-regular, where c is the cost function under consideration. This result only requires the existence of an initial c-regular policy and an irreducibility condition on the state space. Furthermore, under these conditions the sequence of relative value functions generated by the algorithm is bounded from below and "nearly" decreasing, from which it follows that the algorithm is always convergent. Under further conditions, it is shown that the algorithm does compute a solution to the optimality equations and hence an optimal average cost policy. These results provide elementary criteria for the existence of optimal policies for Markov decision processes with unbounded cost and recover known results for the standard linear-quadratic-Gaussian problem. In particular, in the control of multiclass queueing networks, it is found that there is a close connection between optimization of the network and optimal control of a far simpler fluid network model.

...read moreread less

Report•DOI•

Reinforcement Learning: A Tutorial.

[...]

Mance E. Harmon¹, Stephanie S. Harmon•Institutions (1)

Association for Computing Machinery¹

01 Jan 1997

TL;DR: The intent is not to present a rigorous mathematical discussion that requires a great deal of effort on the part of the reader, but rather toPresent a conceptual framework that might serve as an introduction to a more rigorous study of RL.

...read moreread less

Abstract: : The purpose of this tutorial is to provide an introduction to reinforcement learning (RL) at a level easily understood by students and researchers in a wide range of disciplines. The intent is not to present a rigorous mathematical discussion that requires a great deal of effort on the part of the reader, but rather to present a conceptual framework that might serve as an introduction to a more rigorous study of RL. The fundamental principles and techniques used to solve RL problems are presented. The most popular RL algorithms are presented. Section (1) presents an overview of RL and provides a simple example to develop intuition of the underlying dynamic programming mechanism. In Section (2) the parts of a reinforcement learning problem are discussed. These include the environment, reinforcement function, and value function. Section (3) gives a description of the most widely used reinforcement learning algorithms. These include TD(lambda) and both the residual and direct forms of value iteration, Q-learning, and advantage learning. In Section (4) some of the ancillary issues of RL are briefly discussed, such as choosing an exploration strategy and a discount factor. The conclusion is given in Section (5). Finally, Section (6) is a glossary of commonly used terms followed by references and bibliography.

...read moreread less

Proceedings Article•

Multi-time Models for Temporally Abstract Planning

[...]

Doina Precup¹, Richard S. Sutton¹•Institutions (1)

University of Massachusetts Amherst¹

01 Dec 1997

TL;DR: A more general form of temporally abstract model is introduced, the multi-time model, and its suitability for planning and learning by virtue of its relationship to the Bellman equations is established.

...read moreread less

Abstract: Planning and learning at multiple levels of temporal abstraction is a key problem for artificial intelligence. In this paper we summarize an approach to this problem based on the mathematical framework of Markov decision processes and reinforcement learning. Current model-based reinforcement learning is based on one-step models that cannot represent common-sense higher-level actions, such as going to lunch, grasping an object, or flying to Denver. This paper generalizes prior work on temporally abstract models [Sutton, 1995] and extends it from the prediction setting to include actions, control, and planning. We introduce a more general form of temporally abstract model, the multi-time model, and establish its suitability for planning and learning by virtue of its relationship to the Bellman equations. This paper summarizes the theoretical framework of multi-time models and illustrates their potential advantages in a grid world planning task.

...read moreread less

Proceedings Article•

A heuristic variable grid solution method for POMDPs

[...]

Ronen I. Brafman¹•Institutions (1)

University of British Columbia¹

27 Jul 1997

TL;DR: A simple variable-grid solution method which yields good results on relatively large problems with modest computational effort is described.

...read moreread less

Abstract: Partially observable Markov decision processes (POMDPs) are an appealing tool for modeling planning problems under uncertainty. They incorporate stochastic action and sensor descriptions and easily capture goal oriented and process onented tasks. Unfortunately, POMDPs are very difficult to solve. Exact methods cannot handle problems with much more than 10 states, so approximate methods must be used. In this paper, we describe a simple variable-grid solution method which yields good results on relatively large problems with modest computational effort.

...read moreread less

Proceedings Article•

How to Dynamically Merge Markov Decision Processes

[...]

Satinder Singh¹, David Cohn•Institutions (1)

University of Colorado Boulder¹

01 Dec 1997

TL;DR: This paper forms a new theoretically-sound dynamic programming algorithm for finding an optimal policy for the composite MDP, and analyzes various aspects of this algorithm and illustrates its use on a simple merging problem.

...read moreread less

Abstract: We are frequently called upon to perform multiple tasks that compete for our attention and resource. Often we know the optimal solution to each task in isolation; in this paper, we describe how this knowledge can be exploited to efficiently find good solutions for doing the tasks in parallel. We formulate this problem as that of dynamically merging multiple Markov decision processes (MDPs) into a composite MDP, and present a new theoretically-sound dynamic programming algorithm for finding an optimal policy for the composite MDP. We analyze various aspects of our algorithm and illustrate its use on a simple merging problem.

...read moreread less

Proceedings Article•

Model reduction techniques for computing approximately optimal solutions for Markov decision processes

[...]

Thomas Dean¹, Robert Givan¹, Sonia M. Leach¹•Institutions (1)

Brown University¹

01 Aug 1997

TL;DR: A method for solving implicit (factored) Markov decision processes (MDPs) with very large state spaces using an e-homogeneous partition, and algorithms that operate on BMDPs to find policies that are approximately optimal with respect to the original MDP are presented.

...read moreread less

Abstract: We present a method for solving implicit (factored) Markov decision processes (MDPs) with very large state spaces. We introduce a property of state space partitions which we call e-homogeneity. Intuitively, an e-homogeneous partition groups together states that behave approximately the same under all or some subset of policies. Borrowing from recent work on model minimization in computer-aided software verification, we present an algorithm that takes a factored representation of an MDP and an 0 ≤ e ≤ 1 and computes a factored e-homogeneous partition of the state space. This partition defines a family of related MDPs--those MDP's with state space equal to the blocks of the partition, and transition probabilities "approximately" like those of any (original MDP) state in the source block. To formally study such families of MDPs, we introduce the new notion of a "bounded parameter MDP" (BMDP), which is a family of (traditional) MDPs defined by specifying upper and lower bounds on the transition probabilities and rewards. We describe algorithms that operate on BMDPs to find policies that are approximately optimal with respect to the original MDP. In combination, our method for reducing a large implicit MDP to a possibly much smaller BMDP using an e-homogeneous partition, and our methods for selecting actions in BMDP's constitute a new approach for analyzing large implicit MDP's. Among its advantages, this new approach provides insight into existing algorithms to solving implicit MDPs, provides useful connections to work in automata theory and model minimization, and suggests methods, which involve varying e, to trade time and space (specifically in terms of the size of the corresponding state space) for solution quality.

...read moreread less

Book Chapter•DOI•

Bounded Parameter Markov Decision Processes

[...]

Robert Givan¹, Sonia M. Leach¹, Thomas Dean¹•Institutions (1)

Brown University¹

24 Sep 1997-Lecture Notes in Computer Science

TL;DR: The notion of a bounded parameter Markov decision process (BMDP) is introduced as a generalization of the familiar exact MDP to represent variation or uncertainty concerning the parameters of sequential decision problems in cases where no prior probabilities on the parameter values are available.

...read moreread less

Abstract: In this paper, we introduce the notion of a bounded parameter Markov decision process (BMDP) as a generalization of the familiar exact MDP. A bounded parameter MDP is a set of exact MDPs specified by giving upper and lower bounds on transition probabilities and rewards (all the MDPs in the set share the same state and action space). BMDPs form an efficiently solvable special case of the already known class of MDPs with imprecise parameters (MDPIPs). Bounded parameter MDPs can be used to represent variation or uncertainty concerning the parameters of sequential decision problems in cases where no prior probabilities on the parameter values are available. Bounded parameter MDPs can also be used in aggregation schemes to represent the variation in the transition probabilities for different base states aggregated together in the same aggregate state.

...read moreread less

Proceedings Article•

An Improved Policy Iteration Algorithm for Partially Observable MDPs

[...]

Eric A. Hansen¹•Institutions (1)

University of Massachusetts Amherst¹

01 Dec 1997

TL;DR: In this article, a new policy iteration algorithm for partially observable Markov decision processes is presented that is simpler and more efficient than an earlier algorithm of Sondik (1971, 1978).

...read moreread less

Abstract: A new policy iteration algorithm for partially observable Markov decision processes is presented that is simpler and more efficient than an earlier policy iteration algorithm of Sondik (1971, 1978). The key simplification is representation of a policy as a finite-state controller. This representation makes policy evaluation straightforward. The paper's contribution is to show that the dynamic-programming update used in the policy improvement step can be interpreted as the transformation of a finite-state controller into an improved finite-state controller. The new algorithm consistently outperforms value iteration as an approach to solving infinite-horizon problems.

...read moreread less

Journal Article•DOI•

Optimal replacement policies with minimal repair and age-dependent costs

[...]

Mingchih Chen, Richard M. Feldman¹•Institutions (1)

Texas A&M University¹

01 Apr 1997-European Journal of Operational Research

TL;DR: A modified minimal repair/replacement problem that is formulated as a Markov decision process is studied and it is shown that a control limit policy, or in particular a ( t, T ) policy, is optimal over the space of all possible policies under the discounted cost criterion.

...read moreread less

Proceedings Article•

Incremental methods for computing bounds in partially observable Markov decision processes

[...]

Milos Hauskrecht¹•Institutions (1)

Massachusetts Institute of Technology¹

27 Jul 1997

TL;DR: Novel incremental versions of grid-based linear interpolation method and simple lower bound method with Sondik's updates are introduced and a new method for computing an initial upper bound - the fast informed bound method is introduced.

...read moreread less

Abstract: Partially observable Markov decision processes (POMDPs) allow one to model complex dynamic decision or control problems that include both action outcome uncertainty and imperfect observability. The control problem is formulated as a dynamic optimization problem with a value function combining costs or rewards from multiple steps. In this paper we propose, analyse and test various incremental methods for computing bounds on the value function for control problems with infinite discounted horizon criteria. The methods described and tested include novel incremental versions of grid-based linear interpolation method and simple lower bound method with Sondik's updates. Both of these can work with arbitrary points of the belief space and can be enhanced by various heuristic point selection strategies. Also introduced is a new method for computing an initial upper bound - the fast informed bound method. This method is able to improve significantly on the standard and commonly used upper bound computed by the MDP-based method. The quality of resulting bounds are tested on a maze navigation problem with 20 states, 6 actions and 8 observations.

...read moreread less

Dissertation•

Planning and control in stochastic domains with imperfect information

[...]

Milos Hauskrecht, Peter Szolovits

01 Jan 1997

TL;DR: Experimental results show that methods that preserve the shape of the value function over updates, such as the newly designed incremental linear vector and fast informed bound methods, tend to outperform other methods on the control performance test.

...read moreread less

Abstract: Partially observable Markov decision processes (POMDPs) can be used to model complex control problems that include both action outcome uncertainty and imperfect observability. A control problem within the POMDP framework is expressed as a dynamic optimization problem with a value function that combines costs or rewards from multiple steps. Although the POMDP framework is more expressive than other simpler frameworks, like Markov decision processes (MDP), its associated optimization methods are more demanding computationally and only very small problems can be solved exactly in practice. The thesis focuses on two possible approaches that can he used to solve larger problems: approximation methods and exploitation of additional problem structure. First, a number of new efficient approximation methods and improvements of existing algorithms are proposed. These include (1) the fast informed bound method based on approximate dynamic programming updates that lead to piecewise linear and convex value functions with a constant number of linear vectors, (2) a grid-based point interpolation method that supports variable grids, (3) an incremental version of the linear vector method that updates value function derivatives, as well as (4) various heuristics for selecting grid-points. The new and existing methods are experimentally tested and compared on a set of three infinite discounted horizon problems of different complexity. The experimental results show that methods that preserve the shape of the value function over updates, such as the newly designed incremental linear vector and fast informed bound methods, tend to outperform other methods on the control performance test. Second, the thesis presents a number of techniques for exploiting additional structure in the model of complex control problems. These are studied as applied to a medical therapy planning problem--the management of patients with chronic ischemic heart disease. The new extensions proposed include factored and hierarchically structured models that combine the advantages of the POMDP and MDP frameworks and cut down the size and complexity of the information state space.

...read moreread less

Journal Article•DOI•

Contraction Conditions for Average and α-Discount Optimality in Countable State Markov Games with Unbounded Rewards

[...]

Eitan Altman¹, Arie Hordijk², Flos Spieksma²•Institutions (2)

French Institute for Research in Computer Science and Automation¹, Leiden University²

01 Aug 1997-Mathematics of Operations Research

TL;DR: The goal of this paper is to provide a theory of N-person Markov games with unbounded cost, for a countable state space and compact action spaces, and investigates the zero-sum 2 players game, for which the convergence of the value iteration algorithm is established.

...read moreread less

Abstract: The goal of this paper is to provide a theory of N-person Markov games with unbounded cost, for a countable state space and compact action spaces. We investigate both the finite and infinite horizon problems. For the latter, we consider the discounted cost as well as the expected average cost. We present conditions for the infinite horizon problems for which equilibrium policies exist for all players within the stationary policies, and show that the costs in equilibrium policies exist for all players within the stationary policies, and show that the costs in equilibrium satisfy the optimality equations. Similar results are obtained for the finite horizon costs, for which equilibrium policies are shown to exist for all players within the Markov policies. As special case of N-person games, we investigate the zero-sum 2 players game, for which we establish the convergence of the value iteration algorithm. We conclude by studying an application of a zero-sum Markov game in a queueing model.

...read moreread less

Proceedings Article•

Prioritized goal decomposition of Markov decision processes: toward a synthesis of classical and decision theoretic planning

[...]

Craig Boutilier¹, Ronen I. Brafman¹, Christopher W. Geib¹•Institutions (1)

University of British Columbia¹

23 Aug 1997

TL;DR: An abstraction mechanism is used to generate abstract MDPs associated with different objectives, and several methods for merging the policies for these different objectives are considered.

...read moreread less

Abstract: We describe an approach to goal decomposition for a certain class of Markov decision processes (MDPs). An abstraction mechanism is used to generate abstract MDPs associated with different objectives, and several methods for merging the policies for these different objectives are considered. In one t echnique, causal (least-commitment) structures are generated for

...read moreread less

Journal Article•DOI•

Modelling stochastic decision systems using dependent-chance programming

[...]

Baoding Liu¹, Kakuzo Iwamura²•Institutions (2)

Chinese Academy of Sciences¹, Josai University²

16 Aug 1997-European Journal of Operational Research

TL;DR: Some illustrative examples are provided to show how to model complex stochastic decision systems by using dependent-chance programming and how to solve these models by employing a Monte Carlo simulation based genetic algorithm.

...read moreread less

Journal Article•DOI•

A model approximation scheme for planning in partially observable stochastic domains

[...]

Nevin L. Zhang¹, Wenju Liu¹•Institutions (1)

Hong Kong University of Science and Technology¹

01 Jul 1997-Journal of Artificial Intelligence Research

TL;DR: This paper proposes a new approximation scheme to transform a POMDP into another one where additional information is provided by an oracle and uses its optimal policy to construct an approximate policy for the original PomDP.

...read moreread less

Abstract: Partially observable Markov decision processes (POMDPs) are a natural model for planning problems where effects of actions are nondeterministic and the state of the world is not completely observable. It is difficult to solve POMDPs exactly. This paper proposes a new approximation scheme. The basic idea is to transform a POMDP into another one where additional information is provided by an oracle. The oracle informs the planning agent that the current state of the world is in a certain region. The transformed POMDP is consequently said to be region observable. It is easier to solve than the original POMDP. We propose to solve the transformed POMDP and use its optimal policy to construct an approximate policy for the original POMDP. By controlling the amount of additional information that the oracle provides, it is possible to find a proper tradeoff between computational time and approximation quality. In terms of algorithmic contributions, we study in details how to exploit region observability in solving the transformed POMDP. To facilitate the study, we also propose a new exact algorithm for general POMDPs. The algorithm is conceptually simple and yet is significantly more efficient than all previous exact algorithms.

...read moreread less

Journal Article•DOI•

How does the value function of a Markov decision process depend on the transition probabilities

[...]

Alfred Müller¹•Institutions (1)

Karlsruhe Institute of Technology¹

01 Nov 1997-Mathematics of Operations Research

TL;DR: It is shown that the optimal value function of an MDP is monotone with respect to appropriately defined stochastic order relations, and conditions for continuity withrespect to suitable probability metrics are found.

...read moreread less

Abstract: The present work deals with the comparison of discrete time Markov decision processes MDPs, which differ only in their transition probabilities. We show that the optimal value function of an MDP is monotone with respect to appropriately defined stochastic order relations. We also find conditions for continuity with respect to suitable probability metrics. The results are applied to some well-known examples, including inventory control and optimal stopping.

...read moreread less

Proceedings Article•

Structured solution methods for non-Markovian decision processes

[...]

Fahiem Bacchus¹, Craig Boutilier², Adam J. Grove³•Institutions (3)

University of Waterloo¹, University of British Columbia², Princeton University³

27 Jul 1997

TL;DR: It is shown how an NMDP, in which temporal logic is used to specify history dependence, can be automatically converted into an equivalent MDP by adding appropriate temporal variables.

...read moreread less

Abstract: Markov Decision Processes (MDPs), currently a popular method for modeling and solving decision theoretic planning problems, are limited by the Markovian assumption: rewards and dynamics depend on the current state only, and not on previous history. Non-Markovian decision processes (NMDPs) can also be defined, but then the more tractable solution techniques developed for MDP's cannot be directly applied. In this paper, we show how an NMDP, in which temporal logic is used to specify history dependence, can be automatically converted into an equivalent MDP by adding appropriate temporal variables. The resulting MDP can be represented in a structured fashion and solved using structured policy construction methods. In many cases, this offers significant computational advantages over previous proposals for solving NMDPs.

...read moreread less

Journal Article•DOI•

The actor-critic algorithm as multi-time-scale stochastic approximation

[...]

Vivek S. Borkar¹, Vijaymohan R. Konda¹•Institutions (1)

Indian Institute of Science¹

01 Aug 1997-Sadhana-academy Proceedings in Engineering Sciences

TL;DR: The actor-critic algorithm of Barto and others for simulation-based optimization of Markov decision processes is cast as a two time scale stochastic approximation.

...read moreread less

Abstract: The actor-critic algorithm of Barto and others for simulation-based optimization of Markov decision processes is cast as a two time scale stochastic approximation. Convergence analysis, approximation issues and an example are studied.

...read moreread less