Showing papers on "Markov decision process published in 1996"

PDF

Open Access

Book•

[...]

Dimitri P. Bertsekas, John N. Tsitsiklis

01 Jan 1996

TL;DR: This is the first textbook that fully explains the neuro-dynamic programming/reinforcement learning methodology, which is a recent breakthrough in the practical application of neural networks and dynamic programming to complex problems of planning, optimal decision making, and intelligent control.

...read moreread less

Abstract: From the Publisher: This is the first textbook that fully explains the neuro-dynamic programming/reinforcement learning methodology, which is a recent breakthrough in the practical application of neural networks and dynamic programming to complex problems of planning, optimal decision making, and intelligent control

...read moreread less

3,665 citations

Book•DOI•

Competitive Markov decision processes

[...]

Jerzy A. Filar¹, Koos Vrieze•Institutions (1)

University of South Australia¹

01 Dec 1996

TL;DR: In this article, the authors present a series of courses and prerequisites for the development of stochastic games with a focus on reducing the complexity of the problem of finding the optimal solution.

...read moreread less

Abstract: 1 Introduction.- 1.0 Background.- 1.1 Raison d'Etre and Limitations.- 1.2 A Menu of Courses and Prerequisites.- 1.3 For the Cognoscenti.- 1.4 Style and Nomenclature.- I Mathematical Programming Perspective.- 2 Markov Decision Processes: The Noncompetitive Case.- 2.0 Introduction.- 2.1 The Summable Markov Decision Processes.- 2.2 The Finite Horizon Markov Decision Process.- 2.3 Linear Programming and the Summable Markov Decision Models.- 2.4 The Irreducible Limiting Average Process.- 2.5 Application: The Hamiltonian Cycle Problem.- 2.6 Behavior and Markov Strategies.- 2.7 Policy Improvement and Newton's Method in Summable MDPs.- 2.8 Connection Between the Discounted and the Limiting Average Models.- 2.9 Linear Programming and the Multichain Limiting Average Process.- 2.10 Bibliographic Notes.- 2.11 Problems.- 3 Stochastic Games via Mathematical Programming.- 3.0 Introduction.- 3.1 The Discounted Stochastic Games.- 3.2 Linear Programming and the Discounted Stochastic Games.- 3.3 Modified Newton's Method and the Discounted Stochastic Games.- 3.4 Limiting Average Stochastic Games: The Issues.- 3.5 Zero-Sum Single-Controller Limiting Average Game.- 3.6 Application: The Travelling Inspector Model.- 3.7 Nonlinear Programming and Zero-Sum Stochastic Games.- 3.8 Nonlinear Programming and General-Sum Stochastic Games.- 3.9 Shapley's Theorem via Mathematical Programming.- 3.10 Bibliographic Notes.- 3.11 Problems.- II Existence, Structure and Applications.- 4 Summable Stochastic Games.- 4.0 Introduction.- 4.1 The Stochastic Game Model.- 4.2 Transient Stochastic Games.- 4.2.1 Stationary Strategies.- 4.2.2 Extension to Nonstationary Strategies.- 4.3 Discounted Stochastic Games.- 4.3.1 Introduction.- 4.3.2 Solutions of Discounted Stochastic Games.- 4.3.3 Structural Properties.- 4.3.4 The Limit Discount Equation.- 4.4 Positive Stochastic Games.- 4.5 Total Reward Stochastic Games.- 4.6 Nonzero-Sum Discounted Stochastic Games.- 4.6.1 Existence of Equilibrium Points.- 4.6.2 A Nonlinear Compementarity Problem.- 4.6.3 Perfect Equilibrium Points.- 4.7 Bibliographic Notes.- 4.8 Problems.- 5 Average Reward Stochastic Games.- 5.0 Introduction.- 5.1 Irreducible Stochastic Games.- 5.2 Existence of the Value.- 5.3 Stationary Strategies.- 5.4 Equilibrium Points.- 5.5 Bibliographic Notes.- 5.6 Problems.- 6 Applications and Special Classes of Stochastic Games.- 6.0 Introduction.- 6.1 Economic Competition and Stochastic Games.- 6.2 Inspection Problems and Single-Control Games.- 6.3 The Presidency Game and Switching-Control Games.- 6.4 Fishery Games and AR-AT Games.- 6.5 Applications of SER-SIT Games.- 6.6 Advertisement Models and Myopic Strategies.- 6.7 Spend and Save Games and the Weighted Reward Criterion.- 6.8 Bibliographic Notes.- 6.9 Problems.- Appendix G Matrix and Bimatrix Games and Mathematical Programming.- G.1 Introduction.- G.2 Matrix Game.- G.3 Linear Programming.- G.4 Bimatrix Games.- G.5 Mangasarian-Stone Algorithm for Bimatrix Games.- G.6 Bibliographic Notes.- Appendix H A Theorem of Hardy and Littlewood.- H.1 Introduction.- H.2 Preliminaries, Results and Examples.- H.3 Proof of the Hardy-Littlewood Theorem.- Appendix M Markov Chains.- M.1 Introduction.- M.2 Stochastic Matrix.- M.3 Invariant Distribution.- M.4 Limit Discounting.- M.5 The Fundamental Matrix.- M.6 Bibliographic Notes.- Appendix P Complex Varieties and the Limit Discount Equation.- P.1 Background.- P.2 Limit Discount Equation as a Set of Simultaneous Polynomials.- P.3 Algebraic and Analytic Varieties.- P.4 Solution of the Limit Discount Equation via Analytic Varieties.- References.

...read moreread less

1,191 citations

Proceedings Article•

Planning, Learning and Coordination in Multiagent Decision Processes

[...]

Craig Boutilier¹•Institutions (1)

University of British Columbia¹

17 Mar 1996

TL;DR: The extent to which methods from single-agent planning and learning can be applied in multiagent settings is investigated and the decomposition of sequential decision processes so that coordination can be learned locally, at the level of individual states.

...read moreread less

Abstract: There has been a growing interest in AI in the design of multiagent systems, especially in multiagent cooperative planning. In this paper, we investigate the extent to which methods from single-agent planning and learning can be applied in multiagent settings. We survey a number of different techniques from decision-theoretic planning and reinforcement learning and describe a number of interesting issues that arise with regard to coordinating the policies of individual agents. To this end, we describe multiagent Markov decision processes as a general model in which to frame this discussion. These are special n-person cooperative games in which agents share the same utility function. We discuss coordination mechanisms based on imposed conventions (or social laws) as well as learning methods for coordination. Our focus is on the decomposition of sequential decision processes so that coordination can be learned (or imposed) locally, at the level of individual states. We also discuss the use of structured problem representations and their role in the generalization of learned conventions and in approximation.

...read moreread less

496 citations

Algorithms for Sequential Decision Making

[...]

Michael L. Littman

01 Jan 1996

TL;DR: This thesis shows how to answer the question ``What should I do now?

...read moreread less

Abstract: Sequential decision making is a fundamental task faced by any intelligent agent in an extended interaction with its environment; it is the act of answering the question ``What should I do now?'''' In this thesis, I show how to answer this question when ``now'''' is one of a finite set of states, ``do'''' is one of a finite set of actions, ``should'''' is maximize a long-run measure of reward, and ``I'''' is an automated planning or learning system (agent). In particular, I collect basic results concerning methods for finding optimal (or near-optimal) behavior in several different kinds of model environments: Markov decision processes, in which the agent always knows its state; partially observable Markov decision processes (POMDPs), in which the agent must piece together its state on the basis of observations it makes; and Markov games, in which the agent is in direct competition with an opponent. The thesis is written from a computer-science perspective, meaning that many mathematical details are not discussed, and descriptions of algorithms and the complexity of problems are emphasized. New results include an improved algorithm for solving POMDPs exactly over finite horizons, a method for learning minimax-optimal policies for Markov games, a pseudopolynomial bound for policy iteration, and a complete complexity theory for finding zero-reward POMDP policies.

...read moreread less

410 citations

Journal Article•DOI•

Average reward reinforcement learning: foundations, algorithms, and empirical results

[...]

Sridhar Mahadevan¹•Institutions (1)

University of South Florida¹

01 Jan 1996-Machine Learning

TL;DR: This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework, and a detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels.

...read moreread less

Abstract: This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric called n-discount-optimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms: while several algorithms can provably generate gain-optimal policies that maximize average reward, none of them can reliably filter these to produce bias-optimal (or T-optimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of R-learning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels. The results suggest that R-learning is quite sensitive to exploration strategies, and can fall into sub-optimal limit cycles. The performance of R-learning is also compared with that of Q-learning, the best studied discounted RL method. Here, the results suggest that R-learning can be fine-tuned to give better performance than Q-learning in both domains.

...read moreread less

397 citations

Book Chapter•DOI•

Chapter 14 Numerical dynamic programming in economics

[...]

John Rust¹•Institutions (1)

University of Wisconsin-Madison¹

01 Jan 1996-Handbook of Computational Economics

TL;DR: This chapter explores the numerical methods for solving dynamic programming (DP) problems and focuses on continuous Markov decision processes (MDPs) because these problems arise frequently in economic applications.

...read moreread less

Abstract: Publisher Summary This chapter explores the numerical methods for solving dynamic programming (DP) problems. The DP framework has been extensively used in economics because it is sufficiently rich to model almost any problem involving sequential decision making over time and under uncertainty. The chapter focuses on continuous Markov decision processes (MDPs) because these problems arise frequently in economic applications. Although, complexity theory suggests a number of useful algorithms, the theory has relatively little to say about important practical issues, such as determining the point at which various exponential-time algorithms such as Chebyshev approximation methods start to blow up, making it optimal to switch to polynomial-time algorithms. In future work, it will be essential to provide numerical comparisons of a broader range of methods over a broader range of test problems, including problems of moderate to high dimensionality.

...read moreread less

319 citations

Proceedings Article•

Computing optimal policies for partially observable decision processes using compact representations

[...]

Craig Boutilier¹, David Poole¹•Institutions (1)

University of British Columbia¹

04 Aug 1996

TL;DR: This paper uses Bayesian networks with structured conditional probability matrices to represent PomDPs, and uses this model to structure the belief space for POMDP algorithms, allowing irrelevant distinctions to be ignored.

...read moreread less

Abstract: Partially-observable Markov decision processes provide a general model for decision theoretic planning problems, allowing trade-offs between various courses of actions to be determined under conditions of uncertainty, and incorporating partial observations made by an agent. Dynamic programming algorithms based on the belief state of an agent can be used to construct optimal policies without explicit consideration of past history, but at high computational cost. In this paper, we discuss how structured representations of system dynamics can be incorporated in classic POMDP solution algorithms. We use Bayesian networks with structured conditional probability matrices to represent POMDPs, and use this model to structure the belief space for POMDP algorithms, allowing irrelevant distinctions to be ignored. Apart from speeding up optimal policy construction, we suggest that such representations can be exploited in the development of useful approximation methods.

...read moreread less

238 citations

Proceedings Article•

A Generalized Reinforcement-Learning Model: Convergence and Applications

[...]

Michael L. Littman, Csaba Szepesva ''ri

01 Feb 1996

TL;DR: This paper shows how many of the important theoretical results concerning reinforcement learning in MDPs extend to a generalized MDP model that includes M DPs, two-player games and MDP’s under a worst-case optimality criterion as special cases.

...read moreread less

Abstract: Reinforcement learning is the process by which an autonomous agent uses its experience interacting with an environment to improve its behavior. The Markov decision process (MDP) model is a popular way of formalizing the reinforcement-learning problem, but it is by no means the only way. In this paper, we show how many of the important theoretical results concerning reinforcement learning in MDPs extend to a generalized MDP model that includes MDPs, two-player games and MDPs under a worst-case optimality criterion as special cases. The basis of this extension is a stochastic-approximation theorem that reduces asynchronous convergence to synchronous convergence. Keywords: Reinforcement learning, Q-learning convergence, Markov games

...read moreread less

236 citations

Journal Article•DOI•

Risk sensitive control of Markov processes in countable state space

[...]

Daniel Hernández-Hernández¹, Steven I. Marcus¹•Institutions (1)

University of Maryland, College Park¹

11 Nov 1996-Systems & Control Letters

TL;DR: In this article, the authors considered infinite horizon risk-sensitive control of Markov processes with discrete time and denumerable state space, and proved that there exists a bounded solution to the dynamic programming equation.

...read moreread less

138 citations

Book•

Planning under uncertainty: structural assumptions and computational leverage

[...]

Craig Boutilier, Thomas Dean, Steve Hanks

01 Jan 1996

TL;DR: This paper provides a general characterization of the representational requirements for this class of problems, and it is described how to achieve computational leverage using representations that make different types of dependency information explicit.

...read moreread less

Abstract: The problem of planning under uncertainty has been addressed by researchers in many different fields, adopting rather different perspectives on the problem. Unfortunately, these researchers are not always aware of the relationships among these different problem formulations, often resulting in confusion and duplicated effort. Many probabilisticplanning or decision making problems can be characterized as a class of Markov decision processes that allow for significant compression in representing the underlying system dynamics. It is for this class of problems that we as experts in intensional representations are advantageously positioned to contribute efficient solution methods. This paper provides a general characterization of the representational requirements for this class of problems, and we describe how to achieve computational leverage using representations that make different types of dependency information explicit.

...read moreread less

129 citations

Proceedings Article•

Hidden Markov Decision Trees

[...]

Michael I. Jordan¹, Zoubin Ghahramani², Lawrence K. Saul¹•Institutions (2)

Massachusetts Institute of Technology¹, University of Toronto²

03 Dec 1996

TL;DR: A time series model that can be viewed as a decision tree with Markov temporal structure is studied and a Viterbi-like assumption is made to pick out a single most likely state sequence.

...read moreread less

Abstract: We study a time series model that can be viewed as a decision tree with Markov temporal structure. The model is intractable for exact calculations, thus we utilize variational approximations. We consider three different distributions for the approximation: one in which the Markov calculations are performed exactly and the layers of the decision tree are decoupled, one in which the decision tree calculations are performed exactly and the time steps of the Markov chain are decoupled, and one in which a Viterbi-like assumption is made to pick out a single most likely state sequence. We present simulation results for artificial data and the Bach chorales.

...read moreread less

Journal Article•DOI•

Constrained Discounted Dynamic Programming

[...]

Eugene A. Feinberg¹, Adam Shwartz²•Institutions (2)

Stony Brook University¹, Technion – Israel Institute of Technology²

01 Nov 1996-Mathematics of Operations Research

TL;DR: This paper proves the existence and structure of optimal policies, and establishes Pareto optimality of policies from the two classes described above for multi-criteria problems, and describes an algorithm to compute optimal policies with properties i--iii for constrained problems.

...read moreread less

Abstract: This paper deals with constrained optimization of Markov Decision Processes with a countable state space, compact action sets, continuous transition probabilities, and upper semicontinuous reward functions. The objective is to maximize the expected total discounted reward for one reward function, under several inequality constraints on similar criteria with other reward functions. Suppose a feasible policy exists for a problem with M constraints. We prove two results on the existence and structure of optimal policies. First, we show that there exists a randomized stationary optimal policy which requires at most M actions more than a nonrandomized stationary one. This result is known for several particular cases. Second, we prove that there exists an optimal policy which is i stationary nonrandomized from some step onward, ii randomized Markov before this step, but the total number of actions which are added by randomization is at most M, iii the total number of actions that are added by nonstationarity is at most M. We also establish Pareto optimality of policies from the two classes described above for multi-criteria problems. We describe an algorithm to compute optimal policies with properties i--iii for constrained problems. The policies that satisfy properties i--iii have the pleasing aesthetic property that the amount of randomization they require over any trajectory is restricted by the number of constraints. In contrast, a randomized stationary policy may require an infinite number of randomizations over time.

...read moreread less

Proceedings Article•

Rewarding behaviors

[...]

Fahiem Bacchus¹, Craig Boutilier², Adam J. Grove³•Institutions (3)

University of Waterloo¹, University of British Columbia², Princeton University³

04 Aug 1996

TL;DR: An algorithm is considered which, given a decision process with non-Markovian rewards expressed in this manner, automatically constructs an equivalent MDP (with Markovian reward structure), allowing optimal policy construction using standard techniques.

...read moreread less

Abstract: Markov decision processes (MDPs) are a very popular tool for decision theoretic planning (DTP), partly because of the welldeveloped, expressive theory that includes effective solution techniques. But the Markov assumption--that dynamics and rewards depend on the current state only, and not on history-- is often inappropriate. This is especially true of rewards: we frequently wish to associate rewards with behaviors that extend over time. Of course, such reward processes can be encoded in an MDP should we have a rich enough state space (where states encode enough history). However it is often difficult to "hand craft" suitable state spaces that encode an appropriate amount of history. We consider this problem in the case where non-Markovian rewards are encoded by assigning values to formulas of a temporal logic. These formulas characterize the value of temporally extended behaviors. We argue that this allows a natural representation of many commonly encountered non-Markovian rewards. The main result is an algorithm which, given a decision process with non-Markovian rewards expressed in this manner, automatically constructs an equivalent MDP (with Markovian reward structure), allowing optimal policy construction using standard techniques.

...read moreread less

Journal Article•DOI•

Optimal software rejuvenation for tolerating soft failures

[...]

András Pfening, Sachin Garg¹, Antonio Puliafito², Miklós Telek, Kishor S. Trivedi¹ - Show less +1 more•Institutions (2)

Duke University¹, University of Catania²

01 Oct 1996-Performance Evaluation

TL;DR: This paper addresses the problem of determining the optimal time to rejuvenate a server type software which experiences “soft failures” because of aging and develops Markov decision models for such a system for two different queuing policies.

...read moreread less

Generalized Markov Decision Processes: Dynamic-programming and Reinforcement-learning Algorithms

[...]

Csaba Szepesv ''ari¹, Michael L. Littman•Institutions (1)

University of Szeged¹

01 Nov 1996

TL;DR: A new generalized model that subsumes MDPs as well as many of the recent variations is described and generalizations of value iteration, policy iteration, model-based reinforcement-learning, and Q-learning that can be used to make optimal decisions in the generalized model under various assumptions are developed.

...read moreread less

Abstract: The problem of maximizing the expected total discounted reward in a completely observable Markovian environment, i.e., a Markov decision process (MDP), models a particular class of sequential decision problems. Algorithms have been developed for making optimal decisions in MDPs given either an MDP specification or the opportunity to interact with the MDP over time. Recently, other sequential decision-making problems have been studied prompting the development of new algorithms and analyses. We describe a new generalized model that subsumes MDPs as well as many of the recent variations. We prove some basic results concerning this model and develop generalizations of value iteration, policy iteration, model-based reinforcement-learning, and Q-learning that can be used to make optimal decisions in the generalized model under various assumptions. Applications of the theory to particular models are described, including risk-averse MDPs, exploration-sensitive MDPs, sarsa, Q-learning with spreading, two-player games, and approximate max picking via sampling. Central to the results are the contraction property of the value operator and a stochastic-approximation theorem that reduces asynchronous convergence to synchronous convergence.

...read moreread less

Journal Article•DOI•

Fair-efficient call admission control policies for broadband networks—a game theoretic framework

[...]

Zbigniew Dziong¹, L.G. Mason¹•Institutions (1)

Institut national de la recherche scientifique¹

01 Feb 1996-IEEE ACM Transactions on Networking

TL;DR: The authors investigate the Nash, Raiffa-Kalai-Smorodinsky, and modified Thomson (Cao) arbitration solutions from game theory for call admission strategies in broadband networks and indicates that the schemes provide some attractive features especially when compared to traditional control objectives: blocking equalization and traffic maximization.

...read moreread less

Abstract: A fundamental problem in connection oriented multiservice networks (ATM and STM) is finding the optimal policy for call acceptance. One seeks an admission control policy that efficiently utilizes network resources while at the same time being fair to the various call classes being supported. The theory of cooperative games provides a natural and precise framework for formulating such multicriterion problems as well as solution concepts. The authors describe how this framework can be used for analysis and synthesis of call admission strategies in broadband networks. In particular they investigate the Nash (1950), Raiffa-Kalai-Smorodinsky (Raiffa, 1953; Kalai and Smorodinsky, 1975), and modified Thomson (Cao, 1982) arbitration solutions from game theory. The performance of all solutions is evaluated by applying the value iteration algorithm from Markov decision theory. The approach is illustrated on a one-link network example for which the exact solutions can be achieved. The results indicate that the arbitration schemes from game theory provide some attractive features especially when compared to traditional control objectives: blocking equalization and traffic maximization. The authors also compare the optimal solutions with some simplified policies belonging to four different classes: complete sharing, coordinate convex, trunk reservation, and dynamic trunk reservation. The comparison indicates that in many cases, the trunk reservation and dynamic trunk reservation policies can provide fair, efficient solutions, close to the optimal ones.

...read moreread less

Proceedings Article•

Multidimensional Triangulation and Interpolation for Reinforcement Learning

[...]

Scott Davies¹•Institutions (1)

Carnegie Mellon University¹

03 Dec 1996

TL;DR: This paper studies interpolation techniques that can result in vast improvements in the online behavior of the resulting control systems: multilinear interpolation, and an interpolation algorithm based on an interesting regular triangulation of d-dimensional space.

...read moreread less

Abstract: Dynamic Programming, Q-learning and other discrete Markov Decision Process solvers can be applied to continuous d-dimensional state-spaces by quantizing the state space into an array of boxes. This is often problematic above two dimensions: a coarse quantization can lead to poor policies, and fine quantization is too expensive. Possible solutions are variable-resolution discretization, or function approximation by neural nets. A third option, which has been little studied in the reinforcement learning literature, is interpolation on a coarse grid. In this paper we study interpolation techniques that can result in vast improvements in the online behavior of the resulting control systems: multilinear interpolation, and an interpolation algorithm based on an interesting regular triangulation of d-dimensional space. We adapt these interpolators under three reinforcement learning paradigms: (i) offline value iteration with a known model, (ii) Q-learning, and (iii) online value iteration with a previously unknown model learned from data. We describe empirical results, and the resulting implications for practical learning of continuous non-linear dynamic control.

...read moreread less

Planning in stochastic domains : problem characteristics and approximations (version II)

[...]

Nevin L. Zhang, Wenju Liu

01 Jan 1996

TL;DR: This paper proposes a new method for attacking the core problem known as dynamic programming updates that one has to face in solving POMDPs and has shown elsewhere that the new method is signi cantly more e cient that the best previous method.

...read moreread less

Abstract: This paper is about planning in stochastic domains by means of partially observable Markov decision processes POMDPs POMDPs are di cult to solve and approxima tion is a must in real world applications Approximation methods can be classi ed into those that solve a POMDP directly and those that approximate a POMDP model by a simpler model Only one previous method falls into the second category It approximates POMDPs by using fully observable Markov decision processes MDPs We propose to approximate POMDPs by using what we call region observable POMDPs Region ob servable POMDPs are more complex than MDPs and yet still solvable They have been empirically shown to yield signi cantly better approximate policies than MDPs In the process of designing an algorithm for solving region observable POMDPs we also propose a new method for attacking the core problem known as dynamic programming updates that one has to face in solving POMDPs We have shown elsewhere that the new method is signi cantly more e cient that the best previous method

...read moreread less

Proceedings Article•

Approximate Value Trees in Structured Dynamic Programming.

[...]

Craig Boutilier¹, Richard Dearden¹•Institutions (1)

University of British Columbia¹

01 Jan 1996

TL;DR: A method for detecting convergence, proving errors bounds on the resulting approximately optimal value functions and policies, and describing some preliminary experimental results are proposed.

...read moreread less

Abstract: We propose and examine a method of approximate dynamic programming for Markov decision processes based on structured problem representations. We assume an MDP is represented using a dynamic Bayesian network, and construct value functions using decision trees as our function representation. The size of the representation is kept within acceptable limits by pruning these value trees so that leaves represent possible ranges of values, thus approximating the value functions produced during optimization. We propose a method for detecting convergence,prove errors bounds on the resulting approximately optimal value functions and policies, and describe some preliminary experimental results.

...read moreread less

Proceedings Article•DOI•

Active gesture recognition using partially observable Markov decision processes

[...]

Trevor Darrell¹, Alex Pentland•Institutions (1)

Massachusetts Institute of Technology¹

25 Aug 1996

TL;DR: A foveated gesture recognition system that guides an active camera to foveate salient features based on a reinforcement learning paradigm and uses a new multiple-model Q-learning formulation.

...read moreread less

Abstract: We present a foveated gesture recognition system that guides an active camera to foveate salient features based on a reinforcement learning paradigm. Using vision routines previously implemented for an interactive environment, we determine the spatial location of salient body parts of a user and guide an active camera to obtain images of gestures of expressions. A hidden-state reinforcement learning paradigm based on the partially observable Markov decision process (POMDP) is used to implement this visual attention. The attention module selects targets to foveate based on the goal of successful recognition, and uses a new multiple-model Q-learning formulation. Given a set of target and distracter gestures, our system can learn where to foveate to maximally discriminate a particular gesture.

...read moreread less

Book Chapter•DOI•

On measurability and representation of strategic measures in Markov decision processes

[...]

Eugene A. Feinberg

01 Jan 1996

Journal Article•DOI•

On the complexity of partially observed Markov decision processes

[...]

Dima Burago, Michel de Rougemont, Anatol Slissenko

01 May 1996-Theoretical Computer Science

TL;DR: The paper considers the complexity of constructing optimal policies (strategies) for some type of partially observed Markov decision processes, and shows that the problem of constructing even a very weak approximation to an optimal strategy is NP-hard.

...read moreread less

Dynamic programming and Markov decision processes

[...]

Anders R. Kristensen

01 Jan 1996

Journal Article•DOI•

Adaptive optimization of renewable natural resources: Solution algorithms and a computer program

[...]

Byron K. Williams¹•Institutions (1)

University of Vermont¹

16 Dec 1996-Ecological Modelling

TL;DR: Recursive algorithms and a computer program are described for solution of the adaptive optimization problem of adaptive resource management, defined in terms of Markov decision processes, with an objective of maximizing long-term harvest value.

...read moreread less

Journal Article•DOI•

Constrained Markov decision processes with total cost criteria : Occupation measures and primal LP

[...]

Eitan Altman¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

01 Feb 1996-Mathematical Methods of Operations Research

TL;DR: In this paper, the authors study the properties of the set of occupation measures achieved by different classes of policies and present conditions under which optimal policies exist within these classes. And they conclude by introducing an equivalent infinite linear program.

...read moreread less

Abstract: This paper is the third in a series on constrained Markov decision processes (CMDPs) with a countable state space and unbounded cost. In the previous papers we studied the expected average and the discounted cost. We analyze in this paper the total cost criterion. We study the properties of the set of occupation measures achieved by different classes of policies; we then focus on stationary policies and on mixed deterministic policies and present conditions under which optimal policies exist within these classes. We conclude by introducing an equivalent infinite Linear Program.

...read moreread less

Journal Article•DOI•

On constrained Markov decision processes

[...]

Moshe Haviv¹•Institutions (1)

Hebrew University of Jerusalem¹

01 Jul 1996-Operations Research Letters

TL;DR: A multichain Markov decision process with constraints on the expected state-action frequencies may lead to a unique optimal policy which does not satisfy Bellman's principle of optimality, but the model with sample-path constraints does not suffer from this drawback.

...read moreread less

Proceedings Article•

Local Bandit Approximation for Optimal Learning Problems

[...]

Michael O. Duff¹, Andrew G. Barto¹•Institutions (1)

University of Massachusetts Amherst¹

03 Dec 1996

TL;DR: This paper proposes an approximate approach in which bandit processes are used to model, in a certain "local" sense, a given MDP.

...read moreread less

Abstract: In general, procedures for determining Bayes-optimal adaptive controls for Markov decision processes (MDP's) require a prohibitive amount of computation-the optimal learning problem is intractable. This paper proposes an approximate approach in which bandit processes are used to model, in a certain "local" sense, a given MDP. Bandit processes constitute an important subclass of MDP's, and have optimal learning strategies (defined in terms of Gittins indices) that can be computed relatively efficiently. Thus, one scheme for achieving approximately-optimal learning for general MDP's proceeds by taking actions suggested by strategies that are optimal with respect to local bandit models.

...read moreread less

Book Chapter•DOI•

Markov Control Processes

[...]

Onésimo Hernández-Lerma¹, Jean B. Lasserre²•Institutions (2)

CINVESTAV¹, Centre national de la recherche scientifique²

01 Jan 1996

TL;DR: The main objective of this chapter is to set the stage for the rest of the book by formally introducing the controlled stochastic processes in which the authors are interested.

...read moreread less

Abstract: The main objective of this chapter is to set the stage for the rest of the book by formally introducing the controlled stochastic processes in which we are interested. An informal discussion of the main concepts, namely, Markov control models, control policies, and Markov control processes (MCPs), was already presented in §1.2. Their meaning is made precise in this chapter.

...read moreread less

Journal Article•DOI•

Development of New Network Optimization Model for Oklahoma Department of Transportation

[...]

Xin Chen, Stuart W Hudson, Masoud Pajoh¹, William Dickinson¹•Institutions (1)

Oklahoma Department of Transportation¹

01 Jan 1996-Transportation Research Record

TL;DR: A new pavement network optimization model based on the Markov decision process (MDP) is presented, a global optimization model in which the entire network can be optimized without being divided into mutually independent groups.

...read moreread less

Abstract: A new pavement network optimization model based on the Markov decision process (MDP) is presented. The new model is a global optimization model in which the entire network can be optimized without being divided into mutually independent groups. Current MDP models in use for pavement management use only one routine maintenance model for all types of rehabilitation and reconstruction treatments. The new formulation provides separate routine maintenance models for each type of treatment, which is more realistic than the currently available formulations. Methods for estimating pavement maintenance and rehabilitation benefits are described. These methods can be used for the optimization models with objectives of maximization when inadequate data are available to consider road user costs. The model has been applied to the network-level pavement management system for the Oklahoma Department of Transportation. Results of example runs are discussed.

...read moreread less

Journal Article•DOI•

Optimizing long-term hydro-power production using Markov decision processes

[...]

Bernard F. Lamond¹, Abdeslem Boukhtouta¹•Institutions (1)

Laval University¹

01 Jul 1996-International Transactions in Operational Research

TL;DR: Recent research on MDP computation, with application to hydro-power systems is surveyed, and three main approaches are discussed: (i) discrete DP, (ii) numerical approximation of the expected future reward function, and (iii) analytic solution of the DP recursion.

...read moreread less