scispace - formally typeset
Search or ask a question

Showing papers on "Markov decision process published in 2003"


Book
01 Jan 2003
TL;DR: A review of continuous time models can be found in this paper, where the authors present an algorithm for the Ergodic Cost Problem: Formulation and Algorithms 7.1 Formulation of the control problem 7.2 A Jacobi Type Iteration 7.3 Approximation in Policy Space 7.4 Numerical Methods 7.5 The Control Problem 7.6 The Interpolated Process 7.7 Computations 7.8 Linear Programming 7.
Abstract: Introduction 1 Review of Continuous Time Models 1.1 Martingales and Martingale Inequalities 1.2 Stochastic Integration 1.3 Stochastic Differential Equations: Diffusions 1.4 Reflected Diffusions 1.5 Processes with Jumps 2 Controlled Markov Chains 2.1 Recursive Equations for the Cost 2.2 Optimal Stopping Problems 2.3 Discounted Cost 2.4 Control to a Target Set and Contraction Mappings 2.5 Finite Time Control Problems 3 Dynamic Programming Equations 3.1 Functionals of Uncontrolled Processes 3.2 The Optimal Stopping Problem 3.3 Control Until a Target Set Is Reached 3.4 A Discounted Problem with a Target Set and Reflection 3.5 Average Cost Per Unit Time 4 Markov Chain Approximation Method: Introduction 4.1 Markov Chain Approximation 4.2 Continuous Time Interpolation 4.3 A Markov Chain Interpolation 4.4 A Random Walk Approximation 4.5 A Deterministic Discounted Problem 4.6 Deterministic Relaxed Controls 5 Construction of the Approximating Markov Chains 5.1 One Dimensional Examples 5.2 Numerical Simplifications 5.3 The General Finite Difference Method 5.4 A Direct Construction 5.5 Variable Grids 5.6 Jump Diffusion Processes 5.7 Reflecting Boundaries 5.8 Dynamic Programming Equations 5.9 Controlled and State Dependent Variance 6 Computational Methods for Controlled Markov Chains 6.1 The Problem Formulation 6.2 Classical Iterative Methods 6.3 Error Bounds 6.4 Accelerated Jacobi and Gauss-Seidel Methods 6.5 Domain Decomposition 6.6 Coarse Grid-Fine Grid Solutions 6.7 A Multigrid Method 6.8 Linear Programming 7 The Ergodic Cost Problem: Formulation and Algorithms 7.1 Formulation of the Control Problem 7.2 A Jacobi Type Iteration 7.3 Approximation in Policy Space 7.4 Numerical Methods 7.5 The Control Problem 7.6 The Interpolated Process 7.7 Computations 7.8 Boundary Costs and Controls 8 Heavy Traffic and Singular Control 8.1 Motivating Examples &nb

2,274 citations


Journal ArticleDOI
TL;DR: The new algorithm, least-squares policy iteration (LSPI), learns the state-action value function which allows for action selection without a model and for incremental policy improvement within a policy-iteration framework.
Abstract: We propose a new approach to reinforcement learning for control problems which combines value-function approximation with linear architectures and approximate policy iteration. This new approach is motivated by the least-squares temporal-difference learning algorithm (LSTD) for prediction problems, which is known for its efficient use of sample experiences compared to pure temporal-difference algorithms. Heretofore, LSTD has not had a straightforward application to control problems mainly because LSTD learns the state value function of a fixed policy which cannot be used for action selection and control without a model of the underlying process. Our new algorithm, least-squares policy iteration (LSPI), learns the state-action value function which allows for action selection without a model and for incremental policy improvement within a policy-iteration framework. LSPI is a model-free, off-policy method which can use efficiently (and reuse in each iteration) sample experiences collected in any manner. By separating the sample collection method, the choice of the linear approximation architecture, and the solution method, LSPI allows for focused attention on the distinct elements that contribute to practical reinforcement learning. LSPI is tested on the simple task of balancing an inverted pendulum and the harder task of balancing and riding a bicycle to a target location. In both cases, LSPI learns to control the pendulum or the bicycle by merely observing a relatively small number of trials where actions are selected randomly. LSPI is also compared against Q-learning (both with and without experience replay) using the same value function architecture. While LSPI achieves good performance fairly consistently on the difficult bicycle task, Q-learning variants were rarely able to balance for more than a small fraction of the time needed to reach the target location.

1,405 citations


Journal ArticleDOI
TL;DR: This work reviews several approaches to temporal abstraction and hierarchical organization that machine learning researchers have recently developed and discusses extensions of these ideas to concurrent activities, multiagent coordination, and hierarchical memory for addressing partial observability.
Abstract: Reinforcement learning is bedeviled by the curse of dimensionality: the number of parameters to be learned grows exponentially with the size of any compact encoding of a state Recent attempts to combat the curse of dimensionality have turned to principled ways of exploiting temporal abstraction, where decisions are not required at each step, but rather invoke the execution of temporally-extended activities which follow their own policies until termination This leads naturally to hierarchical control architectures and associated learning algorithms We review several approaches to temporal abstraction and hierarchical organization that machine learning researchers have recently developed Common to these approaches is a reliance on the theory of semi-Markov decision processes, which we emphasize in our review We then discuss extensions of these ideas to concurrent activities, multiagent coordination, and hierarchical memory for addressing partial observability Concluding remarks address open challenges facing the further development of reinforcement learning in a hierarchical setting

1,175 citations


Journal ArticleDOI
TL;DR: R-MAX is a very simple model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time and formally justifies the ``optimism under uncertainty'' bias used in many RL algorithms.
Abstract: R-MAX is a very simple model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time. In R-MAX, the agent always maintains a complete, but possibly inaccurate model of its environment and acts based on the optimal policy derived from this model. The model is initialized in an optimistic fashion: all actions in all states return the maximal possible reward (hence the name). During execution, it is updated based on the agent's observations. R-MAX improves upon several previous algorithms: (1) It is simpler and more general than Kearns and Singh's E3 algorithm, covering zero-sum stochastic games. (2) It has a built-in mechanism for resolving the exploration vs. exploitation dilemma. (3) It formally justifies the ``optimism under uncertainty'' bias used in many RL algorithms. (4) It is simpler, more general, and more efficient than Brafman and Tennenholtz's LSG algorithm for learning in single controller stochastic games. (5) It generalizes the algorithm by Monderer and Tennenholtz for learning in repeated games. (6) It is the only algorithm for learning in repeated games, to date, which is provably efficient, considerably improving and simplifying previous algorithms by Banos and by Megiddo.

1,011 citations


Journal ArticleDOI
TL;DR: This article proposes and analyzes a class of actor-critic algorithms in which the critic uses temporal difference learning with a linearly parameterized approximation architecture, and the actor is updated in an approximate gradient direction, based on information provided by the critic.
Abstract: In this article, we propose and analyze a class of actor-critic algorithms. These are two-time-scale algorithms in which the critic uses temporal difference learning with a linearly parameterized approximation architecture, and the actor is updated in an approximate gradient direction, based on information provided by the critic. We show that the features for the critic should ideally span a subspace prescribed by the choice of parameterization of the actor. We study actor-critic algorithms for Markov decision processes with Polish state and action spaces. We state and prove two results regarding their convergence.

634 citations


Journal ArticleDOI
TL;DR: This paper presents two approximate solution algorithms that exploit structure in factored MDPs by using an approximate value function represented as a linear combination of basis functions, where each basis function involves only a small subset of the domain variables.
Abstract: This paper addresses the problem of planning under uncertainty in large Markov Decision Processes (MDPs). Factored MDPs represent a complex state space using state variables and the transition model using a dynamic Bayesian network. This representation often allows an exponential reduction in the representation size of structured MDPs, but the complexity of exact solution algorithms for such MDPs can grow exponentially in the representation size. In this paper, we present two approximate solution algorithms that exploit structure in factored MDPs. Both use an approximate value function represented as a linear combination of basis functions, where each basis function involves only a small subset of the domain variables. A key contribution of this paper is that it shows how the basic operations of both algorithms can be performed efficiently in closed form, by exploiting both additive and context-specific structure in a factored MDP. A central element of our algorithms is a novel linear program decomposition technique, analogous to variable elimination in Bayesian networks, which reduces an exponentially large LP to a provably equivalent, polynomial-sized one. One algorithm uses approximate linear programming, and the second approximate dynamic programming. Our dynamic programming algorithm is novel in that it uses an approximation based on max-norm, a technique that more directly minimizes the terms that appear in error bounds for approximate MDP algorithms. We provide experimental results on problems with over 1040 states, demonstrating a promising indication of the scalability of our approach, and compare our algorithm to an existing state-of-the-art approach, showing, in some problems, exponential gains in computation time.

503 citations


Journal ArticleDOI
TL;DR: The generalization of bisimulation to stochastic processes yields a non-trivial notion of state equivalence that guarantees the optimal policy for the reduced model immediately induces a corresponding Optimal Policy for the original model.

354 citations


Proceedings Article
21 Aug 2003
TL;DR: This work investigates methods for planning in a Markov Decision Process where the cost function is chosen by an adversary after the authors fix their policy and develops efficient algorithms for matrix games where such best response oracles exist.
Abstract: We investigate methods for planning in a Markov Decision Process where the cost function is chosen by an adversary after we fix our policy. As a running example, we consider a robot path planning problem where costs are influenced by sensors that an adversary places in the environment. We formulate the problem as a zero-sum matrix game where rows correspond to deterministic policies for the planning player and columns correspond to cost vectors the adversary can select. For a fixed cost vector, fast algorithms (such as value iteration) are available for solving MDPs. We develop efficient algorithms for matrix games where such best response oracles exist. We show that for our path planning problem these algorithms are at least an order of magnitude faster than direct solution of the linear programming formulation.

280 citations


Journal ArticleDOI
TL;DR: A complexity analysis of planning under uncertainty is presented, showing the "probabilistic classical planning" problem to be formally undecidable and any problem statement where the agent operates over an infinite or indefinite time horizon, and has available only probabilistic information about the system's state.

273 citations


Journal ArticleDOI
TL;DR: A modeling formalism that enables the analyst to combine concepts inherited from fault trees and Markov models in a new way, Boolean logic Driven Markov Processes (BDMP), which allows the definition of complex dynamic models while remaining nearly as readable and easy to build as fault-trees.

232 citations


Proceedings Article
09 Aug 2003
TL;DR: This paper presents an approach to the generalization problem based on a new framework of relational Markov Decision Processes (RMDPs), and proves that a polynomial number of sampled environments suffices to achieve performance close to the performance achievable when optimizing over the entire space.
Abstract: A longstanding goal in planning research is the ability to generalize plans developed for some set of environments to a new but similar environment, with minimal or no replanning. Such generalization can both reduce planning time and allow us to tackle larger domains than the ones tractable for direct planning. In this paper, we present an approach to the generalization problem based on a new framework of relational Markov Decision Processes (RMDPs). An RMDP can model a set of similar environments by representing objects as instances of different classes. In order to generalize plans to multiple environments, we define an approximate value function specified in terms of classes of objects and, in a multiagent setting, by classes of agents. This class-based approximate value function is optimized relative to a sampled subset of environments, and computed using an efficient linear programming method. We prove that a polynomial number of sampled environments suffices to achieve performance close to the performance achievable when optimizing over the entire space. Our experimental results show that our method generalizes plans successfully to new, significantly larger, environments, with minimal loss of performance relative to environment-specific planning. We demonstrate our approach on a real strategic computer war game.

Proceedings Article
09 Dec 2003
TL;DR: This work considers the problem of maximizing the total long-term value of the system despite the self-interest of agents, and induces a Markov Decision Process (MDP), which when solved can be used to implement optimal policies in a truth-revealing Bayesian-Nash equilibrium.
Abstract: Online mechanism design (MD) considers the problem of providing incentives to implement desired system-wide outcomes in systems with self-interested agents that arrive and depart dynamically. Agents can choose to misrepresent their arrival and departure times, in addition to information about their value for different outcomes. We consider the problem of maximizing the total long-term value of the system despite the self-interest of agents. The online MD problem induces a Markov Decision Process (MDP), which when solved can be used to implement optimal policies in a truth-revealing Bayesian-Nash equilibrium.

Book
20 Feb 2003
TL;DR: Even-Aged Management: A Dynamic Model of the Even Aged Forest Economic Objectives and Environmental Policies for Even-Aaged Forests Managing the Uneven Aged Forests with Linear Programming Economic and Environmental Management of U.S. Forests Multiple Objectives Management with Goal Programming Forest Resource Programming Models with Integer Variables Project Management with CPM/PERT Multistage Decision Making with Dynamic Programming Simulation of Even- Aged Stand Management Simulation.
Abstract: Preface Introduction Principles of Linear Programming: Formulations Principles of Linear Programming: Solutions: Even-Aged Management: A First Model Area - and Volume-Control Management with Linear Programming: A Dynamic Model of the Even-Aged Forest Economic Objectives and Environmental Policies for Even-Aged Forests Managing the Uneven-Aged Forest with Linear Programming Economic and Environmental Management of Uneven-Aged Forests Multiple Objectives Management with Goal Programming Forest Resource Programming Models with Integer Variables Project Management with CPM/PERT Multistage Decision Making with Dynamic Programming Simulation of Uneven-Aged Stand Management Simulation of Even-Aged Forest Management Projecting Forest Landscape and Income Under Risk with Markov Chains Optimizing Forest Income and Biodiversity with Markov Decision Processes Analysis of Forest Resource Investments Econometric Analysis and Forecasting of Forest Product markets Appendix A: Compounding and Discounting Appendix B: Elements of Matrix Algebra

Journal ArticleDOI
TL;DR: In this article, the authors consider a single inventory point facing independent stochastic demand and item returns and show average cost optimality of an (s,S)-order policy in this model.

Journal ArticleDOI
TL;DR: The application of a Markov process and a DES model to an economic evaluation comparing alternative adjuvant therapies for early breast cancer indicates that the use of DES may be beneficial only when the available data demonstrates particular characteristics.
Abstract: Markov models have traditionally been used to evaluate the cost-effectiveness of competing health care technologies that require the description of patient pathways over extended time horizons. Discrete event simulation (DES) is a more flexible, but more complicated decision modelling technique, that can also be used to model extended time horizons. Through the application of a Markov process and a DES model to an economic evaluation comparing alternative adjuvant therapies for early breast cancer, this paper compares the respective processes and outputs of these alternative modelling techniques. DES displays increased flexibility in two broad areas, though the outputs from the two modelling techniques were similar. These results indicate that the use of DES may be beneficial only when the available data demonstrates particular characteristics.

Proceedings ArticleDOI
14 Jul 2003
TL;DR: A novel algorithm is presented that is the first effective technique to solve optimally a class of transition-independent decentralized MDPs and lays the foundation for further work in this area on both exact and approximate solutions.
Abstract: There has been substantial progress with formal models for sequential decision making by individual agents using the Markov decision process (MDP). However, similar treatment of multi-agent systems is lacking. A recent complexity result, showing that solving decentralized MDPs is NEXP-hard, provides a partial explanation. To overcome this complexity barrier, we identify a general class of transition-independent decentralized MDPs that is widely applicable. The class consists of independent collaborating agents that are tied together through a global reward function that depends upon both of their histories. We present a novel algorithm for solving this class of problems and examine its properties. The result is the first effective technique to solve optimally a class of decentralized MDPs. This lays the foundation for further work in this area on both exact and approximate solutions.

Proceedings Article
21 Aug 2003
TL;DR: The metric-E3 algorithm as mentioned in this paper is a generalization of the E3 algorithm, which assumes a black box for approximate planning and finds a near optimal policy in an amount of time that does not directly depend on the size of the state space, but instead depends on the covering number of neighborhoods required for accurate local modeling.
Abstract: We present metric-E3, a provably near-optimal algorithm for reinforcement learning in Markov decision processes in which there is a natural metric on the state space that allows the construction of accurate local models. The algorithm is a generalization of the E3 algorithm of Kearns and Singh, and assumes a black box for approximate planning. Unlike the original E3, metric-E 3 finds a near optimal policy in an amount of time that does not directly depend on the size of the state space, but instead depends on the covering number of the state space. Informally, the covering number is the number of neighborhoods required for accurate local modeling.

Journal ArticleDOI
TL;DR: This paper shows how one can formulate this problem as a Markov decision process with recourse that considers decision making throughout the process life cycle and at different hierarchical levels, and decomposes the problem into a sequence of single-period subproblems, each of which is a two-stage stochastic program with recourse.

Journal ArticleDOI
TL;DR: Two results are de rive two results, one for constant and one for random delays, for reducing an MDP with delays to a MDP without delays, which differs only in the size of the state space.
Abstract: Markov decision processes (MDPs) may involve three types of delays. First, state information, rather than being available instantaneously, may arrive with a delay (observation delay). Second, an action may take effect at a later decision stage rather than immediately (action delay). Third, the cost induced by an action may be collected after a number of stages (cost delay). We de rive two results, one for constant and one for random delays, for reducing an MDP with delays to an MDP without delays, which differs only in the size of the state space. The results are based on the intuition that costs may be collected asynchronously, i.e., at a stage other than the one in which they are induced, as long as they are discounted properly.

Journal ArticleDOI
TL;DR: This work considers a queueing system, commonly found in inbound telephone call centers, that processes two types of work, and determines the structure of effective routing policies that are optimal within the class of priority policies.
Abstract: We consider a queueing system, commonly found in inbound telephone call centers, that processes two types of work. Type-H jobs arrive at rate ? H , are processed at rate I»i¾µ H, and are served first come, first served within class. Aservice-level constraint of the form E[delay] = a orP {delay = I²} = limits the delay in queue that these jobs may face. An infinite backlog of type-L jobs awaits processing at rate I»L , and there is no service-level constraint on this type of work. A pool ofc identical servers processes all jobs, and a system controller must maximize the rate at which type-L jobs are processed, subject to the service-level constraint placed on the type-H work.We formulate the problem as a constrained, average-cost Markov decision process and determine the structure of effective routing policies. When the expected service times of the two classes are the same, these policies are globally optimal, and the computation time required to find the optimal policy is about that required to calculate the normalizing constant for a simpleM/M/c system. When the expected service times of the two classes differ, the policies are optimal within the class of priority policies, and the determination of optimal policy parameters can be determined through the solution of a linear program with O( c3) variables and O( c2) constraints.

Journal ArticleDOI
TL;DR: Two new probabilistic planning techniques are described-- c-MAXPLAN and ZANDER--that generate contingent plans in Probabilistic propositional domains that operate by transforming the planning problem into a stochastic satisfiability problem and solving that problem instead.

Proceedings Article
09 Aug 2003
TL;DR: A new algorithm, GM-Sarsa(O), for finding approximate solutions to multiple-goal reinforcement learning problems that are modeled as composite Markov decision processes, which finds good policies in the context of the composite task.
Abstract: We present a new algorithm, GM-Sarsa(O), for finding approximate solutions to multiple-goal reinforcement learning problems that are modeled as composite Markov decision processes. According to our formulation different sub-goals are modeled as MDPs that are coupled by the requirement that they share actions. Existing reinforcement learning algorithms address similar problem formulations by first finding optimal policies for the component MDPs, and then merging these into a policy for the composite task. The problem with such methods is that policies that are optimized separately may or may not perform well when they are merged into a composite solution. Instead of searching for optimal policies for the component MDPs in isolation, our approach finds good policies in the context of the composite task.

Patent
21 May 2003
TL;DR: In this article, a Markov Decision Process (MDP) methodology is used to generate a simplified transition matrix representative of the potential state transitions for account holders and a data structure is constructed to implement a transition matrix computationally in different sizes.
Abstract: A method and system is disclosed for enabling the accurate determination of price points (APRs), credit limits, and other discretionary levels for each cardholder that maximize Net Present Value (NPV) for the portfolio, given constraints on quantities such as risk of default. In accordance with one embodiment, the present invention uses a Markov Decision Process (MDP) methodology to generate a simplified transition matrix representative of the potential state transitions for account holders. This model applies account level historical information on purchases, payments, profitability and delinquency risk to make these decisions. In addition, a data structure is disclosed constructed to implement a transition matrix computationally in different sizes.

Proceedings Article
09 Dec 2003
TL;DR: In this paper, the authors explore approximate policy iteration, replacing the usual cost function learning step with a learning step in policy space, and give policy-language biases that enable solution of very large relational Markov decision processes (MDPs) that no previous technique can solve.
Abstract: We explore approximate policy iteration, replacing the usual cost-function learning step with a learning step in policy space. We give policy-language biases that enable solution of very large relational Markov decision processes (MDPs) that no previous technique can solve. In particular, we induce high-quality domain-specific planners for classical planning domains (both deterministic and stochastic variants) by solving such domains as extremely large MDPs.

Book
01 Jan 2003
TL;DR: A formal framework and approximate planning algorithms that exploit structure in factored MDPs to solve problems with many trillions of states and actions very efficiently are built, demonstrating that the formal framework yields effective plans, complex agent coordination, and successful generalization in some of the largest planning problems in the literature.
Abstract: Many real-world tasks require multiple decision makers (agents) to coordinate their actions in order to achieve common long-term goals. Examples include: manufacturing systems, where managers of a factory coordinate to maximize profit; rescue robots that, after an earthquake, must safely find victims as fast as possible; or sensor networks, where multiple sensors collaborate to perform a large-scale sensing task under strict power constraints. All of these tasks require the solution of complex long-term multiagent planning problems in uncertain dynamic environments. Factored Markov decision processes (MDPs) allow us to represent complex uncertain dynamic systems very compactly by exploiting problem-specific structure. Specifically, the state of the system is described by a set of variables that evolve stochastically over time using a representation, called dynamic Bayesian network, that often allows for an exponential reduction in representation complexity. However, the complexity of exact solution algorithms for such MDPs grows exponentially in the number of variables, and in the number of agents. This thesis builds a formal framework and approximate planning algorithms that exploit structure in factored MDPs to solve problems with many trillions of states and actions very efficiently. The main contributions of this thesis include: Factored linear programs: A novel LP decomposition technique, using ideas from inference in Bayesian networks, that can exploit problem structure to reduce exponentially-large LPs to polynomially-sized ones that are provably equivalent. Factored approximate planning: A suite of algorithms, building on our factored LP decomposition technique, that exploit structure in factored MDPs to obtain exponential reductions in planning time. Distributed coordination: An efficient distributed multiagent decision making algorithm, where the coordination structure arises naturally from the factored representation of the system dynamics. Generalization in relational MDPs: A framework for obtaining general solutions from a small set of environments, allowing agents to act in new environments without replanning. Empirical evaluation: A detailed evaluation on a variety of large-scale tasks, including multiagent coordination in a real strategic computer game, demonstrating that our formal framework yields effective plans, complex agent coordination, and successful generalization in some of the largest planning problems in the literature.

Proceedings Article
09 Dec 2003
TL;DR: This work proposes an algorithm for solving finite-state and finite-action MDPs, where the solution is guaranteed to be robust with respect to estimation errors on the state transition probabilities, via Kullback-Leibler divergence bounds.
Abstract: Optimal solutions to Markov Decision Problems (MDPs) are very sensitive with respect to the state transition probabilities. In many practical problems, the estimation of those probabilities is far from accurate. Hence, estimation errors are limiting factors in applying MDPs to real-world problems. We propose an algorithm for solving finite-state and finite-action MDPs, where the solution is guaranteed to be robust with respect to estimation errors on the state transition probabilities. Our algorithm involves a statistically accurate yet numerically efficient representation of uncertainty, via Kullback-Leibler divergence bounds. The worst-case complexity of the robust algorithm is the same as the original Bellman recursion. Hence, robustness can be added at practically no extra computing cost.

Proceedings Article
09 Aug 2003
TL;DR: The notion of SMDP homomorphism is introduced and it is argued that it provides a useful tool for a rigorous study of abstraction for SMDPs.
Abstract: To operate effectively in complex environments learning agents require the ability to selectively ignore irrelevant details and form useful abstractions. In this article we consider the question of what constitutes a useful abstraction in a stochastic sequential decision problem modeled as a semi-Markov Decision Process (SMDPs). We introduce the notion of SMDP homomorphism and argue that it provides a useful tool for a rigorous study of abstraction for SMDPs. We present an SMDP minimization framework and an abstraction framework for factored MDPs based on SMDP homomorphisms. We also model different classes of abstractions that arise in hierarchical systems. Although we use the options framework for purposes of illustration, the ideas are more generally applicable. We also show that the conditions for abstraction we employ are a generalization of earlier work by Dietterich as applied to the options framework.

DissertationDOI
01 Apr 2003
TL;DR: This thesis develops several improved algorithms for learning policies with memory in an infinite-horizon setting including an application written for the Bunyip cluster that won the international Gordon-Bell prize for price/performance in 2001.
Abstract: Partially observable Markov decision processes are interesting because of their ability to model most conceivable real-world learning problems, for example, robot navigation, driving a car, speech recognition, stock trading, and playing games. The downside of this generality is that exact algorithms are computationally intractable. Such computational complexity motivates approximate approaches. One such class of algorithms are the so-called policy-gradient methods from reinforcement learning. They seek to adjust the parameters of an agent in the direction that maximises the long-term average of a reward signal. Policy-gradient methods are attractive as a scalable approach for controlling partially observable Markov decision processes (POMDPs). In the most general case POMDP policies require some form of internal state, or memory, in order to act optimally. Policy-gradient methods have shown promise for problems admitting memory-less policies but have been less successful when memory is required. This thesis develops several improved algorithms for learning policies with memory in an infinite-horizon setting. Directly, when the dynamics of the world are known, and via Monte-Carlo methods otherwise. The algorithms simultaneously learn how to act and what to remember. Monte-Carlo policy-gradient approaches tend to produce gradient estimates with high variance. Two novel methods for reducing variance are introduced. The first uses high-order filters to replace the eligibility trace of the gradient estimator. The second uses a low-variance value-function method to learn a subset of the parameters and a policy-gradient method to learn the remainder. The algorithms are applied to large domains including a simulated robot navigation scenario, a multi-agent scenario with 21,000 states, and the complex real-world task of large vocabulary continuous speech recognition. To the best of the author’s knowledge, no other policy-gradient algorithms have performed well at such tasks. The high variance of Monte-Carlo methods requires lengthy simulation and hence a super-computer to train agents within a reasonable time. The ANU “Bunyip” Linux cluster was built with such tasks in mind. It was used for several of the experimental results presented here. One chapter of this thesis describes an application written for the Bunyip cluster that won the international Gordon-Bell prize for price/performance in 2001.

Journal ArticleDOI
TL;DR: It is shown that performance sensitivity formulas and policy iteration algorithms of semi-Markov decision processes can be derived based on the performance potential and realization matrix, and it is indicated that performance sensitivities and optimization depend only on first-order statistics.
Abstract: Recent research indicates that Markov decision processes (MDPs) can be viewed from a sensitivity point of view; and the perturbation analysis (PA), MDPs, and reinforcement learning (RL) are three closely related areas in optimization of discrete-event dynamic systems that can be modeled as Markov processes. The goal of this paper is two-fold. First, we develop the PA theory for semi-Markov processes (SMPs); and then we extend the aforementioned results about the relation among PA, MDP, and RL to SMPs. In particular, we show that performance sensitivity formulas and policy iteration algorithms of semi-Markov decision processes can be derived based on the performance potential and realization matrix. Both the long-run average and discounted-cost problems are considered. This approach provides a unified framework for both problems, and the long-run average problem corresponds to the discounted factor being zero. The results indicate that performance sensitivities and optimization depend only on first-order statistics. Single sample path-based implementations are discussed.

Journal ArticleDOI
TL;DR: This paper proposes a simple analytical model called M time scale Markov decision process (MMDPs) for hierarchically structured sequential decision making processes, where decisions in each level in the M-level hierarchy are made in M different discrete time scales.
Abstract: This paper proposes a simple analytical model called M time scale Markov decision process (MMDPs) for hierarchically structured sequential decision making processes, where decisions in each level in the M-level hierarchy are made in M different discrete time scales. In this model, the state-space and the control-space of each level in the hierarchy are nonoverlapping with those of the other levels, respectively, and the hierarchy is structured in a "pyramid" sense such that a decision made at level m (slower time scale) state and/or the state will affect the evolutionary decision making process of the lower level m+1 (faster time scale) until a new decision is made at the higher level but the lower level decisions themselves do not affect the transition dynamics of higher levels. The performance produced by the lower level decisions will affect the higher level decisions. A hierarchical objective function is defined such that the finite-horizon value of following a (nonstationary) policy at level m+1 over a decision epoch of level m plus an immediate reward at level m is the single-step reward for the decision making process at level m. From this we define "multi-level optimal value function" and derive "multi-level optimality equation." We discuss how to solve MMDPs exactly and study some approximation methods, along with heuristic sampling-based schemes, to solve MMDPs.