scispace - formally typeset
Search or ask a question

Showing papers on "Markov decision process published in 2001"


Journal ArticleDOI
TL;DR: In this article, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies is proposed.
Abstract: Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies. A similar algorithm was proposed by Kimura, Yamamura, and Kobayashi (1995). The algorithm's chief advantages are that it requires storage of only twice the number of policy parameters, uses one free parameter β ∈ [0, 1] (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of GPOMDP, and show how the correct choice of the parameter β is related to the mixing time of the controlled POMDP. We briefly describe extensions of GPOMDP to controlled Markov chains, continuous state, observation and control spaces, multiple-agents, higher-order derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter, Bartlett, & Weaver, 2001) we show how the gradient estimates generated by GPOMDP can be used in both a traditional stochastic gradient algorithm and a conjugate-gradient procedure to find local optima of the average reward.

587 citations


Proceedings Article
03 Jan 2001
TL;DR: This work presents a principled and efficient planning algorithm for cooperative multiagent dynamic systems that avoids the exponential blowup in the state and action space and is an efficient alternative to more complicated algorithms even in the single agent case.
Abstract: We present a principled and efficient planning algorithm for cooperative multiagent dynamic systems. A striking feature of our method is that the coordination and communication between the agents is not imposed, but derived directly from the system dynamics and function approximation architecture. We view the entire multiagent system as a single, large Markov decision process (MDP), which we assume can be represented in a factored way using a dynamic Bayesian network (DBN). The action space of the resulting MDP is the joint action space of the entire set of agents. Our approach is based on the use of factored linear value functions as an approximation to the joint value function. This factorization of the value function allows the agents to coordinate their actions at runtime using a natural message passing scheme. We provide a simple and efficient method for computing such an approximate value function by solving a single linear program, whose size is determined by the interaction between the value function structure and the DBN. We thereby avoid the exponential blowup in the state and action space. We show that our approach compares favorably with approaches based on reward sharing. We also show that our algorithm is an efficient alternative to more complicated algorithms even in the single agent case.

479 citations


Journal ArticleDOI
Michael L. Littman1
TL;DR: A set of reinforcement-learning algorithms based on estimating value functions and convergence theorems for these algorithms are described and presented in a way that makes it easy to reason about the behavior of simultaneous learners in a shared environment.

404 citations


Proceedings Article
28 Jun 2001
TL;DR: The first algorithm for off-policy temporal-difference learning that is stable with linear function approximation is introduced and it is proved that, given training under any -soft policy, the algorithm converges w.p.1 to a close approximation to the action-value function for an arbitrary target policy.
Abstract: We introduce the first algorithm for off-policy temporal-difference learning that is stable with linear function approximation. Off-policy learning is of interest because it forms the basis for popular reinforcement learning methods such as Q-learning, which has been known to diverge with linear function approximation, and because it is critical to the practical utility of multi-scale, multi-goal, learning frameworks such as options, HAMs, and MAXQ. Our new algorithm combines TD(λ) over state–action pairs with importance sampling ideas from our previous work. We prove that, given training under any -soft policy, the algorithm converges w.p.1 to a close approximation (as in Tsitsiklis and Van Roy, 1997; Tadic, 2001) to the action-value function for an arbitrary target policy. Variations of the algorithm designed to reduce variance introduce additional bias but are also guaranteed convergent. We also illustrate our method empirically on a small policy evaluation problem. Our current results are limited to episodic tasks with episodes of bounded length. 1Although Q-learning remains the most popular of all reinforcement learning algorithms, it has been known since about 1996 that it is unsound with linear function approximation (see Gordon, 1995; Bertsekas and Tsitsiklis, 1996). The most telling counterexample, due to Baird (1995) is a seven-state Markov decision process with linearly independent feature vectors, for which an exact solution exists, yet This is a re-typeset version of an article published in the Proceedings of the 18th International Conference on Machine Learning (2001). It differs from the original in line and page breaks, is crisper for electronic viewing, and has this funny footnote, but otherwise it is identical to the published article. for which the approximate values found by Q-learning diverge to infinity. This problem prompted the development of residual gradient methods (Baird, 1995), which are stable but much slower than Q-learning, and fitted value iteration (Gordon, 1995, 1999), which is also stable but limited to restricted, weaker-than-linear function approximators. Of course, Q-learning has been used with linear function approximation since its invention (Watkins, 1989), often with good results, but the soundness of this approach is no longer an open question. There exist non-pathological Markov decision processes for which it diverges; it is absolutely unsound in this sense. A sensible response is to turn to some of the other reinforcement learning methods, such as Sarsa, that are also efficient and for which soundness remains a possibility. An important distinction here is between methods that must follow the policy they are learning about, called on-policy methods, and those that can learn from behavior generated by a different policy, called off-policy methods. Q-learning is an off-policy method in that it learns the optimal policy even when actions are selected according to a more exploratory or even random policy. Q-learning requires only that all actions be tried in all states, whereas on-policy methods like Sarsa require that they be selected with specific probabilities. Although the off-policy capability of Q-learning is appealing, it is also the source of at least part of its instability problems. For example, in one version of Baird’s counterexample, the TD(λ) algorithm, which underlies both Qlearning and Sarsa, is applied with linear function approximation to learn the action-value function Q for a given policy π. Operating in an on-policy mode, updating state– action pairs according to the same distribution they would be experienced under π, this method is stable and convergent near the best possible solution (Tsitsiklis and Van Roy, 1997; Tadic, 2001). However, if state-action pairs are updated according to a different distribution, say that generated by following the greedy policy, then the estimated values again diverge to infinity. This and related counterexamples suggest that at least some of the reason for the instability of Q-learning is that it is an off-policy method; they also make it clear that this part of the problem can be studied in a purely policy-evaluation context. Despite these problems, there remains substantial reason for interest in off-policy learning methods. Several researchers have argued for an ambitious extension of reinforcement learning ideas into modular, multi-scale, and hierarchical architectures (Sutton, Precup & Singh, 1999; Parr, 1998; Parr & Russell, 1998; Dietterich, 2000). These architectures rely on off-policy learning to learn about multiple subgoals and multiple ways of behaving from the singular stream of experience. For these approaches to be feasible, some efficient way of combining off-policy learning and function approximation must be found. Because the problems with current off-policy methods become apparent in a policy evaluation setting, it is there that we focus in this paper. In previous work we considered multi-step off-policy policy evaluation in the tabular case. In this paper we introduce the first off-policy policy evaluation method consistent with linear function approximation. Our mathematical development focuses on the episodic case, and in fact on a single episode. Given a starting state and action, we show that the expected offpolicy update under our algorithm is the same as the expected on-policy update under conventional TD(λ). This, together with some variance conditions, allows us to prove convergence and bounds on the error in the asymptotic approximation identical to those obtained by Tsitsiklis and Van Roy (1997; Bertsekas and Tsitsiklis, 1996). 1. Notation and Main Result We consider the standard episodic reinforcement learning framework (see, e.g., Sutton & Barto, 1998) in which a learning agent interacts with a Markov decision process (MDP). Our notation focuses on a single episode of T time steps, s0, a0, r1, s1, a1, r2, . . . , rT , sT , with states st ∈ S, actions at ∈ A, and rewards rt ∈ <. We take the initial state and action, s0 and a0, to be given arbitrarily. Given a state and action, st and at, the next reward, rt+1, is a random variable with mean rt st and the next state, st+1, is chosen with probabilities pt stst+1 . The final state is a special terminal state that may not occur on any preceding time step. Given a state, st, 0 < t < T , the action at is selected according to probability π(st, at) or b(st, at) depending on whether policy π or policy b is in force. We always use π to denote the target policy, the policy that we are learning about. In the on-policy case, π is also used to generate the actions of the episode. In the off-policy case, the actions are instead generated by b, which we call the behavior policy. In either case, we seek an approximation to the action-value function Q : S ×A 7→ < for the target policy π: Q(s, a) = Eπ { rt+1 + · · ·+ γrT | st = s, at = a } , where 0 ≤ γ ≤ 1 is a discount-rate parameter. We consider approximations that are linear in a set of feature vectors {φsa}, s ∈ S, a ∈ A: Q(s, a) ≈ θφsa = n ∑

363 citations


Journal ArticleDOI
TL;DR: This paper proposes a simulation-based algorithm for optimizing the average reward in a finite-state Markov reward process that depends on a set of parameters and relies on the regenerative structure of finite- state Markov processes.
Abstract: This paper proposes a simulation-based algorithm for optimizing the average reward in a finite-state Markov reward process that depends on a set of parameters. As a special case, the method applies to Markov decision processes where optimization takes place within a parametrized set of policies. The algorithm relies on the regenerative structure of finite-state Markov processes, involves the simulation of a single sample path, and can be implemented online. A convergence result (with probability 1) is provided.

344 citations


Proceedings Article
04 Aug 2001
TL;DR: This technique uses an MDP whose dynamics is represented in a variant of the situation calculus allowing for stochastic actions and produces a logical description of the optimal value function and policy by constructing a set of first-order formulae that minimally partition state space according to distinctions made by the valuefunction and policy.
Abstract: We present a dynamic programming approach for the solution of first-order Markov decisions processes. This technique uses an MDP whose dynamics is represented in a variant of the situation calculus allowing for stochastic actions. It produces a logical description of the optimal value function and policy by constructing a set of first-order formulae that minimally partition state space according to distinctions made by the value function and policy. This is achieved through the use of an operation known as decision-theoretic regression. In effect, our algorithm performs value iteration without explicit enumeration of either the state or action spaces of the MDP. This allows problems involving relational fluents and quantification to be solved without requiring explicit state space enumeration or conversion to propositional form.

262 citations


Proceedings ArticleDOI
28 May 2001
TL;DR: A multi-agent extension to Markov decision processes in which communication can be modeled as an explicit action that incurs a cost is presented, which provides a foundation for a quantified study of agent coordination policies and provides both motivation and insight to the design of heuristic approaches.
Abstract: In multi-agent cooperation, agents share a common goal, which is evaluated through a global utility function. However, an agent typically cannot observe the global state of an uncertain environment, and therefore they must communicate with each other in order to share the information needed for deciding which actions to take. We argue that, when communication incurs a cost (due to resource consumption, for example), whether to communicate or not also becomes a decision to make. Hence, communication decision becomes part of the overall agent decision problem. In order to explicitly address this problem, we present a multi-agent extension to Markov decision processes in which communication can be modeled as an explicit action that incurs a cost. This framework provides a foundation for a quantified study of agent coordination policies and provides both motivation and insight to the design of heuristic approaches. An example problem is studied under this framework. From this example we can see the impact communication policies have on the overall agent policies, and what implications we can find toward the design of agent coordination policies.

237 citations


Journal ArticleDOI
TL;DR: An algorithm for improving any given strategy by local computation of single policy updates and investigate conditions for the resulting strategy to be optimal is given.
Abstract: We introduce the notion ofLImitedMemoryInfluenceDiagram (LIMID) to describe multistage decision problems in which the traditional assumption of no forgetting is relaxed. This can be relevant in situations with multiple decision makers or when decisions must be prescribed under memory constraints, such as in partially observed Markov decision processes (POMDPs). We give an algorithm for improving any given strategy by local computation of single policy updates and investigate conditions for the resulting strategy to be optimal.

229 citations


Journal ArticleDOI
TL;DR: This paper gives the first rigorous convergence analysis of analogues of Watkins's Q-learning algorithm, applied to average cost control of finite-state Markov chains, using the ODE method.
Abstract: This paper gives the first rigorous convergence analysis of analogues of Watkins's Q-learning algorithm, applied to average cost control of finite-state Markov chains. We discuss two algorithms which may be viewed as stochastic approximation counterparts of two existing algorithms for recursively computing the value function of the average cost problem---the traditional relative value iteration (RVI) algorithm and a recent algorithm of Bertsekas based on the stochastic shortest path (SSP) formulation of the problem. Both synchronous and asynchronous implementations are considered and analyzed using the ODE method. This involves establishing asymptotic stability of associated ODE limits. The SSP algorithm also uses ideas from two-time-scale stochastic approximation.

208 citations


Book ChapterDOI
TL;DR: A novel development to model check quantitative reachability properties on Markov decision processes together with its prototype implementation using an abstraction of the model under analysis significantly smaller than the original model.
Abstract: We report on a novel development to model check quantitative reachability properties on Markov decision processes together with its prototype implementation. The innovation of the technique is that the analysis is performed on an abstraction of the model under analysis. Such an abstraction is significantly smaller than the original model and may safely refute or accept the required property. Otherwise, the abstraction is refined and the process repeated. As the numerical analysis necessary to determine the validity of the property is more costly than the refinement process, the technique profits from applying such numerical analysis on smaller state spaces.

175 citations


Book ChapterDOI
01 Jan 2001
TL;DR: In this paper, a general framework to define probabilistic process languages is presented, and a family of pre-orders are defined to be precongruences with respect to the algebraic operators that can be defined in the general framework.
Abstract: In this chapter, we adopt Probabilistic Transition Systems as a basic model for probabilistic processes, in which probabilistic and nondeterministic choices are independent concepts. The model is essentially a nondeterministic version of Markov decision processes or probabilistic automata of Rabin. We develop a general framework to define probabilistic process languages to describe probabilistic transition systems. In particular, we show how operators for nonprobabilistic process algebras can be lifted to probabilistic process algebras in a uniform way similar to de Simone format. To establish a notion of refinement, we present a family of preorders including probabilistic bisimulation and simulation, and probabilistic testing pre-orders as well as their logical or denotational characterization. These preorders are shown to be precongruences with respect to the algebraic operators that can be defined in our general framework. Finally, we give a short account of the important work on extending the successful field of model checking to probabilistic settings and a brief discussion on current research in the area.

Journal ArticleDOI
TL;DR: This paper proposes a method for accelerating the convergence of value iteration, a well-known algorithm for finding optimal policies for POMDPs, and has been evaluated on an array of benchmark problems and was found to be very effective.
Abstract: Partially observable Markov decision processes (POMDPs) have recently become popular among many AI researchers because they serve as a natural model for planning under uncertainty. Value iteration is a well-known algorithm for finding optimal policies for POMDPs. It typically takes a large number of iterations to converge. This paper proposes a method for accelerating the convergence of value iteration. The method has been evaluated on an array of benchmark problems and was found to be very effective: It enabled value iteration to converge after only a few iterations on all the test problems.

Proceedings Article
04 Aug 2001
TL;DR: This paper presents the first approximate MDP solution algorithms - both value and policy iteration - that use max-norm projection, thereby directly optimizing the quantity required to obtain the best error bounds.
Abstract: Markov Decision Processes (MDPs) provide a coherent mathematical framework for planning under uncertainty. However, exact MDP solution algorithms require the manipulation of a value function, which specifies a value for each state in the system. Most real-world MDPs are too large for such a representation to be feasible, preventing the use of exact MDP algorithms. Various approximate solution algorithms have been proposed, many of which use a linear combination of basis functions as a compact approximation to the value function. Almost all of these algorithms use an approximation based on the (weighted) L2-norm (Euclidean distance); this approach prevents the application of standard convergence results for MDP algorithms, all of which are based on max-norm. This paper makes two contributions. First, it presents the first approximate MDP solution algorithms - both value and policy iteration - that use max-norm projection, thereby directly optimizing the quantity required to obtain the best error bounds. Second, it shows how these algorithms can be applied efficiently in the context of factored MDPs, where the transition model is specified using a dynamic Bayesian network.

Journal ArticleDOI
TL;DR: Experimental results show that the power management method based on a Markov decision process outperforms heuristic methods by as much as 44% in terms of power dissipation savings for a given level of system performance.
Abstract: The goal of a dynamic power management policy is to reduce the power consumption of an electronic system by putting system components into different states, each representing a certain performance and power consumption level. The policy determines the type and timing of these transitions based on the system history, workload, and performance constraints. In this paper we propose a new abstract model of a power-managed electronic system. We formulate the problem of system-level power management as a controlled optimization problem based on the theories of continuous-time Markov derision processes and stochastic networks. This problem is solved exactly using linear programming or heuristically using "policy iteration." Our method is compared with existing heuristic methods for different workload statistics. Experimental results show that the power management method based on a Markov decision process outperforms heuristic methods by as much as 44% in terms of power dissipation savings for a given level of system performance.

Journal ArticleDOI
TL;DR: It is shown that for several variations of partially observable Markov decision processes, polynomial-time algorithms for finding control policies are unlikely to or simply don't have guarantees of finding policies within a constant factor or a constant summand of optimal.
Abstract: We show that for several variations of partially observable Markov decision processes, polynomial-time algorithms for finding control policies are unlikely to or simply don't have guarantees of finding policies within a constant factor or a constant summand of optimal. Here "unlikely" means "unless some complexity classes collapse," where the collapses considered are P = NP, P = PSPACE, or P = EXP. Until or unless these collapses are shown to hold, any control-policy designer must choose between such performance guarantees and efficient computation.

Dissertation
01 Jan 2001
TL;DR: Two ways to accelerate value iteration are investigated, which aim to reduce the number of DP updates and therefore value iteration over a belief subspace, a subset of belief space, and which is more efficient for this POMDP class.
Abstract: Partially Observable Markov Decision Process (POMDP) is a general sequential decision-making model where the effects of actions are nondeterministic and only partial information about world states is available. However, finding near optimal solutions for POMDPs is computationally difficult. Value iteration is a standard algorithm for solving POMDPs. It conducts a sequence of dynamic programming (DP) updates to improve value functions. Value iteration is inefficient for two reasons. First, a DP update is expensive due to the need of accounting for all belief states in a continuous belief space. Second, value iteration needs to conduct a large number of DP updates before its convergence. This thesis investigates two ways to accelerate value iteration. The work presented centers around the idea of conducting DP updates and therefore value iteration over a belief subspace, a subset of belief space. The first use of belief subspace is to reduce the number of DP updates for value iteration to converge. We design a computationally cheap procedure considering a belief subspace which consists of a finite number of belief states. It is used as an additional step for improving value functions. Due to additional improvements by the procedure, value iteration conducts fewer DP updates and therefore is more efficient. The second use of belief subspace is to reduce the complexity of DP updates. We establish a framework on how to carry out value iteration over a belief subspace determined by a POMDP model. Whether the belief subspace is smaller than the belief space is model dependent. If this is true for a POMDP, value iteration over the belief subspace is expected to be more efficient. Based on this framework, we study three POMDP classes with special problem characteristics and propose different value iteration algorithms for them. (1) An informative POMDP assumes that an agent always has a good idea about the world states. The subspace determined by the model is much smaller than the belief space. Value iteration over the belief subspace is more efficient for this POMDP class. (2) A near-discernible POMDP assumes that the agent can get a good idea about states once in a while if it executes some particular actions. For such a POMDP, the belief subspace determined by the model can be of the same size as the belief space. We propose an anytime value iteration algorithm which focuses the computations on a small belief subspace and gradually expand it. (3) A more general class than near-discernible POMDPs assumes that the agent can get a good idea about states with a high likelihood once in a while if it executes some particular actions. For such POMDPs, we adapt the anytime algorithm to conduct value iteration over a growing belief subspace.

01 Jan 2001
TL;DR: This paper presents a meta-modelling architecture that automates the very labor-intensive and therefore time-heavy and expensive process of manually segmenting Sequences through Hierarchical Reinforcement Learning.
Abstract: to Sequence Learning.- to Sequence Learning.- Sequence Clustering and Learning with Markov Models.- Sequence Learning via Bayesian Clustering by Dynamics.- Using Dynamic Time Warping to Bootstrap HMM-Based Clustering of Time Series.- Sequence Prediction and Recognition with Neural Networks.- Anticipation Model for Sequential Learning of Complex Sequences.- Bidirectional Dynamics for Protein Secondary Structure Prediction.- Time in Connectionist Models.- On the Need for a Neural Abstract Machine.- Sequence Discovery with Symbolic Methods.- Sequence Mining in Categorical Domains: Algorithms and Applications.- Sequence Learning in the ACT-R Cognitive Architecture: Empirical Analysis of a Hybrid Model.- Sequential Decision Making.- Sequential Decision Making Based on Direct Search.- Automatic Segmentation of Sequences through Hierarchical Reinforcement Learning.- Hidden-Mode Markov Decision Processes for Nonstationary Sequential Decision Making.- Pricing in Agent Economies Using Neural Networks and Multi-agent Q-Learning.- Biologically Inspired Sequence Learning Models.- Multiple Forward Model Architecture for Sequence Processing.- Integration of Biologically Inspired Temporal Mechanisms into a Cortical Framework for Sequence Processing.- Attentive Learning of Sequential Handwriting Movements: A Neural Network Model.

Proceedings Article
03 Jan 2001
TL;DR: This paper presents a simple approach for computing reasonable policies for factored Markov decision processes (MDPs), when the optimal value function can be approximated by a compact linear form.
Abstract: We present a simple approach for computing reasonable policies for factored Markov decision processes (MDPs), when the optimal value function can be approximated by a compact linear form. Our method is based on solving a single linear program that approximates the best linear fit to the optimal value function. By applying an efficient constraint generation procedure we obtain an iterative solution method that tackles concise linear programs. This direct linear programming approach experimentally yields a significant reduction in computation time over approximate value- and policy-iteration methods (sometimes reducing several hours to a few seconds). However, the quality of the solutions produced by linear programming is weaker—usually about twice the approximation error for the same approximating class. Nevertheless, the speed advantage allows one to use larger approximation classes to achieve similar error in reasonable time.

Dissertation
01 Jan 2001
TL;DR: This thesis considers three complications that arise from applying reinforcement learning to a real-world application, and employs importance sampling (likelihood ratios) to achieve good performance in partially observable Markov decision processes with few data.
Abstract: This thesis considers three complications that arise from applying reinforcement learning to a real-world application. In the process of using reinforcement learning to build an adaptive electronic market-maker, we find the sparsity of data, the partial observability of the domain, and the multiple objectives of the agent to cause serious problems for existing reinforcement learning algorithms. We employ importance sampling (likelihood ratios) to achieve good performance in partially observable Markov decision processes with few data. Our importance sampling estimator requires no knowledge about the environment and places few restrictions on the method of collecting data. It can be used efficiently with reactive controllers, finite-state controllers, or policies with function approximation. We present theoretical analyses of the estimator and incorporate it into a reinforcement learning algorithm. Additionally, this method provides a complete return surface which can be used to balance multiple objectives dynamically. We demonstrate the need for multiple goals in a variety of applications and natural solutions based on our sampling method. The thesis concludes with example results from employing our algorithm to the domain of automated electronic market-making. Thesis Supervisor: Tomaso Poggio Title: Professor of Brain and Cognitive Science

Journal Article
TL;DR: In this paper, the authors generalize Faustmann's approach by recogniz- ing that future stand states and prices are known only as probabilistic distributions, and the objective function is then the expected discounted value of returns, over an infinite horizon.
Abstract: Faustmann's formula gives the land value, or the forest value of land with trees, under deterministic assumptions regarding future stand growth and prices, over an infinite horizon. Markov decision process (MDP) models generalize Faustmann's approach by recogniz- ing that future stand states and prices are known only as probabilistic distributions. The objective function is then the expected discounted value of returns, over an infinite horizon. It gives the land or the forest value in a stochastic environment. In MDP models, the laws of motion between stand-price states are Markov chains. Faustmann's formula is a special case where the probability of movement from one state to another is equal to unity. MDP models apply whether the stand state is bare land, or any state with trees, be it even- or uneven-aged. Decisions change the transition probabilities between stand states through silvicultural interventions. Decisions that maximize land or forest value depend only on the stand-price state, independently of how it was reached. Furthermore, to each stand-price state corresponds one single best decision. The solution of the MDP gives simultaneously the best decision for each state, and the forest value (land plus trees), given the stand state and following the best policy. Numerical solutions use either successive approximation, or linear programming. Examples with deterministic and stochastic cases show in particular the convergence of the MDP model to Faustmann's formula when the future is assumed known with certainty. In this deterministic environment, Faustmann's rule is independent of the distribution of stands in the forest. FOR. SCI. 47(4):466-474.

01 Jan 2001
TL;DR: The authors demonstrate that the uncertain model approach can be used to solve a class of nearly Markovian Decision Problems, providing lower bounds on performance in stochastic models with higher-order interactions.
Abstract: The authors consider the fundamental problem of nding good policies in uncertain models. It is demonstrated that although the general problem of nding the best policy with respect to the worst model is NP-hard, in the special case of a convex uncertainty set the problem is tractable. A stochastic dynamic game is proposed, and the security equilibrium solution of the game is shown to correspond to the value function under the worst model and the optimal controller. The authors demonstrate that the uncertain model approach can be used to solve a class of nearly Markovian Decision Problems, providing lower bounds on performance in stochastic models with higher-order interactions. The framework considered establishes connections between and generalizes paradigms of stochastic optimal, mini-max, and H1/robust control. Applications are considered, including robustness in reinforcement learning, planning in nearly Markovian decision processes, and bounding error due to sensor discretization in noisy, continuous state-spaces.

Journal ArticleDOI
TL;DR: A method in which, depending on the number of modems available, and the arrival and departure rates of different classes of customers, a decision is made whether to accept or reject a log‐on request is proposed, which maximizes the discounted value to ISPs while improving service levels for higher class customers.
Abstract: In this paper we study strategies for better utilizing the network capacity of Internet Service Providers (ISPs) when they are faced with stochastic and dynamic arrivals and departures of customers attempting to log-on or log-off, respectively. We propose a method in which, depending on the number of modems available, and the arrival and departure rates of different classes of customers, a decision is made whether to accept or reject a log-on request. The problem is formulated as a continuous time Markov Decision Process for which optimal policies can be readily derived using techniques such as value iteration. This decision maximizes the discounted value to ISPs while improving service levels for higher class customers. The methodology is similar to yield management techniques successfully used in airlines, hotels, etc. However, there are sufficient differences, such as no predefined time horizon or reservations, that make this model interesting to pursue and challenging. This work was completed in collaboration with one of the largest ISPs in Connecticut. The problem is topical, and approaches such as those proposed here are sought by users. © 2001 John Wiley & Sons, Inc., Naval Research Logistics 48:348–362, 2001

Journal ArticleDOI
TL;DR: A TD algorithm for estimating the variance of return in MDP(Markov decision processes) environments and a gradient-based reinforcement learning algorithm on the variance penalized criterion, which is a typical criterion in risk-avoiding control are presented.
Abstract: Estimating probability distributions on returns provides various sophisticated decision making schemes for control problems in Markov environments, including risk-sensitive control, efficient exploration of environments and so on. Many reinforcement learning algorithms, however, have simply relied on the expected return. This paper provides a scheme of decision making using mean and variance of returndistributions. This paper presents a TD algorithm for estimating the variance of return in MDP(Markov decision processes) environments and a gradient-based reinforcement learning algorithm on the variance penalized criterion, which is a typical criterion in risk-avoiding control. Empirical results demonstrate behaviors of the algorithms and validates of the criterion for risk-avoiding sequential decision tasks.

01 Jan 2001
TL;DR: An approximation scheme where each vector is represented as a linear combination of basis functions to provide a compact approximation to the value function is proposed and can be exploited to allow for efficient computations in approximate value and policy iteration algorithms in the context of factored POMDPs.
Abstract: Partially Observable Markov Decision Processes (POMDPs) provide a coherent mathematical framework for planning under uncertainty when the state of the system cannot be fully observed. However, the problem of finding an exact POMDP solution is intractable. Computing such solution requires the manipulation of a piecewise linear convex value function, which specifies a value for each possible belief state. This value function can be represented by a set of vectors, each one with dimension equal to the size of the state space. In nontrivial problems, however, these vectors are too large for such a representation to be feasible, preventing the use of exact POMDP algorithms. We propose an approximation scheme where each vector is represented as a linear combination of basis functions to provide a compact approximation to the value function. We also show that this representation can be exploited to allow for efficient computations in approximate value and policy iteration algorithms in the context of factored POMDPs, where the transition model is specified using a dynamic Bayesian network.

Journal ArticleDOI
TL;DR: A class of terminating Markov decision processes with an exponential risk-averse objective function and compact constraint sets is considered, establishing the existence of a real-valued optimal cost function which can be achieved by a stationary policy.

Proceedings Article
28 Jun 2001
TL;DR: It is proved that if an MDP possesses a symmetry, then the optimal value function andQ function are similarly symmetric and there exists a symmetric optimal policy.
Abstract: This paper examines the notion of symmetry in Markov decision processes (MDPs). We define symmetry for an MDP and show how it can be exploited for more effective learning in single agent systems as well as multiagent systems and multirobot systems. We prove that if an MDP possesses a symmetry, then the optimal value function andQ function are similarly symmetric and there exists a symmetric optimal policy. If an MDP is known to possess a symmetry, this knowledge can be applied to decrease the number of training examples needed for algorithms like Q learning and value iteration. It can also be used to directly restrict the hypothesis space.

01 Sep 2001
TL;DR: This work extends the model minimization framework proposed by Dean and Givan to include symmetries and base the framework on concepts derived from finite state automata and group theory.
Abstract: Current solution and modelling approaches to Markov Decision Processes (MDPs) scale poorly with the size of the MDP. Model minimization methods address this issue by exploiting redundancy in problem specification to reduce the size of the MDP model. Symmetries in a problem specification can give rise to special forms of redundancy that are not exploited by existing minimization methods. In this work we extend the model minimization framework proposed by Dean and Givan to include symmetries. We base our framework on concepts derived from finite state automata and group theory.


Journal ArticleDOI
TL;DR: The algorithm is based on a ‘sensitivity formula’ for the risk sensitive cost and is shown to converge with probability one to the desired solution of the finite Markov chains.

Proceedings Article
01 Jan 2001
TL;DR: A novel dialogue model based on the partially observable Markov decision process (POMDP) is proposed, which uses hidden system states and user intentions as the state set, parser results and low-level information as the observation set, domain actions and dialogue repair actions as the action set.
Abstract: Some stochastic models like Markov decision process (MDP) are used to model the dialogue manager. MDP-based system degrades fast when uncertainty about user’s intention increases. We propose a novel dialogue model based on the partially observable Markov decision process (POMDP). We use hidden system states and user intentions as the state set, parser results and low-level information as the observation set, domain actions and dialogue repair actions as the action set. Here the low-level information is extracted from different input modals using Bayesian networks. Because of the limitation of exact algorithms, we focus on heuristic methods and their applicability in dialogue management.