scispace - formally typeset
Search or ask a question

Showing papers on "Markov decision process published in 2011"


BookDOI
04 Aug 2011
TL;DR: This book discusses the challenges of dynamic programming, the three curses of dimensionality, and some experimental comparisons of stepsize formulas that led to the creation of ADP for online applications.
Abstract: Preface. Acknowledgments. 1. The challenges of dynamic programming. 1.1 A dynamic programming example: a shortest path problem. 1.2 The three curses of dimensionality. 1.3 Some real applications. 1.4 Problem classes. 1.5 The many dialects of dynamic programming. 1.6 What is new in this book? 1.7 Bibliographic notes. 2. Some illustrative models. 2.1 Deterministic problems. 2.2 Stochastic problems. 2.3 Information acquisition problems. 2.4 A simple modeling framework for dynamic programs. 2.5 Bibliographic notes. Problems. 3. Introduction to Markov decision processes. 3.1 The optimality equations. 3.2 Finite horizon problems. 3.3 Infinite horizon problems. 3.4 Value iteration. 3.5 Policy iteration. 3.6 Hybrid valuepolicy iteration. 3.7 The linear programming method for dynamic programs. 3.8 Monotone policies. 3.9 Why does it work? 3.10 Bibliographic notes. Problems 4. Introduction to approximate dynamic programming. 4.1 The three curses of dimensionality (revisited). 4.2 The basic idea. 4.3 Sampling random variables . 4.4 ADP using the postdecision state variable. 4.5 Lowdimensional representations of value functions. 4.6 So just what is approximate dynamic programming? 4.7 Experimental issues. 4.8 Dynamic programming with missing or incomplete models. 4.9 Relationship to reinforcement learning. 4.10 But does it work? 4.11 Bibliographic notes. Problems. 5. Modeling dynamic programs. 5.1 Notational style. 5.2 Modeling time. 5.3 Modeling resources. 5.4 The states of our system. 5.5 Modeling decisions. 5.6 The exogenous information process. 5.7 The transition function. 5.8 The contribution function. 5.9 The objective function. 5.10 A measuretheoretic view of information. 5.11 Bibliographic notes. Problems. 6. Stochastic approximation methods. 6.1 A stochastic gradient algorithm. 6.2 Some stepsize recipes. 6.3 Stochastic stepsizes. 6.4 Computing bias and variance. 6.5 Optimal stepsizes. 6.6 Some experimental comparisons of stepsize formulas. 6.7 Convergence. 6.8 Why does it work? 6.9 Bibliographic notes. Problems. 7. Approximating value functions. 7.1 Approximation using aggregation. 7.2 Approximation methods using regression models. 7.3 Recursive methods for regression models. 7.4 Neural networks. 7.5 Batch processes. 7.6 Why does it work? 7.7 Bibliographic notes. Problems. 8. ADP for finite horizon problems. 8.1 Strategies for finite horizon problems. 8.2 Qlearning. 8.3 Temporal difference learning. 8.4 Policy iteration. 8.5 Monte Carlo value and policy iteration. 8.6 The actorcritic paradigm. 8.7 Bias in value function estimation. 8.8 State sampling strategies. 8.9 Starting and stopping. 8.10 A taxonomy of approximate dynamic programming strategies. 8.11 Why does it work? 8.12 Bibliographic notes. Problems. 9. Infinite horizon problems. 9.1 From finite to infinite horizon. 9.2 Algorithmic strategies. 9.3 Stepsizes for infinite horizon problems. 9.4 Error measures. 9.5 Direct ADP for online applications. 9.6 Finite horizon models for steady state applications. 9.7 Why does it work? 9.8 Bibliographic notes. Problems. 10. Exploration vs. exploitation. 10.1 A learning exercise: the nomadic trucker. 10.2 Learning strategies. 10.3 A simple information acquisition problem. 10.4 Gittins indices and the information acquisition problem. 10.5 Variations. 10.6 The knowledge gradient algorithm. 10.7 Information acquisition in dynamic programming. 10.8 Bibliographic notes. Problems. 11. Value function approximations for special functions. 11.1 Value functions versus gradients. 11.2 Linear approximations. 11.3 Piecewise linear approximations. 11.4 The SHAPE algorithm. 11.5 Regression methods. 11.6 Cutting planes. 11.7 Why does it work? 11.8 Bibliographic notes. Problems. 12. Dynamic resource allocation. 12.1 An asset acquisition problem. 12.2 The blood management problem. 12.3 A portfolio optimization problem. 12.4 A general resource allocation problem. 12.5 A fleet management problem. 12.6 A driver management problem. 12.7 Bibliographic references. Problems. 13. Implementation challenges. 13.1 Will ADP work for your problem? 13.2 Designing an ADP algorithm for complex problems. 13.3 Debugging an ADP algorithm. 13.4 Convergence issues. 13.5 Modeling your problem. 13.6 Online vs. offline models. 13.7 If it works, patent it!

2,300 citations


Journal ArticleDOI
TL;DR: GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies, is introduced.
Abstract: Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a {\em biased} estimate of the gradient of the {\em average reward} in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies. A similar algorithm was proposed by Kimura, Yamamura, and Kobayashi (1995). The algorithm's chief advantages are that it requires storage of only twice the number of policy parameters, uses one free parameter $\beta\in [0,1)$ (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of GPOMDP, and show how the correct choice of the parameter $\beta$ is related to the {\em mixing time} of the controlled POMDP. We briefly describe extensions of GPOMDP to controlled Markov chains, continuous state, observation and control spaces, multiple-agents, higher-order derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter, Bartlett, & Weaver, 2001) we show how the gradient estimates generated by GPOMDP can be used in both a traditional stochastic gradient algorithm and a conjugate-gradient procedure to find local optima of the average reward

645 citations


Book
26 Sep 2011
TL;DR: The theory of Markov processes has its origins in the studies by A. A. Markov (1856-1922) of sequences of experiments "connected in a chain" and in the attempts to describe mathematically the physical phenomenon known as Brownian mo- tion.
Abstract: At first there was the Markov property. The theory of stochastic processes, which can be considered as an exten- sion of probability theory, allows the modeling of the evolution of systems through the time. It cannot be properly understood just as pure mathemat- ics, separated from the body of experience and examples that have brought it to life. The theory of stochastic processes entered a period of intensive develop- ment, which is not finished yet, when the idea of the Markov property was brought in. Not even a serious study of the renewal processes is possible without using the strong tool of Markov processes. The modern theory of Markov processes has its origins in the studies by A. A: Markov (1856-1922) of sequences of experiments "connected in a chain" and in the attempts to describe mathematically the physical phenomenon known as Brownian mo- tion. Later, many generalizations (in fact all kinds of weakenings of the Markov property) of Markov type stochastic processes were proposed. Some of them have led to new classes of stochastic processes and useful applications. Let us mention some of them: systems with complete connections [90, 91, 45, 86]; K-dependent Markov processes [44]; semi-Markov processes, and so forth. The semi-Markov processes generalize the renewal processes as well as the Markov jump processes and have numerous applications, especially in relia- bility.

486 citations


Journal ArticleDOI
01 Feb 2011
TL;DR: It is shown that, similar to Q-learning, the new methods have the important advantage that knowledge of the system dynamics is not needed for the implementation of these learning algorithms or for the OPFB control.
Abstract: Approximate dynamic programming (ADP) is a class of reinforcement learning methods that have shown their importance in a variety of applications, including feedback control of dynamical systems. ADP generally requires full information about the system internal states, which is usually not available in practical situations. In this paper, we show how to implement ADP methods using only measured input/output data from the system. Linear dynamical systems with deterministic behavior are considered herein, which are systems of great interest in the control system community. In control system theory, these types of methods are referred to as output feedback (OPFB). The stochastic equivalent of the systems dealt with in this paper is a class of partially observable Markov decision processes. We develop both policy iteration and value iteration algorithms that converge to an optimal controller that requires only OPFB. It is shown that, similar to Q-learning, the new methods have the important advantage that knowledge of the system dynamics is not needed for the implementation of these learning algorithms or for the OPFB control. Only the order of the system, as well as an upper bound on its "observability index," must be known. The learned OPFB controller is in the form of a polynomial autoregressive moving-average controller that has equivalent performance with the optimal state variable feedback gain.

406 citations


Journal ArticleDOI
TL;DR: This paper extends the framework of partially observable Markov decision processes (POMDPs) to multi-agent settings by incorporating the notion of agent models into the state space and expresses the agents' autonomy by postulating that their models are not directly manipulable or observable by other agents.
Abstract: This paper extends the framework of partially observable Markov decision processes (POMDPs) to multi-agent settings by incorporating the notion of agent models into the state space. Agents maintain beliefs over physical states of the environment and over models of other agents, and they use Bayesian updates to maintain their beliefs over time. The solutions map belief states to actions. Models of other agents may include their belief states and are related to agent types considered in games of incomplete information. We express the agents autonomy by postulating that their models are not directly manipulable or observable by other agents. We show that important properties of POMDPs, such as convergence of value iteration, the rate of convergence, and piece-wise linearity and convexity of the value functions carry over to our framework. Our approach complements a more traditional approach to interactive settings which uses Nash equilibria as a solution paradigm. We seek to avoid some of the drawbacks of equilibria which may be non-unique and do not capture off-equilibrium behaviors. We do so at the cost of having to represent, process and continuously revise models of other agents. Since the agents beliefs may be arbitrarily nested, the optimal solutions to decision making problems are only asymptotically computable. However, approximate belief updates and approximately optimal plans are computable. We illustrate our framework using a simple application domain, and we show examples of belief updates and value functions.

369 citations


Book
08 Jun 2011
TL;DR: Theory of Finite Horizon Markov Decision Processes and Financial Markets, part I, and theory of Optimal Stopping Problems, part V.
Abstract: Preface.- 1.Introduction and First Examples.- Part I Finite Horizon Optimization Problems and Financial Markets.- 2.Theory of Finite Horizon Markov Decision Processes.- 3.The Financial Markets.- 4.Financial Optimization Problems.- Part II Partially Observable Markov Decision Problems.- 5.Partially Observable Markov Decision Processes.- 6.Partially Observable Markov Decision Problems in Finance.- Part III Infinite Horizon Optimization Problems.- 7.Theory of Infinite Horizon Markov Decision Processes.- 8.Piecewise Deterministic Markov Decision Processes.- 9.Optimization Problems in Finance and Insurance.- Part IV Stopping Problems.- 10.Theory of Optimal Stopping Problems.- 11.Stopping Problems in Finance.- Part V Appendix.- A.Tools from Analysis.- B.Tools from Probability.- C.Tools from Mathematical Finance.- References.- Index.

346 citations


Proceedings Article
12 Dec 2011
TL;DR: A probabilistic algorithm that allows complex behaviors to be captured from suboptimal stochastic demonstrations, while automatically balancing the simplicity of the learned reward structure against its consistency with the observed actions.
Abstract: We present a probabilistic algorithm for nonlinear inverse reinforcement learning. The goal of inverse reinforcement learning is to learn the reward function in a Markov decision process from expert demonstrations. While most prior inverse reinforcement learning algorithms represent the reward as a linear combination of a set of features, we use Gaussian processes to learn the reward as a nonlinear function, while also determining the relevance of each feature to the expert's policy. Our probabilistic algorithm allows complex behaviors to be captured from suboptimal stochastic demonstrations, while automatically balancing the simplicity of the learned reward structure against its consistency with the observed actions.

336 citations


Book ChapterDOI
13 Jun 2011
TL;DR: Methods to analyse Markov decision processes, which model both stochastic and nondeterministic behaviour, and a wide range of their properties, including specifications in the temporal logics PCTL and LTL, probabilistic safety properties and cost- or reward-based measures are described.
Abstract: This tutorial provides an introduction to probabilistic model checking, a technique for automatically verifying quantitative properties of probabilistic systems. We focus on Markov decision processes (MDPs), which model both stochastic and nondeterministic behaviour. We describe methods to analyse a wide range of their properties, including specifications in the temporal logics PCTL and LTL, probabilistic safety properties and cost- or reward-based measures. We also discuss multi-objective probabilistic model checking, used to analyse trade-offs between several different quantitative properties. Applications of the techniques in this tutorial include performance and dependability analysis of networked systems, communication protocols and randomised distributed algorithms. Since such systems often comprise several components operating in parallel, we also cover techniques for compositional modelling and verification of multi-component probabilistic systems. Finally, we describe three large case studies which illustrate practical applications of the various methods discussed in the tutorial.

333 citations


Journal ArticleDOI
TL;DR: The Markov Reward Model Checker is a software tool for verifying properties over probabilistic models that supports PCTL and CSL model checking, and their reward extensions.

319 citations


Proceedings Article
14 Jun 2011
TL;DR: This paper proposes a model-free IRL algorithm, where the relative entropy between the empirical distribution of the state-action trajectories under a baseline policy and their distribution under the learned policy is minimized by stochastic gradient descent.
Abstract: We consider the problem of imitation learning where the examples, demonstrated by an expert, cover only a small part of a large state space. Inverse Reinforcement Learning (IRL) provides an ecient tool for generalizing the demonstration, based on the assumption that the expert is optimally acting in a Markov Decision Process (MDP). Most of the past work on IRL requires that a (near)optimal policy can be computed for dierent reward functions. However, this requirement can hardly be satised in systems with a large, or continuous, state space. In this paper, we propose a model-free IRL algorithm, where the relative entropy between the empirical distribution of the state-action trajectories under a baseline policy and their distribution under the learned policy is minimized by stochastic gradient descent. We compare this new approach to well-known IRL algorithms using learned MDP models. Empirical results on simulated car racing, gridworld and ball-in-a-cup problems show that our approach is able to learn good policies from a small number of demonstrations.

319 citations


Journal ArticleDOI
TL;DR: Numerical results suggest that incorporating the statistical knowledge into the scheduling policies can result in significant savings, especially for short tasks, and it is demonstrated with real price data from Commonwealth Edison that scheduling with mismatched modeling and online parameter estimation can still provide significant economic advantages to consumers.
Abstract: The problem of causally scheduling power consumption to minimize the expected cost at the consumer side is considered. The price of electricity is assumed to be time-varying. The scheduler has access to past and current prices, but only statistical knowledge about future prices, which it uses to make an optimal decision in each time period. The scheduling problem is naturally cast as a Markov decision process. Algorithms to find decision thresholds for both noninterruptible and interruptible loads under a deadline constraint are then developed. Numerical results suggest that incorporating the statistical knowledge into the scheduling policies can result in significant savings, especially for short tasks. It is demonstrated with real price data from Commonwealth Edison that scheduling with mismatched modeling and online parameter estimation can still provide significant economic advantages to consumers.

Journal ArticleDOI
Yinyu Ye1
TL;DR: It is proved that the classic policy-iteration method and the original simplex method with the most-negative-reduced-cost pivoting rule of Dantzig are strongly polynomial-time algorithms for solving the Markov decision problem (MDP) with a fixed discount rate.
Abstract: We prove that the classic policy-iteration method [Howard, R. A. 1960. Dynamic Programming and Markov Processes. MIT, Cambridge] and the original simplex method with the most-negative-reduced-cost pivoting rule of Dantzig are strongly polynomial-time algorithms for solving the Markov decision problem (MDP) with a fixed discount rate. Furthermore, the computational complexity of the policy-iteration and simplex methods is superior to that of the only known strongly polynomial-time interior-point algorithm [Ye, Y. 2005. A new complexity result on solving the Markov decision problem. Math. Oper. Res.30(3) 733--749] for solving this problem. The result is surprising because the simplex method with the same pivoting rule was shown to be exponential for solving a general linear programming problem [Klee, V., G. J. Minty. 1972. How good is the simplex method? Technical report. O. Shisha, ed. Inequalities III. Academic Press, New York], the simplex method with the smallest index pivoting rule was shown to be exponential for solving an MDP regardless of discount rates [Melekopoglou, M., A. Condon. 1994. On the complexity of the policy improvement algorithm for Markov decision processes. INFORMS J. Comput.6(2) 188--192], and the policy-iteration method was recently shown to be exponential for solving undiscounted MDPs under the average cost criterion. We also extend the result to solving MDPs with transient substochastic transition matrices whose spectral radii are uniformly below one.

Book ChapterDOI
17 Jan 2011
TL;DR: It is guess that the double-progressive widening trick can be used for other algorithms as well, as a general tool for ensuring a good bias/variance compromise in search algorithms.
Abstract: Upper Confidence Trees are a very efficient tool for solving Markov Decision Processes; originating in difficult games like the game of Go, it is in particular surprisingly efficient in high dimensional problems. It is known that it can be adapted to continuous domains in some cases (in particular continuous action spaces). We here present an extension of Upper Confidence Trees to continuous stochastic problems. We (i) show a deceptive problem on which the classical Upper Confidence Tree approach does not work, even with arbitrarily large computational power and with progressive widening (ii) propose an improvement, termed double-progressive widening, which takes care of the compromise between variance (we want infinitely many simulations for each action/state) and bias (we want sufficiently many nodes to avoid a bias by the first nodes) and which extends the classical progressive widening (iii) discuss its consistency and show experimentally that it performs well on the deceptive problem and on experimental benchmarks. We guess that the double-progressive widening trick can be used for other algorithms as well, as a general tool for ensuring a good bias/variance compromise in search algorithms.

Proceedings ArticleDOI
20 Jun 2011
TL;DR: This work addresses shape grammar parsing for facade segmentation using Reinforcement Learning using a Hierarchical Markov Decision Process, by employing a recursive binary split grammar to efficiently find the optimal parse of a given facade in terms of the authors' shape grammar.
Abstract: We address shape grammar parsing for facade segmentation using Reinforcement Learning (RL). Shape parsing entails simultaneously optimizing the geometry and the topology (e.g. number of floors) of the facade, so as to optimize the fit of the predicted shape with the responses of pixel-level 'terminal detectors'. We formulate this problem in terms of a Hierarchical Markov Decision Process, by employing a recursive binary split grammar. This allows us to use RL to efficiently find the optimal parse of a given facade in terms of our shape grammar. Building on the RL paradigm, we exploit state aggregation to speedup computation, and introduce image-driven exploration in RL to accelerate convergence. We achieve state-of-the-art results on facade parsing, with a significant speed-up compared to existing methods, and substantial robustness to initial conditions. We demonstrate that the method can also be applied to interactive segmentation, and to a broad variety of architectural styles.

Journal ArticleDOI
TL;DR: The implementation of Rényi divergence via the sequential Monte Carlo method is presented and the performance of the proposed reward function is demonstrated by a numerical example, where a moving range-only sensor is controlled to estimate the number and the states of several moving objects using the PHD filter.
Abstract: The context is sensor control for multi-object Bayes filtering in the framework of partially observed Markov decision processes (POMDPs). The current information state is represented by the multi-object probability density function (pdf), while the reward function associated with each sensor control (action) is the information gain measured by the alpha or Renyi divergence. Assuming that both the predicted and updated state can be represented by independent identically distributed (IID) cluster random finite sets (RFSs) or, as a special case, the Poisson RFSs, this work derives the analytic expressions of the corresponding Renyi divergence based information gains. The implementation of Renyi divergence via the sequential Monte Carlo method is presented. The performance of the proposed reward function is demonstrated by a numerical example, where a moving range-only sensor is controlled to estimate the number and the states of several moving objects using the PHD filter.

Journal Article
TL;DR: This paper introduces the Bayes-Adaptive Partially Observable Markov Decision Processes, a new framework that can be used to simultaneously learn a model of the POMDP domain through interaction with the environment, and track the state of the system under partial observability.
Abstract: Bayesian learning methods have recently been shown to provide an elegant solution to the exploration-exploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs). The primary focus of this paper is to extend these ideas to the case of partially observable domains, by introducing the Bayes-Adaptive Partially Observable Markov Decision Processes. This new framework can be used to simultaneously (1) learn a model of the POMDP domain through interaction with the environment, (2) track the state of the system under partial observability, and (3) plan (near-)optimal sequences of actions. An important contribution of this paper is to provide theoretical results showing how the model can be finitely approximated while preserving good learning performance. We present approximate algorithms for belief tracking and planning in this model, as well as empirical results that illustrate how the model estimate and agent's return improve as a function of experience.

Proceedings ArticleDOI
10 Apr 2011
TL;DR: This study considers smart grids with two classes of energy users - traditional energy users and opportunistic energy users (e.g., smart meters or smart appliances), and investigates pricing and dispatch at two timescales, via day-ahead scheduling and real-time scheduling.
Abstract: Integrating volatile renewable energy resources into the bulk power grid is challenging, due to the reliability requirement that the load and generation in the system remain balanced all the time. In this study, we tackle this challenge for smart grid with integrated wind generation, by leveraging multi-timescale dispatch and scheduling. Specifically, we consider smart grids with two classes of energy users - traditional energy users and opportunistic energy users (e.g., smart meters or smart appliances), and investigate pricing and dispatch at two timescales, via day-ahead scheduling and real-time scheduling. In day-ahead scheduling, with the statistical information on wind generation and energy demands, we characterize the optimal procurement of the energy supply and the day-ahead retail price for the traditional energy users; in real-time scheduling, with the realization of wind generation and the load of traditional energy users, we optimize real-time prices to manage the opportunistic energy users so as to achieve system-wide reliability. More specifically, when the opportunistic users are non-persistent, we obtain closed-form solutions to the two-level scheduling problem. For the persistent case, we treat the scheduling problem as a multi-timescale Markov decision process. We show that it can be recast, explicitly, as a classic Markov decision process with continuous state and action spaces, the solution to which can be found via standard techniques.

Journal ArticleDOI
TL;DR: A technique based on a combination of mechanistic population-scale models, Markov decision process theory and game theory that facilitates the evaluation of game theoretic decisions at both individual and community scales is presented.
Abstract: Reconciling the interests of individuals with the interests of communities is a major challenge in designing and implementing health policies. In this paper, we present a technique based on a combination of mechanistic population-scale models, Markov decision process theory and game theory that facilitates the evaluation of game theoretic decisions at both individual and community scales. To illustrate our technique, we provide solutions to several variants of the simple vaccination game including imperfect vaccine efficacy and differential waning of natural and vaccine immunity. In addition, we show how path-integral approaches can be applied to the study of models in which strategies are fixed waiting times rather than exponential random variables. These methods can be applied to a wide variety of decision problems with population-dynamic feedbacks.

Proceedings ArticleDOI
15 Dec 2011
TL;DR: The proposed V2G control algorithm is evaluated using both the simulated price and the actual price from PJM in 2010 to show that it can work effectively in the real electricity market and it is able to increase the profit significantly compared with the conventional EV charging scheme.
Abstract: The vehicle-to-grid (V2G) system enables energy flow from the electric vehicles (EVs) to the grid. The distributed power of the EVs can either be sold to the grid or be used to provide frequency regulation service when V2G is implemented. A V2G control algorithm is necessary to decide whether the EV should be charged, discharged, or provide frequency regulation service in each hour. The V2G control problem is further complicated by the price uncertainty, where the electricity price is determined dynamically every hour. In this paper, we study the real-time V2G control problem under price uncertainty. We model the electricity price as a Markov chain with unknown transition probabilities and formulate the problem as a Markov decision process (MDP). This model features implicit estimation of the impact of future electricity prices and current control operation on long-term profits. The Q-learning algorithm is then used to adapt the control operation to the hourly available price in order to maximize the profit for the EV owner during the whole parking time. We evaluate our proposed V2G control algorithm using both the simulated price and the actual price from PJM in 2010. Simulation results show that our proposed algorithm can work effectively in the real electricity market and it is able to increase the profit significantly compared with the conventional EV charging scheme.

Journal ArticleDOI
TL;DR: This work investigates the problem of minimizing the Average-Value-at-Risk (AVaRτ) of the discounted cost over a finite and an infinite horizon which is generated by a Markov Decision Process and shows that this problem can be reduced to an ordinary MDP with extended state space and given conditions under which an optimal policy exists.
Abstract: We investigate the problem of minimizing the Average-Value-at-Risk (AVaR τ ) of the discounted cost over a finite and an infinite horizon which is generated by a Markov Decision Process (MDP). We show that this problem can be reduced to an ordinary MDP with extended state space and give conditions under which an optimal policy exists. We also give a time-consistent interpretation of the AVaR τ . At the end we consider a numerical example which is a simple repeated casino game. It is used to discuss the influence of the risk aversion parameter τ of the AVaR τ -criterion.

Journal ArticleDOI
TL;DR: In order to meet performance and robustness objectives, a new class of policies are proposed, called the Log rule, that are radial sum-rate monotone (RSM) and provably throughput-optimal and it can be shown that an RSM policy minimizes the asymptotic probability of sum-queue overflow.
Abstract: This paper considers the design of multiuser opportunistic packet schedulers for users sharing a time-varying wireless channel from performance and robustness points of view. For a simplified model falling in the classical Markov decision process framework, we numerically compute and characterize mean-delay-optimal scheduling policies. The computed policies exhibit radial sum-rate monotonicity: As users' queues grow linearly, the scheduler allocates service in a manner that deemphasizes the balancing of unequal queues in favor of maximizing current system throughput (being opportunistic). This is in sharp contrast to previously proposed throughput-optimal policies, e.g., Exp rule and MaxWeight (with any positive exponent of queue length). In order to meet performance and robustness objectives, we propose a new class of policies, called the Log rule, that are radial sum-rate monotone (RSM) and provably throughput-optimal. In fact, it can also be shown that an RSM policy minimizes the asymptotic probability of sum-queue overflow. We use extensive simulations to explore various possible design objectives for opportunistic schedulers. When users see heterogenous channels, we find that emphasizing queue balancing, e.g., Exp rule and MaxWeight, may excessively compromise the overall delay. Finally, we discuss approaches to implement the proposed policies for scheduling and resource allocation in OFDMA-based multichannel systems.

Journal ArticleDOI
TL;DR: This paper develops an upper bound on the performance of any arbitrary scheduler, formulated and solved as a Markov Decision Process (MDP), assuming that complete state information about the relays is available at the source nodes.
Abstract: This paper considers wireless sensor networks (WSNs) with energy harvesting and cooperative communications and develops energy efficient scheduling strategies for such networks. In order to maximize the long-term utility of the network, the scheduling problem considered in this paper addresses the following question: given an estimate of the current network state, should a source transmit its data directly to the destination or use a relay to help with the transmission? We first develop an upper bound on the performance of any arbitrary scheduler. Next, the optimal scheduling problem is formulated and solved as a Markov Decision Process (MDP), assuming that complete state information about the relays is available at the source nodes. We then relax the assumption of the availability of full state information, and formulate the scheduling problem as a Partially Observable Markov Decision Process (POMDP) and show that it can be decomposed into an equivalent MDP problem. Simulation results are used to show the performance of the schedulers.

Journal ArticleDOI
TL;DR: A rigorous and unified framework for simultaneously utilizing both physical-layer and system-level techniques to minimize energy consumption, under delay constraints, in the presence of stochastic and unknown traffic and channel conditions is proposed.
Abstract: We consider the problem of energy-efficient point-to-point transmission of delay-sensitive data (e.g., multimedia data) over a fading channel. We propose a rigorous and unified framework for simultaneously utilizing both physical-layer and system-level techniques to minimize energy consumption, under delay constraints, in the presence of stochastic and unknown traffic and channel conditions. We formulate the problem as a Markov decision process and solve it online using reinforcement learning. The advantages of the proposed online method are that i) it does not require a priori knowledge of the traffic arrival and channel statistics to determine the jointly optimal physical-layer and system-level power management strategies; ii) it exploits partial information about the system so that less information needs to be learned than when using conventional reinforcement learning algorithms; and iii) it obviates the need for action exploration, which severely limits the adaptation speed and run-time performance of conventional reinforcement learning algorithms.

Proceedings Article
01 Jan 2011
TL;DR: Ye as mentioned in this paper showed that the simplex method with Dantzig's pivoting rule, as well as Howard's policy iteration algorithm, solve discounted Markov decision processes (MDPs), with a constant discount factor, in strongly polynomial time.
Abstract: Ye [2011] showed recently that the simplex method with Dantzig’s pivoting rule, as well as Howard’s policy iteration algorithm, solve discounted Markov decision processes (MDPs), with a constant discount factor, in strongly polynomial time. More precisely, Ye showed that both algorithms terminate after at most O(mn1−γ log n1−γ) iterations, where n is the number of states, m is the total number of actions in the MDP, and 0

Journal ArticleDOI
TL;DR: This work provides the first distance-estimation scheme for metrics based on bisimulation for continuous probabilistic transition systems and shows that the optimal value function associated with a discounted infinite-horizon planning task is continuous with respect to metric distances.
Abstract: In recent years, various metrics have been developed for measuring the behavioral similarity of states in probabilistic transition systems [J. Desharnais et al., Proceedings of CONCUR'99, Springer-Verlag, London, 1999, pp. 258-273; F. van Breugel and J. Worrell, Proceedings of ICALP'01, Springer-Verlag, London, 2001, pp. 421-432]. In the context of finite Markov decision processes (MDPs), we have built on these metrics to provide a robust quantitative analogue of stochastic bisimulation [N. Ferns, P. Panangaden, and D. Precup, Proceedings of UAI-04, AUAI Press, Arlington, VA, 2004, pp. 162-169] and an efficient algorithm for its calculation [N. Ferns, P. Panangaden, and D. Precup, Proceedings of UAI-06, AUAI Press, Arlington, VA, 2006, pp. 174-181]. In this paper, we seek to properly extend these bisimulation metrics to MDPs with continuous state spaces. In particular, we provide the first distance-estimation scheme for metrics based on bisimulation for continuous probabilistic transition systems. Our work, based on statistical sampling and infinite dimensional linear programming, is a crucial first step in formally guiding real-world planning, where tasks are usually continuous and highly stochastic in nature, e.g., robot navigation, and often a substitution with a parametric model or crude finite approximation must be made. We show that the optimal value function associated with a discounted infinite-horizon planning task is continuous with respect to metric distances. Thus, our metrics allow one to reason about the quality of solution obtained by replacing one model with another. Alternatively, they may potentially be used directly for state aggregation.

Book ChapterDOI
18 Apr 2011
TL;DR: This paper studies the synthesis problem for PCTL in PMDPs, and synthesises the parameter valuations under which F is true, using existing decision procedures to check whether F holds on each of the Markov processes represented by the hyper-rectangle.
Abstract: In parametric Markov decision processes (PMDPs), transition probabilities are not fixed, but are given as functions over a set of parameters. A PMDP denotes a family of concrete MDPs. This paper studies the synthesis problem for PCTL in PMDPs: Given a specification F in PCTL, we synthesise the parameter valuations under which F is true. First, we divide the possible parameter space into hyper-rectangles. We use existing decision procedures to check whether F holds on each of the Markov processes represented by the hyper-rectangle. As it is normally impossible to cover the whole parameter space by hyper-rectangles, we allow a limited area to remain undecided. We also consider an extension of PCTL with reachability rewards. To demonstrate the applicability of the approach, we apply our technique on a case study, using a preliminary implementation.

Book ChapterDOI
28 Jun 2011
TL;DR: This work frames the problem of optimally selecting teaching actions using a decision-theoretic approach and shows how to formulate teaching as a partially-observable Markov decision process (POMDP) planning problem.
Abstract: Both human and automated tutors must infer what a student knows and plan future actions to maximize learning. Though substantial research has been done on tracking and modeling student learning, there has been significantly less attention on planning teaching actions and how the assumed student model impacts the resulting plans. We frame the problem of optimally selecting teaching actions using a decision-theoretic approach and show how to formulate teaching as a partially-observable Markov decision process (POMDP) planning problem. We consider three models of student learning and present approximate methods for finding optimal teaching actions given the large state and action spaces that arise in teaching. An experimental evaluation of the resulting policies on a simple concept-learning task shows that framing teacher action planning as a POMDP can accelerate learning relative to baseline performance.

Journal Article
TL;DR: Inverse reinforcement learning (IRL) is the problem of recovering the underlying reward function from the behavior of an expert as discussed by the authors, which can be modeled as a partially observable Markov decision process (POMDP).
Abstract: Inverse reinforcement learning (IRL) is the problem of recovering the underlying reward function from the behavior of an expert. Most of the existing IRL algorithms assume that the environment is modeled as a Markov decision process (MDP), although it is desirable to handle partially observable settings in order to handle more realistic scenarios. In this paper, we present IRL algorithms for partially observable environments that can be modeled as a partially observable Markov decision process (POMDP). We deal with two cases according to the representation of the given expert's behavior, namely the case in which the expert's policy is explicitly given, and the case in which the expert's trajectories are available instead. The IRL in POMDPs poses a greater challenge than in MDPs since it is not only ill-posed due to the nature of IRL, but also computationally intractable due to the hardness in solving POMDPs. To overcome these obstacles, we present algorithms that exploit some of the classical results from the POMDP literature. Experimental results on several benchmark POMDP domains show that our work is useful for partially observable settings.

Proceedings Article
11 Jun 2011
TL;DR: A new algorithm is presented that integrates recent advances in solving continuous bandit problems with sample-based rollout methods for planning in Markov Decision Processes (MDPs) and addresses planning in continuous-action MDPs.
Abstract: In this paper, we present a new algorithm that integrates recent advances in solving continuous bandit problems with sample-based rollout methods for planning in Markov Decision Processes (MDPs). Our algorithm, Hierarchical Optimistic Optimization applied to Trees (HOOT) addresses planning in continuous-action MDPs. Empirical results are given that show that the performance of our algorithm meets or exceeds that of a similar discrete action planner by eliminating the problem of manual discretization of the action space.

Journal ArticleDOI
TL;DR: It is shown in this paper that, when the collision constraints are tight, the optimal access strategy can be implemented by a simple memoryless access policy with periodic channel sensing with Extensions to multiple secondary users are presented.
Abstract: The problem of cognitive access of channels of primary users by a secondary user is considered. The transmissions of primary users are modeled as independent continuous-time Markovian on-off processes. A secondary cognitive user employs a slotted transmission format, and it senses one of the possible channels before transmission. The objective of the cognitive user is to maximize its throughput subject to collision constraints imposed by the primary users. The optimal access strategy is in general a solution of a constrained partially observable Markov decision process, which involves a constrained optimization in an infinite dimensional functional space. It is shown in this paper that, when the collision constraints are tight, the optimal access strategy can be implemented by a simple memoryless access policy with periodic channel sensing. Analytical expressions are given for the thresholds on collision probabilities for which memoryless access performs optimally. Extensions to multiple secondary users are also presented. Numerical and theoretical results are presented to validate and extend the analysis for different practical scenarios.