Showing papers on "Markov decision process published in 2011"

PDF

Open Access

Book•DOI•

Approximate dynamic programming : solving the curses of dimensionality

[...]

04 Aug 2011

TL;DR: This book discusses the challenges of dynamic programming, the three curses of dimensionality, and some experimental comparisons of stepsize formulas that led to the creation of ADP for online applications.

...read moreread less

Abstract: Preface. Acknowledgments. 1. The challenges of dynamic programming. 1.1 A dynamic programming example: a shortest path problem. 1.2 The three curses of dimensionality. 1.3 Some real applications. 1.4 Problem classes. 1.5 The many dialects of dynamic programming. 1.6 What is new in this book? 1.7 Bibliographic notes. 2. Some illustrative models. 2.1 Deterministic problems. 2.2 Stochastic problems. 2.3 Information acquisition problems. 2.4 A simple modeling framework for dynamic programs. 2.5 Bibliographic notes. Problems. 3. Introduction to Markov decision processes. 3.1 The optimality equations. 3.2 Finite horizon problems. 3.3 Infinite horizon problems. 3.4 Value iteration. 3.5 Policy iteration. 3.6 Hybrid valuepolicy iteration. 3.7 The linear programming method for dynamic programs. 3.8 Monotone policies. 3.9 Why does it work? 3.10 Bibliographic notes. Problems 4. Introduction to approximate dynamic programming. 4.1 The three curses of dimensionality (revisited). 4.2 The basic idea. 4.3 Sampling random variables . 4.4 ADP using the postdecision state variable. 4.5 Lowdimensional representations of value functions. 4.6 So just what is approximate dynamic programming? 4.7 Experimental issues. 4.8 Dynamic programming with missing or incomplete models. 4.9 Relationship to reinforcement learning. 4.10 But does it work? 4.11 Bibliographic notes. Problems. 5. Modeling dynamic programs. 5.1 Notational style. 5.2 Modeling time. 5.3 Modeling resources. 5.4 The states of our system. 5.5 Modeling decisions. 5.6 The exogenous information process. 5.7 The transition function. 5.8 The contribution function. 5.9 The objective function. 5.10 A measuretheoretic view of information. 5.11 Bibliographic notes. Problems. 6. Stochastic approximation methods. 6.1 A stochastic gradient algorithm. 6.2 Some stepsize recipes. 6.3 Stochastic stepsizes. 6.4 Computing bias and variance. 6.5 Optimal stepsizes. 6.6 Some experimental comparisons of stepsize formulas. 6.7 Convergence. 6.8 Why does it work? 6.9 Bibliographic notes. Problems. 7. Approximating value functions. 7.1 Approximation using aggregation. 7.2 Approximation methods using regression models. 7.3 Recursive methods for regression models. 7.4 Neural networks. 7.5 Batch processes. 7.6 Why does it work? 7.7 Bibliographic notes. Problems. 8. ADP for finite horizon problems. 8.1 Strategies for finite horizon problems. 8.2 Qlearning. 8.3 Temporal difference learning. 8.4 Policy iteration. 8.5 Monte Carlo value and policy iteration. 8.6 The actorcritic paradigm. 8.7 Bias in value function estimation. 8.8 State sampling strategies. 8.9 Starting and stopping. 8.10 A taxonomy of approximate dynamic programming strategies. 8.11 Why does it work? 8.12 Bibliographic notes. Problems. 9. Infinite horizon problems. 9.1 From finite to infinite horizon. 9.2 Algorithmic strategies. 9.3 Stepsizes for infinite horizon problems. 9.4 Error measures. 9.5 Direct ADP for online applications. 9.6 Finite horizon models for steady state applications. 9.7 Why does it work? 9.8 Bibliographic notes. Problems. 10. Exploration vs. exploitation. 10.1 A learning exercise: the nomadic trucker. 10.2 Learning strategies. 10.3 A simple information acquisition problem. 10.4 Gittins indices and the information acquisition problem. 10.5 Variations. 10.6 The knowledge gradient algorithm. 10.7 Information acquisition in dynamic programming. 10.8 Bibliographic notes. Problems. 11. Value function approximations for special functions. 11.1 Value functions versus gradients. 11.2 Linear approximations. 11.3 Piecewise linear approximations. 11.4 The SHAPE algorithm. 11.5 Regression methods. 11.6 Cutting planes. 11.7 Why does it work? 11.8 Bibliographic notes. Problems. 12. Dynamic resource allocation. 12.1 An asset acquisition problem. 12.2 The blood management problem. 12.3 A portfolio optimization problem. 12.4 A general resource allocation problem. 12.5 A fleet management problem. 12.6 A driver management problem. 12.7 Bibliographic references. Problems. 13. Implementation challenges. 13.1 Will ADP work for your problem? 13.2 Designing an ADP algorithm for complex problems. 13.3 Debugging an ADP algorithm. 13.4 Convergence issues. 13.5 Modeling your problem. 13.6 Online vs. offline models. 13.7 If it works, patent it!

...read moreread less

2,300 citations

Journal Article•DOI•

Infinite-Horizon Policy-Gradient Estimation

[...]

Jonathan Baxter, Peter L. Bartlett

03 Jun 2011-arXiv: Artificial Intelligence

TL;DR: GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies, is introduced.

...read moreread less

Abstract: Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a {\em biased} estimate of the gradient of the {\em average reward} in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies. A similar algorithm was proposed by Kimura, Yamamura, and Kobayashi (1995). The algorithm's chief advantages are that it requires storage of only twice the number of policy parameters, uses one free parameter $\beta\in [0,1)$ (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of GPOMDP, and show how the correct choice of the parameter $\beta$ is related to the {\em mixing time} of the controlled POMDP. We briefly describe extensions of GPOMDP to controlled Markov chains, continuous state, observation and control spaces, multiple-agents, higher-order derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter, Bartlett, & Weaver, 2001) we show how the gradient estimates generated by GPOMDP can be used in both a traditional stochastic gradient algorithm and a conjugate-gradient procedure to find local optima of the average reward

...read moreread less

645 citations

Book•

Semi-Markov Processes and Reliability

[...]

Nikolaos Limnios, G. Oprisan

26 Sep 2011

TL;DR: The theory of Markov processes has its origins in the studies by A. A. Markov (1856-1922) of sequences of experiments "connected in a chain" and in the attempts to describe mathematically the physical phenomenon known as Brownian mo- tion.

...read moreread less

Abstract: At first there was the Markov property. The theory of stochastic processes, which can be considered as an exten- sion of probability theory, allows the modeling of the evolution of systems through the time. It cannot be properly understood just as pure mathemat- ics, separated from the body of experience and examples that have brought it to life. The theory of stochastic processes entered a period of intensive develop- ment, which is not finished yet, when the idea of the Markov property was brought in. Not even a serious study of the renewal processes is possible without using the strong tool of Markov processes. The modern theory of Markov processes has its origins in the studies by A. A: Markov (1856-1922) of sequences of experiments "connected in a chain" and in the attempts to describe mathematically the physical phenomenon known as Brownian mo- tion. Later, many generalizations (in fact all kinds of weakenings of the Markov property) of Markov type stochastic processes were proposed. Some of them have led to new classes of stochastic processes and useful applications. Let us mention some of them: systems with complete connections [90, 91, 45, 86]; K-dependent Markov processes [44]; semi-Markov processes, and so forth. The semi-Markov processes generalize the renewal processes as well as the Markov jump processes and have numerous applications, especially in relia- bility.

...read moreread less

486 citations

Journal Article•DOI•

Reinforcement Learning for Partially Observable Dynamic Processes: Adaptive Dynamic Programming Using Measured Output Data

[...]

Frank L. Lewis¹, Kyriakos G. Vamvoudakis¹•Institutions (1)

University of Texas at Arlington¹

01 Feb 2011

TL;DR: It is shown that, similar to Q-learning, the new methods have the important advantage that knowledge of the system dynamics is not needed for the implementation of these learning algorithms or for the OPFB control.

...read moreread less

Abstract: Approximate dynamic programming (ADP) is a class of reinforcement learning methods that have shown their importance in a variety of applications, including feedback control of dynamical systems. ADP generally requires full information about the system internal states, which is usually not available in practical situations. In this paper, we show how to implement ADP methods using only measured input/output data from the system. Linear dynamical systems with deterministic behavior are considered herein, which are systems of great interest in the control system community. In control system theory, these types of methods are referred to as output feedback (OPFB). The stochastic equivalent of the systems dealt with in this paper is a class of partially observable Markov decision processes. We develop both policy iteration and value iteration algorithms that converge to an optimal controller that requires only OPFB. It is shown that, similar to Q-learning, the new methods have the important advantage that knowledge of the system dynamics is not needed for the implementation of these learning algorithms or for the OPFB control. Only the order of the system, as well as an upper bound on its "observability index," must be known. The learned OPFB controller is in the form of a polynomial autoregressive moving-average controller that has equivalent performance with the optimal state variable feedback gain.

...read moreread less

406 citations

Journal Article•DOI•

A Framework for Sequential Planning in Multi-Agent Settings

[...]

Prashant Doshi, Piotr J. Gmytrasiewicz

09 Sep 2011-arXiv: Artificial Intelligence

TL;DR: This paper extends the framework of partially observable Markov decision processes (POMDPs) to multi-agent settings by incorporating the notion of agent models into the state space and expresses the agents' autonomy by postulating that their models are not directly manipulable or observable by other agents.

...read moreread less

Abstract: This paper extends the framework of partially observable Markov decision processes (POMDPs) to multi-agent settings by incorporating the notion of agent models into the state space. Agents maintain beliefs over physical states of the environment and over models of other agents, and they use Bayesian updates to maintain their beliefs over time. The solutions map belief states to actions. Models of other agents may include their belief states and are related to agent types considered in games of incomplete information. We express the agents autonomy by postulating that their models are not directly manipulable or observable by other agents. We show that important properties of POMDPs, such as convergence of value iteration, the rate of convergence, and piece-wise linearity and convexity of the value functions carry over to our framework. Our approach complements a more traditional approach to interactive settings which uses Nash equilibria as a solution paradigm. We seek to avoid some of the drawbacks of equilibria which may be non-unique and do not capture off-equilibrium behaviors. We do so at the cost of having to represent, process and continuously revise models of other agents. Since the agents beliefs may be arbitrarily nested, the optimal solutions to decision making problems are only asymptotically computable. However, approximate belief updates and approximately optimal plans are computable. We illustrate our framework using a simple application domain, and we show examples of belief updates and value functions.

...read moreread less

369 citations

Book•

Markov Decision Processes with Applications to Finance

[...]

Nicole Bäuerle, Ulrich Rieder

08 Jun 2011

TL;DR: Theory of Finite Horizon Markov Decision Processes and Financial Markets, part I, and theory of Optimal Stopping Problems, part V.

...read moreread less

Abstract: Preface.- 1.Introduction and First Examples.- Part I Finite Horizon Optimization Problems and Financial Markets.- 2.Theory of Finite Horizon Markov Decision Processes.- 3.The Financial Markets.- 4.Financial Optimization Problems.- Part II Partially Observable Markov Decision Problems.- 5.Partially Observable Markov Decision Processes.- 6.Partially Observable Markov Decision Problems in Finance.- Part III Infinite Horizon Optimization Problems.- 7.Theory of Infinite Horizon Markov Decision Processes.- 8.Piecewise Deterministic Markov Decision Processes.- 9.Optimization Problems in Finance and Insurance.- Part IV Stopping Problems.- 10.Theory of Optimal Stopping Problems.- 11.Stopping Problems in Finance.- Part V Appendix.- A.Tools from Analysis.- B.Tools from Probability.- C.Tools from Mathematical Finance.- References.- Index.

...read moreread less

346 citations

Proceedings Article•

Nonlinear Inverse Reinforcement Learning with Gaussian Processes

[...]

Sergey Levine¹, Zoran Popović², Vladlen Koltun¹•Institutions (2)

Stanford University¹, University of Washington²

12 Dec 2011

TL;DR: A probabilistic algorithm that allows complex behaviors to be captured from suboptimal stochastic demonstrations, while automatically balancing the simplicity of the learned reward structure against its consistency with the observed actions.

...read moreread less

Abstract: We present a probabilistic algorithm for nonlinear inverse reinforcement learning. The goal of inverse reinforcement learning is to learn the reward function in a Markov decision process from expert demonstrations. While most prior inverse reinforcement learning algorithms represent the reward as a linear combination of a set of features, we use Gaussian processes to learn the reward as a nonlinear function, while also determining the relevance of each feature to the expert's policy. Our probabilistic algorithm allows complex behaviors to be captured from suboptimal stochastic demonstrations, while automatically balancing the simplicity of the learned reward structure against its consistency with the observed actions.

...read moreread less

336 citations

Book Chapter•DOI•

Automated Verification Techniques for Probabilistic Systems

[...]

Vojtech Forejt¹, Marta Kwiatkowska¹, Gethin Norman², David Parker¹•Institutions (2)

University of Oxford¹, University of Glasgow²

13 Jun 2011

TL;DR: Methods to analyse Markov decision processes, which model both stochastic and nondeterministic behaviour, and a wide range of their properties, including specifications in the temporal logics PCTL and LTL, probabilistic safety properties and cost- or reward-based measures are described.

...read moreread less

Abstract: This tutorial provides an introduction to probabilistic model checking, a technique for automatically verifying quantitative properties of probabilistic systems. We focus on Markov decision processes (MDPs), which model both stochastic and nondeterministic behaviour. We describe methods to analyse a wide range of their properties, including specifications in the temporal logics PCTL and LTL, probabilistic safety properties and cost- or reward-based measures. We also discuss multi-objective probabilistic model checking, used to analyse trade-offs between several different quantitative properties. Applications of the techniques in this tutorial include performance and dependability analysis of networked systems, communication protocols and randomised distributed algorithms. Since such systems often comprise several components operating in parallel, we also cover techniques for compositional modelling and verification of multi-component probabilistic systems. Finally, we describe three large case studies which illustrate practical applications of the various methods discussed in the tutorial.

...read moreread less

333 citations

Journal Article•DOI•

The ins and outs of the probabilistic model checker MRMC

[...]

Joost-Pieter Katoen¹, Ivan S. Zapreev, Ernst Moritz Hahn², Holger Hermanns², David N. Jansen³ - Show less +1 more•Institutions (3)

RWTH Aachen University¹, Saarland University², Radboud University Nijmegen³

01 Feb 2011-Performance Evaluation

TL;DR: The Markov Reward Model Checker is a software tool for verifying properties over probabilistic models that supports PCTL and CSL model checking, and their reward extensions.

...read moreread less

319 citations

Proceedings Article•

Relative Entropy Inverse Reinforcement Learning

[...]

Abdeslam Boularias¹, Jens Kober¹, Jan Peters¹•Institutions (1)

Max Planck Society¹

14 Jun 2011

TL;DR: This paper proposes a model-free IRL algorithm, where the relative entropy between the empirical distribution of the state-action trajectories under a baseline policy and their distribution under the learned policy is minimized by stochastic gradient descent.

...read moreread less

Abstract: We consider the problem of imitation learning where the examples, demonstrated by an expert, cover only a small part of a large state space. Inverse Reinforcement Learning (IRL) provides an ecient tool for generalizing the demonstration, based on the assumption that the expert is optimally acting in a Markov Decision Process (MDP). Most of the past work on IRL requires that a (near)optimal policy can be computed for dierent reward functions. However, this requirement can hardly be satised in systems with a large, or continuous, state space. In this paper, we propose a model-free IRL algorithm, where the relative entropy between the empirical distribution of the state-action trajectories under a baseline policy and their distribution under the learned policy is minimized by stochastic gradient descent. We compare this new approach to well-known IRL algorithms using learned MDP models. Empirical results on simulated car racing, gridworld and ball-in-a-cup problems show that our approach is able to learn good policies from a small number of demonstrations.

...read moreread less

319 citations

Journal Article•DOI•

Scheduling Power Consumption With Price Uncertainty

[...]

T.T. Kim¹, H.V. Poor¹•Institutions (1)

Princeton University¹

22 Jul 2011-IEEE Transactions on Smart Grid

TL;DR: Numerical results suggest that incorporating the statistical knowledge into the scheduling policies can result in significant savings, especially for short tasks, and it is demonstrated with real price data from Commonwealth Edison that scheduling with mismatched modeling and online parameter estimation can still provide significant economic advantages to consumers.

...read moreread less

Abstract: The problem of causally scheduling power consumption to minimize the expected cost at the consumer side is considered. The price of electricity is assumed to be time-varying. The scheduler has access to past and current prices, but only statistical knowledge about future prices, which it uses to make an optimal decision in each time period. The scheduling problem is naturally cast as a Markov decision process. Algorithms to find decision thresholds for both noninterruptible and interruptible loads under a deadline constraint are then developed. Numerical results suggest that incorporating the statistical knowledge into the scheduling policies can result in significant savings, especially for short tasks. It is demonstrated with real price data from Commonwealth Edison that scheduling with mismatched modeling and online parameter estimation can still provide significant economic advantages to consumers.

...read moreread less

Journal Article•DOI•

The Simplex and Policy-Iteration Methods Are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate

[...]

Yinyu Ye¹•Institutions (1)

Stanford University¹

01 Nov 2011-Mathematics of Operations Research

TL;DR: It is proved that the classic policy-iteration method and the original simplex method with the most-negative-reduced-cost pivoting rule of Dantzig are strongly polynomial-time algorithms for solving the Markov decision problem (MDP) with a fixed discount rate.

...read moreread less

Abstract: We prove that the classic policy-iteration method [Howard, R. A. 1960. Dynamic Programming and Markov Processes. MIT, Cambridge] and the original simplex method with the most-negative-reduced-cost pivoting rule of Dantzig are strongly polynomial-time algorithms for solving the Markov decision problem (MDP) with a fixed discount rate. Furthermore, the computational complexity of the policy-iteration and simplex methods is superior to that of the only known strongly polynomial-time interior-point algorithm [Ye, Y. 2005. A new complexity result on solving the Markov decision problem. Math. Oper. Res.30(3) 733--749] for solving this problem. The result is surprising because the simplex method with the same pivoting rule was shown to be exponential for solving a general linear programming problem [Klee, V., G. J. Minty. 1972. How good is the simplex method? Technical report. O. Shisha, ed. Inequalities III. Academic Press, New York], the simplex method with the smallest index pivoting rule was shown to be exponential for solving an MDP regardless of discount rates [Melekopoglou, M., A. Condon. 1994. On the complexity of the policy improvement algorithm for Markov decision processes. INFORMS J. Comput.6(2) 188--192], and the policy-iteration method was recently shown to be exponential for solving undiscounted MDPs under the average cost criterion. We also extend the result to solving MDPs with transient substochastic transition matrices whose spectral radii are uniformly below one.

...read moreread less

Book Chapter•DOI•

Continuous upper confidence trees

[...]

Adrien Couëtoux¹, Jean-Baptiste Hoock¹, Nataliya Sokolovska¹, Olivier Teytaud¹, Nicolas Bonnard - Show less +1 more•Institutions (1)

University of Paris-Sud¹

17 Jan 2011

TL;DR: It is guess that the double-progressive widening trick can be used for other algorithms as well, as a general tool for ensuring a good bias/variance compromise in search algorithms.

...read moreread less

Abstract: Upper Confidence Trees are a very efficient tool for solving Markov Decision Processes; originating in difficult games like the game of Go, it is in particular surprisingly efficient in high dimensional problems. It is known that it can be adapted to continuous domains in some cases (in particular continuous action spaces). We here present an extension of Upper Confidence Trees to continuous stochastic problems. We (i) show a deceptive problem on which the classical Upper Confidence Tree approach does not work, even with arbitrarily large computational power and with progressive widening (ii) propose an improvement, termed double-progressive widening, which takes care of the compromise between variance (we want infinitely many simulations for each action/state) and bias (we want sufficiently many nodes to avoid a bias by the first nodes) and which extends the classical progressive widening (iii) discuss its consistency and show experimentally that it performs well on the deceptive problem and on experimental benchmarks. We guess that the double-progressive widening trick can be used for other algorithms as well, as a general tool for ensuring a good bias/variance compromise in search algorithms.

...read moreread less

Proceedings Article•DOI•

Shape grammar parsing via Reinforcement Learning

[...]

Olivier Teboul¹, Iasonas Kokkinos¹, Loic Simon¹, Panagiotis Koutsourakis¹, Nikos Paragios¹ - Show less +1 more•Institutions (1)

École Centrale Paris¹

20 Jun 2011

TL;DR: This work addresses shape grammar parsing for facade segmentation using Reinforcement Learning using a Hierarchical Markov Decision Process, by employing a recursive binary split grammar to efficiently find the optimal parse of a given facade in terms of the authors' shape grammar.

...read moreread less

Abstract: We address shape grammar parsing for facade segmentation using Reinforcement Learning (RL). Shape parsing entails simultaneously optimizing the geometry and the topology (e.g. number of floors) of the facade, so as to optimize the fit of the predicted shape with the responses of pixel-level 'terminal detectors'. We formulate this problem in terms of a Hierarchical Markov Decision Process, by employing a recursive binary split grammar. This allows us to use RL to efficiently find the optimal parse of a given facade in terms of our shape grammar. Building on the RL paradigm, we exploit state aggregation to speedup computation, and introduce image-driven exploration in RL to accelerate convergence. We achieve state-of-the-art results on facade parsing, with a significant speed-up compared to existing methods, and substantial robustness to initial conditions. We demonstrate that the method can also be applied to interactive segmentation, and to a broad variety of architectural styles.

...read moreread less

Journal Article•DOI•

A Note on the Reward Function for PHD Filters with Sensor Control

[...]

Branko Ristic¹, Ba-Ngu Vo², Daniel E. Clark³•Institutions (3)

Defence Science and Technology Organisation¹, University of Western Australia², Heriot-Watt University³

15 Apr 2011-IEEE Transactions on Aerospace and Electronic Systems

TL;DR: The implementation of Rényi divergence via the sequential Monte Carlo method is presented and the performance of the proposed reward function is demonstrated by a numerical example, where a moving range-only sensor is controlled to estimate the number and the states of several moving objects using the PHD filter.

...read moreread less

Abstract: The context is sensor control for multi-object Bayes filtering in the framework of partially observed Markov decision processes (POMDPs). The current information state is represented by the multi-object probability density function (pdf), while the reward function associated with each sensor control (action) is the information gain measured by the alpha or Renyi divergence. Assuming that both the predicted and updated state can be represented by independent identically distributed (IID) cluster random finite sets (RFSs) or, as a special case, the Poisson RFSs, this work derives the analytic expressions of the corresponding Renyi divergence based information gains. The implementation of Renyi divergence via the sequential Monte Carlo method is presented. The performance of the proposed reward function is demonstrated by a numerical example, where a moving range-only sensor is controlled to estimate the number and the states of several moving objects using the PHD filter.

...read moreread less

Journal Article•

A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

[...]

Stephane Ross, Joelle Pineau, Brahim Chaib-draa, Pierre Kreitmann

01 Feb 2011-Journal of Machine Learning Research

TL;DR: This paper introduces the Bayes-Adaptive Partially Observable Markov Decision Processes, a new framework that can be used to simultaneously learn a model of the POMDP domain through interaction with the environment, and track the state of the system under partial observability.

...read moreread less

Abstract: Bayesian learning methods have recently been shown to provide an elegant solution to the exploration-exploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs). The primary focus of this paper is to extend these ideas to the case of partially observable domains, by introducing the Bayes-Adaptive Partially Observable Markov Decision Processes. This new framework can be used to simultaneously (1) learn a model of the POMDP domain through interaction with the environment, (2) track the state of the system under partial observability, and (3) plan (near-)optimal sequences of actions. An important contribution of this paper is to provide theoretical results showing how the model can be finitely approximated while preserving good learning performance. We present approximate algorithms for belief tracking and planning in this model, as well as empirical results that illustrate how the model estimate and agent's return improve as a function of experience.

...read moreread less

Proceedings Article•DOI•

Multiple timescale dispatch and scheduling for stochastic reliability in smart grids with wind generation integration

[...]

Miao He¹, Sugumar Murugesan¹, Junshan Zhang¹•Institutions (1)

Arizona State University¹

10 Apr 2011

TL;DR: This study considers smart grids with two classes of energy users - traditional energy users and opportunistic energy users (e.g., smart meters or smart appliances), and investigates pricing and dispatch at two timescales, via day-ahead scheduling and real-time scheduling.

...read moreread less

Abstract: Integrating volatile renewable energy resources into the bulk power grid is challenging, due to the reliability requirement that the load and generation in the system remain balanced all the time. In this study, we tackle this challenge for smart grid with integrated wind generation, by leveraging multi-timescale dispatch and scheduling. Specifically, we consider smart grids with two classes of energy users - traditional energy users and opportunistic energy users (e.g., smart meters or smart appliances), and investigate pricing and dispatch at two timescales, via day-ahead scheduling and real-time scheduling. In day-ahead scheduling, with the statistical information on wind generation and energy demands, we characterize the optimal procurement of the energy supply and the day-ahead retail price for the traditional energy users; in real-time scheduling, with the realization of wind generation and the load of traditional energy users, we optimize real-time prices to manage the opportunistic energy users so as to achieve system-wide reliability. More specifically, when the opportunistic users are non-persistent, we obtain closed-form solutions to the two-level scheduling problem. For the persistent case, we treat the scheduling problem as a multi-timescale Markov decision process. We show that it can be recast, explicitly, as a classic Markov decision process with continuous state and action spaces, the solution to which can be found via standard techniques.

...read moreread less

Journal Article•DOI•

A general approach for population games with application to vaccination.

[...]

Timothy C. Reluga¹, Alison P. Galvani²•Institutions (2)

Pennsylvania State University¹, Yale University²

01 Apr 2011-Bellman Prize in Mathematical Biosciences

TL;DR: A technique based on a combination of mechanistic population-scale models, Markov decision process theory and game theory that facilitates the evaluation of game theoretic decisions at both individual and community scales is presented.

...read moreread less

Abstract: Reconciling the interests of individuals with the interests of communities is a major challenge in designing and implementing health policies. In this paper, we present a technique based on a combination of mechanistic population-scale models, Markov decision process theory and game theory that facilitates the evaluation of game theoretic decisions at both individual and community scales. To illustrate our technique, we provide solutions to several variants of the simple vaccination game including imperfect vaccine efficacy and differential waning of natural and vaccine immunity. In addition, we show how path-integral approaches can be applied to the study of models in which strategies are fixed waiting times rather than exponential random variables. These methods can be applied to a wide variety of decision problems with population-dynamic feedbacks.

...read moreread less

Proceedings Article•DOI•

Real-time vehicle-to-grid control algorithm under price uncertainty

[...]

Wenbo Shi¹, Vincent W. S. Wong¹•Institutions (1)

University of British Columbia¹

15 Dec 2011

TL;DR: The proposed V2G control algorithm is evaluated using both the simulated price and the actual price from PJM in 2010 to show that it can work effectively in the real electricity market and it is able to increase the profit significantly compared with the conventional EV charging scheme.

...read moreread less

Abstract: The vehicle-to-grid (V2G) system enables energy flow from the electric vehicles (EVs) to the grid. The distributed power of the EVs can either be sold to the grid or be used to provide frequency regulation service when V2G is implemented. A V2G control algorithm is necessary to decide whether the EV should be charged, discharged, or provide frequency regulation service in each hour. The V2G control problem is further complicated by the price uncertainty, where the electricity price is determined dynamically every hour. In this paper, we study the real-time V2G control problem under price uncertainty. We model the electricity price as a Markov chain with unknown transition probabilities and formulate the problem as a Markov decision process (MDP). This model features implicit estimation of the impact of future electricity prices and current control operation on long-term profits. The Q-learning algorithm is then used to adapt the control operation to the hourly available price in order to maximize the profit for the EV owner during the whole parking time. We evaluate our proposed V2G control algorithm using both the simulated price and the actual price from PJM in 2010. Simulation results show that our proposed algorithm can work effectively in the real electricity market and it is able to increase the profit significantly compared with the conventional EV charging scheme.

...read moreread less

Journal Article•DOI•

Markov Decision Processes with Average-Value-at-Risk criteria

[...]

Nicole Bäuerle¹, Jonathan Ott¹•Institutions (1)

Karlsruhe Institute of Technology¹

28 Sep 2011-Mathematical Methods of Operations Research

TL;DR: This work investigates the problem of minimizing the Average-Value-at-Risk (AVaRτ) of the discounted cost over a finite and an infinite horizon which is generated by a Markov Decision Process and shows that this problem can be reduced to an ordinary MDP with extended state space and given conditions under which an optimal policy exists.

...read moreread less

Abstract: We investigate the problem of minimizing the Average-Value-at-Risk (AVaR τ ) of the discounted cost over a finite and an infinite horizon which is generated by a Markov Decision Process (MDP). We show that this problem can be reduced to an ordinary MDP with extended state space and give conditions under which an optimal policy exists. We also give a time-consistent interpretation of the AVaR τ . At the end we consider a numerical example which is a simple repeated casino game. It is used to discuss the influence of the risk aversion parameter τ of the AVaR τ -criterion.

...read moreread less

Journal Article•DOI•

Delay-optimal opportunistic scheduling and approximations: the log rule

[...]

Bilal Sadiq¹, Seung Jun Baek², G. de Veciana¹•Institutions (2)

University of Texas at Austin¹, Korea University²

01 Apr 2011-IEEE ACM Transactions on Networking

TL;DR: In order to meet performance and robustness objectives, a new class of policies are proposed, called the Log rule, that are radial sum-rate monotone (RSM) and provably throughput-optimal and it can be shown that an RSM policy minimizes the asymptotic probability of sum-queue overflow.

...read moreread less

Abstract: This paper considers the design of multiuser opportunistic packet schedulers for users sharing a time-varying wireless channel from performance and robustness points of view. For a simplified model falling in the classical Markov decision process framework, we numerically compute and characterize mean-delay-optimal scheduling policies. The computed policies exhibit radial sum-rate monotonicity: As users' queues grow linearly, the scheduler allocates service in a manner that deemphasizes the balancing of unequal queues in favor of maximizing current system throughput (being opportunistic). This is in sharp contrast to previously proposed throughput-optimal policies, e.g., Exp rule and MaxWeight (with any positive exponent of queue length). In order to meet performance and robustness objectives, we propose a new class of policies, called the Log rule, that are radial sum-rate monotone (RSM) and provably throughput-optimal. In fact, it can also be shown that an RSM policy minimizes the asymptotic probability of sum-queue overflow. We use extensive simulations to explore various possible design objectives for opportunistic schedulers. When users see heterogenous channels, we find that emphasizing queue balancing, e.g., Exp rule and MaxWeight, may excessively compromise the overall delay. Finally, we discuss approaches to implement the proposed policies for scheduling and resource allocation in OFDMA-based multichannel systems.

...read moreread less

Journal Article•DOI•

Relay Scheduling for Cooperative Communications in Sensor Networks with Energy Harvesting

[...]

Huijiang Li¹, Neeraj Jaggi², Biplab Sikdar¹•Institutions (2)

Rensselaer Polytechnic Institute¹, Wichita State University²

22 Jul 2011-IEEE Transactions on Wireless Communications

TL;DR: This paper develops an upper bound on the performance of any arbitrary scheduler, formulated and solved as a Markov Decision Process (MDP), assuming that complete state information about the relays is available at the source nodes.

...read moreread less

Abstract: This paper considers wireless sensor networks (WSNs) with energy harvesting and cooperative communications and develops energy efficient scheduling strategies for such networks. In order to maximize the long-term utility of the network, the scheduling problem considered in this paper addresses the following question: given an estimate of the current network state, should a source transmit its data directly to the destination or use a relay to help with the transmission? We first develop an upper bound on the performance of any arbitrary scheduler. Next, the optimal scheduling problem is formulated and solved as a Markov Decision Process (MDP), assuming that complete state information about the relays is available at the source nodes. We then relax the assumption of the availability of full state information, and formulate the scheduling problem as a Partially Observable Markov Decision Process (POMDP) and show that it can be decomposed into an equivalent MDP problem. Simulation results are used to show the performance of the schedulers.

...read moreread less

Journal Article•DOI•

Fast Reinforcement Learning for Energy-Efficient Wireless Communication

[...]

Nicholas Mastronarde¹, M. van der Schaar¹•Institutions (1)

University at Buffalo¹

01 Dec 2011-IEEE Transactions on Signal Processing

TL;DR: A rigorous and unified framework for simultaneously utilizing both physical-layer and system-level techniques to minimize energy consumption, under delay constraints, in the presence of stochastic and unknown traffic and channel conditions is proposed.

...read moreread less

Abstract: We consider the problem of energy-efficient point-to-point transmission of delay-sensitive data (e.g., multimedia data) over a fading channel. We propose a rigorous and unified framework for simultaneously utilizing both physical-layer and system-level techniques to minimize energy consumption, under delay constraints, in the presence of stochastic and unknown traffic and channel conditions. We formulate the problem as a Markov decision process and solve it online using reinforcement learning. The advantages of the proposed online method are that i) it does not require a priori knowledge of the traffic arrival and channel statistics to determine the jointly optimal physical-layer and system-level power management strategies; ii) it exploits partial information about the system so that less information needs to be learned than when using conventional reinforcement learning algorithms; and iii) it obviates the need for action exploration, which severely limits the adaptation speed and run-time performance of conventional reinforcement learning algorithms.

...read moreread less

Proceedings Article•

Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor.

[...]

Thomas Dueholm Hansen¹, Peter Bro Miltersen¹, Uri Zwick²•Institutions (2)

Aarhus University¹, Tel Aviv University²

01 Jan 2011

TL;DR: Ye as mentioned in this paper showed that the simplex method with Dantzig's pivoting rule, as well as Howard's policy iteration algorithm, solve discounted Markov decision processes (MDPs), with a constant discount factor, in strongly polynomial time.

...read moreread less

Abstract: Ye [2011] showed recently that the simplex method with Dantzig’s pivoting rule, as well as Howard’s policy iteration algorithm, solve discounted Markov decision processes (MDPs), with a constant discount factor, in strongly polynomial time. More precisely, Ye showed that both algorithms terminate after at most O(mn1−γ log n1−γ) iterations, where n is the number of states, m is the total number of actions in the MDP, and 0

...read moreread less

Journal Article•DOI•

Bisimulation Metrics for Continuous Markov Decision Processes

[...]

Norm Ferns, Prakash Panangaden¹, Doina Precup•Institutions (1)

McGill University¹

01 Nov 2011-SIAM Journal on Computing

TL;DR: This work provides the first distance-estimation scheme for metrics based on bisimulation for continuous probabilistic transition systems and shows that the optimal value function associated with a discounted infinite-horizon planning task is continuous with respect to metric distances.

...read moreread less

Abstract: In recent years, various metrics have been developed for measuring the behavioral similarity of states in probabilistic transition systems [J. Desharnais et al., Proceedings of CONCUR'99, Springer-Verlag, London, 1999, pp. 258-273; F. van Breugel and J. Worrell, Proceedings of ICALP'01, Springer-Verlag, London, 2001, pp. 421-432]. In the context of finite Markov decision processes (MDPs), we have built on these metrics to provide a robust quantitative analogue of stochastic bisimulation [N. Ferns, P. Panangaden, and D. Precup, Proceedings of UAI-04, AUAI Press, Arlington, VA, 2004, pp. 162-169] and an efficient algorithm for its calculation [N. Ferns, P. Panangaden, and D. Precup, Proceedings of UAI-06, AUAI Press, Arlington, VA, 2006, pp. 174-181]. In this paper, we seek to properly extend these bisimulation metrics to MDPs with continuous state spaces. In particular, we provide the first distance-estimation scheme for metrics based on bisimulation for continuous probabilistic transition systems. Our work, based on statistical sampling and infinite dimensional linear programming, is a crucial first step in formally guiding real-world planning, where tasks are usually continuous and highly stochastic in nature, e.g., robot navigation, and often a substitution with a parametric model or crude finite approximation must be made. We show that the optimal value function associated with a discounted infinite-horizon planning task is continuous with respect to metric distances. Thus, our metrics allow one to reason about the quality of solution obtained by replacing one model with another. Alternatively, they may potentially be used directly for state aggregation.

...read moreread less

Book Chapter•DOI•

Synthesis for PCTL in parametric Markov decision processes

[...]

Ernst Moritz Hahn¹, Tingting Han², Lijun Zhang³•Institutions (3)

Saarland University¹, University of Oxford², Technical University of Denmark³

18 Apr 2011

TL;DR: This paper studies the synthesis problem for PCTL in PMDPs, and synthesises the parameter valuations under which F is true, using existing decision procedures to check whether F holds on each of the Markov processes represented by the hyper-rectangle.

...read moreread less

Abstract: In parametric Markov decision processes (PMDPs), transition probabilities are not fixed, but are given as functions over a set of parameters. A PMDP denotes a family of concrete MDPs. This paper studies the synthesis problem for PCTL in PMDPs: Given a specification F in PCTL, we synthesise the parameter valuations under which F is true. First, we divide the possible parameter space into hyper-rectangles. We use existing decision procedures to check whether F holds on each of the Markov processes represented by the hyper-rectangle. As it is normally impossible to cover the whole parameter space by hyper-rectangles, we allow a limited area to remain undecided. We also consider an extension of PCTL with reachability rewards. To demonstrate the applicability of the approach, we apply our technique on a case study, using a preliminary implementation.

...read moreread less

Book Chapter•DOI•

Faster teaching by POMDP planning

[...]

Anna N. Rafferty¹, Emma Brunskill¹, Thomas L. Griffiths¹, Patrick Shafto²•Institutions (2)

University of California, Berkeley¹, University of Louisville²

28 Jun 2011

TL;DR: This work frames the problem of optimally selecting teaching actions using a decision-theoretic approach and shows how to formulate teaching as a partially-observable Markov decision process (POMDP) planning problem.

...read moreread less

Abstract: Both human and automated tutors must infer what a student knows and plan future actions to maximize learning. Though substantial research has been done on tracking and modeling student learning, there has been significantly less attention on planning teaching actions and how the assumed student model impacts the resulting plans. We frame the problem of optimally selecting teaching actions using a decision-theoretic approach and show how to formulate teaching as a partially-observable Markov decision process (POMDP) planning problem. We consider three models of student learning and present approximate methods for finding optimal teaching actions given the large state and action spaces that arise in teaching. An experimental evaluation of the resulting policies on a simple concept-learning task shows that framing teacher action planning as a POMDP can accelerate learning relative to baseline performance.

...read moreread less

Journal Article•

Inverse Reinforcement Learning in Partially Observable Environments

[...]

Jaedeug Choi, Kee-Eung Kim

01 Feb 2011-Journal of Machine Learning Research

TL;DR: Inverse reinforcement learning (IRL) is the problem of recovering the underlying reward function from the behavior of an expert as discussed by the authors, which can be modeled as a partially observable Markov decision process (POMDP).

...read moreread less

Abstract: Inverse reinforcement learning (IRL) is the problem of recovering the underlying reward function from the behavior of an expert. Most of the existing IRL algorithms assume that the environment is modeled as a Markov decision process (MDP), although it is desirable to handle partially observable settings in order to handle more realistic scenarios. In this paper, we present IRL algorithms for partially observable environments that can be modeled as a partially observable Markov decision process (POMDP). We deal with two cases according to the representation of the given expert's behavior, namely the case in which the expert's policy is explicitly given, and the case in which the expert's trajectories are available instead. The IRL in POMDPs poses a greater challenge than in MDPs since it is not only ill-posed due to the nature of IRL, but also computationally intractable due to the hardness in solving POMDPs. To overcome these obstacles, we present algorithms that exploit some of the classical results from the POMDP literature. Experimental results on several benchmark POMDP domains show that our work is useful for partially observable settings.

...read moreread less

Proceedings Article•

Sample-based planning for continuous action Markov Decision Processes

[...]

Christopher R. Mansley¹, Ari Weinstein¹, Michael L. Littman¹•Institutions (1)

Rutgers University¹

11 Jun 2011

TL;DR: A new algorithm is presented that integrates recent advances in solving continuous bandit problems with sample-based rollout methods for planning in Markov Decision Processes (MDPs) and addresses planning in continuous-action MDPs.

...read moreread less

Abstract: In this paper, we present a new algorithm that integrates recent advances in solving continuous bandit problems with sample-based rollout methods for planning in Markov Decision Processes (MDPs). Our algorithm, Hierarchical Optimistic Optimization applied to Trees (HOOT) addresses planning in continuous-action MDPs. Empirical results are given that show that the performance of our algorithm meets or exceeds that of a similar discrete action planner by eliminating the problem of manual discretization of the action space.

...read moreread less

Journal Article•DOI•

Optimal Cognitive Access of Markovian Channels under Tight Collision Constraints

[...]

Xin Li¹, Qianchuan Zhao¹, Xiaohong Guan¹, Lang Tong²•Institutions (2)

Tsinghua University¹, Cornell University²

01 Apr 2011-IEEE Journal on Selected Areas in Communications

TL;DR: It is shown in this paper that, when the collision constraints are tight, the optimal access strategy can be implemented by a simple memoryless access policy with periodic channel sensing with Extensions to multiple secondary users are presented.

...read moreread less

Abstract: The problem of cognitive access of channels of primary users by a secondary user is considered. The transmissions of primary users are modeled as independent continuous-time Markovian on-off processes. A secondary cognitive user employs a slotted transmission format, and it senses one of the possible channels before transmission. The objective of the cognitive user is to maximize its throughput subject to collision constraints imposed by the primary users. The optimal access strategy is in general a solution of a constrained partially observable Markov decision process, which involves a constrained optimization in an infinite dimensional functional space. It is shown in this paper that, when the collision constraints are tight, the optimal access strategy can be implemented by a simple memoryless access policy with periodic channel sensing. Analytical expressions are given for the thresholds on collision probabilities for which memoryless access performs optimally. Extensions to multiple secondary users are also presented. Numerical and theoretical results are presented to validate and extend the analysis for different practical scenarios.

...read moreread less

Collapse