Showing papers on "Markov decision process published in 2012"

PDF

Open Access

Book•

Introduction to stochastic search and optimization: estimation, simulation, and control

[...]

01 Jan 2012

TL;DR: In this article, the authors present a comprehensive book with 504 main pages divided into 17 chapters, covering multivariate analysis, basic tests in statistics, probability theory and convergence, random number generators and Markov processes.

...read moreread less

Abstract: This comprehensive book offers 504 main pages divided into 17 chapters. In addition, five very useful and clearly written appendices are provided, covering multivariate analysis, basic tests in statistics, probability theory and convergence, random number generators and Markov processes. Some of the topics covered in the book include: stochastic approximation in nonlinear search and optimization; evolutionary computations; reinforcement learning via temporal differences; mathematical model selection; and computer-simulation-based optimizations. Over 250 exercises are provided in the book, though only a small number of them have solutions included in the volume. A separate solution manual is available, as is a very informative webpage. The book may serve as either a reference for researchers and practitioners in many fields or as an excellent graduate level textbook.

...read moreread less

1,163 citations

Book Chapter•DOI•

Reinforcement Learning and Markov Decision Processes

[...]

Martijn van Otterlo¹, Marco A. Wiering²•Institutions (2)

Radboud University Nijmegen¹, University of Groningen²

01 Jan 2012

TL;DR: This text introduces the intuitions and concepts behind Markov decision processes and two classes of algorithms for computing optimal behaviors: reinforcement learning and dynamic programming, and surveys efficient extensions of the foundational algorithms.

...read moreread less

Abstract: Situated in between supervised learning and unsupervised learning, the paradigm of reinforcement learning deals with learning in sequential decision making problems in which there is limited feedback. This text introduces the intuitions and concepts behind Markov decision processes and two classes of algorithms for computing optimal behaviors: reinforcement learning and dynamic programming. First the formal framework of Markov decision process is defined, accompanied by the definition of value functions and policies. The main part of this text deals with introducing foundational classes of algorithms for learning optimal behaviors, based on various definitions of optimality with respect to the goal of learning sequential decisions. Additionally, it surveys efficient extensions of the foundational algorithms, differing mainly in the way feedback given by the environment is used to speed up learning, and in the way they concentrate on relevant parts of the problem. For both model-based and model-free settings these efficient extensions have shown useful in scaling up to larger problems.

...read moreread less

281 citations

Posted Content•

Continuous Inverse Optimal Control with Locally Optimal Examples

[...]

Sergey Levine¹, Vladlen Koltun¹•Institutions (1)

Stanford University¹

18 Jun 2012-arXiv: Learning

TL;DR: Inverse reinforcement learning (IRL) as mentioned in this paper is a probabilistic inverse optimal control algorithm that scales gracefully with task dimensionality, and is suitable for large, continuous domains where even computing a full policy is impractical.

...read moreread less

Abstract: Inverse optimal control, also known as inverse reinforcement learning, is the problem of recovering an unknown reward function in a Markov decision process from expert demonstrations of the optimal policy. We introduce a probabilistic inverse optimal control algorithm that scales gracefully with task dimensionality, and is suitable for large, continuous domains where even computing a full policy is impractical. By using a local approximation of the reward function, our method can also drop the assumption that the demonstrations are globally optimal, requiring only local optimality. This allows it to learn from examples that are unsuitable for prior methods.

...read moreread less

272 citations

Book•

Markov Processes and Learning Models

[...]

M. Frank Norman

19 Apr 2012

252 citations

Journal Article•DOI•

Anti-Jamming Games in Multi-Channel Cognitive Radio Networks

[...]

Yongle Wu¹, Beibei Wang¹, K.J.R. Liu², T.C. Clancy•Institutions (2)

Qualcomm¹, University of Maryland, College Park²

01 Jan 2012-IEEE Journal on Selected Areas in Communications

TL;DR: This paper derives a channel hopping defense strategy using the Markov decision process approach with the assumption of perfect knowledge, and proposes two learning schemes for secondary users to gain knowledge of adversaries to handle cases without perfect knowledge.

...read moreread less

Abstract: Crucial to the successful deployment of cognitive radio networks, security issues have begun to receive research interests recently. In this paper, we focus on defending against the jamming attack, one of the major threats to cognitive radio networks. Secondary users can exploit the flexible access to multiple channels as the means of anti-jamming defense. We first investigate the situation where a secondary user can access only one channel at a time and hop among different channels, and model it as an anti-jamming game. Analyzing the interaction between the secondary user and attackers, we derive a channel hopping defense strategy using the Markov decision process approach with the assumption of perfect knowledge, and then propose two learning schemes for secondary users to gain knowledge of adversaries to handle cases without perfect knowledge. In addition, we extend to the scenario where secondary users can access all available channels simultaneously, and redefine the anti-jamming game with randomized power allocation as the defense strategy. We derive the Nash equilibrium for this Colonel Blotto game which minimizes the worst-case damage. Finally, simulation results are presented to verify the performance.

...read moreread less

242 citations

Journal Article•DOI•

A Survey on Delay-Aware Resource Control for Wireless Systems—Large Deviation Theory, Stochastic Lyapunov Drift, and Distributed Stochastic Learning

[...]

Ying Cui¹, Vincent K. N. Lau¹, Rui Wang², Huang Huang², Shunqing Zhang² - Show less +1 more•Institutions (2)

Hong Kong University of Science and Technology¹, Huawei²

01 Mar 2012-IEEE Transactions on Information Theory

TL;DR: A comprehensive survey is given on several major systematic approaches in dealing with delay-aware control problems, namely the equivalentrate constraint approach, the Lyapunov stability drift approach, and the approximate Markov decision process approach using stochastic learning.

...read moreread less

Abstract: In this paper, a comprehensive survey is given on several major systematic approaches in dealing with delay-aware control problems, namely the equivalentrate constraint approach, the Lyapunov stability drift approach, and the approximate Markov decision process approach using stochastic learning. These approaches essentially embrace most of the existing literature regarding delay-aware resource control in wireless systems. They have their relative pros and cons in terms of performance, complexity, and implementation issues. For each of the approaches, the problem setup, the general solution, and the design methodology are discussed. Applications of these approaches to delay-aware resource allocation are illustrated with examples in single-hop wireless networks. Furthermore, recent results regarding delay-aware multihop routing designs in general multihop networks are elaborated. Finally, the delay performances of various approaches are compared through simulations using an example of the uplink OFDMA systems.

...read moreread less

210 citations

Posted Content•

Safe Exploration in Markov Decision Processes

[...]

Teodor Moldovan¹, Pieter Abbeel¹•Institutions (1)

University of California, Berkeley¹

22 May 2012-arXiv: Learning

TL;DR: In this paper, a general formulation of safety through ergodicity is proposed, and an efficient algorithm for guaranteed safe, but potentially suboptimal, exploration is presented, in which the constraints restrict attention to a subset of the guaranteed safe policies and the objective favors exploration policies.

...read moreread less

Abstract: In environments with uncertain dynamics exploration is necessary to learn how to perform well. Existing reinforcement learning algorithms provide strong exploration guarantees, but they tend to rely on an ergodicity assumption. The essence of ergodicity is that any state is eventually reachable from any other state by following a suitable policy. This assumption allows for exploration algorithms that operate by simply favoring states that have rarely been visited before. For most physical systems this assumption is impractical as the systems would break before any reasonable exploration has taken place, i.e., most physical systems don't satisfy the ergodicity assumption. In this paper we address the need for safe exploration methods in Markov decision processes. We first propose a general formulation of safety through ergodicity. We show that imposing safety by restricting attention to the resulting set of guaranteed safe policies is NP-hard. We then present an efficient algorithm for guaranteed safe, but potentially suboptimal, exploration. At the core is an optimization formulation in which the constraints restrict attention to a subset of the guaranteed safe policies and the objective favors exploration policies. Our framework is compatible with the majority of previously proposed exploration methods, which rely on an exploration bonus. Our experiments, which include a Martian terrain exploration problem, show that our method is able to explore better than classical exploration methods.

...read moreread less

209 citations

Posted Content•

Metrics for Finite Markov Decision Processes

[...]

Norm Ferns¹, Prakash Panangaden¹, Doina Precup¹•Institutions (1)

McGill University¹

11 Jul 2012-arXiv: Artificial Intelligence

TL;DR: In this paper, the authors present metrics for measuring the similarity of states in a finite Markov decision process (MDP) based on the notion of bisimulation, with an aim towards solving discounted infinite horizon reinforcement learning tasks.

...read moreread less

Abstract: We present metrics for measuring the similarity of states in a finite Markov decision process (MDP). The formulation of our metrics is based on the notion of bisimulation for MDPs, with an aim towards solving discounted infinite horizon reinforcement learning tasks. Such metrics can be used to aggregate states, as well as to better structure other value function approximators (e.g., memory-based or nearest-neighbor approximators). We provide bounds that relate our metric distances to the optimal values of states in the given MDP.

...read moreread less

201 citations

Journal Article•DOI•

Optimal Control of Wake Up Mechanisms of Femtocells in Heterogeneous Networks

[...]

Louai Saker, Salah Eddine Elayoubi, Richard Combes, Tijani Chahed¹•Institutions (1)

Telecom SudParis¹

22 Mar 2012-IEEE Journal on Selected Areas in Communications

TL;DR: This work model such a system at the flow level, considering a dynamic user configuration, and derive optimal sleep/wake up schemes based on the information on traffic load and user localization in the cell, in the cases where this information is complete, partial or delayed.

...read moreread less

Abstract: We study, in this work, optimal sleep/wake up schemes for the base stations of network-operated femto cells deployed within macro cells for the purpose of offloading part of its traffic. Our aim is to minimize the energy consumption of the overall heterogeneous network while preserving the Quality of Service (QoS) experienced by users. We model such a system at the flow level, considering a dynamic user configuration, and derive, using Markov Decision Processes (MDPs), optimal sleep/wake up schemes based on the information on traffic load and user localization in the cell, in the cases where this information is complete, partial or delayed. Our results quantify the energy consumption and QoS perceived by the users in each of these cases and identify the tradeoffs between those two quantities. We also illustrate numerically the optimal policies in different traffic scenarios.

...read moreread less

185 citations

Posted Content•

REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs

[...]

Peter L. Bartlett¹, Ambuj Tewari²•Institutions (2)

University of California, Berkeley¹, Toyota Technological Institute at Chicago²

09 May 2012-arXiv: Learning

TL;DR: In this paper, the authors provided an algorithm that achieves the optimal regret rate in an unknown weakly communicating Markov Decision Process (MDP) with S states and A actions whose optimal bias vector has span bounded by H.

...read moreread less

Abstract: We provide an algorithm that achieves the optimal regret rate in an unknown weakly communicating Markov Decision Process (MDP). The algorithm proceeds in episodes where, in each episode, it picks a policy using regularization based on the span of the optimal bias vector. For an MDP with S states and A actions whose optimal bias vector has span bounded by H, we show a regret bound of ~O(HSpAT). We also relate the span to various diameter-like quantities associated with the MDP, demonstrating how our results improve on previous regret bounds.

...read moreread less

183 citations

Proceedings Article•DOI•

Robust control of uncertain Markov Decision Processes with temporal logic specifications

[...]

Eric M. Wolff¹, Ufuk Topcu¹, Richard M. Murray¹•Institutions (1)

California Institute of Technology¹

01 Dec 2012

TL;DR: A procedure from probabilistic model checking is used to combine the system model with an automaton representing the specification and this new MDP is transformed into an equivalent form that satisfies assumptions for stochastic shortest path dynamic programming.

...read moreread less

Abstract: We present a method for designing a robust control policy for an uncertain system subject to temporal logic specifications. The system is modeled as a finite Markov Decision Process (MDP) whose transition probabilities are not exactly known but are known to belong to a given uncertainty set. A robust control policy is generated for the MDP that maximizes the worst-case probability of satisfying the specification over all transition probabilities in this uncertainty set. To this end, we use a procedure from probabilistic model checking to combine the system model with an automaton representing the specification. This new MDP is then transformed into an equivalent form that satisfies assumptions for stochastic shortest path dynamic programming. A robust version of dynamic programming solves for a e-suboptimal robust control policy with time complexity O(log1/e) times that for the non-robust case.

...read moreread less

Journal Article•DOI•

Dynamic multi-appointment patient scheduling for radiation therapy

[...]

Antoine Sauré¹, Jonathan Patrick², Scott Tyldesley, Martin L. Puterman¹•Institutions (2)

University of British Columbia¹, University of Ottawa²

01 Dec 2012-European Journal of Operational Research

TL;DR: A discounted infinite-horizon Markov decision process for scheduling cancer treatments in radiation therapy units is formulated and solved to identify good policies for allocating available treatment capacity to incoming demand, while reducing wait times in a cost-effective manner.

...read moreread less

Book Chapter•DOI•

Game Theory and Multi-agent Reinforcement Learning

[...]

Ann Nowé¹, Peter Vrancx¹, Yann-Michaël De Hauwere¹•Institutions (1)

Vrije Universiteit Brussel¹

01 Jan 2012

TL;DR: A basic learning framework based on the economic research into game theory is described, and a representative selection of algorithms for the different areas of multi-agent reinforcement learning research is described.

...read moreread less

Abstract: Reinforcement Learning was originally developed for Markov Decision Processes (MDPs). It allows a single agent to learn a policy that maximizes a possibly delayed reward signal in a stochastic stationary environment. It guarantees convergence to the optimal policy, provided that the agent can sufficiently experiment and the environment in which it is operating is Markovian. However, when multiple agents apply reinforcement learning in a shared environment, this might be beyond the MDP model. In such systems, the optimal policy of an agent depends not only on the environment, but on the policies of the other agents as well. These situations arise naturally in a variety of domains, such as: robotics, telecommunications, economics, distributed control, auctions, traffic light control, etc. In these domains multi-agent learning is used, either because of the complexity of the domain or because control is inherently decentralized. In such systems it is important that agents are capable of discovering good solutions to the problem at hand either by coordinating with other learners or by competing with them. This chapter focuses on the application reinforcement learning techniques in multi-agent systems. We describe a basic learning framework based on the economic research into game theory, and illustrate the additional complexity that arises in such systems. We also described a representative selection of algorithms for the different areas of multi-agent reinforcement learning research.

...read moreread less

Book Chapter•DOI•

Partially Observable Markov Decision Processes

[...]

Matthijs T. J. Spaan¹, Matthijs T. J. Spaan²•Institutions (2)

Delft University of Technology¹, Instituto Superior Técnico²

01 Jan 2012

TL;DR: This chapter presents the POMDP model by focusing on the differences with fully observable MDPs, and it is shown how optimal policies for POM DPs can be represented.

...read moreread less

Abstract: For reinforcement learning in environments in which an agent has access to a reliable state signal, methods based on the Markov decision process (MDP) have had many successes. In many problem domains, however, an agent suffers from limited sensing capabilities that preclude it from recovering a Markovian state signal from its perceptions. Extending the MDP framework, partially observable Markov decision processes (POMDPs) allow for principled decision making under conditions of uncertain sensing. In this chapter we present the POMDP model by focusing on the differences with fully observable MDPs, and we show how optimal policies for POMDPs can be represented. Next, we give a review of model-based techniques for policy computation, followed by an overview of the available model-free methods for POMDPs. We conclude by highlighting recent trends in POMDP reinforcement learning.

...read moreread less

Journal Article•DOI•

Temporal Logic Motion Planning and Control With Probabilistic Satisfaction Guarantees

[...]

Morteza Lahijanian¹, Sean B. Andersson¹, Calin Belta¹•Institutions (1)

Boston University¹

01 Apr 2012-IEEE Transactions on Robotics

TL;DR: A computational framework for automatic deployment of a robot with sensor and actuator noise from a temporal logic specification over a set of properties that are satisfied by the regions of a partitioned environment is described.

...read moreread less

Abstract: We describe a computational framework for automatic deployment of a robot with sensor and actuator noise from a temporal logic specification over a set of properties that are satisfied by the regions of a partitioned environment. We model the motion of the robot in the environment as a Markov decision process (MDP) and translate the motion specification to a formula of probabilistic computation tree logic (PCTL). As a result, the robot control problem is mapped to that of generating an MDP control policy from a PCTL formula. We present algorithms for the synthesis of such policies for different classes of PCTL formulas. We illustrate our method with simulation and experimental results.

...read moreread less

Journal Article•DOI•

Sensing and Transmit Energy Optimization for an Energy Harvesting Cognitive Radio

[...]

A. Sultan¹•Institutions (1)

Alexandria University¹

18 Jul 2012-IEEE Wireless Communications Letters

TL;DR: A cognitive radio setting in which the secondary user is an energy harvester with a finite capacity battery and the optimal policy is illustrated, a myopic policy is compared, and the variation of throughput with various system parameters is investigated.

...read moreread less

Abstract: We consider a cognitive radio setting in which the secondary user is an energy harvester with a finite capacity battery. The primary user operates in a time-slotted fashion. At the beginning of each time slot, the secondary user, aiming at maximizing its throughput, may remain idle or carry out spectrum sensing to detect primary activity. The decision is determined by the secondary belief regarding primary activity and the amount of stored energy. We formulate this problem as a Markov decision process. We illustrate the optimal policy, compare it with a myopic policy, and investigate the variation of throughput with various system parameters.

...read moreread less

Journal Article•DOI•

Average Cost Markov Decision Processes with Weakly Continuous Transition Probabilities

[...]

Eugene A. Feinberg¹, Pavlo O. Kasyanov², Nina V. Zadoianchuk²•Institutions (2)

Stony Brook University¹, National Technical University²

01 Nov 2012-Mathematics of Operations Research

TL;DR: In this paper, sufficient conditions for the existence of stationary optimal policies for average cost Markov decision processes with Borel state and action sets and weakly continuous transition probabilities were presented.

...read moreread less

Abstract: This paper presents sufficient conditions for the existence of stationary optimal policies for average cost Markov decision processes with Borel state and action sets and weakly continuous transition probabilities The one-step cost functions may be unbounded, and the action sets may be noncompact The main contributions of this paper are: (i) general sufficient conditions for the existence of stationary discount optimal and average cost optimal policies and descriptions of properties of value functions and sets of optimal actions, (ii) a sufficient condition for the average cost optimality of a stationary policy in the form of optimality inequalities, and (iii) approximations of average cost optimal actions by discount optimal actions

...read moreread less

Proceedings Article•DOI•

Statistical Model Checking for Markov Decision Processes

[...]

David Henriques¹, João G. Martins¹, Paolo Zuliani¹, André Platzer¹, Edmund M. Clarke¹ - Show less +1 more•Institutions (1)

Carnegie Mellon University¹

17 Sep 2012

TL;DR: This work develops an algorithm that resolves nond determinism probabilistically, and then uses multiple rounds of sampling and Reinforcement Learning to provably improve resolutions of nondeterminism with respect to satisfying a Bounded Linear Temporal Logic (BLTL) property.

...read moreread less

Abstract: Statistical Model Checking (SMC) is a computationally very efficient verification technique based on selective system sampling. One well identified shortcoming of SMC is that, unlike probabilistic model checking, it cannot be applied to systems featuring nondeterminism, such as Markov Decision Processes (MDP). We address this limitation by developing an algorithm that resolves nondeterminism probabilistically, and then uses multiple rounds of sampling and Reinforcement Learning to provably improve resolutions of nondeterminism with respect to satisfying a Bounded Linear Temporal Logic (BLTL) property. Our algorithm thus reduces an MDP to a fully probabilistic Markov chain on which SMC may be applied to give an approximate solution to the problem of checking the probabilistic BLTL property. We integrate our algorithm in a parallelised modification of the PRISM simulation framework. Extensive validation with both new and PRISM benchmarks demonstrates that the approach scales very well in scenarios where symbolic algorithms fail to do so.

...read moreread less

Book Chapter•DOI•

PAC bounds for discounted MDPs

[...]

Tor Lattimore¹, Marcus Hutter¹•Institutions (1)

Australian National University¹

29 Oct 2012

TL;DR: In this article, the sample complexity of learning near-optimal behavior in finite-state discounted Markov Decision Processes (mdps) was studied and upper and lower bounds on sample complexity were shown for a modified version of UCRL with only cubic dependence on the horizon.

...read moreread less

Abstract: We study upper and lower bounds on the sample-complexity of learning near-optimal behaviour in finite-state discounted Markov Decision Processes (mdps). We prove a new bound for a modified version of Upper Confidence Reinforcement Learning (ucrl) with only cubic dependence on the horizon. The bound is unimprovable in all parameters except the size of the state/action space, where it depends linearly on the number of non-zero transition probabilities. The lower bound strengthens previous work by being both more general (it applies to all policies) and tighter. The upper and lower bounds match up to logarithmic factors provided the transition matrix is not too dense.

...read moreread less

Book•

Planning with Markov Decision Processes: An AI Perspective

[...]

Andrey Kolobov¹•Institutions (1)

University of Washington¹

03 Jul 2012

TL;DR: Markov Decision Processes (MDPs) are widely used in Artificial Intelligence for modeling sequential decision-making scenarios with probabilistic dynamics as mentioned in this paper, and they are the framework of choice when designing an intelligent agent that needs to act for long periods of time in an environment where its actions could have uncertain outcomes.

...read moreread less

Abstract: Markov Decision Processes (MDPs) are widely popular in Artificial Intelligence for modeling sequential decision-making scenarios with probabilistic dynamics. They are the framework of choice when designing an intelligent agent that needs to act for long periods of time in an environment where its actions could have uncertain outcomes. MDPs are actively researched in two related subareas of AI, probabilistic planning and reinforcement learning. Probabilistic planning assumes known models for the agent's goals and domain dynamics, and focuses on determining how the agent should behave to achieve its objectives. On the other hand, reinforcement learning additionally learns these models based on the feedback the agent gets from the environment. This book provides a concise introduction to the use of MDPs for solving probabilistic planning problems, with an emphasis on the algorithmic perspective. It covers the whole spectrum of the field, from the basics to state-of-the-art optimal and approximation algorithms. We first describe the theoretical foundations of MDPs and the fundamental solution techniques for them. We then discuss modern optimal algorithms based on heuristic search and the use of structured representations. A major focus of the book is on the numerous approximation schemes for MDPs that have been developed in the AI literature. These include determinization-based approaches, sampling techniques, heuristic functions, dimensionality reduction, and hierarchical representations. Finally, we briefly introduce several extensions of the standard MDP classes that model and solve even more complex planning problems. Table of Contents: Introduction / MDPs / Fundamental Algorithms / Heuristic Search Algorithms / Symbolic Algorithms / Approximation Algorithms / Advanced Notes

...read moreread less

Journal Article•DOI•

Network Revenue Management with Inventory-Sensitive Bid Prices and Customer Choice

[...]

Joern Meissner¹, Arne K. Strauss²•Institutions (2)

Kühne Logistics University¹, Lancaster University²

16 Jan 2012-European Journal of Operational Research

TL;DR: An approximate dynamic programming approach to network revenue management models with customer choice that approximates the value function of the Markov decision process with a non-linear function which is separable across resource inventory levels is developed.

...read moreread less

Proceedings Article•DOI•

RALF: A reinforced active learning formulation for object class recognition

[...]

Sandra Ebert¹, Mario Fritz¹, Bernt Schiele¹•Institutions (1)

Max Planck Society¹

16 Jun 2012

TL;DR: This paper analyzes different sampling criteria including a novel density-based criteria and demonstrates the importance to combine exploration and exploitation sampling criteria, and proposes a novel feedback-driven framework based on reinforcement learning.

...read moreread less

Abstract: Active learning aims to reduce the amount of labels required for classification. The main difficulty is to find a good trade-off between exploration and exploitation of the labeling process that depends — among other things — on the classification task, the distribution of the data and the employed classification scheme. In this paper, we analyze different sampling criteria including a novel density-based criteria and demonstrate the importance to combine exploration and exploitation sampling criteria. We also show that a time-varying combination of sampling criteria often improves performance. Finally, by formulating the criteria selection as a Markov decision process, we propose a novel feedback-driven framework based on reinforcement learning. Our method does not require prior information on the dataset or the sampling criteria but rather is able to adapt the sampling strategy during the learning process by experience. We evaluate our approach on three challenging object recognition datasets and show superior performance to previous active learning methods.

...read moreread less

Journal Article•

Finite-sample analysis of least-squares policy iteration

[...]

Alessandro Lazaric¹, Mohammad Ghavamzadeh¹, Rémi Munos¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

01 Jan 2012-Journal of Machine Learning Research

TL;DR: A performance bound is reported for the widely used least-squares policy iteration (LSPI) algorithm based on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function.

...read moreread less

Abstract: In this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is b-mixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm.

...read moreread less

Book•

Continuous-Time Markov Chains and Applications: A Two-Time-Scale Approach

[...]

George Yin, Qing Zhang

14 Nov 2012

TL;DR: In this article, two-time-scale Markov chains are used for solving Markov Decision Problems with Switching (LQG) with switching of Markov decision problems.

...read moreread less

Abstract: Prologue and Preliminaries: Introduction and overview- Mathematical preliminaries.- Markovian models.- Two-Time-Scale Markov Chains: Asymptotic Expansions of Solutions for Forward Equations.- Occupation Measures: Asymptotic Properties and Ramification.- Asymptotic Expansions of Solutions for Backward Equations.- Applications:MDPs, Near-optimal Controls, Numerical Methods, and LQG with Switching: Markov Decision Problems.- Stochastic Control of Dynamical Systems.- Numerical Methods for Control and Optimization.- Hybrid LQG Problems.- References.- Index.-

...read moreread less

Proceedings Article•

Safe Exploration in Markov Decision Processes

[...]

Teodor Moldovan¹, Pieter Abbeel¹•Institutions (1)

University of California, Berkeley¹

26 Jun 2012

TL;DR: This paper proposes a general formulation of safety through ergodicity, and shows that imposing safety by restricting attention to the resulting set of guaranteed safe policies is NP-hard, and presents an efficient algorithm for guaranteed safe, but potentially suboptimal, exploration.

...read moreread less

Proceedings Article•DOI•

Towards Optimal Bidding Strategy for Amazon EC2 Cloud Spot Instance

[...]

Shaojie Tang¹, Jing Yuan², Xiang-Yang Li¹•Institutions (2)

Illinois Institute of Technology¹, University of Illinois at Chicago²

24 Jun 2012

TL;DR: This paper formulate this problem as a Constrained Markov Decision Process (CMDP), and is able to obtain an optimal randomized bidding strategy through linear programming, and compares several adaptive check-pointing schemes in terms of monetary costs and job completion time.

...read moreread less

Abstract: With the recent introduction of Spot Instances in the Amazon Elastic Compute Cloud (EC2), users can bid for resources and thus control the balance of reliability versus monetary costs. Mechanisms and tools that deal with the cost-reliability trade-offs under this schema are of great value for users seeking to lessen their costs while maintaining high reliability. In this paper, we propose a set of bidding strategies to minimize the cost and volatility of resource provisioning. Essentially, to derive an optimal bidding strategy, we formulate this problem as a Constrained Markov Decision Process (CMDP). Based on this model, we are able to obtain an optimal randomized bidding strategy through linear programming. Using real Instance Price traces and workload models, we compare several adaptive check-pointing schemes in terms of monetary costs and job completion time. We evaluate our model and demonstrate how users should bid optimally on Spot Instances to reach different objectives with desired levels of confidence.

...read moreread less

Proceedings Article•

Continuous Inverse Optimal Control with Locally Optimal Examples

[...]

Sergey Levine¹, Vladlen Koltun¹•Institutions (1)

Stanford University¹

26 Jun 2012

TL;DR: A probabilistic inverse optimal control algorithm that scales gracefully with task dimensionality, and is suitable for large, continuous domains where even computing a full policy is impractical.

...read moreread less

Book Chapter•DOI•

Trading Value and Information in MDPs

[...]

Jonathan Rubin¹, Ohad Shamir², Naftali Tishby¹•Institutions (2)

Hebrew University of Jerusalem¹, Microsoft²

01 Jan 2012

TL;DR: The tradeoff between value and information, explored using the info-rl algorithm, provides a principled justification for stochastic (soft) policies and is used to show that these optimal policies are also robust to uncertainties in settings with only partial knowledge of the MDP parameters.

...read moreread less

Abstract: Interactions between an organism and its environment are commonly treated in the framework of Markov Decision Processes (MDP). While standard MDP is aimed solely at maximizing expected future rewards (value), the circular flow of information between the agent and its environment is generally ignored. In particular, the information gained from the environment by means of perception and the information involved in the process of action selection (i.e., control) are not treated in the standard MDP setting. In this paper, we focus on the control information and show how it can be combined with the reward measure in a unified way. Both of these measures satisfy the familiar Bellman recursive equations, and their linear combination (the free-energy) provides an interesting new optimization criterion. The tradeoff between value and information, explored using our info-rl algorithm, provides a principled justification for stochastic (soft) policies. We use computational learning theory to show that these optimal policies are also robust to uncertainties in settings with only partial knowledge of the MDP parameters.

...read moreread less

Proceedings Article•

Coordinated multi-robot exploration under communication constraints using decentralized Markov decision processes

[...]

Laëtitia Matignon¹, Laurent Jeanpierre¹, Abdel-Illah Mouaddib¹•Institutions (1)

University of Caen Lower Normandy¹

22 Jul 2012

TL;DR: This paper extends the DVF methodology to address full local observability, limited share of information and communication breaks and applies it in a real-world application consisting of multi-robot exploration where each robot computes locally a strategy that minimizes the interactions between the robots and maximizes the space coverage of the team even under communication constraints.

...read moreread less

Abstract: Recent works on multi-agent sequential decision making using decentralized partially observable Markov decision processes have been concerned with interactionoriented resolution techniques and provide promising results. These techniques take advantage of local interactions and coordination. In this paper, we propose an approach based on an interaction-oriented resolution of decentralized decision makers. To this end, distributed value functions (DVF) have been used by decoupling the multi-agent problem into a set of individual agent problems. However existing DVF techniques assume permanent and free communication between the agents. In this paper, we extend the DVF methodology to address full local observability, limited share of information and communication breaks. We apply our new DVF in a real-world application consisting of multi-robot exploration where each robot computes locally a strategy that minimizes the interactions between the robots and maximizes the space coverage of the team even under communication constraints. Our technique has been implemented and evaluated in simulation and in real-world scenarios during a robotic challenge for the exploration and mapping of an unknown environment. Experimental results from real-world scenarios and from the challenge are given where our system was vice-champion.

...read moreread less

Journal Article•DOI•

An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes

[...]

Shalabh Bhatnagar¹, K. Lakshmanan¹•Institutions (1)

Indian Institute of Science¹

19 Jan 2012-Journal of Optimization Theory and Applications

TL;DR: An online actor–critic reinforcement learning algorithm with function approximation for a problem of control under inequality constraints and it is proved the asymptotic almost sure convergence of the algorithm to a locally optimal solution.

...read moreread less

Abstract: We develop an online actor–critic reinforcement learning algorithm with function approximation for a problem of control under inequality constraints. We consider the long-run average cost Markov decision process (MDP) framework in which both the objective and the constraint functions are suitable policy-dependent long-run averages of certain sample path functions. The Lagrange multiplier method is used to handle the inequality constraints. We prove the asymptotic almost sure convergence of our algorithm to a locally optimal solution. We also provide the results of numerical experiments on a problem of routing in a multi-stage queueing network with constraints on long-run average queue lengths. We observe that our algorithm exhibits good performance on this setting and converges to a feasible point.

...read moreread less

Collapse