scispace - formally typeset
Search or ask a question

Showing papers on "Markov decision process published in 2012"


Book
01 Jan 2012
TL;DR: In this article, the authors present a comprehensive book with 504 main pages divided into 17 chapters, covering multivariate analysis, basic tests in statistics, probability theory and convergence, random number generators and Markov processes.
Abstract: This comprehensive book offers 504 main pages divided into 17 chapters. In addition, five very useful and clearly written appendices are provided, covering multivariate analysis, basic tests in statistics, probability theory and convergence, random number generators and Markov processes. Some of the topics covered in the book include: stochastic approximation in nonlinear search and optimization; evolutionary computations; reinforcement learning via temporal differences; mathematical model selection; and computer-simulation-based optimizations. Over 250 exercises are provided in the book, though only a small number of them have solutions included in the volume. A separate solution manual is available, as is a very informative webpage. The book may serve as either a reference for researchers and practitioners in many fields or as an excellent graduate level textbook.

1,163 citations


Book ChapterDOI
01 Jan 2012
TL;DR: This text introduces the intuitions and concepts behind Markov decision processes and two classes of algorithms for computing optimal behaviors: reinforcement learning and dynamic programming, and surveys efficient extensions of the foundational algorithms.
Abstract: Situated in between supervised learning and unsupervised learning, the paradigm of reinforcement learning deals with learning in sequential decision making problems in which there is limited feedback. This text introduces the intuitions and concepts behind Markov decision processes and two classes of algorithms for computing optimal behaviors: reinforcement learning and dynamic programming. First the formal framework of Markov decision process is defined, accompanied by the definition of value functions and policies. The main part of this text deals with introducing foundational classes of algorithms for learning optimal behaviors, based on various definitions of optimality with respect to the goal of learning sequential decisions. Additionally, it surveys efficient extensions of the foundational algorithms, differing mainly in the way feedback given by the environment is used to speed up learning, and in the way they concentrate on relevant parts of the problem. For both model-based and model-free settings these efficient extensions have shown useful in scaling up to larger problems.

281 citations


Posted Content
TL;DR: Inverse reinforcement learning (IRL) as mentioned in this paper is a probabilistic inverse optimal control algorithm that scales gracefully with task dimensionality, and is suitable for large, continuous domains where even computing a full policy is impractical.
Abstract: Inverse optimal control, also known as inverse reinforcement learning, is the problem of recovering an unknown reward function in a Markov decision process from expert demonstrations of the optimal policy. We introduce a probabilistic inverse optimal control algorithm that scales gracefully with task dimensionality, and is suitable for large, continuous domains where even computing a full policy is impractical. By using a local approximation of the reward function, our method can also drop the assumption that the demonstrations are globally optimal, requiring only local optimality. This allows it to learn from examples that are unsuitable for prior methods.

272 citations


Book
19 Apr 2012

252 citations


Journal ArticleDOI
TL;DR: This paper derives a channel hopping defense strategy using the Markov decision process approach with the assumption of perfect knowledge, and proposes two learning schemes for secondary users to gain knowledge of adversaries to handle cases without perfect knowledge.
Abstract: Crucial to the successful deployment of cognitive radio networks, security issues have begun to receive research interests recently. In this paper, we focus on defending against the jamming attack, one of the major threats to cognitive radio networks. Secondary users can exploit the flexible access to multiple channels as the means of anti-jamming defense. We first investigate the situation where a secondary user can access only one channel at a time and hop among different channels, and model it as an anti-jamming game. Analyzing the interaction between the secondary user and attackers, we derive a channel hopping defense strategy using the Markov decision process approach with the assumption of perfect knowledge, and then propose two learning schemes for secondary users to gain knowledge of adversaries to handle cases without perfect knowledge. In addition, we extend to the scenario where secondary users can access all available channels simultaneously, and redefine the anti-jamming game with randomized power allocation as the defense strategy. We derive the Nash equilibrium for this Colonel Blotto game which minimizes the worst-case damage. Finally, simulation results are presented to verify the performance.

242 citations


Journal ArticleDOI
TL;DR: A comprehensive survey is given on several major systematic approaches in dealing with delay-aware control problems, namely the equivalentrate constraint approach, the Lyapunov stability drift approach, and the approximate Markov decision process approach using stochastic learning.
Abstract: In this paper, a comprehensive survey is given on several major systematic approaches in dealing with delay-aware control problems, namely the equivalentrate constraint approach, the Lyapunov stability drift approach, and the approximate Markov decision process approach using stochastic learning. These approaches essentially embrace most of the existing literature regarding delay-aware resource control in wireless systems. They have their relative pros and cons in terms of performance, complexity, and implementation issues. For each of the approaches, the problem setup, the general solution, and the design methodology are discussed. Applications of these approaches to delay-aware resource allocation are illustrated with examples in single-hop wireless networks. Furthermore, recent results regarding delay-aware multihop routing designs in general multihop networks are elaborated. Finally, the delay performances of various approaches are compared through simulations using an example of the uplink OFDMA systems.

210 citations


Posted Content
TL;DR: In this paper, a general formulation of safety through ergodicity is proposed, and an efficient algorithm for guaranteed safe, but potentially suboptimal, exploration is presented, in which the constraints restrict attention to a subset of the guaranteed safe policies and the objective favors exploration policies.
Abstract: In environments with uncertain dynamics exploration is necessary to learn how to perform well. Existing reinforcement learning algorithms provide strong exploration guarantees, but they tend to rely on an ergodicity assumption. The essence of ergodicity is that any state is eventually reachable from any other state by following a suitable policy. This assumption allows for exploration algorithms that operate by simply favoring states that have rarely been visited before. For most physical systems this assumption is impractical as the systems would break before any reasonable exploration has taken place, i.e., most physical systems don't satisfy the ergodicity assumption. In this paper we address the need for safe exploration methods in Markov decision processes. We first propose a general formulation of safety through ergodicity. We show that imposing safety by restricting attention to the resulting set of guaranteed safe policies is NP-hard. We then present an efficient algorithm for guaranteed safe, but potentially suboptimal, exploration. At the core is an optimization formulation in which the constraints restrict attention to a subset of the guaranteed safe policies and the objective favors exploration policies. Our framework is compatible with the majority of previously proposed exploration methods, which rely on an exploration bonus. Our experiments, which include a Martian terrain exploration problem, show that our method is able to explore better than classical exploration methods.

209 citations


Posted Content
TL;DR: In this paper, the authors present metrics for measuring the similarity of states in a finite Markov decision process (MDP) based on the notion of bisimulation, with an aim towards solving discounted infinite horizon reinforcement learning tasks.
Abstract: We present metrics for measuring the similarity of states in a finite Markov decision process (MDP). The formulation of our metrics is based on the notion of bisimulation for MDPs, with an aim towards solving discounted infinite horizon reinforcement learning tasks. Such metrics can be used to aggregate states, as well as to better structure other value function approximators (e.g., memory-based or nearest-neighbor approximators). We provide bounds that relate our metric distances to the optimal values of states in the given MDP.

201 citations


Journal ArticleDOI
TL;DR: This work model such a system at the flow level, considering a dynamic user configuration, and derive optimal sleep/wake up schemes based on the information on traffic load and user localization in the cell, in the cases where this information is complete, partial or delayed.
Abstract: We study, in this work, optimal sleep/wake up schemes for the base stations of network-operated femto cells deployed within macro cells for the purpose of offloading part of its traffic. Our aim is to minimize the energy consumption of the overall heterogeneous network while preserving the Quality of Service (QoS) experienced by users. We model such a system at the flow level, considering a dynamic user configuration, and derive, using Markov Decision Processes (MDPs), optimal sleep/wake up schemes based on the information on traffic load and user localization in the cell, in the cases where this information is complete, partial or delayed. Our results quantify the energy consumption and QoS perceived by the users in each of these cases and identify the tradeoffs between those two quantities. We also illustrate numerically the optimal policies in different traffic scenarios.

185 citations


Posted Content
TL;DR: In this paper, the authors provided an algorithm that achieves the optimal regret rate in an unknown weakly communicating Markov Decision Process (MDP) with S states and A actions whose optimal bias vector has span bounded by H.
Abstract: We provide an algorithm that achieves the optimal regret rate in an unknown weakly communicating Markov Decision Process (MDP). The algorithm proceeds in episodes where, in each episode, it picks a policy using regularization based on the span of the optimal bias vector. For an MDP with S states and A actions whose optimal bias vector has span bounded by H, we show a regret bound of ~O(HSpAT). We also relate the span to various diameter-like quantities associated with the MDP, demonstrating how our results improve on previous regret bounds.

183 citations


Proceedings ArticleDOI
01 Dec 2012
TL;DR: A procedure from probabilistic model checking is used to combine the system model with an automaton representing the specification and this new MDP is transformed into an equivalent form that satisfies assumptions for stochastic shortest path dynamic programming.
Abstract: We present a method for designing a robust control policy for an uncertain system subject to temporal logic specifications. The system is modeled as a finite Markov Decision Process (MDP) whose transition probabilities are not exactly known but are known to belong to a given uncertainty set. A robust control policy is generated for the MDP that maximizes the worst-case probability of satisfying the specification over all transition probabilities in this uncertainty set. To this end, we use a procedure from probabilistic model checking to combine the system model with an automaton representing the specification. This new MDP is then transformed into an equivalent form that satisfies assumptions for stochastic shortest path dynamic programming. A robust version of dynamic programming solves for a e-suboptimal robust control policy with time complexity O(log1/e) times that for the non-robust case.

Journal ArticleDOI
TL;DR: A discounted infinite-horizon Markov decision process for scheduling cancer treatments in radiation therapy units is formulated and solved to identify good policies for allocating available treatment capacity to incoming demand, while reducing wait times in a cost-effective manner.

Book ChapterDOI
01 Jan 2012
TL;DR: A basic learning framework based on the economic research into game theory is described, and a representative selection of algorithms for the different areas of multi-agent reinforcement learning research is described.
Abstract: Reinforcement Learning was originally developed for Markov Decision Processes (MDPs). It allows a single agent to learn a policy that maximizes a possibly delayed reward signal in a stochastic stationary environment. It guarantees convergence to the optimal policy, provided that the agent can sufficiently experiment and the environment in which it is operating is Markovian. However, when multiple agents apply reinforcement learning in a shared environment, this might be beyond the MDP model. In such systems, the optimal policy of an agent depends not only on the environment, but on the policies of the other agents as well. These situations arise naturally in a variety of domains, such as: robotics, telecommunications, economics, distributed control, auctions, traffic light control, etc. In these domains multi-agent learning is used, either because of the complexity of the domain or because control is inherently decentralized. In such systems it is important that agents are capable of discovering good solutions to the problem at hand either by coordinating with other learners or by competing with them. This chapter focuses on the application reinforcement learning techniques in multi-agent systems. We describe a basic learning framework based on the economic research into game theory, and illustrate the additional complexity that arises in such systems. We also described a representative selection of algorithms for the different areas of multi-agent reinforcement learning research.

Book ChapterDOI
01 Jan 2012
TL;DR: This chapter presents the POMDP model by focusing on the differences with fully observable MDPs, and it is shown how optimal policies for POM DPs can be represented.
Abstract: For reinforcement learning in environments in which an agent has access to a reliable state signal, methods based on the Markov decision process (MDP) have had many successes. In many problem domains, however, an agent suffers from limited sensing capabilities that preclude it from recovering a Markovian state signal from its perceptions. Extending the MDP framework, partially observable Markov decision processes (POMDPs) allow for principled decision making under conditions of uncertain sensing. In this chapter we present the POMDP model by focusing on the differences with fully observable MDPs, and we show how optimal policies for POMDPs can be represented. Next, we give a review of model-based techniques for policy computation, followed by an overview of the available model-free methods for POMDPs. We conclude by highlighting recent trends in POMDP reinforcement learning.

Journal ArticleDOI
TL;DR: A computational framework for automatic deployment of a robot with sensor and actuator noise from a temporal logic specification over a set of properties that are satisfied by the regions of a partitioned environment is described.
Abstract: We describe a computational framework for automatic deployment of a robot with sensor and actuator noise from a temporal logic specification over a set of properties that are satisfied by the regions of a partitioned environment. We model the motion of the robot in the environment as a Markov decision process (MDP) and translate the motion specification to a formula of probabilistic computation tree logic (PCTL). As a result, the robot control problem is mapped to that of generating an MDP control policy from a PCTL formula. We present algorithms for the synthesis of such policies for different classes of PCTL formulas. We illustrate our method with simulation and experimental results.

Journal ArticleDOI
A. Sultan1
TL;DR: A cognitive radio setting in which the secondary user is an energy harvester with a finite capacity battery and the optimal policy is illustrated, a myopic policy is compared, and the variation of throughput with various system parameters is investigated.
Abstract: We consider a cognitive radio setting in which the secondary user is an energy harvester with a finite capacity battery. The primary user operates in a time-slotted fashion. At the beginning of each time slot, the secondary user, aiming at maximizing its throughput, may remain idle or carry out spectrum sensing to detect primary activity. The decision is determined by the secondary belief regarding primary activity and the amount of stored energy. We formulate this problem as a Markov decision process. We illustrate the optimal policy, compare it with a myopic policy, and investigate the variation of throughput with various system parameters.

Journal ArticleDOI
TL;DR: In this paper, sufficient conditions for the existence of stationary optimal policies for average cost Markov decision processes with Borel state and action sets and weakly continuous transition probabilities were presented.
Abstract: This paper presents sufficient conditions for the existence of stationary optimal policies for average cost Markov decision processes with Borel state and action sets and weakly continuous transition probabilities The one-step cost functions may be unbounded, and the action sets may be noncompact The main contributions of this paper are: (i) general sufficient conditions for the existence of stationary discount optimal and average cost optimal policies and descriptions of properties of value functions and sets of optimal actions, (ii) a sufficient condition for the average cost optimality of a stationary policy in the form of optimality inequalities, and (iii) approximations of average cost optimal actions by discount optimal actions

Proceedings ArticleDOI
17 Sep 2012
TL;DR: This work develops an algorithm that resolves nond determinism probabilistically, and then uses multiple rounds of sampling and Reinforcement Learning to provably improve resolutions of nondeterminism with respect to satisfying a Bounded Linear Temporal Logic (BLTL) property.
Abstract: Statistical Model Checking (SMC) is a computationally very efficient verification technique based on selective system sampling. One well identified shortcoming of SMC is that, unlike probabilistic model checking, it cannot be applied to systems featuring nondeterminism, such as Markov Decision Processes (MDP). We address this limitation by developing an algorithm that resolves nondeterminism probabilistically, and then uses multiple rounds of sampling and Reinforcement Learning to provably improve resolutions of nondeterminism with respect to satisfying a Bounded Linear Temporal Logic (BLTL) property. Our algorithm thus reduces an MDP to a fully probabilistic Markov chain on which SMC may be applied to give an approximate solution to the problem of checking the probabilistic BLTL property. We integrate our algorithm in a parallelised modification of the PRISM simulation framework. Extensive validation with both new and PRISM benchmarks demonstrates that the approach scales very well in scenarios where symbolic algorithms fail to do so.

Book ChapterDOI
29 Oct 2012
TL;DR: In this article, the sample complexity of learning near-optimal behavior in finite-state discounted Markov Decision Processes (mdps) was studied and upper and lower bounds on sample complexity were shown for a modified version of UCRL with only cubic dependence on the horizon.
Abstract: We study upper and lower bounds on the sample-complexity of learning near-optimal behaviour in finite-state discounted Markov Decision Processes (mdps). We prove a new bound for a modified version of Upper Confidence Reinforcement Learning (ucrl) with only cubic dependence on the horizon. The bound is unimprovable in all parameters except the size of the state/action space, where it depends linearly on the number of non-zero transition probabilities. The lower bound strengthens previous work by being both more general (it applies to all policies) and tighter. The upper and lower bounds match up to logarithmic factors provided the transition matrix is not too dense.

Book
03 Jul 2012
TL;DR: Markov Decision Processes (MDPs) are widely used in Artificial Intelligence for modeling sequential decision-making scenarios with probabilistic dynamics as mentioned in this paper, and they are the framework of choice when designing an intelligent agent that needs to act for long periods of time in an environment where its actions could have uncertain outcomes.
Abstract: Markov Decision Processes (MDPs) are widely popular in Artificial Intelligence for modeling sequential decision-making scenarios with probabilistic dynamics. They are the framework of choice when designing an intelligent agent that needs to act for long periods of time in an environment where its actions could have uncertain outcomes. MDPs are actively researched in two related subareas of AI, probabilistic planning and reinforcement learning. Probabilistic planning assumes known models for the agent's goals and domain dynamics, and focuses on determining how the agent should behave to achieve its objectives. On the other hand, reinforcement learning additionally learns these models based on the feedback the agent gets from the environment. This book provides a concise introduction to the use of MDPs for solving probabilistic planning problems, with an emphasis on the algorithmic perspective. It covers the whole spectrum of the field, from the basics to state-of-the-art optimal and approximation algorithms. We first describe the theoretical foundations of MDPs and the fundamental solution techniques for them. We then discuss modern optimal algorithms based on heuristic search and the use of structured representations. A major focus of the book is on the numerous approximation schemes for MDPs that have been developed in the AI literature. These include determinization-based approaches, sampling techniques, heuristic functions, dimensionality reduction, and hierarchical representations. Finally, we briefly introduce several extensions of the standard MDP classes that model and solve even more complex planning problems. Table of Contents: Introduction / MDPs / Fundamental Algorithms / Heuristic Search Algorithms / Symbolic Algorithms / Approximation Algorithms / Advanced Notes

Journal ArticleDOI
TL;DR: An approximate dynamic programming approach to network revenue management models with customer choice that approximates the value function of the Markov decision process with a non-linear function which is separable across resource inventory levels is developed.

Proceedings ArticleDOI
16 Jun 2012
TL;DR: This paper analyzes different sampling criteria including a novel density-based criteria and demonstrates the importance to combine exploration and exploitation sampling criteria, and proposes a novel feedback-driven framework based on reinforcement learning.
Abstract: Active learning aims to reduce the amount of labels required for classification. The main difficulty is to find a good trade-off between exploration and exploitation of the labeling process that depends — among other things — on the classification task, the distribution of the data and the employed classification scheme. In this paper, we analyze different sampling criteria including a novel density-based criteria and demonstrate the importance to combine exploration and exploitation sampling criteria. We also show that a time-varying combination of sampling criteria often improves performance. Finally, by formulating the criteria selection as a Markov decision process, we propose a novel feedback-driven framework based on reinforcement learning. Our method does not require prior information on the dataset or the sampling criteria but rather is able to adapt the sampling strategy during the learning process by experience. We evaluate our approach on three challenging object recognition datasets and show superior performance to previous active learning methods.

Journal Article
TL;DR: A performance bound is reported for the widely used least-squares policy iteration (LSPI) algorithm based on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function.
Abstract: In this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is b-mixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm.

Book
14 Nov 2012
TL;DR: In this article, two-time-scale Markov chains are used for solving Markov Decision Problems with Switching (LQG) with switching of Markov decision problems.
Abstract: Prologue and Preliminaries: Introduction and overview- Mathematical preliminaries.- Markovian models.- Two-Time-Scale Markov Chains: Asymptotic Expansions of Solutions for Forward Equations.- Occupation Measures: Asymptotic Properties and Ramification.- Asymptotic Expansions of Solutions for Backward Equations.- Applications:MDPs, Near-optimal Controls, Numerical Methods, and LQG with Switching: Markov Decision Problems.- Stochastic Control of Dynamical Systems.- Numerical Methods for Control and Optimization.- Hybrid LQG Problems.- References.- Index.-

Proceedings Article
26 Jun 2012
TL;DR: This paper proposes a general formulation of safety through ergodicity, and shows that imposing safety by restricting attention to the resulting set of guaranteed safe policies is NP-hard, and presents an efficient algorithm for guaranteed safe, but potentially suboptimal, exploration.
Abstract: In environments with uncertain dynamics exploration is necessary to learn how to perform well. Existing reinforcement learning algorithms provide strong exploration guarantees, but they tend to rely on an ergodicity assumption. The essence of ergodicity is that any state is eventually reachable from any other state by following a suitable policy. This assumption allows for exploration algorithms that operate by simply favoring states that have rarely been visited before. For most physical systems this assumption is impractical as the systems would break before any reasonable exploration has taken place, i.e., most physical systems don't satisfy the ergodicity assumption. In this paper we address the need for safe exploration methods in Markov decision processes. We first propose a general formulation of safety through ergodicity. We show that imposing safety by restricting attention to the resulting set of guaranteed safe policies is NP-hard. We then present an efficient algorithm for guaranteed safe, but potentially suboptimal, exploration. At the core is an optimization formulation in which the constraints restrict attention to a subset of the guaranteed safe policies and the objective favors exploration policies. Our framework is compatible with the majority of previously proposed exploration methods, which rely on an exploration bonus. Our experiments, which include a Martian terrain exploration problem, show that our method is able to explore better than classical exploration methods.

Proceedings ArticleDOI
24 Jun 2012
TL;DR: This paper formulate this problem as a Constrained Markov Decision Process (CMDP), and is able to obtain an optimal randomized bidding strategy through linear programming, and compares several adaptive check-pointing schemes in terms of monetary costs and job completion time.
Abstract: With the recent introduction of Spot Instances in the Amazon Elastic Compute Cloud (EC2), users can bid for resources and thus control the balance of reliability versus monetary costs. Mechanisms and tools that deal with the cost-reliability trade-offs under this schema are of great value for users seeking to lessen their costs while maintaining high reliability. In this paper, we propose a set of bidding strategies to minimize the cost and volatility of resource provisioning. Essentially, to derive an optimal bidding strategy, we formulate this problem as a Constrained Markov Decision Process (CMDP). Based on this model, we are able to obtain an optimal randomized bidding strategy through linear programming. Using real Instance Price traces and workload models, we compare several adaptive check-pointing schemes in terms of monetary costs and job completion time. We evaluate our model and demonstrate how users should bid optimally on Spot Instances to reach different objectives with desired levels of confidence.

Proceedings Article
26 Jun 2012
TL;DR: A probabilistic inverse optimal control algorithm that scales gracefully with task dimensionality, and is suitable for large, continuous domains where even computing a full policy is impractical.
Abstract: Inverse optimal control, also known as inverse reinforcement learning, is the problem of recovering an unknown reward function in a Markov decision process from expert demonstrations of the optimal policy. We introduce a probabilistic inverse optimal control algorithm that scales gracefully with task dimensionality, and is suitable for large, continuous domains where even computing a full policy is impractical. By using a local approximation of the reward function, our method can also drop the assumption that the demonstrations are globally optimal, requiring only local optimality. This allows it to learn from examples that are unsuitable for prior methods.

Book ChapterDOI
01 Jan 2012
TL;DR: The tradeoff between value and information, explored using the info-rl algorithm, provides a principled justification for stochastic (soft) policies and is used to show that these optimal policies are also robust to uncertainties in settings with only partial knowledge of the MDP parameters.
Abstract: Interactions between an organism and its environment are commonly treated in the framework of Markov Decision Processes (MDP). While standard MDP is aimed solely at maximizing expected future rewards (value), the circular flow of information between the agent and its environment is generally ignored. In particular, the information gained from the environment by means of perception and the information involved in the process of action selection (i.e., control) are not treated in the standard MDP setting. In this paper, we focus on the control information and show how it can be combined with the reward measure in a unified way. Both of these measures satisfy the familiar Bellman recursive equations, and their linear combination (the free-energy) provides an interesting new optimization criterion. The tradeoff between value and information, explored using our info-rl algorithm, provides a principled justification for stochastic (soft) policies. We use computational learning theory to show that these optimal policies are also robust to uncertainties in settings with only partial knowledge of the MDP parameters.

Proceedings Article
22 Jul 2012
TL;DR: This paper extends the DVF methodology to address full local observability, limited share of information and communication breaks and applies it in a real-world application consisting of multi-robot exploration where each robot computes locally a strategy that minimizes the interactions between the robots and maximizes the space coverage of the team even under communication constraints.
Abstract: Recent works on multi-agent sequential decision making using decentralized partially observable Markov decision processes have been concerned with interactionoriented resolution techniques and provide promising results. These techniques take advantage of local interactions and coordination. In this paper, we propose an approach based on an interaction-oriented resolution of decentralized decision makers. To this end, distributed value functions (DVF) have been used by decoupling the multi-agent problem into a set of individual agent problems. However existing DVF techniques assume permanent and free communication between the agents. In this paper, we extend the DVF methodology to address full local observability, limited share of information and communication breaks. We apply our new DVF in a real-world application consisting of multi-robot exploration where each robot computes locally a strategy that minimizes the interactions between the robots and maximizes the space coverage of the team even under communication constraints. Our technique has been implemented and evaluated in simulation and in real-world scenarios during a robotic challenge for the exploration and mapping of an unknown environment. Experimental results from real-world scenarios and from the challenge are given where our system was vice-champion.

Journal ArticleDOI
TL;DR: An online actor–critic reinforcement learning algorithm with function approximation for a problem of control under inequality constraints and it is proved the asymptotic almost sure convergence of the algorithm to a locally optimal solution.
Abstract: We develop an online actor–critic reinforcement learning algorithm with function approximation for a problem of control under inequality constraints. We consider the long-run average cost Markov decision process (MDP) framework in which both the objective and the constraint functions are suitable policy-dependent long-run averages of certain sample path functions. The Lagrange multiplier method is used to handle the inequality constraints. We prove the asymptotic almost sure convergence of our algorithm to a locally optimal solution. We also provide the results of numerical experiments on a problem of routing in a multi-stage queueing network with constraints on long-run average queue lengths. We observe that our algorithm exhibits good performance on this setting and converges to a feasible point.