scispace - formally typeset
Search or ask a question

Showing papers on "Markov decision process published in 2008"


Book
15 Dec 2008
TL;DR: This exciting and pioneering new overview of multiagent systems, which are online systems composed of multiple interacting intelligent agents, i.e., online trading, offers a newly seen computer science perspective on multi agent systems, while integrating ideas from operations research, game theory, economics, logic, and even philosophy and linguistics.
Abstract: This exciting and pioneering new overview of multiagent systems, which are online systems composed of multiple interacting intelligent agents, i.e., online trading, offers a newly seen computer science perspective on multiagent systems, while integrating ideas from operations research, game theory, economics, logic, and even philosophy and linguistics. The authors emphasize foundations to create a broad and rigorous treatment of their subject, with thorough presentations of distributed problem solving, game theory, multiagent communication and learning, social choice, mechanism design, auctions, cooperative game theory, and modal logics of knowledge and belief. For each topic, basic concepts are introduced, examples are given, proofs of key results are offered, and algorithmic considerations are examined. An appendix covers background material in probability theory, classical logic, Markov decision processes and mathematical programming. Written by two of the leading researchers of this engaging field, this book will surely serve as THE reference for researchers in the fastest-growing area of computer science, and be used as a text for advanced undergraduate or graduate courses.

2,068 citations


Proceedings ArticleDOI
25 Jun 2008
TL;DR: This work has developed a new point-based POMDP algorithm that exploits the notion of optimally reachable belief spaces to improve com- putational efficiency and substantially outperformed one of the fastest existing point- based algorithms.
Abstract: IN Proc. Robotics: Science & Systems, 2008 Abstract—Motion planning in uncertain and dynamic environ- ments is an essential capability for autonomous robots. Partially observable Markov decision processes (POMDPs) provide a principled mathematical framework for solving such problems, but they are often avoided in robotics due to high computational complexity. Our goal is to create practical POMDP algorithms and software for common robotic tasks. To this end, we have developed a new point-based POMDP algorithm that exploits the notion of optimally reachable belief spaces to improve com- putational efficiency. In simulation, we successfully applied the algorithm to a set of common robotic tasks, including instances of coastal navigation, grasping, mobile robot exploration, and target tracking, all modeled as POMDPs with a large number of states. In most of the instances studied, our algorithm substantially outperformed one of the fastest existing point-based algorithms. A software package implementing our algorithm will soon be released at http://motion.comp.nus.edu.sg/ projects/pomdp/pomdp.html.

897 citations


Book ChapterDOI
29 Mar 2008
TL;DR: Turn-based stochastic games on infinite graphs induced by game probabilistic lossy channel systems (GPLCS) are decidable, which generalizes the decidability result for PLCS-induced Markov decision processes in [10].
Abstract: We consider turn-based stochastic games on infinite graphs induced by game probabilistic lossy channel systems (GPLCS), the game version of probabilistic lossy channel systems (PLCS). We study games with Buchi (repeated reachability) objectives and almost-sure winning conditions. These games are pure memoryless determined and, under the assumption that the target set is regular, a symbolic representation of the set of winning states for each player can be effectively constructed. Thus, turn-based stochastic games on GPLCS are decidable. This generalizes the decidability result for PLCS-induced Markov decision processes in [10].

570 citations


Journal ArticleDOI
TL;DR: The objectives here are to survey the various existing online POMDP methods, analyze their properties and discuss their advantages and disadvantages; and to thoroughly evaluate these online approaches in different environments under various metrics.
Abstract: Partially Observable Markov Decision Processes (POMDPs) provide a rich framework for sequential decision-making under uncertainty in stochastic domains. However, solving a POMDP is often intractable except for small problems due to their complexity. Here, we focus on online approaches that alleviate the computational complexity by computing good local policies at each decision step during the execution. Online algorithms generally consist of a lookahead search to find the best action to execute at each time step in an environment. Our objectives here are to survey the various existing online POMDP methods, analyze their properties and discuss their advantages and disadvantages; and to thoroughly evaluate these online approaches in different environments under various metrics (return, error bound reduction, lower bound improvement). Our experimental results indicate that state-of-the-art online heuristic search methods can handle large POMDP domains efficiently.

557 citations


Journal ArticleDOI
TL;DR: A theoretical analysis of Model-based Interval Estimation and a new variation called MBIE-EB are presented, proving their efficiency even under worst-case conditions.

503 citations


Journal Article
TL;DR: A theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available.
Abstract: In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finite-time bounds on the performance of two versions of sampling-based FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufficiently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted Lp-norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reflects how well the function space is "aligned" with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical findings.

441 citations


Journal ArticleDOI
TL;DR: The numerical results show that the proposed scheme performs better than other vertical handoff decision algorithms, namely, simple additive weighting, the technique for order preference by similarity to ideal solution, and Grey relational analysis.
Abstract: The architecture for the Beyond 3rd Generation (B3G) or 4th Generation (4G) wireless networks aims at integrating various heterogeneous wireless access networks. One of the major design issues is the support of vertical handoff. Vertical handoff occurs when a mobile terminal switches from one network to another (e.g., from wireless local area network to code-division multiple-access 1x radio transmission technology). The objective of this paper is to determine the conditions under which vertical handoff should be performed. The problem is formulated as a Markov decision process with the objective of maximizing the total expected reward per connection. The network resources that are utilized by the connection are captured by a link reward function. A signaling cost is used to model the signaling and processing load incurred on the network when vertical handoff is performed. The value iteration algorithm is used to compute a stationary deterministic policy. For performance evaluation, voice and data applications are considered. The numerical results show that our proposed scheme performs better than other vertical handoff decision algorithms, namely, simple additive weighting, the technique for order preference by similarity to ideal solution, and Grey relational analysis.

420 citations


Journal ArticleDOI
TL;DR: A framework of constrained Markov decision processes is presented, and the optimal access policy is derived via a linear program and it is demonstrated that periodic sensing yields negligible loss of throughput when the constraint on interference is tight.
Abstract: The problem of opportunistic access of parallel channels occupied by primary users is considered. Under a continuous-time Markov chain modeling of the channel occupancy by the primary users, a slotted transmission protocol for secondary users using a periodic sensing strategy with optimal dynamic access is proposed. To maximize channel utilization while limiting interference to primary users, a framework of constrained Markov decision processes is presented, and the optimal access policy is derived via a linear program. Simulations are used for performance evaluation. It is demonstrated that periodic sensing yields negligible loss of throughput when the constraint on interference is tight.

398 citations


Journal ArticleDOI
TL;DR: A method to dynamically schedule patients with different priorities to a diagnostic facility in a public health-care setting and the form of the optimal linear value function approximation and the resulting policy is presented.
Abstract: We present a method to dynamically schedule patients with different priorities to a diagnostic facility in a public health-care setting. Rather than maximizing revenue, the challenge facing the resource manager is to dynamically allocate available capacity to incoming demand to achieve wait-time targets in a cost-effective manner. We model the scheduling process as a Markov decision process. Because the state space is too large for a direct solution, we solve the equivalent linear program through approximate dynamic programming. For a broad range of cost parameter values, we present analytical results that give the form of the optimal linear value function approximation and the resulting policy. We investigate the practical implications and the quality of the policy through simulation.

361 citations


Proceedings Article
08 Dec 2008
TL;DR: This work presents a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D, and proposes a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps.
Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s' there is a policy which moves from s to s' in at most D steps (on average). We present a reinforcement learning algorithm with total regret O(DS √AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of Ω(√DSAT) on the total regret of any learning algorithm.

344 citations


Book
16 Sep 2008
TL;DR: This book provides the practical, current applications of Markov processes, and covers HMM, Point processes,and Monte Carlo, and includes enough theory to help students gain thorough understanding of the subject.
Abstract: Markov processes are used to model systems with limited memory. They are used in many areas including communications systems, transportation networks, image segmentation and analysis, biological systems and DNA sequence analysis, random atomic motion and diffusion in physics, social mobility, population studies, epidemiology, animal and insect migration, queueing systems, resource management, dams, financial engineering, actuarial science, and decision systems. This book, which is written for upper level undergraduate and graduate students, and researchers, presents a unified presentation of Markov processes. In addition to traditional topics such as Markovian queueing system, the book discusses such topics as continuous-time random walk,correlated random walk, Brownian motion, diffusion processes, hidden Markov models, Markov random fields, Markov point processes and Markov chain Monte Carlo. Continuous-time random walk is currently used in econophysics to model the financial market, which has traditionally been modelled as a Brownian motion. Correlated random walk is popularly used in ecological studies to model animal and insect movement. Hidden Markov models are used in speech analysis and DNA sequence analysis while Markov random fields and Markov point processes are used in image analysis. Thus, the book is designed to have a very broad appeal.

Journal ArticleDOI
TL;DR: A cognitive radio that can coexist with multiple parallel WLAN channels while abiding by an interference constraint is designed and it is shown that optimal CMA admits structured solutions, simplifying practical implementations.
Abstract: In this paper we design a cognitive radio that can coexist with multiple parallel WLAN channels while abiding by an interference constraint. The interaction between both systems is characterized by measurement and coexistence is enhanced by predicting the WLAN's behavior based on a continuous-time Markov chain model. Cognitive medium access (CMA) is derived from this model by recasting the problem as one of constrained Markov decision processes. Solutions are obtained by linear programming. Furthermore, we show that optimal CMA admits structured solutions, simplifying practical implementations. Preliminary results for the partially observable case are presented. The performance of the proposed schemes is evaluated for a typical WLAN coexistence setup and shows a significant performance improvement.

Proceedings Article
01 Jan 2008
TL;DR: The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L_2 norm, and proves that its expected update is in the direction of the gradient, assuring convergence under the usual stoChastic approximation conditions to the same least-squares solution as found by the LSTD, but without its quadratic computational complexity.
Abstract: We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, target policy, and exciting behavior policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d.\ policy-evaluation setting in which the data need not come from on-policy experience. The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L_2 norm. Our analysis proves that its expected update is in the direction of the gradient, assuring convergence under the usual stochastic approximation conditions to the same least-squares solution as found by the LSTD, but without its quadratic computational complexity. GTD is online and incremental, and does not involve multiplying by products of likelihood ratios as in importance-sampling methods.

Proceedings ArticleDOI
05 Jul 2008
TL;DR: This work shows how to frame apprenticeship learning as a linear programming problem, and shows that using an off-the-shelf LP solver to solve this problem results in a substantial improvement in running time over existing methods---up to two orders of magnitude faster in the authors' experiments.
Abstract: In apprenticeship learning, the goal is to learn a policy in a Markov decision process that is at least as good as a policy demonstrated by an expert. The difficulty arises in that the MDP's true reward function is assumed to be unknown. We show how to frame apprenticeship learning as a linear programming problem, and show that using an off-the-shelf LP solver to solve this problem results in a substantial improvement in running time over existing methods---up to two orders of magnitude faster in our experiments. Additionally, our approach produces stationary policies, while all existing methods for apprenticeship learning output policies that are "mixed", i.e. randomized combinations of stationary policies. The technique used is general enough to convert any mixed policy to a stationary policy.

Book
25 Aug 2008
TL;DR: The self-contained approach of this book will appeal not only to researchers in MDPs, stochastic modeling, and control, and simulation but will be a valuable source of tuition and reference for students of control and operations research.
Abstract: Markov decision process (MDP) models are widely used for modeling sequential decision-making problems that arise in engineering, economics, computer science, and the social sciences. Many real-world problems modeled by MDPs have huge state and/or action spaces, giving an opening to the curse of dimensionality and so making practical solution of the resulting models intractable. In other cases, the system of interest is too complex to allow explicit specification of some of the MDP model parameters, but simulation samples are readily available (e.g., for random transitions and costs). For these settings, various sampling and population-based algorithms have been developed to overcome the difficulties of computing an optimal solution in terms of a policy and/or value function. Specific approaches include adaptive sampling, evolutionary policy iteration, evolutionary random policy search, and model reference adaptive search. This substantially enlarged new edition reflects the latest developments in novel algorithms and their underpinning theories, and presents an updated account of the topics that have emerged since the publication of the first edition. Includes: innovative material on MDPs, both in constrained settings and with uncertain transition properties; game-theoretic method for solving MDPs; theories for developing roll-out based algorithms; and details of approximation stochastic annealing, a population-based on-line simulation-based algorithm. The self-contained approach of this book will appeal not only to researchers in MDPs, stochastic modeling, and control, and simulation but will be a valuable source of tuition and reference for students of control and operations research.

Proceedings ArticleDOI
05 Jul 2008
TL;DR: The algorithm can be viewed as an extension to standard reinforcement learning for MDPs where instead of repeatedly backing up maximal expected rewards, it back up the set of expected rewards that are maximal for some set of linear preferences.
Abstract: We describe an algorithm for learning in the presence of multiple criteria. Our technique generalizes previous approaches in that it can learn optimal policies for all linear preference assignments over the multiple reward criteria at once. The algorithm can be viewed as an extension to standard reinforcement learning for MDPs where instead of repeatedly backing up maximal expected rewards, we back up the set of expected rewards that are maximal for some set of linear preferences (given by a weight vector, w). We present the algorithm along with a proof of correctness showing that our solution gives the optimal policy for any linear preference function. The solution reduces to the standard value iteration algorithm for a specific weight vector, w.

Proceedings Article
08 Dec 2008
TL;DR: The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L2 norm and it is proved that this algorithm is stable and convergent under the usual stoChastic approximation conditions to the same least-squares solution as found by the LSTD, but without LSTD's quadratic computational complexity.
Abstract: We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d. policy-evaluation setting in which the data need not come from on-policy experience. The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L2 norm. We prove that this algorithm is stable and convergent under the usual stochastic approximation conditions to the same least-squares solution as found by the LSTD, but without LSTD's quadratic computational complexity. GTD is online and incremental, and does not involve multiplying by products of likelihood ratios as in importance-sampling methods.

Book ChapterDOI
23 Jun 2008
TL;DR: The feasibility of this approach to automatically generate hints for an intelligent tutor that learns is demonstrated by extracting MDPs from four semesters of student solutions in a logic proof tutor, and the probability that they will be able to generate hints at any point in a given problem is calculated.
Abstract: We have proposed a novel application of Markov decision processes (MDPs), a reinforcement learning technique, to automatically generate hints for an intelligent tutor that learns. We demonstrate the feasibility of this approach by extracting MDPs from four semesters of student solutions in a logic proof tutor, and calculating the probability that we will be able to generate hints at any point in a given problem. Our results indicate that extracted MDPs and our proposed hint-generating functions will be able to provide hints over 80% of the time. Our results also indicate that we can provide valuable tradeoffs between hint specificity and the amount of data used to create an MDP.

Proceedings ArticleDOI
19 Mar 2008
TL;DR: The problem of developing energy efficient transmission strategies for Body Sensor Networks with energy harvesting is addressed as a Markov Decision Process (MDP) and the performance of the transmission policy derived is compared with that of energy balancing as well as aggressive policies.
Abstract: This paper addresses the problem of developing energy efficient transmission strategies for Body Sensor Networks (BSNs) with energy harvesting capabilities. It is assumed that two transmission modes that allow a tradeoff between the energy consumption and packet error probability are available to the sensors. Decision policies are developed to determine the transmission mode to use at a given instant of time in order to maximize the quality of coverage. The problem is formulated in a Markov Decision Process (MDP) framework and an upper bound on the performance of arbitrary policies is determined. Our results show that the quality of coverage associated with the MDP formulation outperforms the other policies.

Journal ArticleDOI
TL;DR: It is shown that one can compute an approximate Pareto curve with respect to a set of \omega -regular properties in time polynomial in the size of the MDP, and the quantitative upper bounds use LP methods.
Abstract: We study and provide efficient algorithms for multi-objective model checking problems for Markov Decision Processes (MDPs) Given an MDP, M, and given multiple linear-time (\omega -regular or LTL) properties \varphi\_i, and probabilities r\_i \epsilon [0,1], i=1,,k, we ask whether there exists a strategy \sigma for the controller such that, for all i, the probability that a trajectory of M controlled by \sigma satisfies \varphi\_i is at least r\_i We provide an algorithm that decides whether there exists such a strategy and if so produces it, and which runs in time polynomial in the size of the MDP Such a strategy may require the use of both randomization and memory We also consider more general multi-objective \omega -regular queries, which we motivate with an application to assume-guarantee compositional reasoning for probabilistic systems Note that there can be trade-offs between different properties: satisfying property \varphi\_1 with high probability may necessitate satisfying \varphi\_2 with low probability Viewing this as a multi-objective optimization problem, we want information about the "trade-off curve" or Pareto curve for maximizing the probabilities of different properties We show that one can compute an approximate Pareto curve with respect to a set of \omega -regular properties in time polynomial in the size of the MDP Our quantitative upper bounds use LP methods We also study qualitative multi-objective model checking problems, and we show that these can be analysed by purely graph-theoretic methods, even though the strategies may still require both randomization and memory

Proceedings Article
08 Dec 2008
TL;DR: After constructing an inference scheme which combines slice sampling and dynamic programming, it is demonstrated how the infinite factorial hidden Markov model can be used for blind source separation.
Abstract: We introduce a new probability distribution over a potentially infinite number of binary Markov chains which we call the Markov Indian buffet process. This process extends the IBP to allow temporal dependencies in the hidden variables. We use this stochastic process to build a nonparametric extension of the factorial hidden Markov model. After constructing an inference scheme which combines slice sampling and dynamic programming we demonstrate how the infinite factorial hidden Markov model can be used for blind source separation.

Journal ArticleDOI
TL;DR: In this paper, the authors study and provide efficient algorithms for multi-objective model checking problems for Markov Decision Processes (MDPs) and show that one can compute an approximate Pareto curve with respect to a set of \omega -regular properties in a time polynomial in the size of the MDP.
Abstract: We study and provide efficient algorithms for multi-objective model checking problems for Markov Decision Processes (MDPs). Given an MDP, M, and given multiple linear-time (\omega -regular or LTL) properties \varphi\_i, and probabilities r\_i \epsilon [0,1], i=1,...,k, we ask whether there exists a strategy \sigma for the controller such that, for all i, the probability that a trajectory of M controlled by \sigma satisfies \varphi\_i is at least r\_i. We provide an algorithm that decides whether there exists such a strategy and if so produces it, and which runs in time polynomial in the size of the MDP. Such a strategy may require the use of both randomization and memory. We also consider more general multi-objective \omega -regular queries, which we motivate with an application to assume-guarantee compositional reasoning for probabilistic systems. Note that there can be trade-offs between different properties: satisfying property \varphi\_1 with high probability may necessitate satisfying \varphi\_2 with low probability. Viewing this as a multi-objective optimization problem, we want information about the "trade-off curve" or Pareto curve for maximizing the probabilities of different properties. We show that one can compute an approximate Pareto curve with respect to a set of \omega -regular properties in time polynomial in the size of the MDP. Our quantitative upper bounds use LP methods. We also study qualitative multi-objective model checking problems, and we show that these can be analysed by purely graph-theoretic methods, even though the strategies may still require both randomization and memory.

Book ChapterDOI
16 Sep 2008
TL;DR: This work shows how a spoken dialogue system can be represented as a Partially Observable Markov Decision Process (POMDP) with composite observations consisting of discrete elements representing dialogue acts and continuous components representing confidence scores.
Abstract: This work shows how a spoken dialogue system can be represented as a Partially Observable Markov Decision Process (POMDP) with composite observations consisting of discrete elements representing dialogue acts and continuous components representing confidence scores. Using a testbed simulated dialogue management problem and recently developed optimisation techniques, we demonstrate that this continuous POMDP can outperform traditional approaches in which confidence score is tracked discretely. Further, we present a method for automatically improving handcrafted dialogue managers by incorporating POMDP belief state monitoring, including confidence score information. Experiments on the test-bed system show significant improvements for several example handcrafted dialogue managers across a range of operating conditions.

Journal ArticleDOI
TL;DR: An overview of quantitative methodologies for the study of stage-sequential development based on extensions of Markov chain modeling is presented, with a special case of the mixture latent Markov model, the so-called mover‐stayer model, used in this study.
Abstract: This article presents an overview of quantitative methodologies for the study of stage-sequential development based on extensions of Markov chain modeling. Four methods are presented that exemplify the flexibility of this approach: the manifest Markov model, the latent Markov model, latent transition analysis, and the mixture latent Markov model. A special case of the mixture latent Markov model, the so-called mover‐stayer model, is used in this study. Unconditional and conditional models are estimated for the manifest Markov model and the latent Markov model, where the conditional models include a measure of poverty status. Issues of model specification, estimation, and testing using the Mplus software environment are briefly discussed, and the Mplus input syntax is provided. The author applies these 4 methods to a single example of stage-sequential development in reading competency in the early school years, using data from the Early Childhood Longitudinal Study—Kindergarten Cohort.

Journal ArticleDOI
TL;DR: Two CT-scanners in a radiology department of a hospital providing medical service to three patient groups with different arrival patterns and cost-structures are considered, model the problem as as Markov Decision Process and compare it with three decision rules which can be applied in hospitals.
Abstract: We consider two CT-scanners in a radiology department of a hospital providing medical service to three patient groups with different arrival patterns and cost-structures: scheduled outpatients, non-scheduled inpatients, and emergency patients. Scheduled outpatients arrive based on an appointment schedule with some randomness due to no-shows. Inpatients and emergency patients arrive at random. The problem is to allocate the available resources dynamically to the patients of the groups such that the expected total reward consisting of revenues, waiting costs, and penalty costs is maximized. We model the problem as as Markov Decision Process and compare it with three decision rules which can be applied in hospitals.

Book ChapterDOI
29 Mar 2008
TL;DR: An alternative semantics for PBA where it is required that almost all runs for an accepted word are accepting, which turns out to be less powerful, but has a decidable emptiness problem.
Abstract: Probabilistic Buchi automata (PBA) are finite-state acceptors for infinite words where all choices are resolved by fixed distributions and where the accepted language is defined by the requirement that the measure of the accepting runs is positive. The main contribution of this paper is a complementation operator for PBA and a discussion on several algorithmic problems for PBA. All interesting problems, such as checking emptiness or equivalence for PBA or checking whether a finite transition system satisfies a PBA-specification, turn out to be undecidable. An important consequence of these results are several undecidability results for stochastic games with incomplete information, modelled by partially-observable Markov decision processes and ω-regular winning objectives. Furthermore, we discuss an alternative semantics for PBA where it is required that almost all runs for an accepted word are accepting, which turns out to be less powerful, but has a decidable emptiness problem.

Proceedings ArticleDOI
05 Jul 2008
TL;DR: This paper presents an approximation approach that allows us to treat the POMDP model parameters as additional hidden state in a "model-uncertainty" PomDP.
Abstract: Partially Observable Markov Decision Processes (POMDPs) have succeeded in planning domains that require balancing actions that increase an agent's knowledge and actions that increase an agent's reward Unfortunately, most POMDPs are defined with a large number of parameters which are difficult to specify only from domain knowledge In this paper, we present an approximation approach that allows us to treat the POMDP model parameters as additional hidden state in a "model-uncertainty" POMDP Coupled with model-directed queries, our planner actively learns good policies We demonstrate our approach on several POMDP problems

Proceedings Article
13 Jul 2008
TL;DR: Applying a mix of techniques from algorithm analysis and the theory of Markov Decision Processes, this work provides efficient exact algorithms for directed acyclic graphs and (undirected) graphs of disjoint paths from source to destination with random two-valued edge costs.
Abstract: The Canadian Traveller problem is a stochastic shortest paths problem in which one learns the cost of an edge only when arriving at one of its endpoints. The goal is to find an optimal policy that minimizes the expected cost of travel. The problem is known to be #P-hard. Since there has been no significant progress on approximation algorithms for several decades, we have chosen to seek out special cases for which exact solutions exist, in the hope of demonstrating techniques that could lead to further progress. Applying a mix of techniques from algorithm analysis and the theory of Markov Decision Processes, we provide efficient exact algorithms for directed acyclic graphs and (undirected) graphs of disjoint paths from source to destination with random two-valued edge costs. We also give worst-case performance analysis and experimental data for two natural heuristics.

Proceedings ArticleDOI
12 May 2008
TL;DR: Together, the results allow us to exploit the problem structure as well as heuristics in a single framework that is based on collaborative graphical Bayesian games (CGBGs), and a preliminary experiment shows a speedup of two orders of magnitude.
Abstract: Decentralized partially observable Markov decision processes (Dec-POMDPs) constitute an expressive framework for multiagent planning under uncertainty, but solving them is provably intractable. We demonstrate how their scalability can be improved by exploiting locality of interaction between agents in a factored representation. Factored Dec-POMDP representations have been proposed before, but only for Dec-POMDPs whose transition and observation models are fully independent. Such strong assumptions simplify the planning problem, but result in models with limited applicability. By contrast, we consider general factored Dec-POMDPs for which we analyze the model dependencies over space (locality of interaction) and time (horizon of the problem). We also present a formulation of decomposable value functions. Together, our results allow us to exploit the problem structure as well as heuristics in a single framework that is based on collaborative graphical Bayesian games (CGBGs). A preliminary experiment shows a speedup of two orders of magnitude.

Proceedings Article
07 Dec 2008
TL;DR: Simulation results for a small example illustrate the potential for a coordinated control strategy to achieve better energy management than traditional schemes that control the computational and cooling subsystems separately.
Abstract: This paper presents a unified approach to data center energy management based on a modeling framework that characterizes the influence of key decision variables on computational performance, thermal generation, and power consumption. Temperature dynamics are modeled by a network of interconnected components reflecting the spatial distribution of servers, computer room air conditioning (CRAC) units, and non-computational components in the data center. A second network models the distribution of the computational load among the servers. Server power states influence both networks. Formulating the control problem as a Markov decision process (MDP), the coordinated cooling and load management strategy minimizes the integrated weighted sum of power consumption and computational performance. Simulation results for a small example illustrate the potential for a coordinated control strategy to achieve better energy management than traditional schemes that control the computational and cooling subsystems separately. These results suggest several directions for further research.