scispace - formally typeset
Search or ask a question

Showing papers on "Markov decision process published in 2005"


Journal ArticleDOI
TL;DR: This work considers a robust control problem for a finite-state, finite-action Markov decision process, where uncertainty on the transition matrices is described in terms of possibly nonconvex sets, and shows that perfect duality holds for this problem, and that it can be solved with a variant of the classical dynamic programming algorithm, the "robust dynamic programming" algorithm.
Abstract: Optimal solutions to Markov decision problems may be very sensitive with respect to the state transition probabilities In many practical problems, the estimation of these probabilities is far from accurate Hence, estimation errors are limiting factors in applying Markov decision processes to real-world problems We consider a robust control problem for a finite-state, finite-action Markov decision process, where uncertainty on the transition matrices is described in terms of possibly nonconvex sets We show that perfect duality holds for this problem, and that as a consequence, it can be solved with a variant of the classical dynamic programming algorithm, the "robust dynamic programming" algorithm We show that a particular choice of the uncertainty sets, involving likelihood regions or entropy bounds, leads to both a statistically accurate representation of uncertainty, and a complexity of the robust recursion that is almost the same as that of the classical recursion Hence, robustness can be added at practically no extra computing cost We derive similar results for other uncertainty sets, including one with a finite number of possible values for the transition matrices We describe in a practical path planning example the benefits of using a robust strategy instead of the classical optimal strategy; even if the uncertainty level is only crudely guessed, the robust strategy yields a much better worst-case expected travel time

740 citations


Journal ArticleDOI
TL;DR: In this paper, the authors argue that it is more appropriate to view the problem of generating recommendations as a sequential optimization problem and, consequently, that Markov decision processes (MDPs) provide a more appropriate model for recommender systems.
Abstract: Typical recommender systems adopt a static view of the recommendation process and treat it as a prediction problem. We argue that it is more appropriate to view the problem of generating recommendations as a sequential optimization problem and, consequently, that Markov decision processes (MDPs) provide a more appropriate model for recommender systems. MDPs introduce two benefits: they take into account the long-term effects of each recommendation and the expected value of each recommendation. To succeed in practice, an MDP-based recommender system must employ a strong initial model, must be solvable quickly, and should not consume too much memory. In this paper, we describe our particular MDP model, its initialization using a predictive model, the solution and update algorithm, and its actual performance on a commercial site. We also describe the particular predictive model we used which outperforms previous models. Our system is one of a small number of commercially deployed recommender systems. As far as we know, it is the first to report experimental analysis conducted on a real commercial site. These results validate the commercial value of recommender systems, and in particular, of our MDP-based approach.

690 citations


Journal ArticleDOI
TL;DR: This work presents a randomized point-based value iteration algorithm called PERSEUS, which backs up only a (randomly selected) subset of points in the belief set, sufficient for improving the value of each belief point in the set.
Abstract: Partially observable Markov decision processes (POMDPs) form an attractive and principled framework for agent planning under uncertainty. Point-based approximate techniques for POMDPs compute a policy based on a finite set of points collected in advance from the agent's belief space. We present a randomized point-based value iteration algorithm called PERSEUS. The algorithm performs approximate value backup stages, ensuring that in each backup stage the value of each point in the belief set is improved; the key observation is that a single backup may improve the value of many belief points. Contrary to other point-based methods, PERSEUS backs up only a (randomly selected) subset of points in the belief set, sufficient for improving the value of each belief point in the set. We show how the same idea can be extended to dealing with continuous action spaces. Experimental results show the potential of PERSEUS in large scale POMDP problems.

674 citations


Journal ArticleDOI
TL;DR: It is proved that when this set of measures has a certain "rectangularity" property, all of the main results for finite and infinite horizon DP extend to natural robust counterparts.
Abstract: In this paper we propose a robust formulation for discrete time dynamic programming (DP). The objective of the robust formulation is to systematically mitigate the sensitivity of the DP optimal policy to ambiguity in the underlying transition probabilities. The ambiguity is modeled by associating a set of conditional measures with each state-action pair. Consequently, in the robust formulation each policy has a set of measures associated with it. We prove that when this set of measures has a certain "rectangularity" property, all of the main results for finite and infinite horizon DP extend to natural robust counterparts. We discuss techniques from Nilim and El Ghaoui [17] for constructing suitable sets of conditional measures that allow one to efficiently solve for the optimal robust policy. We also show that robust DP is equivalent to stochastic zero-sum games with perfect information.

585 citations


Journal ArticleDOI
TL;DR: In this paper, the authors extend the framework of partially observable Markov decision processes (POMDPs) to multi-agent settings by incorporating the notion of agent models into the state space.
Abstract: This paper extends the framework of partially observable Markov decision processes (POMDPs) to multi-agent settings by incorporating the notion of agent models into the state space. Agents maintain beliefs over physical states of the environment and over models of other agents, and they use Bayesian updates to maintain their beliefs over time. The solutions map belief states to actions. Models of other agents may include their belief states and are related to agent types considered in games of incomplete information. We express the agents' autonomy by postulating that their models are not directly manipulable or observable by other agents. We show that important properties of POMDPs, such as convergence of value iteration, the rate of convergence, and piece-wise linearity and convexity of the value functions carry over to our framework. Our approach complements a more traditional approach to interactive settings which uses Nash equilibria as a solution paradigm. We seek to avoid some of the drawbacks of equilibria which may be non-unique and do not capture off-equilibrium behaviors. We do so at the cost of having to represent, process and continuously revise models of other agents. Since the agent's beliefs may be arbitrarily nested, the optimal solutions to decision making problems are only asymptotically computable. However, approximate belief updates and approximately optimal plans are computable. We illustrate our framework using a simple application domain, and we show examples of belief updates and value functions.

315 citations


Journal ArticleDOI
TL;DR: A model free, heuristic reinforcement learning algorithm that aims at finding good deterministic policies based on weighting the original value function and the risk, which was successfully applied to the control of a feed tank with stochastic inflows that lies upstream of a distillation column.
Abstract: In this paper, we consider Markov Decision Processes (MDPs) with error states. Error states are those states entering which is undesirable or dangerous. We define the risk with respect to a policy as the probability of entering such a state when the policy is pursued. We consider the problem of finding good policies whose risk is smaller than some user-specified threshold, and formalize it as a constrained MDP with two criteria. The first criterion corresponds to the value function originally given. We will show that the risk can be formulated as a second criterion function based on a cumulative return, whose definition is independent of the original value function. We present a model free, heuristic reinforcement learning algorithm that aims at finding good deterministic policies. It is based on weighting the original value function and the risk. The weight parameter is adapted in order to find a feasible solution for the constrained problem that has a good performance with respect to the value function. The algorithm was successfully applied to the control of a feed tank with stochastic inflows that lies upstream of a distillation column. This control task was originally formulated as an optimal control problem with chance constraints, and it was solved under certain assumptions on the model to obtain an optimal solution. The power of our learning algorithm is that it can be used even when some of these restrictive assumptions are relaxed.

283 citations


Journal ArticleDOI
TL;DR: This thesis describes a scalable approach to POMDP planning which uses low-dimensional representations of the belief space and demonstrates how to make use of a variant of Principal Components Analysis (PCA) called Exponential family PCA in order to compress certain kinds of large real-world PomDPs, and find policies for these problems.
Abstract: Standard value function approaches to finding policies for Partially Observable Markov Decision Processes (POMDPs) are generally considered to be intractable for large models. The intractability of these algorithms is to a large extent a consequence of computing an exact, optimal policy over the entire belief space. However, in real-world POMDP problems, computing the optimal policy for the full belief space is often unnecessary for good control even for problems with complicated policy classes. The beliefs experienced by the controller often lie near a structured, low-dimensional subspace embedded in the high-dimensional belief space. Finding a good approximation to the optimal value function for only this subspace can be much easier than computing the full value function. We introduce a new method for solving large-scale POMDPs by reducing the dimensionality of the belief space. We use Exponential family Principal Components Analysis (Collins, Dasgupta, & Schapire, 2002) to represent sparse, high-dimensional belief spaces using small sets of learned features of the belief state. We then plan only in terms of the low-dimensional belief features. By planning in this low-dimensional space, we can find policies for POMDP models that are orders of magnitude larger than models that can be handled by conventional techniques. We demonstrate the use of this algorithm on a synthetic problem and on mobile robot navigation tasks.

244 citations


01 Jan 2005
TL;DR: This thesis first presents a Bounded Policy Iteration algorithm to robustly find a good policy represented by a small finite state controller, and describes three approaches that combine techniques capable of dealing with each source of intractability: VDC with BPI, VDCWith Perseus, and state abstraction with Perseus.
Abstract: Partially observable Markov decision processes (POMDPs) provide a natural and principled framework to model a wide range of sequential decision making problems under uncertainty. To date, the use of POMDPs in real-world problems has been limited by the poor scalability of existing solution algorithms, which can only solve problems with up to ten thousand states. In fact, the complexity of finding an optimal policy for a finite-horizon discrete POMDP is PSPACE-complete. In practice, two important sources of intractability plague most solution algorithms: Large policy spaces and large state spaces. On the other hand, for many real-world POMDPs it is possible to define effective policies with simple rules of thumb. This suggests that we may be able to find small policies that are near optimal. This thesis first presents a Bounded Policy Iteration (BPI) algorithm to robustly find a good policy represented by a small finite state controller. Real-world POMDPs also tend to exhibit structural properties that can be exploited to mitigate the effect of large state spaces. To that effect, a value-directed compression (VDC) technique is also presented to reduce POMDP models to lower dimensional representations. In practice, it is critical to simultaneously mitigate the impact of complex policy representations and large state spaces. Hence, this thesis describes three approaches that combine techniques capable of dealing with each source of intractability: VDC with BPI, VDC with Perseus (a randomized point-based value iteration algorithm by Spaan and Vlassis [136]), and state abstraction with Perseus. The scalability of those approaches is demonstrated on two problems with more than 33 million states: synthetic network management and a real-world system designed to assist elderly persons with cognitive deficiencies to carry out simple daily tasks such as hand-washing. This represents an important step towards the deployment of POMDP techniques in ever larger, real-world, sequential decision making problems.

242 citations


Journal ArticleDOI
TL;DR: Significant advantages when using real-time traffic information to optimal vehicle routing in a nonstationary stochastic network are demonstrated when using this information in terms of total cost savings and vehicle usage reduction while satisfying or improving service levels for just-in-time delivery.
Abstract: This paper examines the value of real-time traffic information to optimal vehicle routing in a nonstationary stochastic network We present a systematic approach to aid in the implementation of transportation systems integrated with real-time information technology We develop decision-making procedures for determining the optimal driver attendance time, optimal departure times, and optimal routing policies under time-varying traffic flows based on a Markov decision process formulation With a numerical study carried out on an urban road network in Southeast Michigan, we demonstrate significant advantages when using this information in terms of total cost savings and vehicle usage reduction while satisfying or improving service levels for just-in-time delivery

235 citations


Journal ArticleDOI
TL;DR: This work considers the simultaneous seat-inventory control of a set of parallel flights between a common origin and destination with dynamic customer choice among the flights as an extension of the classic multiperiod, single-flight "block demand" revenue management model.
Abstract: We consider the simultaneous seat-inventory control of a set of parallel flights between a common origin and destination with dynamic customer choice among the flights. We formulate the problem as an extension of the classic multiperiod, single-flight "block demand" revenue management model. The resulting Markov decision process is quite complex, owing to its multidimensional state space and the fact that the airline's inventory controls do affect the distribution of demand. Using stochastic comparisons, consumer-choice models, and inventory-pooling ideas, we derive easily computable upper and lower bounds for the value function of our model. We propose simulation-based techniques for solving the stochastic optimization problem and also describe heuristics based upon an extension of a well-known linear programming formulation. We provide numerical examples.

213 citations


Proceedings ArticleDOI
27 Nov 2005
TL;DR: A novel algorithm for the induction of Markov blankets from data, called Fast-IAMB, that employs a heuristic to quickly recover the Markov blanket and performs in many cases faster and more reliably than existing algorithms without adversely affecting the accuracy of the recovered Markov covers.
Abstract: In this paper we address the problem of learning the Markov blanket of a quantity from data in an efficient manner Markov blanket discovery can be used in the feature selection problem to find an optimal set of features for classification tasks, and is a frequently-used preprocessing phase in data mining, especially for high-dimensional domains. Our contribution is a novel algorithm for the induction of Markov blankets from data, called Fast-IAMB, that employs a heuristic to quickly recover the Markov blanket. Empirical results show that Fast-IAMB performs in many cases faster and more reliably than existing algorithms without adversely affecting the accuracy of the recovered Markov blankets.

Journal ArticleDOI
TL;DR: The semi-Markov decision process (SMDP) is built for the maintenance policy optimization of condition-based preventive maintenance problems, and the approach for joint optimization of inspection rate and maintenance policy is presented.

Journal ArticleDOI
TL;DR: A stylized partially observed Markov decision process (POMDP) framework is developed to study a dynamic pricing problem faced by sellers of fashion-like goods and proposes an active-learning heuristic pricing policy.
Abstract: In this paper, we develop a stylized partially observed Markov decision process (POMDP) framework to study a dynamic pricing problem faced by sellers of fashion-like goods. We consider a retailer that plans to sell a given stock of items during a finite sales season. The objective of the retailer is to dynamically price the product in a way that maximizes expected revenues. Our model brings together various types of uncertainties about the demand, some of which are resolvable through sales observations. We develop a rigorous upper bound for the seller's optimal dynamic decision problem and use it to propose an active-learning heuristic pricing policy. We conduct a numerical study to test the performance of four different heuristic dynamic pricing policies in order to gain insight into several important managerial questions that arise in the context of revenue management.

Proceedings ArticleDOI
07 Aug 2005
TL;DR: The first theoretical analysis of MBIE is presented, proving its efficiency even under worst-case conditions, and a new performance metric, average loss, is introduced, which relates it to its less "online" cousins from the literature.
Abstract: Several algorithms for learning near-optimal policies in Markov Decision Processes have been analyzed and proven efficient. Empirical results have suggested that Model-based Interval Estimation (MBIE) learns efficiently in practice, effectively balancing exploration and exploitation. This paper presents the first theoretical analysis of MBIE, proving its efficiency even under worst-case conditions. The paper also introduces a new performance metric, average loss, and relates it to its less "online" cousins from the literature.

Journal ArticleDOI
TL;DR: An actor-critic type reinforcement learning algorithm is proposed and analyzed for constrained controlled Markov decision processes using multiscale stochastic approximation theory and the `envelope theorem' of mathematical economics.

Book ChapterDOI
01 Jan 2005
TL;DR: This chapter describes MDP modeling in the context of medical treatment and discusses when MDPs are an appropriate technique, and discusses the challenges and opportunities for applying M DPs to medical treatment decisions.
Abstract: Medical treatment decisions are often sequential and uncertain. Markov decision processes (MDPs) are an appropriate technique for modeling and solving such stochastic and dynamic decisions. This chapter gives an overview of MDP models and solution techniques. We describe MDP modeling in the context of medical treatment and discuss when MDPs are an appropriate technique. We review selected successful applications of MDPs to treatment decisions in the literature. We conclude with a discussion of the challenges and opportunities for applying MDPs to medical treatment decisions.

Journal ArticleDOI
TL;DR: An adaptive sampling algorithm that adaptively chooses which action to sample as the sampling process proceeds and generates an asymptotically unbiased estimator, whose bias is bounded by a quantity that converges to zero at rate (lnN)/ N.
Abstract: Based on recent results for multiarmed bandit problems, we propose an adaptive sampling algorithm that approximates the optimal value of a finite-horizon Markov decision process (MDP) with finite state and action spaces. The algorithm adaptively chooses which action to sample as the sampling process proceeds and generates an asymptotically unbiased estimator, whose bias is bounded by a quantity that converges to zero at rate (lnN)/ N, whereN is the total number of samples that are used per state sampled in each stage. The worst-case running-time complexity of the algorithm isO(( |A|N) H ), independent of the size of the state space, where | A| is the size of the action space andH is the horizon length. The algorithm can be used to create an approximate receding horizon control to solve infinite-horizon MDPs. To illustrate the algorithm, computational results are reported on simple examples from inventory control.

Journal ArticleDOI
21 Nov 2005
TL;DR: This paper presents an efficient algorithm to compute the maximum (or minimum) probability to reach a set of goal states within a given time bound in a uniform CTMDP, and proves that these probabilities coincide for (time-abstract) history-dependent and Markovian schedulers that resolve nondeterminism either deterministically or in a randomized way.
Abstract: A continuous-time Markov decision process (CTMDP) is a generalization of a continuous-time Markov chain in which both probabilistic and nondeterministic choices co-exist. This paper presents an efficient algorithm to compute the maximum (or minimum) probability to reach a set of goal states within a given time bound in a uniform CTMDP, i.e., a CTMDP in which the delay time distribution per state visit is the same for all states. It furthermore proves that these probabilities coincide for (time-abstract) history-dependent and Markovian schedulers that resolve nondeterminism either deterministically or in a randomized way.

16 Sep 2005
TL;DR: This work shows how a dialogue model can be represented as a factored Partially Observable Markov Decision Process (POMDP), and shows howA dialogue manager produced with a POMDP optimisation technique may be directly compared to a handcrafted dialogue manager.
Abstract: This work shows how a dialogue model can be represented as a factored Partially Observable Markov Decision Process (POMDP). The factored representation has several benefits, such as enabling more nuanced reward functions to be specified. Although our dialogue model is significantly larger than past work using POMDPs, experiments on a small testbed problem demonstrate that recent optimisation techniques scale well and produce policies which outperform a traditional fully-observable Markov Decision Process. This work then shows how a dialogue manager produced with a POMDP optimisation technique may be directly compared to a handcrafted dialogue manager. Experiments on the testbed problem show that automatically generated dialogue managers outperform several handcrafted dialogue managers, and that automatically generated dialogue managers for the testbed problem successfully adapt to changes in speech recognition accuracy.

Journal ArticleDOI
TL;DR: In this article, a two-factor real options model of the harvesting decision over infinite rotations assuming a known stochastic price process and using a rigorous Hamilton-Jacobi-Bellman methodology was developed.
Abstract: This article develops a two-factor real options model of the harvesting decision over infinite rotations assuming a known stochastic price process and using a rigorous Hamilton-Jacobi-Bellman methodology. The harvesting problem is formulated as a linear complementarity problem that is solved numerically using a fully implicit finite difference method. This approach is contrasted with the Markov decision process models commonly used in the literature. The model is used to estimate the value of a representative stand in Ontario's boreal forest, both when there is complete flexibility regarding harvesting time and when regulations dictate the harvesting date.

Book ChapterDOI
11 Jul 2005
TL;DR: Recursive Markov Decision Processes (RMDPs) and Recursive Simple Stochastic Games (RSSGs) are introduced, and the decidability and complexity of algorithms for their analysis and verification are studied.
Abstract: We introduce Recursive Markov Decision Processes (RMDPs) and Recursive Simple Stochastic Games (RSSGs), and study the decidability and complexity of algorithms for their analysis and verification. These models extend Recursive Markov Chains (RMCs), introduced in [EY05a, EY05b] as a natural model for verification of probabilistic procedural programs and related systems involving both recursion and probabilistic behavior. RMCs define a class of denumerable Markov chains with a rich theory generalizing that of stochastic context-free grammars and multi-type branching processes, and they are also intimately related to probabilistic pushdown systems. RMDPs & RSSGs extend RMCs with one controller or two adversarial players, respectively. Such extensions are useful for modeling nondeterministic and concurrent behavior, as well as modeling a system’s interactions with an environment. We provide upper and lower bounds for deciding, given an RMDP (or RSSG) A and probability p, whether player 1 has a strategy to force termination at a desired exit with probability at least p. We also address “qualitative” termination, where p=1, and model checking questions.

Journal ArticleDOI
TL;DR: This work proposes using Markov decision processes (MDPs) to model workflow composition and produces workflows that are robust to non-deterministic behaviors of Web services and that adapt to a changing environment.
Abstract: The advent of Web services has made automated workflow composition relevant to Web-based applications. One technique that has received some attention for automatically composing workflows is AI-based classical planning. However, workflows generated by classical planning algorithms suffer from the paradoxical assumption of deterministic behavior of Web services, then requiring the additional overhead of execution monitoring to recover from unexpected behavior of services due to service failures, and the dynamic nature of real-world environments. To address these concerns, we propose using Markov decision processes (MDPs) to model workflow composition. To account for the uncertainty over the true environmental model, and for dynamic environments, we interleave MDP-based workflow generation and Bayesian model learning. Consequently, our method models both the inherent stochastic nature of Web services and the dynamic nature of the environment. Our algorithm produces workflows that are robust to non-deterministic behaviors of Web services and that adapt to a changing environment. We use a supply chain scenario to demonstrate our method and provide empirical results.

Proceedings Article
30 Jul 2005
TL;DR: It is demonstrated how to find this partition while computing a policy, and how the resulting discretisation of the observation space reveals the relevant features of the application domain.
Abstract: We describe methods to solve partially observable Markov decision processes (POMDPs) with continuous or large discrete observation spaces. Realistic problems often have rich observation spaces, posing significant problems for standard POMDP algorithms that require explicit enumeration of the observations. This problem is usually approached by imposing an a priori discretisation on the observation space, which can be sub-optimal for the decision making task. However, since only those observations that would change the policy need to be distinguished, the decision problem itself induces a lossless partitioning of the observation space. This paper demonstrates how to find this partition while computing a policy, and how the resulting discretisation of the observation space reveals the relevant features of the application domain. The algorithms are demonstrated on a toy example and on a realistic assisted living task.

Journal ArticleDOI
TL;DR: The version of the GPT planner used in the probabilistic track of the 4th International Planning Competition (IPC-4), called mGPT, solves Markov Decision Processes specified in the PPDDL language by extracting and using different classes of lower bounds along with various heuristic-search algorithms.
Abstract: We describe the version of the GPT planner used in the probabilistic track of the 4th International Planning Competition (IPC-4). This version, called mGPT, solves Markov Decision Processes specified in the PPDDL language by extracting and using different classes of lower bounds along with various heuristic-search algorithms. The lower bounds are extracted from deterministic relaxations where the alternative probabilistic effects of an action are mapped into different, independent, deterministic actions. The heuristic-search algorithms use these lower bounds for focusing the updates and delivering a consistent value function over all states reachable from the initial state and the greedy policy.

Journal ArticleDOI
TL;DR: The extension of XCS to gradient-based update methods results in a classifier system that is more robust and more parameter independent, solving large and difficult maze problems reliably.
Abstract: The accuracy-based XCS classifier system has been shown to solve typical data mining problems in a machine-learning competitive way. However, successful applications in multistep problems, modeled by a Markov decision process, were restricted to very small problems. Until now, the temporal difference learning technique in XCS was based on deterministic updates. However, since a prediction is actually generated by a set of rules in XCS and Learning Classifier Systems in general, gradient-based update methods are applicable. The extension of XCS to gradient-based update methods results in a classifier system that is more robust and more parameter independent, solving large and difficult maze problems reliably. Additionally, the extension to gradient methods highlights the relation of XCS to other function approximation methods in reinforcement learning.

Journal ArticleDOI
TL;DR: In this article, the utilization of Markov decision processes as a sequential decision algorithm in the management actions of infrastructure (inspection, maintenance and repair) is discussed, and the use of this approach to determine optimal inspection strategies is described, as well as the role of deterioration and maintenance for steel structures.
Abstract: The utilization of Markov decision processes as a sequential decision algorithm in the management actions of infrastructure (inspection, maintenance and repair) is discussed. The realistic issue of partial information from inspection is described, and the classic approach of partially observable Markov decision processes is then introduced. The use of this approach to determine optimal inspection strategies is described, as well as the role of deterioration and maintenance for steel structures. Discrete structural shapes and maintenance actions provide a tractable approach. In-service inspection incorporates Bayesian updating and leads to optimal operation and initial design. Finally, the concept of management policy is described with strategy vectors.

Journal ArticleDOI
TL;DR: It is shown that Markov decision processes and the policy-gradient approach, or perturbation analysis, can be derived easily from two fundamental sensitivity formulas, and such formulas can be flexibly constructed, by first principles, with performance potentials as building blocks.
Abstract: The goal of this paper is two-fold: First, we present a sensitivity point of view on the optimization of Markov systems. We show that Markov decision processes (MDPs) and the policy-gradient approach, or perturbation analysis (PA), can be derived easily from two fundamental sensitivity formulas, and such formulas can be flexibly constructed, by first principles, with performance potentials as building blocks. Second, with this sensitivity view we propose an event-based optimization approach, including the event-based sensitivity analysis and event-based policy iteration. This approach utilizes the special feature of a system characterized by events and illustrates how the potentials can be aggregated using the special feature and how the aggregated potential can be used in policy iteration. Compared with the traditional MDP approach, the event-based approach has its advantages: the number of aggregated potentials may scale to the system size despite that the number of states grows exponentially in the system size, this reduces the policy space and saves computation; the approach does not require actions at different states to be independent; and it utilizes the special feature of a system and does not need to know the exact transition probability matrix. The main ideas of the approach are illustrated by an admission control problem.

Proceedings Article
09 Jul 2005
TL;DR: Empirical study shows that lazy approximation performs much better than discretization, and this new technique is successfully applied to a more realistic planetary rover planning problem.
Abstract: Solving Markov decision processes (MDPs) with continuous state spaces is a challenge due to, among other problems. the well-known curse of dimensionality. Nevertheless, numerous real-world applications such as transportation planning and telescope observation scheduling exhibit a critical dependence on continuous states. Current approaches to continuous-state MDPs include discretizing their transition models. In this paper, we propose and study an alternative, discretization-free approach we call lazy approximation. Empirical study shows that lazy approximation performs much better than discretization, and we successfully applied this new technique to a more realistic planetary rover planning problem.

Journal ArticleDOI
TL;DR: Simulation results show that the novel hierarchical scheme for adaptive dynamic power management under nonstationary service requests can lead to significant power savings compared to previously proposed heuristic approaches.
Abstract: Dynamic power management aims at extending battery life by switching devices to lower-power modes when there is a reduced demand for service. Static power management strategies can lead to poor performance or unnecessary power consumption when there are wide variations in the rate of requests for service. This paper presents a hierarchical scheme for adaptive dynamic power management (DPM) under nonstationary service requests. As the main theoretical contribution, we model the nonstationary request process as a Markov-modulated process with a collection of modes, each corresponding to a particular stationary request process. Optimal DPM policies are precalculated offline for selected modes using standard algorithms available for stationary Markov decision processes (MDPs). The power manager then switches online among these policies to accommodate the stochastic mode-switching request dynamics using an adaptive algorithm to determine the optimal switching rule based on the observed sample path. As a target application, we present simulations of hierarchical DPM for hard disk drives where the read/write request arrivals are modeled as a Markov-modulated Poisson process. Simulation results show that the power consumption of our approach under highly nonstationary request arrivals is less than that of a previously proposed heuristic approach and is even comparable to that of the optimal policy under stationary Poisson request process with the same arrival rate as the average arrival rate of the nonstationary request process.

Journal ArticleDOI
TL;DR: In this paper, a routing problem in a system where customers call back when their problems are not completely resolved by the customer service representatives (CSRs) is analyzed, and the concept of call resolution probability constitutes a good proxy for call quality.
Abstract: Traditional research on routing in queueing systems usually ignores service quality related factors. In this paper, we analyze the routing problem in a system where customers call back when their problems are not completely resolved by the customer service representatives (CSRs). We introduce the concept of call resolution probability, and we argue that it constitutes a good proxy for call quality. For each call, both the call resolution probability (p) and the average service time (1/µ) are CSR dependent. We use a Markov decision process formulation to obtain analytical results and insights about the optimal routing policy that minimizes the average total time of call resolution, including callbacks. In particular, we provide sufficient conditions under which it is optimal to route to the CSR with the highest call resolution rate (pµ) among those available. We also develop efficient heuristics that can be easily implemented in practice.