scispace - formally typeset
Search or ask a question

Showing papers on "Markov decision process published in 2004"


Proceedings ArticleDOI
04 Jul 2004
TL;DR: This work thinks of the expert as trying to maximize a reward function that is expressible as a linear combination of known features, and gives an algorithm for learning the task demonstrated by the expert, based on using "inverse reinforcement learning" to try to recover the unknown reward function.
Abstract: We consider learning in a Markov decision process where we are not explicitly given a reward function, but where instead we can observe an expert demonstrating the task that we want to learn to perform. This setting is useful in applications (such as the task of driving) where it may be difficult to write down an explicit reward function specifying exactly how different desiderata should be traded off. We think of the expert as trying to maximize a reward function that is expressible as a linear combination of known features, and give an algorithm for learning the task demonstrated by the expert. Our algorithm is based on using "inverse reinforcement learning" to try to recover the unknown reward function. We show that our algorithm terminates in a small number of iterations, and that even though we may never recover the expert's reward function, the policy output by the algorithm will attain performance close to that of the expert, where here performance is measured with respect to the expert's unknown reward function.

3,110 citations


BookDOI
01 Jan 2004
TL;DR: This chapter discusses reinforcement learning in large, high-dimensional state spaces, model-based adaptive critic designs, and applications of approximate dynamic programming in power systems control.
Abstract: Foreword. 1. ADP: goals, opportunities and principles. Part I: Overview. 2. Reinforcement learning and its relationship to supervised learning. 3. Model-based adaptive critic designs. 4. Guidance in the use of adaptive critics for control. 5. Direct neural dynamic programming. 6. The linear programming approach to approximate dynamic programming. 7. Reinforcement learning in large, high-dimensional state spaces. 8. Hierarchical decision making. Part II: Technical advances. 9. Improved temporal difference methods with linear function approximation. 10. Approximate dynamic programming for high-dimensional resource allocation problems. 11. Hierarchical approaches to concurrency, multiagency, and partial observability. 12. Learning and optimization - from a system theoretic perspective. 13. Robust reinforcement learning using integral-quadratic constraints. 14. Supervised actor-critic reinforcement learning. 15. BPTT and DAC - a common framework for comparison. Part III: Applications. 16. Near-optimal control via reinforcement learning. 17. Multiobjective control problems by reinforcement learning. 18. Adaptive critic based neural network for control-constrained agile missile. 19. Applications of approximate dynamic programming in power systems control. 20. Robust reinforcement learning for heating, ventilation, and air conditioning control of buildings. 21. Helicopter flight control using direct neural dynamic programming. 22. Toward dynamic stochastic optimal power flow. 23. Control, optimization, security, and self-healing of benchmark power systems.

780 citations


Proceedings Article
25 Jul 2004
TL;DR: An exact dynamic programming algorithm for partially observable stochastic games (POSGs) is developed and it is proved that when applied to finite-horizon POSGs, the algorithm iteratively eliminates very weakly dominated strategies without first forming a normal form representation of the game.
Abstract: We develop an exact dynamic programming algorithm for partially observable stochastic games (POSGs). The algorithm is a synthesis of dynamic programming for partially observable Markov decision processes (POMDPs) and iterated elimination or dominated strategies in normal form games. We prove that when applied to finite-horizon POSGs, the algorithm iteratively eliminates very weakly dominated strategies without first forming a normal form representation of the game. For the special case in which agents share the same payoffs, the algorithm can be used to find an optimal solution. We present preliminary empirical results and discuss ways to further exploit POMDP theory in solving POSGs.

528 citations


Journal ArticleDOI
01 Aug 2004
TL;DR: A novel hybrid technique which combines aspects of symbolic and explicit approaches to overcome performance problems in probabilistic model checking, and achieves a dramatic improvement over the purely symbolic approach.
Abstract: In this paper we present efficient symbolic techniques for probabilistic model checking. These have been implemented in PRISM, a tool for the analysis of probabilistic models such as discrete-time Markov chains, continuous-time Markov chains and Markov decision processes using specifications in the probabilistic temporal logics PCTL and CSL. Motivated by the success of model checkers such as SMV which use BDDs (binary decision diagrams), we have developed an implementation of PCTL and CSL model checking based on MTBDDs (multi-terminal BDDs) and BDDs. Existing work in this direction has been hindered by the generally poor performance of MTBDD-based numerical computation, which is often substantially slower than explicit methods using sparse matrices. The focus of this paper is a novel hybrid technique which combines aspects of symbolic and explicit approaches to overcome these performance problems. For typical examples, we achieve a dramatic improvement over the purely symbolic approach. In addition, thanks to the compact model representation using MTBDDs, we can verify systems an order of magnitude larger than with sparse matrices, while almost matching or even beating them for speed.

474 citations


Journal ArticleDOI
TL;DR: A scheme that samples and imposes a subset of constraints on a linear program that has a relatively small number of variables but an intractable number of constraints is studied.
Abstract: In the linear programming approach to approximate dynamic programming, one tries to solve a certain linear program--the ALP--that has a relatively small numberK of variables but an intractable numberM of constraints. In this paper, we study a scheme that samples and imposes a subset ofm <

415 citations


Journal ArticleDOI
TL;DR: In this article, the authors consider variance reduction methods that were developed for Monte Carlo estimates of integrals and study two commonly used policy gradient techniques, the baseline and actor-critic methods, from this perspective.
Abstract: Policy gradient methods for reinforcement learning avoid some of the undesirable properties of the value function approaches, such as policy degradation (Baxter and Bartlett, 2001). However, the variance of the performance gradient estimates obtained from the simulation is sometimes excessive. In this paper, we consider variance reduction methods that were developed for Monte Carlo estimates of integrals. We study two commonly used policy gradient techniques, the baseline and actor-critic methods, from this perspective. Both can be interpreted as additive control variate variance reduction methods. We consider the expected average reward performance measure, and we focus on the GPOMDP algorithm for estimating performance gradients in partially observable Markov decision processes controlled by stochastic reactive policies. We give bounds for the estimation error of the gradient estimates for both baseline and actor-critic algorithms, in terms of the sample size and mixing properties of the controlled system. For the baseline technique, we compute the optimal baseline, and show that the popular approach of using the average reward to define the baseline can be suboptimal. For actor-critic algorithms, we show that using the true value function as the critic can be suboptimal. We also discuss algorithms for estimating the optimal baseline and approximate value function.

399 citations


Journal ArticleDOI
TL;DR: This work presents a novel algorithm for solving a specific class of decentralized MDPs in which the agents' transitions are independent, and lays the foundation for further work in this area on both exact and approximate algorithms.
Abstract: Formal treatment of collaborative multi-agent systems has been lagging behind the rapid progress in sequential decision making by individual agents. Recent work in the area of decentralized Markov Decision Processes (MDPs) has contributed to closing this gap, but the computational complexity of these models remains a serious obstacle. To overcome this complexity barrier, we identify a specific class of decentralized MDPs in which the agents' transitions are independent. The class consists of independent collaborating agents that are tied together through a structured global reward function that depends on all of their histories of states and actions. We present a novel algorithm for solving this class of problems and examine its properties, both as an optimal algorithm and as an anytime algorithm. To the best of our knowledge, this is the first algorithm to optimally solve a non-trivial subclass of decentralized MDPs. It lays the foundation for further work in this area on both exact and approximate algorithms.

270 citations


MonographDOI
09 Mar 2004
TL;DR: On streams and coinduction: preface Acknowledgments Streams and co Induction Stream calculus Analytical differential equations Coinductive counting Component connectors Key differential equations Bibliography Modelling and verification of probabilistic systems.
Abstract: On streams and coinduction: Preface Acknowledgments Streams and coinduction Stream calculus Analytical differential equations Coinductive counting Component connectors Key differential equations Bibliography Modelling and verification of probabilistic systems: Preface Introduction Discrete-time Markov chains Markov decision processes Continuous-time Markov chains Probabilistic timed automata Implementation Measure theory and probability Iterative solution methods Bibliography

255 citations


Journal ArticleDOI
TL;DR: In this article, the authors present a joint optimization approach for the segmentation of customers into homogeneous groups of customers, and determining the optimal policy (i.e. what action to take from a set of available actions) towards each segment.
Abstract: With the advent of one-to-one marketing media, e.g. targeted direct mail or internet marketing, the opportunities to develop targeted marketing (customer relationship management) campaigns are enhanced in such a way that it is now both organizationally and economically feasible to profitably support a substantially larger number of marketing segments. However, the problem of what segments to distinguish, and what actions to take towards the different segments increases substantially in such an environment. A systematic analytic procedure optimizing both steps would be very welcome. In this study, we present a joint optimization approach addressing two issues: (1) the segmentation of customers into homogeneous groups of customers, (2) determining the optimal policy (i.e. what action to take from a set of available actions) towards each segment. We implement this joint optimization framework in a direct-mail setting for a charitable organization. Many previous studies in this area highlighted the importance of the following variables: R(ecency), F(requency), and M(onetary value). We use these variables to segment customers. In a second step, we determine which marketing policy is optimal using markov decision processes, following similar previous applications. The attractiveness of this stochastic dynamic programming procedure is based on the long-run maximization of expected average profit. Our contribution lies in the combination of both steps into one optimization framework to obtain an optimal allocation of marketing expenditures. Moreover, we control segment stability and policy performance by a bootstrap procedure. Our framework is illustrated by a real-life application. The results show that the proposed model outperforms a CHAID segmentation.

215 citations


Journal ArticleDOI
TL;DR: A new approach to stochastic inventory/routing that approximates the future costs of current actions using optimal dual prices of a linear program and an efficient algorithm to both generate and eliminate itineraries during solution of the linear programs and control policy are considered.
Abstract: We consider a new approach to stochastic inventory/routing that approximates the future costs of current actions using optimal dual prices of a linear program. We obtain two such linear programs by formulating the control problem as a Markov decision process and then replacing the optimal value function with the sum of single-customer inventory value functions. The resulting approximation yields statewise lower bounds on optimal infinite-horizon discounted costs. We present a linear program that takes into account inventory dynamics and economics in allocating transportation costs for stochastic inventory routing. On test instances we find that these allocations do not introduce any error in the value function approximations relative to the best approximations that can be achieved without them. Also, unlike other approaches, we do not restrict the set of allowable vehicle itineraries in any way. Instead, we develop an efficient algorithm to both generate and eliminate itineraries during solution of the linear programs and control policy. In simulation experiments, the price-directed policy outperforms other policies from the literature.

204 citations


Proceedings ArticleDOI
04 Jul 2004
TL;DR: This paper exploits a linear programming relaxation for the task of finding the best joint assignment in associative Markov networks, which provides an approximate quadratic program (QP) for the problem of learning a margin-maximizing Markov network.
Abstract: Markov networks are extensively used to model complex sequential, spatial, and relational interactions in fields as diverse as image processing, natural language analysis, and bioinformatics. However, inference and learning in general Markov networks is intractable. In this paper, we focus on learning a large subclass of such models (called associative Markov networks) that are tractable or closely approximable. This subclass contains networks of discrete variables with K labels each and clique potentials that favor the same labels for all variables in the clique. Such networks capture the "guilt by association" pattern of reasoning present in many domains, in which connected ("associated") variables tend to have the same label. Our approach exploits a linear programming relaxation for the task of finding the best joint assignment in such networks, which provides an approximate quadratic program (QP) for the problem of learning a margin-maximizing Markov network. We show that for associative Markov network over binary-valued variables, this approximate QP is guaranteed to return an optimal parameterization for Markov networks of arbitrary topology. For the nonbinary case, optimality is not guaranteed, but the relaxation produces good solutions in practice. Experimental results with hypertext and newswire classification show significant advantages over standard approaches.

Journal ArticleDOI
TL;DR: This work forms a Markov decision process model of the stochastic inventory routing problem and proposes approximation methods to find good solutions with reasonable computational effort and indicates how the proposed approach can be used for other Markov decisions involving the control of multiple resources.
Abstract: This work is motivated by the need to solve the inventory routing problem when implementing a business practice called vendor managed inventory replenishment (VMI). With VMI, vendors monitor their customers' inventories and decide when and how much inventory should be replenished at each customer. The inventory routing problem attempts to coordinate inventory replenishment and transportation in such a way that the cost is minimized over the long run. We formulate a Markov decision process model of the stochastic inventory routing problem and propose approximation methods to find good solutions with reasonable computational effort. We indicate how the proposed approach can be used for other Markov decision processes involving the control of multiple resources.

Proceedings ArticleDOI
11 Jan 2004
TL;DR: The existence of optimal pure memoryless strategies together with the polynomial-time solution for the one-player case implies that the quantitative two-player stochastic parity game problem is in NP ∩ co-NP, which generalizes a result of Condon for Stochastic games with reachability objectives.
Abstract: We study perfect-information stochastic parity games. These are two-player nonterminating games which are played on a graph with turn-based probabilistic transitions. A play results in an infinite path and the conflicting goals of the two players are ω-regular path properties, formalized as parity winning conditions. The qualitative solution of such a game amounts to computing the set of vertices from which a player has a strategy to win with probability 1 (or with positive probability). The quantitative solution amounts to computing the value of the game in every vertex, i.e., the highest probability with which a player can guarantee satisfaction of his own objective in a play that starts from the vertex.For the important special case of one-player stochastic parity games (parity Markov decision processes) we give polynomial-time algorithms both for the qualitative and the quantitative solution. The running time of the qualitative solution is O(d · m3/2) for graphs with m edges and d priorities. The quantitative solution is based on a linear-programming formulation.For the two-player case, we establish the existence of optimal pure memoryless strategies. This has several important ramifications. First, it implies that the values of the games are rational. This is in contrast to the concurrent stochastic parity games of de Alfaro et al.; there, values are in general algebraic numbers, optimal strategies do not exist, and e-optimal strategies have to be mixed and with infinite memory. Second, the existence of optimal pure memoryless strategies together with the polynomial-time solution forone-player case implies that the quantitative two-player stochastic parity game problem is in NP ∩ co-NP. This generalizes a result of Condon for stochastic games with reachability objectives. It also constitutes an exponential improvement over the best previous algorithm, which is based on a doubly exponential procedure of de Alfaro and Majumdar for concurrent stochastic parity games and provides only e-approximations of the values.

Proceedings ArticleDOI
30 Aug 2004
TL;DR: This work proposes a game theoretic framework for defensing nodes in a sensor network as a two-player, nonzero-sum, non-cooperative game between an attacker and a sensor networks, and applies three different schemes for defense.
Abstract: Insufficiency of memory and battery power of sensors makes the security of sensor networks a hard task to do. This insufficiency also makes applying the existing methods of securing other type of networks on the sensor networks unsuitable. We propose a game theoretic framework for defensing nodes in a sensor network. We apply three different schemes for defense. Our main concern in all three schemes is finding the most vulnerable node in a sensor network and protecting it. In the first scheme we formulate attack-defense problem as a two-player, nonzero-sum, non-cooperative game between an attacker and a sensor network. We show that this game achieves Nash equilibrium and thus leading to a defense strategy for the network. In the second scheme we use Markov decision process to predict the most vulnerable sensor node. In the third scheme we use an intuitive metric (node's traffic) and protect the node with the highest value of this metric. We evaluate the performance of each of these three schemes, and show that the proposed game framework significantly increases the chance of success in defense strategy for sensor network.

Journal ArticleDOI
TL;DR: A novel approximation method is presented for approximating the value function and selecting good actions for Markov decision processes with large state and action spaces and shows that the product of experts approximation can be used to solve large problems.
Abstract: A novel approximation method is presented for approximating the value function and selecting good actions for Markov decision processes with large state and action spaces. The method approximates state-action values as negative free energies in an undirected graphical model called a product of experts. The model parameters can be learned efficiently because values and derivatives can be efficiently computed for a product of experts. Actions can be found even in large factored action spaces by the use of Markov chain Monte Carlo sampling. Simulation results show that the product of experts approximation can be used to solve large problems. In one simulation it is used to find actions in action spaces of size 240.

Book
25 Oct 2004
TL;DR: In this paper, the authors present an overview of the application of Markov Decision Processes (MDPs) in the context of dynamic systems and their application in production planning.
Abstract: Prologue and Preliminaries.- Introduction, Overview, and Examples.- Mathematical Preliminaries.- Asymptotic Properties.- Asymptotic Expansions.- Occupation Measures.- Exponential Bounds.- Interim Summary and Extensions.- Applications.- Stability of Dynamic Systems.- Filtering.- Markov Decision Processes.- LQ Controls.- Mean-Variance Controls.- Production Planning.- Stochastic Approximation.

Proceedings ArticleDOI
07 Jul 2004
TL;DR: The formulation of metrics for measuring the similarity of states in a finite Markov decision process is based on the notion of bisimulation for MDPs, with an aim towards solving discounted infinite horizon reinforcement learning tasks.
Abstract: We present metrics for measuring the similarity of states in a finite Markov decision process (MDP). The formulation of our metrics is based on the notion of bisimulation for MDPs, with an aim towards solving discounted infinite horizon reinforcement learning tasks. Such metrics can be used to aggregate states, as well as to better structure other value function approximators (e.g., memory-based or nearest-neighbor approximators). We provide bounds that relate our metric distances to the optimal values of states in the given MDP.

Journal ArticleDOI
TL;DR: A decision support system to investigate and improve the combined inventory and transportation system in a representative world-wide crude supply problem and proposes an approximation architecture as a potential solution strategy attempting to solve the optimal control problem.

Proceedings ArticleDOI
04 Jul 2004
TL;DR: In this article, a relational Bellman update operator called REBEL is introduced to compactly represent Markov decision processes over relational domains, and a value iteration algorithm is developed in which abstraction over states and actions plays a major role.
Abstract: Motivated by the interest in relational reinforcement learning, we introduce a novel relational Bellman update operator called REBEL. It employs a constraint logic programming language to compactly represent Markov decision processes over relational domains. Using REBEL, a novel value iteration algorithm is developed in which abstraction (over states and actions) plays a major role. This framework provides new insights into relational reinforcement learning. Convergence results as well as experiments are presented.

Journal Article
TL;DR: This work captures the problem of reinforcement learning in a controlled Markov environment with multiple objective functions of the long-term average reward type using a stochastic game model, where the learning agent is facing an adversary whose policy is arbitrary and unknown, and where the reward function is vector-valued.
Abstract: We consider the problem of reinforcement learning in a controlled Markov environment with multiple objective functions of the long-term average reward type. The environment is initially unknown, and furthermore may be affected by the actions of other agents, actions that are observed but cannot be predicted beforehand. We capture this situation using a stochastic game model, where the learning agent is facing an adversary whose policy is arbitrary and unknown, and where the reward function is vector-valued. State recurrence conditions are imposed throughout. In our basic problem formulation, a desired target set is specified in the vector reward space, and the objective of the learning agent is to approach the target set, in the sense that the long-term average reward vector will belong to this set. We devise appropriate learning algorithms, that essentially use multiple reinforcement learning algorithms for the standard scalar reward problem, which are combined using the geometric insight from the theory of approachability for vector-valued stochastic games. We then address the more general and optimization-related problem, where a nested class of possible target sets is prescribed, and the goal of the learning agent is to approach the smallest possible target set (which will generally depend on the unknown system parameters). A particular case which falls into this framework is that of stochastic games with average reward constraints, and further specialization provides a reinforcement learning algorithm for constrained Markov decision processes. Some basic examples are provided to illustrate these results.

Proceedings ArticleDOI
19 Jul 2004
TL;DR: A class of DEC-MDPs that restricts the interactions between the agents to a structured, event-driven dependency, which can model locking a shared resource or temporal enabling constraints, both of which arise frequently in practice.
Abstract: Decentralized MDPs provide a powerful formal framework for planning in multi-agent systems, but the complexity of the model limits its usefulness. We study in this paper a class of DEC-MDPs that restricts the interactions between the agents to a structured, event-driven dependency. These dependencies can model locking a shared resource or temporal enabling constraints, both of which arise frequently in practice. The complexity of this class of problems is shown to be no harder than exponential in the number of states and doubly exponential in the number of dependencies. Since the number of dependencies is much smaller than the number of states for many problems, this is significantly better than the doubly exponential (in the state space) complexity of DEC-MDPs. We also demonstrate how an algorithm we previously developed can be used to solve problems in this class both optimally and approximately. Experimental work indicates that this solution technique is significantly faster than a naive policy search approach.

Journal ArticleDOI
TL;DR: This work model and analyze the problem of constructing a minimum expected total cost route from an origin to a destination that anticipates and then responds to service requests, if they occur, while the vehicle is en route, and presents several structured results associated with the optimal expected cost-to-go function and an optimal policy for route construction.
Abstract: Mobile communication technologies enable communication between dispatchers and drivers and hence can enable fleet management based on real-time information. We assume that such communication capability exists for a single pickup and delivery vehicle and that we know the likelihood, as a function of time, that each of the vehicle's potential customers will make a pickup request. We then model and analyze the problem of constructing a minimum expected total cost route from an origin to a destination that anticipates and then responds to service requests, if they occur, while the vehicle is en route. We model this problem as a Markov decision process and present several structured results associated with the optimal expected cost-to-go function and an optimal policy for route construction. We illustrate the behavior of an optimal policy with several numerical examples and demonstrate the superiority of an optimal anticipatory policy, relative to a route design approach that reflects the reactive nature of current routing procedures for less-than-truckload pickup and delivery.

Proceedings Article
25 Jul 2004
TL;DR: It is shown how an arbitrary GSMDP can be approximated by a discrete-time MDP, which can then be solved using existing MDP techniques, and the introduction of phases allows us to generate higher quality policies than those obtained by standard SMDP solution techniques.
Abstract: We introduce the generalized semi-Markov decision process (GSMDP) as an extension of continuous-time MDPs and semi-Markov decision processes (SMDPs) for modeling stochastic decision processes with asynchronous events and actions. Using phase-type distributions and uniformization, we show how an arbitrary GSMDP can be approximated by a discrete-time MDP, which can then be solved using existing MDP techniques. The techniques we present can also be seen as an alternative approach for solving SMDPs, and we demonstrate that the introduction of phases allows us to generate higher quality policies than those obtained by standard SMDP solution techniques.

Proceedings ArticleDOI
07 Jul 2004
TL;DR: This work describes an approach for exploiting structure in Markov Decision Processes with continuous state variables and extends it to piecewise constant representations, using techniques from POMDPs to represent and reason about linear surfaces efficiently.
Abstract: We describe an approach for exploiting structure in Markov Decision Processes with continuous state variables. At each step of the dynamic programming, the state space is dynamically partitioned into regions where the value function is the same throughout the region. We first describe the algorithm for piecewise constant representations. We then extend it to piecewise linear representations, using techniques from POMDPs to represent and reason about linear surfaces efficiently. We show that for complex, structured problems, our approach exploits the natural structure so that optimal solutions can be computed efficiently.

01 Jan 2004
TL;DR: This work introduces relativized options, a generalization of Markov sub-goal options, that allow us to define options without an absolute frame of reference and introduces an extension to the options framework that allows us to learn simultaneously at multiple levels of the hierarchy guarantees regarding the performance of hierarchical systems that employ approximate in several test-beds.
Abstract: To operate effectively in complex environments learning agents ignore irrelevant details. Stated in general terms this is a very difficult problem. Much of the work in this field is specialized to specific modeling framework for Markov decision processes (MDPs) based on homomorphisms relating MDPs. We build on classical finite-state automata literature and develop a minimization framework for MDPs that can exploit structure and symmetries to derive smaller equivalent models of the problem. Since employing homomorphisms approximate and partial homomorphisms and develop bounds for the loss that Our MDP minimization results can be readily employed by reinforcement approach to hierarchical RL, specifically using the options framework. We introduce relativized options, a generalization of Markov sub-goal options, that allow us to define options without an absolute frame of reference. We introduce an extension to the options framework, based on relativized options, that allows us to learn simultaneously at multiple levels of the hierarchy guarantees regarding the performance of hierarchical systems that employ approximate in several test-beds. Relativized options can also be interpreted as behavioral schemas. We demonstrate that such schemas can be profitably employed in a hierarchical RL setting. We also develop algorithms that learn the appropriate parameter binding to a given schema. We empirically demonstrate the validity and utility of these algorithms. Relativized options allow us to model certain aspects of deictic or indexical representations. We develop a modification of our parameter binding algorithm suited to hierarchical RL architectures that employ deictic representations.

Proceedings Article
01 Dec 2004
TL;DR: This work considers an MDP setting in which the reward function is allowed to change during each time step of play (possibly in an adversarial manner), yet the dynamics remain fixed, and provides efficient algorithms, which have regret bounds with no dependence on the size of state space.
Abstract: We consider an MDP setting in which the reward function is allowed to change during each time step of play (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well can an agent do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which have regret bounds with no dependence on the size of state space. Instead, these bounds depend only on a certain horizon time of the process and logarithmically on the number of actions. We also show that in the case that the dynamics change over time, the problem becomes computationally hard.

Proceedings ArticleDOI
06 Jun 2004
TL;DR: This work proposes using Markov decision processes (MDPs), to model workflow composition, and demonstrates the resulting workflows are robust to nondeterministic behaviors of Web services and adaptive to a changing environment.
Abstract: The advent of Web services has made automated workflow composition relevant to Web based applications. One technique, that has received some attention, for automatically composing workflows is AI-based classical planning. However, classical planning suffers from the paradox of first assuming deterministic behavior of Web services, then requiring the additional overhead of execution monitoring to recover from unexpected behavior of services. To address these concerns, we propose using Markov decision processes (MDPs), to model workflow composition. Our method models both, the inherent stochastic nature of Web services, and the dynamic nature of the environment. The resulting workflows are robust to nondeterministic behaviors of Web services and adaptive to a changing environment. Using an example scenario, we demonstrate our method and provide empirical results in its support.

Proceedings Article
01 Dec 2004
TL;DR: A new algorithm (VDCBPI) that mitigates both sources of intractability by combining the Value Directed Compression (VDC) technique with Bounded Policy Iteration (BPI) is described.
Abstract: Existing algorithms for discrete partially observable Markov decision processes can at best solve problems of a few thousand states due to two important sources of intractability: the curse of dimensionality and the policy space complexity. This paper describes a new algorithm (VDCBPI) that mitigates both sources of intractability by combining the Value Directed Compression (VDC) technique [13] with Bounded Policy Iteration (BPI) [14]. The scalability of VDCBPI is demonstrated on synthetic network management problems with up to 33 million states.

Journal ArticleDOI
TL;DR: This paper introduces the theory for multiple-objective problems with expected total discounted rewards and constraints and develops a new approach to the theory of continuous time jump Markov decision processes, based on the equivalence of strategies that change actions between jumps and the randomized strategies thatchange actions only at jump epochs.
Abstract: This paper introduces and develops a new approach to the theory of continuous time jump Markov decision processes (CTJMDP). This approach reduces discounted CTJMDPs to discounted semi-Markov decision processes (SMDPs) and eventually to discrete-time Markov decision processes (MDPs). The reduction is based on the equivalence of strategies that change actions between jumps and the randomized strategies that change actions only at jump epochs. This holds both for one-criterion problems and for multiple-objective problems with constraints. In particular, this paper introduces the theory for multiple-objective problems with expected total discounted rewards and constraints. If a problem is feasible, there exist three types of optimal policies: (i) nonrandomized switching stationary policies, (ii) randomized stationary policies for the CTJMDP, and (iii) randomized stationary policies for the corresponding SMDP with exponentially distributed sojourn times, and these policies can be implemented as randomized strategies in the CTJMDP.

Journal ArticleDOI
TL;DR: This work develops a new way to combine heuristic solutions through dynamic programming in the state space that the heuristics generate by using a discrete time Markov chain, which enables us to model probabilistic correlation of the uncertain parameters.