Showing papers on "Markov decision process published in 2002"

PDF

Open Access

Journal Article•DOI•

The Complexity of Decentralized Control of Markov Decision Processes

[...]

Daniel S. Bernstein, Robert Givan, Neil Immerman, Shlomo Zilberstein

01 Nov 2002-Mathematics of Operations Research

TL;DR: This work considers decentralized control of Markov decision processes and gives complexity bounds on the worst-case running time for algorithms that find optimal solutions and describes generalizations that allow for decentralized control.

...read moreread less

Abstract: We consider decentralized control of Markov decision processes and give complexity bounds on the worst-case running time for algorithms that find optimal solutions. Generalizations of both the fully observable case and the partially observable case that allow for decentralized control are described. For even two agents, the finite-horizon problems corresponding to both of these models are hard for nondeterministic exponential time. These complexity results illustrate a fundamental difference between centralized and decentralized control of Markov decision processes. In contrast to the problems involving centralized control, the problems we considerprovably do not admit polynomial-time algorithms. Furthermore, assuming EXP ? NEXP, the problems require superexponential time to solve in the worst case.

...read moreread less

930 citations

Journal Article•DOI•

Near-Optimal Reinforcement Learning in Polynomial Time

[...]

Michael Kearns¹, Satinder Singh•Institutions (1)

University of Pennsylvania¹

01 Nov 2002-Machine Learning

TL;DR: In this paper, the authors show that the number of actions required to approach the optimal return is lower bounded by the mixing time of the optimal policy (in the undiscounted case) or by the horizon time T in the discounted case.

...read moreread less

Abstract: We present new algorithms for reinforcement learning and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states and actions, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the Exploration-Exploitation trade-off.

...read moreread less

802 citations

Book Chapter•DOI•

PRISM: Probabilistic Symbolic Model Checker

[...]

Marta Kwiatkowska¹, Gethin Norman¹, David Parker¹•Institutions (1)

University of Birmingham¹

14 Apr 2002-Lecture Notes in Computer Science

TL;DR: PRISM has been successfully used to analyse probabilistic termination, performance, and quality of service properties for a range of systems, including randomized distributed algorithms, manufacturing systems and workstation clusters.

...read moreread less

Abstract: In this paper we describe PRISM, a tool being developed at the University of Birmingham for the analysis of probabilistic systems. PRISM supports three probabilistic models: discrete-time Markov chains, Markov decision processes and continuous-time Markov chains. Analysis is performed through model checking such systems against specifications written in the probabilistic temporal logics PCTL and CSL. The tool features three model checking engines: one symbolic, using BDDs (binary decision diagrams) and MTBDDs (multi-terminal BDDs); one based on sparse matrices; and one which combines both symbolic and sparse matrix methods. PRISM has been successfully used to analyse probabilistic termination, performance, and quality of service properties for a range of systems, including randomized distributed algorithms, manufacturing systems and workstation clusters.

...read moreread less

717 citations

Journal Article•DOI•

The communicative multiagent team decision problem: analyzing teamwork theories and models

[...]

David V. Pynadath¹, Milind Tambe¹•Institutions (1)

University of Southern California¹

01 Jan 2002-Journal of Artificial Intelligence Research

TL;DR: A unified framework for multiagent teamwork, the COMmunicative Multiagent Team Decision Problem (COM-MTDP), which combines and extends existing multiagent theories, and provides a basis for the development of novel team coordination algorithms.

...read moreread less

Abstract: Despite the significant progress in multiagent teamwork, existing research does not address the optimality of its prescriptions nor the complexity of the teamwork problem. Without a characterization of the optimality-complexity tradeoffs, it is impossible to determine whether the assumptions and approximations made by a particular theory gain enough efficiency to justify the losses in overall performance. To provide a tool for use by multiagent researchers in evaluating this tradeoff, we present a unified framework, the COMmunicative Multiagent Team Decision Problem (COM-MTDP). The COM-MTDP model combines and extends existing multiagent theories, such as decentralized partially observable Markov decision processes and economic team theory. In addition to their generality of representation, COM-MTDPs also support the analysis of both the optimality of team performance and the computational complexity of the agents' decision problem. In analyzing complexity, we present a breakdown of the computational complexity of constructing optimal teams under various classes of problem domains, along the dimensions of observability and communication cost. In analyzing optimality, we exploit the COM-MTDP's ability to encode existing teamwork theories and models to encode two instantiations of joint intentions theory taken from the literature. Furthermore, the COM-MTDP model provides a basis for the development of novel team coordination algorithms. We derive a domain-independent criterion for optimal communication and provide a comparative analysis of the two joint intentions instantiations with respect to this optimal policy. We have implemented a reusable, domain-independent software package based on COM-MTDPs to analyze teamwork coordination strategies, and we demonstrate its use by encoding and evaluating the two joint intentions strategies within an example domain.

...read moreread less

428 citations

Journal Article•DOI•

A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes

[...]

Michael Kearns¹, Yishay Mansour², Andrew Y. Ng³•Institutions (3)

University of Pennsylvania¹, Tel Aviv University², University of California, Berkeley³

01 Nov 2002-Machine Learning

TL;DR: This paper presents a new algorithm that, given only a generative model (a natural and common type of simulator) for an arbitrary MDP, performs on-line, near-optimal planning with a per-state running time that has no dependence on the number of states.

...read moreread less

Abstract: A critical issue for the application of Markov decision processes (MDPs) to realistic problems is how the complexity of planning scales with the size of the MDP In stochastic environments with very large or infinite state spaces, traditional planning and reinforcement learning algorithms may be inapplicable, since their running time typically grows linearly with the state space size in the worst case In this paper we present a new algorithm that, given only a generative model (a natural and common type of simulator) for an arbitrary MDP, performs on-line, near-optimal planning with a per-state running time that has no dependence on the number of states The running time is exponential in the horizon time (which depends only on the discount factor γ and the desired degree of approximation to the optimal policy) Our algorithm thus provides a different complexity trade-off than classical algorithms such as value iteration—rather than scaling linearly in both horizon time and state space size, our running time trades an exponential dependence on the former in exchange for no dependence on the latter Our algorithm is based on the idea of sparse sampling We prove that a randomly sampled look-ahead tree that covers only a vanishing fraction of the full look-ahead tree nevertheless suffices to compute near-optimal actions from any state of an MDP Practical implementations of the algorithm are discussed, and we draw ties to our related recent results on finding a near-best strategy from a given class of strategies in very large partially observable MDPs (Kearns, Mansour, & Ng Neural information processing systems 13, to appear)

...read moreread less

416 citations

Book Chapter•DOI•

PAC Bounds for Multi-armed Bandit and Markov Decision Processes

[...]

Eyal Even-Dar¹, Shie Mannor², Yishay Mansour¹•Institutions (2)

Tel Aviv University¹, Technion – Israel Institute of Technology²

08 Jul 2002

TL;DR: The bandit problem is revisited and considered under the PAC model, and it is shown that given n arms, it suffices to pull the arms O(n/?2 log 1/?) times to find an ?-optimal arm with probability of at least 1 - ?.

...read moreread less

Abstract: The bandit problem is revisited and considered under the PAC model. Our main contribution in this part is to show that given n arms, it suffices to pull the arms O(n/?2 log 1/?) times to find an ?-optimal arm with probability of at least 1 - ?. This is in contrast to the naive bound of O(n/?2 log n/?). We derive another algorithm whose complexity depends on the specific setting of the rewards, rather than the worst case setting. We also provide a matching lower bound. We show how given an algorithm for the PAC model Multi-armed Bandit problem, one can derive a batch learningalg orithm for Markov Decision Processes. This is done essentially by simulatingV alue Iteration, and in each iteration invokingt he multi-armed bandit algorithm. Using our PAC algorithm for the multi-armed bandit problem we improve the dependence on the number of actions.

...read moreread less

392 citations

Journal Article•DOI•

Variable Resolution Discretization in Optimal Control

[...]

Rémi Munos¹, Andrew W. Moore²•Institutions (2)

École Polytechnique¹, Carnegie Mellon University²

01 Nov 2002-Machine Learning

TL;DR: This paper evaluates the performance of a variety of splitting criteria on many benchmark problems, paying careful attention to their number-of-cells versus closeness-to-optimality tradeoff curves.

...read moreread less

Abstract: The problem of state abstraction is of central importance in optimal control, reinforcement learning and Markov decision processes. This paper studies the case of variable resolution state abstraction for continuous time and space, deterministic dynamic control problems in which near-optimal policies are required. We begin by defining a class of variable resolution policy and value function representations based on Kuhn triangulations embedded in a kd-trie. We then consider top-down approaches to choosing which cells to split in order to generate improved policies. The core of this paper is the introduction and evaluation of a wide variety of possible splitting criteria. We begin with local approaches based on value function and policy properties that use only features of individual cells in making split choices. Later, by introducing two new non-local measures, influence and variance, we derive splitting criteria that allow one cell to efficiently take into account its impact on other cells when deciding whether to split. Influence is an efficiently-calculable measure of the extent to which changes in some state effect the value function of some other states. Variance is an efficiently-calculable measure of how risky is some state in a Markov chain: a low variance state is one in which we would be very surprised if, during any one execution, the long-term reward attained from that state differed substantially from its expected value, given by the value function. The paper proceeds by graphically demonstrating the various approaches to splitting on the familiar, non-linear, non-minimum phase, and two dimensional problem of the “Car on the hill”. It then evaluates the performance of a variety of splitting criteria on many benchmark problems, paying careful attention to their number-of-cells versus closeness-to-optimality tradeoff curves.

...read moreread less

360 citations

Book•DOI•

Interactive Markov chains: and the quest for quantified quality

[...]

Holger Hermanns¹•Institutions (1)

University of Twente¹

01 Jan 2002-Lecture Notes in Computer Science

TL;DR: In this paper, the authors propose an algebra of Interactive Markov Chains (IMC) and prove its correctness in practice using proofs for Chapter 3 and Chapter 4 and proofs for Chapter 5.

...read moreread less

Abstract: Interactive Processes.- Markov Chains.- Interactive Markov Chains.- Algebra of Interactive Markov Chains.- Interactive Markov Chains in Practice.- Conclusion.- Proofs for Chapter 3 and Chapter 4.- Proofs for Chapter 5.

...read moreread less

342 citations

Optimal learning: computational procedures for bayes-adaptive markov decision processes

[...]

Michael O. Duff, Andrew G. Barto

01 Jan 2002

316 citations

Book•

Handbook of Markov decision processes : methods and applications

[...]

Eugene A. Feinberg, Adam Shwartz

01 Jan 2002

TL;DR: In this article, the authors present an overview of Markov Decision Processes in the context of communication networks and their applications in finance and dynamic options, including a discussion of the role of the Poisson Equation for Countable Markov Chains.

...read moreread less

Abstract: 1. Introduction E.A. Feinberg, A. Shwartz. Part I: Finite State and Action Models. 2. Finite State and Action MDPs L. Kallenberg. 3. Bias Optimality M.E. Lewis, M.L. Puterman. 4. Singular Perturbations of Markov Chains and Decision Processes K.E. Avrachenkov, et al. Part II: Infinite State Models. 5. Average Reward Optimization Theory for Denumerable State Spaces L.I. Sennott. 6. Total Reward Criteria E.A. Feinberg. 7. Mixed Criteria E.A. Feinberg, A. Shwartz. 8. Blackwell Optimality A. Hordijk, A.A. Yushkevich. 9. The Poisson Equation for Countable Markov Chains: Probabilistic Methods and Interpretations A.M. Makowski, A. Shwartz. 10. Stability, Performance Evaluation, and Optimization S.P. Meyn. 11. Convex Analytic Methods in Markov Decision Processes V.S. Borkar. 12. The Linear Programming Approach O. Hernandez-Lerma, J.B. Lasserre. 13. Invariant Gambling Problems and Markov Decision Processes L.E. Dubins, et al. Part III: Applications. 14. Neuro-Dynamic Programming: Overview and Recent Trends B. Van Roy. 15. Markov Decision Processes in Finance and Dynamic Options M. Schal. 16. Applications of Markov Decision Processes in Communication Networks E. Altman. 17. Water Reservoir Applications of Markov Decision Processes B.F. Lamond, A. Boukhtouta. Index.

...read moreread less

281 citations

Proceedings Article•DOI•

A POMDP formulation of preference elicitation problems

[...]

Craig Boutilier¹•Institutions (1)

University of Toronto¹

28 Jul 2002

TL;DR: Methods that exploit the special structure of preference elicitation to deal with parameterized belief states over the continuous state space, and gradient techniques for optimizing parameterized actions are described.

...read moreread less

Abstract: Preference elicitation is a key problem facing the deployment of intelligent systems that make or recommend decisions on the behalf of users. Since not all aspects of a utility function have the same impact on object-level decision quality, determining which information to extract from a user is itself a sequential decision problem, balancing the amount of elicitation effort and time with decision quality. We formulate this problem as a partially-observable Markov decision process (POMDP). Because of the continuous nature of the state and action spaces of this POMDP, standard techniques cannot be used to solve it. We describe methods that exploit the special structure of preference elicitation to deal with parameterized belief states over the continuous state space, and gradient techniques for optimizing parameterized actions. These methods can be used with a number of different belief state representations, including mixture models.

...read moreread less

Book Chapter•DOI•

Q-Cut - Dynamic Discovery of Sub-goals in Reinforcement Learning

[...]

Ishai Menache¹, Shie Mannor¹, Nahum Shimkin¹•Institutions (1)

Technion – Israel Institute of Technology¹

19 Aug 2002

TL;DR: The Q-Cut algorithm is presented, a graph theoretic approach for automatic detection of sub-goals in a dynamic environment, which is used for acceleration of the Q-Learning algorithm, and extended to the Segmented Q- cut algorithm, which uses previously identified bottlenecks for state space partitioning, necessary for finding additional bottlenECks in complex environments.

...read moreread less

Abstract: We present the Q-Cut algorithm, a graph theoretic approach for automatic detection of sub-goals in a dynamic environment, which is used for acceleration of the Q-Learning algorithm. The learning agent creates an on-line map of the process history, and uses an efficient Max-Flow/Min-Cut algorithm for identifying bottlenecks. The policies for reaching bottlenecks are separately learned and added to the model in a form of options (macro-actions). We then extend the basic Q-Cut algorithm to the Segmented Q-Cut algorithm, which uses previously identified bottlenecks for state space partitioning, necessary for finding additional bottlenecks in complex environments. Experiments showsign ificant performance improvements, particulary in the initial learning phase.

...read moreread less

Proceedings Article•

An MDP-based recommender system

[...]

Guy Shani¹, Ronen I. Brafman¹, David Heckerman²•Institutions (2)

Ben-Gurion University of the Negev¹, Microsoft²

01 Aug 2002

TL;DR: The use of an n-gram predictive model is suggested for generating the initial MDP, which induces a Markovchain model of user behavior whose predictive accuracy is greater than that of existing predictive models.

...read moreread less

Abstract: Typical Recommender systems adopt a static view of the recommendation process and treat it as a prediction problem. We argue that it is more appropriate to view the problem of generating recommendations as a sequential decision problem and, consequently, that Markov decision processes (MDP) provide a more appropriate model for Recommender systems. MDPs introduce two benefits: they take into account the long-term effects of each recommendation, and they take into account the expected value of each recommendation. To succeed in practice, an MDP-based Recommender system must employ a strong initial model; and the bulk of this paper is concerned with the generation of such a model. In particular, we suggest the use of an n-gram predictive model for generating the initial MDP. Our n-gram model induces a Markovchain model of user behavior whose predictive accuracy is greater than that of existing predictive models. We describe our predictive model in detail and evaluate its performance on real data. In addition, we show how the model can be used in an MDP-based Recommender system.

...read moreread less

Book•DOI•

Handbook of Markov Decision Processes

[...]

Eugene A. Feinberg, Adam Shwartz

01 Jan 2002

Journal Article•DOI•

Towards adjustable autonomy for the real world

[...]

Paul Scerri¹, David V. Pynadath¹, Milind Tambe¹•Institutions (1)

University of Southern California¹

01 Jul 2002-Journal of Artificial Intelligence Research

TL;DR: A novel approach to adjustable autonomy is presented, based on the notion of a transfer-of-control strategy, which guides and informs the operationalization of the strategies using Markov Decision Processes, which select an optimal strategy, given an uncertain environment and costs to the individuals and teams.

...read moreread less

Abstract: Adjustable autonomy refers to entities dynamically varying their own autonomy, transferring decision-making control to other entities (typically agents transferring control to human users) in key situations. Determining whether and when such transfers-of-control should occur is arguably the fundamental research problem in adjustable autonomy. Previous work has investigated various approaches to addressing this problem but has often focused on individual agent-human interactions. Unfortunately, domains requiring collaboration between teams of agents and humans reveal two key shortcomings of these previous approaches. First, these approaches use rigid one-shot transfers of control that can result in unacceptable coordination failures in multiagent settings. Second, they ignore costs (e.g., in terms of time delays or effects on actions) to an agent's team due to such transfers-of-control. To remedy these problems, this article presents a novel approach to adjustable autonomy, based on the notion of a transfer-of-control strategy. A transfer-of-control strategy consists of a conditional sequence of two types of actions: (i) actions to transfer decision-making control (e.g., from an agent to a user or vice versa) and (ii) actions to change an agent's pre-specified coordination constraints with team members, aimed at minimizing miscoordination costs. The goal is for high-quality individual decisions to be made with minimal disruption to the coordination of the team. We present a mathematical model of transfer-of-control strategies. The model guides and informs the operationalization of the strategies using Markov Decision Processes, which select an optimal strategy, given an uncertain environment and costs to the individuals and teams. The approach has been carefully evaluated, including via its use in a real-world, deployed multi-agent system that assists a research group in its daily activities.

...read moreread less

Journal Article•DOI•

The Censored Newsvendor and the Optimal Acquisition of Information

[...]

Xiaomei Ding, Martin L. Puterman¹, Arnab Bisi¹•Institutions (1)

University of British Columbia¹

01 May 2002-Operations Research

TL;DR: This paper investigates the effect of demand censoring on the optimal policy in newsvendor inventory models with general parametric demand distributions and unknown parameter values and shows that the optimal ventory level in the presence of censored demand is higher than would be determined using a Bayesian myopic policy.

...read moreread less

Abstract: This paper investigates the effect of demand censoring on the optimal policy in newsvendor inventory models with general parametric demand distributions and unknown parameter values. We show that the newsvendor problem withobservable lost sales reduces to a sequence of single-period problems, while the newsvendor problem withunobservable lost sales requires a dynamic analysis. Using a Bayesian Markov decision process approach we show that the optimalin ventory level in the presence of censored demand ishigher than would be determined using a Bayesian myopic policy. We explore the economic rationality for this observation and illustrate it with numerical examples.

...read moreread less

Journal Article•DOI•

Inventory management in supply chains: a reinforcement learning approach

[...]

Ilaria Giannoccaro¹, Pierpaolo Pontrandolfo¹•Institutions (1)

Instituto Politécnico Nacional¹

21 Jul 2002-International Journal of Production Economics

TL;DR: This paper presents an approach to manage inventory decisions at all stages of the supply chain in an integrated manner that allows an inventory order policy to be determined, aimed at optimizing the performance of the whole supply chain.

...read moreread less

Book Chapter•DOI•

Probabilistic Model Checking of the IEEE 802.11 Wireless Local Area Network Protocol

[...]

Marta Kwiatkowska¹, Gethin Norman¹, Jeremy Sproston²•Institutions (2)

University of Birmingham¹, University of Turin²

25 Jul 2002-Lecture Notes in Computer Science

TL;DR: This work model the two-way handshake mechanism of the IEEE 802.11 standard with a fixed network topology using Probabilistic timed automata, a formal description mechanism in which both nondeterministic choice and probabilistic choice can be represented.

...read moreread less

Abstract: The international standard IEEE 802.11 was developed recently in recognition of the increased demand for wireless local area networks. Its medium access control mechanism is described according to a variant of the Carrier Sense Multiple Access with Collision Avoidance (CSMA/CA) scheme. Although collisions cannot always be prevented, randomised exponential backoff rules are used in the retransmission scheme to minimise the likelihood of repeated collisions. More precisely, the backoff procedure involves a uniform probabilistic choice of an integer-valued delay from an interval, where the size of the interval grows exponentially with regard to the number of retransmissions of the current data packet. We model the two-way handshake mechanism of the IEEE 802.11 standard with a fixed network topology using probabilistic timed automata, a formal description mechanism in which both nondeterministic choice and probabilistic choice can be represented. From our probabilistic timed automaton model, we obtain a finite-state Markov decision process via a property-preserving discrete-time semantics. The Markov decision process is then verified using Prism, a probabilistic model checking tool, against probabilistic, timed properties such as "at most 5,000 microseconds pass before a station sends its packet correctly."

...read moreread less

Journal Article•DOI•

Q-Learning for Risk-Sensitive Control

[...]

Vivek S. Borkar

01 May 2002-Mathematics of Operations Research

TL;DR: For risk-sensitive control of finite Markov chains a counterpart of the popular Q-learning algorithm for classical Markov decision processes is proposed, and the algorithm is shown to converge with probability one to the desired solution.

...read moreread less

Abstract: We propose for risk-sensitive control of finite Markov chains a counterpart of the popular Q-learning algorithm for classical Markov decision processes. The algorithm is shown to converge with probability one to the desired solution. The proof technique is an adaptation of the o.d.e. approach for the analysis of stochastic approximation algorithms, with most of the work involved used for the analysis of the specific o.d.e.s that arise.

...read moreread less

Journal Article•DOI•

Risk-Sensitive Optimal Control for Markov Decision Processes with Monotone Cost

[...]

Vivek S. Borkar, Sean P. Meyn¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Feb 2002-Mathematics of Operations Research

TL;DR: The existence of an optimal feedback law is established for the risk-sensitive optimal control problem with denumerable state space and a solution can be found constructively using either value iteration or policy iteration under suitable conditions on initial feedback law.

...read moreread less

Abstract: The existence of an optimal feedback law is established for the risk-sensitive optimal control problem with denumerable state space. The main assumptions imposed are irreducibility and anear monotonicity condition on the one-step cost function. A solution can be found constructively using either value iteration or policy iteration under suitable conditions on initial feedback law.

...read moreread less

Journal Article•DOI•

Integrated maintenance and production control of a deteriorating production system

[...]

Seyed M. R. Iravani¹, Izak Duenyas²•Institutions (2)

Northwestern University¹, University of Michigan²

01 May 2002-Iie Transactions

TL;DR: The optimal dynamic policy is shown to have a rather complex structure which leads to more implementable policies and a double-threshold policy is presented and exact and approximate methods for evaluating the performance of this policy and computing its optimal parameters are derived.

...read moreread less

Abstract: We consider a make-to-stock production/inventory system consisting of a single deteriorating machine which produces a single item. We formulate the integrated decisions of maintenance and production using a Markov Decision Process. The optimal dynamic policy is shown to have a rather complex structure which leads us to consider more implementable policies. We present a double-threshold policy and derive exact and approximate methods for evaluating the performance of this policy and computing its optimal parameters. A detailed numerical study demonstrates that the proposed policy and our approximate method for computing its parameters perform extremely well. Finally, we show that policies which do not address maintenance and production control decisions in an integrated manner can perform rather badly.

...read moreread less

Book Chapter•DOI•

Convex Analytic Methods in Markov Decision Processes

[...]

Vivek S. Borkar¹•Institutions (1)

Tata Institute of Fundamental Research¹

01 Jan 2002

TL;DR: The convex analytic approach to classical Markov decision processes wherein they are cast as a static convex programming problem in the space of measures is described.

...read moreread less

Abstract: This article describes the convex analytic approach to classical Markov decision processes wherein they are cast as a static convex programming problem in the space of measures. Applications to multiobjective problems are described.

...read moreread less

Proceedings Article•DOI•

Context-specific multiagent coordination and planning with factored MDPs

[...]

Carlos Guestrin¹, Shobha Venkataraman¹, Daphne Koller¹•Institutions (1)

Stanford University¹

28 Jul 2002

TL;DR: An algorithm for coordinated decision making in cooperative multiagent settings, where the agents' value function can be represented as a sum of context-specific value rules using an efficient linear programming algorithm is presented.

...read moreread less

Abstract: We present an algorithm for coordinated decision making in cooperative multiagent settings, where the agents' value function canbe represented as a sum of context-specific value rules. The task of finding an optimal joint action in this setting leads to an algorithm where the coordination structure between agents depends on the current state of the system and even on the actual mmaerical values assigned to the value rules. We apply this framework to the task of multiagent planning in dynamic systems, showing how a joint value function of the associated Markov Decision Process can be approximated as a set of value rules using an efficient linear programming algorithm. The agents then apply the coordination graph algorithm at each iteration of the process to decide on the highest-value joint action, potentially leading to a different coordination pattern at each step of the plan.

...read moreread less

Journal Article•DOI•

Managing Learning and Turnover in Employee Staffing

[...]

Noah Gans¹, Yong-Pin Zhou²•Institutions (2)

University of Pennsylvania¹, University of Washington²

01 Nov 2002-Operations Research

TL;DR: In this article, the authors study the employee staffing problem in a service organization that uses employee service capacity to meet random, nonstationary service requirements, and develop a Markov Decision Process (MDP) model which explicitly represents the stochastic nature of these effects.

...read moreread less

Abstract: We study the employee staffing problem in a service organization that uses employee service capacity to meet random, nonstationary service requirements. The employees experience learning and turnover on the job, and we develop a Markov Decision Process (MDP) model which explicitly represents the stochastic nature of these effects. Theoretical results show that the optimal hiring policy is of a state-dependent "hire-up-to" type, similar to an inventory "order-up-to" policy. For two important special cases, a myopic policy is optimal. We also test a linear programming (LP) based heuristic, which uses average learning and turnover behavior, in stationary environments. In most cases, the LP-based policy performs quite well, within 1% of optimality. When flexible capacity--in the form of overtime or outsourcing--is expensive or not available, however, explicit modeling of stochastic learning and turnover effects may improve performance significantly.

...read moreread less

Book Chapter•DOI•

Applications of Markov Decision Processes in Communication Networks

[...]

Eitan Altman¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

01 Jan 2002

TL;DR: This chapter presents in this chapter a survey on applications of MDPs to communication networks and the theoretical tools that have been developed to model and to solve the resulting control problems.

...read moreread less

Abstract: We present in this chapter a survey on applications of MDPs to communication networks. We survey both the different application areas in communication networks as well as the theoretical tools that have been developed to model and to solve the resulting control problems.

...read moreread less

Journal Article•DOI•

Adaptive Inventory Control for Nonstationary Demand and Partial Information

[...]

James T. Treharne¹, Charles R. Sox²•Institutions (2)

United States Department of the Army¹, Auburn University²

01 May 2002-Management Science

TL;DR: This paper presents results that demonstrate that there are other practical control policies that almost always provide much better solutions for this problem than the CEC policies commonly used in practice.

...read moreread less

Abstract: This paper examines several different policies for an inventory control problem in which the demand process is nonstationary and partially observed. The probability distribution for the demand in each period is determined by the state of a Markov chain, the core process. However, the state of this core process is not directly observed, only the actual demand is observed by the decision maker. Given this demand process, the inventory control problem is a composite-state, partially observed Markov decision process POMDP, which is an appropriate model for a number of dynamic demand problems. In practice, managers often use certainty equivalent control CEC policies to solve such a problem. However, this paper presents results that demonstrate that there are other practical control policies that almost always provide much better solutions for this problem than the CEC policies commonly used in practice. The computational results also indicate how specific problem characteristics influence the performance of each of the alternative policies.

...read moreread less

Journal Article•DOI•

Should Start-up Companies Be Cautious? Inventory Policies Which Maximise Survival Probabilities

[...]

Thomas Welsh Archibald¹, Lyn C. Thomas², John M. Betts³, Robert B. Johnston⁴•Institutions (4)

University of Edinburgh¹, University of Southampton², Monash University, Clayton campus³, University of Melbourne⁴

01 Sep 2002-Management Science

TL;DR: It is shown that although the start-up company should be more conservative in its component purchasing strategy than if it were a well-established company, it should not be too conservative and its strategy monotone in the amount of capital it has available.

...read moreread less

Abstract: New start-up companies, which are considered to be a vital ingredient in a successful economy, have a different objective than established companies: They want to maximise their chance of long-term survival. We examine the implications for their operating decisions of this different criterion by considering an abstraction of the inventory problem faced by a start-up manufacturing company. The problem is modelled under two criteria as a Markov decision process; the characteristics of the optimal policies under the two criteria are compared. It is shown that although the start-up company should be more conservative in its component purchasing strategy than if it were a well-established company, it should not be too conservative. Nor is its strategy monotone in the amount of capital it has available. The models are extended to allow for interest on investment and inflation.

...read moreread less

Proceedings Article•DOI•

Symbolic heuristic search for factored Markov decision processes

[...]

Zhengzhu Feng¹, Eric A. Hansen²•Institutions (2)

University of Massachusetts Amherst¹, Mississippi State University²

28 Jul 2002

TL;DR: A plnning algorithm is described that integrates two approaches to solving Markov decision processes with large state spaces in a novel way that exploits symbolic model-checking techniques and demonstrates their usefulness for decision-theoretic planning.

...read moreread less

Abstract: We describe a plnning algorithm that integrates two approaches to solving Markov decision processes with large state spaces. State abstraction is used to avoid evaluating states individually. Forward search from a start state, guided by an admissible heuristic, is used to avoid evaluating all states. We combine these two approaches in a novel way that exploits symbolic model-checking techniques and demonstrates their usefulness for decision-theoretic planning.

...read moreread less

Journal Article•DOI•

Dynamic allocation indices for restless projects and queueing admission control: a polyhedral approach

[...]

José Niño-Mora¹•Institutions (1)

Charles III University of Madrid¹

01 Dec 2002-Mathematical Programming

TL;DR: The polyhedral foundation of the PCL framework is developed, based on the structural and algorithmic properties of a new polytope associated with an accessible set system-extended polymatroid, and PCL-indexability is interpreted as a form of the classic economics law of diminishing marginal returns.

...read moreread less

Abstract: This paper develops a polyhedral approach to the design, analysis, and computation of dynamic allocation indices for scheduling binary-action (engage/rest) Markovian stochastic projects which can change state when rested (restless bandits (RBs)), based on partial conservation laws (PCLs). This extends previous work by the author [J. Nino-Mora (2001): Restless bandits, partial conservation laws and indexability. Adv. Appl. Probab. 33, 76–98], where PCLs were shown to imply the optimality of index policies with a postulated structure in stochastic scheduling problems, under admissible linear objectives, and they were deployed to obtain simple sufficient conditions for the existence of Whittle's (1988) RB index (indexability), along with an adaptive-greedy index algorithm. The new contributions include: (i) we develop the polyhedral foundation of the PCL framework, based on the structural and algorithmic properties of a new polytope associated with an accessible set system -extended polymatroid}); (ii) we present new dynamic allocation indices for RBs, motivated by an admission control model, which extend Whittle's and have a significantly increased scope; (iii) we deploy PCLs to obtain both sufficient conditions for the existence of the new indices (PCL-indexability), and a new adaptive-greedy index algorithm; (iv) we interpret PCL-indexability as a form of the classic economics law of diminishing marginal returns, and characterize the index as an optimal marginal cost rate; we further solve a related optimal constrained control problem; (v) we carry out a PCL-indexability analysis of the motivating admission control model, under time-discounted and long-run average criteria; this gives, under mild conditions, a new index characterization of optimal threshold policies; and (vi) we apply the latter to present new heuristic index policies for two hard queueing control problems: admission control and routing to parallel queues; and scheduling a multiclass make-to-stock queue with lost sales, both under state-dependent holding cost rates and birth-death dynamics.

...read moreread less

Journal Article•DOI•

Learning classifier systems from a reinforcement learning perspective

[...]

Pier Luca Lanzi¹•Institutions (1)

Polytechnic University of Milan¹

01 Jun 2002

TL;DR: It is suggested that genetic algorithms are probably the most general approach for adding generalization although they might be not the only solution.

...read moreread less

Abstract: We analyze learning classifier systems in the light of tabular reinforcement learning. We note that although genetic algorithms are the most distinctive feature of learning classifier systems, it is not clear whether genetic algorithms are important to learning classifiers systems. In fact, there are models which are strongly based on evolutionary computation (e.g., Wilson's XCS) and others which do not exploit evolutionary computation at all (e.g., Stolzmann's ACS). To find some clarifications, we try to develop learning classifier systems “from scratch”, i.e., starting from one of the most known reinforcement learning technique, Q-learning. We first consider thebasics of reinforcement learning: a problem modeled as a Markov decision process and tabular Q-learning. We introduce a formal framework to define a general purpose rule-based representation which we use to implement tabular Q-learning. We formally define generalization within rules and discuss the possible approaches to extend our rule-based Q-learning with generalization capabilities. We suggest that genetic algorithms are probably the most general approach for adding generalization although they might be not the only solution.

...read moreread less

Collapse