scispace - formally typeset
Search or ask a question

Showing papers on "Markov decision process published in 2002"


Journal ArticleDOI
TL;DR: This work considers decentralized control of Markov decision processes and gives complexity bounds on the worst-case running time for algorithms that find optimal solutions and describes generalizations that allow for decentralized control.
Abstract: We consider decentralized control of Markov decision processes and give complexity bounds on the worst-case running time for algorithms that find optimal solutions. Generalizations of both the fully observable case and the partially observable case that allow for decentralized control are described. For even two agents, the finite-horizon problems corresponding to both of these models are hard for nondeterministic exponential time. These complexity results illustrate a fundamental difference between centralized and decentralized control of Markov decision processes. In contrast to the problems involving centralized control, the problems we considerprovably do not admit polynomial-time algorithms. Furthermore, assuming EXP ? NEXP, the problems require superexponential time to solve in the worst case.

930 citations


Journal ArticleDOI
TL;DR: In this paper, the authors show that the number of actions required to approach the optimal return is lower bounded by the mixing time of the optimal policy (in the undiscounted case) or by the horizon time T in the discounted case.
Abstract: We present new algorithms for reinforcement learning and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states and actions, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the Exploration-Exploitation trade-off.

802 citations


Book ChapterDOI
TL;DR: PRISM has been successfully used to analyse probabilistic termination, performance, and quality of service properties for a range of systems, including randomized distributed algorithms, manufacturing systems and workstation clusters.
Abstract: In this paper we describe PRISM, a tool being developed at the University of Birmingham for the analysis of probabilistic systems. PRISM supports three probabilistic models: discrete-time Markov chains, Markov decision processes and continuous-time Markov chains. Analysis is performed through model checking such systems against specifications written in the probabilistic temporal logics PCTL and CSL. The tool features three model checking engines: one symbolic, using BDDs (binary decision diagrams) and MTBDDs (multi-terminal BDDs); one based on sparse matrices; and one which combines both symbolic and sparse matrix methods. PRISM has been successfully used to analyse probabilistic termination, performance, and quality of service properties for a range of systems, including randomized distributed algorithms, manufacturing systems and workstation clusters.

717 citations


Journal ArticleDOI
TL;DR: A unified framework for multiagent teamwork, the COMmunicative Multiagent Team Decision Problem (COM-MTDP), which combines and extends existing multiagent theories, and provides a basis for the development of novel team coordination algorithms.
Abstract: Despite the significant progress in multiagent teamwork, existing research does not address the optimality of its prescriptions nor the complexity of the teamwork problem. Without a characterization of the optimality-complexity tradeoffs, it is impossible to determine whether the assumptions and approximations made by a particular theory gain enough efficiency to justify the losses in overall performance. To provide a tool for use by multiagent researchers in evaluating this tradeoff, we present a unified framework, the COMmunicative Multiagent Team Decision Problem (COM-MTDP). The COM-MTDP model combines and extends existing multiagent theories, such as decentralized partially observable Markov decision processes and economic team theory. In addition to their generality of representation, COM-MTDPs also support the analysis of both the optimality of team performance and the computational complexity of the agents' decision problem. In analyzing complexity, we present a breakdown of the computational complexity of constructing optimal teams under various classes of problem domains, along the dimensions of observability and communication cost. In analyzing optimality, we exploit the COM-MTDP's ability to encode existing teamwork theories and models to encode two instantiations of joint intentions theory taken from the literature. Furthermore, the COM-MTDP model provides a basis for the development of novel team coordination algorithms. We derive a domain-independent criterion for optimal communication and provide a comparative analysis of the two joint intentions instantiations with respect to this optimal policy. We have implemented a reusable, domain-independent software package based on COM-MTDPs to analyze teamwork coordination strategies, and we demonstrate its use by encoding and evaluating the two joint intentions strategies within an example domain.

428 citations


Journal ArticleDOI
TL;DR: This paper presents a new algorithm that, given only a generative model (a natural and common type of simulator) for an arbitrary MDP, performs on-line, near-optimal planning with a per-state running time that has no dependence on the number of states.
Abstract: A critical issue for the application of Markov decision processes (MDPs) to realistic problems is how the complexity of planning scales with the size of the MDP In stochastic environments with very large or infinite state spaces, traditional planning and reinforcement learning algorithms may be inapplicable, since their running time typically grows linearly with the state space size in the worst case In this paper we present a new algorithm that, given only a generative model (a natural and common type of simulator) for an arbitrary MDP, performs on-line, near-optimal planning with a per-state running time that has no dependence on the number of states The running time is exponential in the horizon time (which depends only on the discount factor γ and the desired degree of approximation to the optimal policy) Our algorithm thus provides a different complexity trade-off than classical algorithms such as value iteration—rather than scaling linearly in both horizon time and state space size, our running time trades an exponential dependence on the former in exchange for no dependence on the latter Our algorithm is based on the idea of sparse sampling We prove that a randomly sampled look-ahead tree that covers only a vanishing fraction of the full look-ahead tree nevertheless suffices to compute near-optimal actions from any state of an MDP Practical implementations of the algorithm are discussed, and we draw ties to our related recent results on finding a near-best strategy from a given class of strategies in very large partially observable MDPs (Kearns, Mansour, & Ng Neural information processing systems 13, to appear)

416 citations


Book ChapterDOI
08 Jul 2002
TL;DR: The bandit problem is revisited and considered under the PAC model, and it is shown that given n arms, it suffices to pull the arms O(n/?2 log 1/?) times to find an ?-optimal arm with probability of at least 1 - ?.
Abstract: The bandit problem is revisited and considered under the PAC model. Our main contribution in this part is to show that given n arms, it suffices to pull the arms O(n/?2 log 1/?) times to find an ?-optimal arm with probability of at least 1 - ?. This is in contrast to the naive bound of O(n/?2 log n/?). We derive another algorithm whose complexity depends on the specific setting of the rewards, rather than the worst case setting. We also provide a matching lower bound. We show how given an algorithm for the PAC model Multi-armed Bandit problem, one can derive a batch learningalg orithm for Markov Decision Processes. This is done essentially by simulatingV alue Iteration, and in each iteration invokingt he multi-armed bandit algorithm. Using our PAC algorithm for the multi-armed bandit problem we improve the dependence on the number of actions.

392 citations


Journal ArticleDOI
TL;DR: This paper evaluates the performance of a variety of splitting criteria on many benchmark problems, paying careful attention to their number-of-cells versus closeness-to-optimality tradeoff curves.
Abstract: The problem of state abstraction is of central importance in optimal control, reinforcement learning and Markov decision processes. This paper studies the case of variable resolution state abstraction for continuous time and space, deterministic dynamic control problems in which near-optimal policies are required. We begin by defining a class of variable resolution policy and value function representations based on Kuhn triangulations embedded in a kd-trie. We then consider top-down approaches to choosing which cells to split in order to generate improved policies. The core of this paper is the introduction and evaluation of a wide variety of possible splitting criteria. We begin with local approaches based on value function and policy properties that use only features of individual cells in making split choices. Later, by introducing two new non-local measures, influence and variance, we derive splitting criteria that allow one cell to efficiently take into account its impact on other cells when deciding whether to split. Influence is an efficiently-calculable measure of the extent to which changes in some state effect the value function of some other states. Variance is an efficiently-calculable measure of how risky is some state in a Markov chain: a low variance state is one in which we would be very surprised if, during any one execution, the long-term reward attained from that state differed substantially from its expected value, given by the value function. The paper proceeds by graphically demonstrating the various approaches to splitting on the familiar, non-linear, non-minimum phase, and two dimensional problem of the “Car on the hill”. It then evaluates the performance of a variety of splitting criteria on many benchmark problems, paying careful attention to their number-of-cells versus closeness-to-optimality tradeoff curves.

360 citations


BookDOI
TL;DR: In this paper, the authors propose an algebra of Interactive Markov Chains (IMC) and prove its correctness in practice using proofs for Chapter 3 and Chapter 4 and proofs for Chapter 5.
Abstract: Interactive Processes.- Markov Chains.- Interactive Markov Chains.- Algebra of Interactive Markov Chains.- Interactive Markov Chains in Practice.- Conclusion.- Proofs for Chapter 3 and Chapter 4.- Proofs for Chapter 5.

342 citations



Book
01 Jan 2002
TL;DR: In this article, the authors present an overview of Markov Decision Processes in the context of communication networks and their applications in finance and dynamic options, including a discussion of the role of the Poisson Equation for Countable Markov Chains.
Abstract: 1. Introduction E.A. Feinberg, A. Shwartz. Part I: Finite State and Action Models. 2. Finite State and Action MDPs L. Kallenberg. 3. Bias Optimality M.E. Lewis, M.L. Puterman. 4. Singular Perturbations of Markov Chains and Decision Processes K.E. Avrachenkov, et al. Part II: Infinite State Models. 5. Average Reward Optimization Theory for Denumerable State Spaces L.I. Sennott. 6. Total Reward Criteria E.A. Feinberg. 7. Mixed Criteria E.A. Feinberg, A. Shwartz. 8. Blackwell Optimality A. Hordijk, A.A. Yushkevich. 9. The Poisson Equation for Countable Markov Chains: Probabilistic Methods and Interpretations A.M. Makowski, A. Shwartz. 10. Stability, Performance Evaluation, and Optimization S.P. Meyn. 11. Convex Analytic Methods in Markov Decision Processes V.S. Borkar. 12. The Linear Programming Approach O. Hernandez-Lerma, J.B. Lasserre. 13. Invariant Gambling Problems and Markov Decision Processes L.E. Dubins, et al. Part III: Applications. 14. Neuro-Dynamic Programming: Overview and Recent Trends B. Van Roy. 15. Markov Decision Processes in Finance and Dynamic Options M. Schal. 16. Applications of Markov Decision Processes in Communication Networks E. Altman. 17. Water Reservoir Applications of Markov Decision Processes B.F. Lamond, A. Boukhtouta. Index.

281 citations


Proceedings ArticleDOI
28 Jul 2002
TL;DR: Methods that exploit the special structure of preference elicitation to deal with parameterized belief states over the continuous state space, and gradient techniques for optimizing parameterized actions are described.
Abstract: Preference elicitation is a key problem facing the deployment of intelligent systems that make or recommend decisions on the behalf of users. Since not all aspects of a utility function have the same impact on object-level decision quality, determining which information to extract from a user is itself a sequential decision problem, balancing the amount of elicitation effort and time with decision quality. We formulate this problem as a partially-observable Markov decision process (POMDP). Because of the continuous nature of the state and action spaces of this POMDP, standard techniques cannot be used to solve it. We describe methods that exploit the special structure of preference elicitation to deal with parameterized belief states over the continuous state space, and gradient techniques for optimizing parameterized actions. These methods can be used with a number of different belief state representations, including mixture models.

Book ChapterDOI
19 Aug 2002
TL;DR: The Q-Cut algorithm is presented, a graph theoretic approach for automatic detection of sub-goals in a dynamic environment, which is used for acceleration of the Q-Learning algorithm, and extended to the Segmented Q- cut algorithm, which uses previously identified bottlenecks for state space partitioning, necessary for finding additional bottlenECks in complex environments.
Abstract: We present the Q-Cut algorithm, a graph theoretic approach for automatic detection of sub-goals in a dynamic environment, which is used for acceleration of the Q-Learning algorithm. The learning agent creates an on-line map of the process history, and uses an efficient Max-Flow/Min-Cut algorithm for identifying bottlenecks. The policies for reaching bottlenecks are separately learned and added to the model in a form of options (macro-actions). We then extend the basic Q-Cut algorithm to the Segmented Q-Cut algorithm, which uses previously identified bottlenecks for state space partitioning, necessary for finding additional bottlenecks in complex environments. Experiments showsign ificant performance improvements, particulary in the initial learning phase.

Proceedings Article
01 Aug 2002
TL;DR: The use of an n-gram predictive model is suggested for generating the initial MDP, which induces a Markovchain model of user behavior whose predictive accuracy is greater than that of existing predictive models.
Abstract: Typical Recommender systems adopt a static view of the recommendation process and treat it as a prediction problem. We argue that it is more appropriate to view the problem of generating recommendations as a sequential decision problem and, consequently, that Markov decision processes (MDP) provide a more appropriate model for Recommender systems. MDPs introduce two benefits: they take into account the long-term effects of each recommendation, and they take into account the expected value of each recommendation. To succeed in practice, an MDP-based Recommender system must employ a strong initial model; and the bulk of this paper is concerned with the generation of such a model. In particular, we suggest the use of an n-gram predictive model for generating the initial MDP. Our n-gram model induces a Markovchain model of user behavior whose predictive accuracy is greater than that of existing predictive models. We describe our predictive model in detail and evaluate its performance on real data. In addition, we show how the model can be used in an MDP-based Recommender system.


Journal ArticleDOI
TL;DR: A novel approach to adjustable autonomy is presented, based on the notion of a transfer-of-control strategy, which guides and informs the operationalization of the strategies using Markov Decision Processes, which select an optimal strategy, given an uncertain environment and costs to the individuals and teams.
Abstract: Adjustable autonomy refers to entities dynamically varying their own autonomy, transferring decision-making control to other entities (typically agents transferring control to human users) in key situations. Determining whether and when such transfers-of-control should occur is arguably the fundamental research problem in adjustable autonomy. Previous work has investigated various approaches to addressing this problem but has often focused on individual agent-human interactions. Unfortunately, domains requiring collaboration between teams of agents and humans reveal two key shortcomings of these previous approaches. First, these approaches use rigid one-shot transfers of control that can result in unacceptable coordination failures in multiagent settings. Second, they ignore costs (e.g., in terms of time delays or effects on actions) to an agent's team due to such transfers-of-control. To remedy these problems, this article presents a novel approach to adjustable autonomy, based on the notion of a transfer-of-control strategy. A transfer-of-control strategy consists of a conditional sequence of two types of actions: (i) actions to transfer decision-making control (e.g., from an agent to a user or vice versa) and (ii) actions to change an agent's pre-specified coordination constraints with team members, aimed at minimizing miscoordination costs. The goal is for high-quality individual decisions to be made with minimal disruption to the coordination of the team. We present a mathematical model of transfer-of-control strategies. The model guides and informs the operationalization of the strategies using Markov Decision Processes, which select an optimal strategy, given an uncertain environment and costs to the individuals and teams. The approach has been carefully evaluated, including via its use in a real-world, deployed multi-agent system that assists a research group in its daily activities.

Journal ArticleDOI
TL;DR: This paper investigates the effect of demand censoring on the optimal policy in newsvendor inventory models with general parametric demand distributions and unknown parameter values and shows that the optimal ventory level in the presence of censored demand is higher than would be determined using a Bayesian myopic policy.
Abstract: This paper investigates the effect of demand censoring on the optimal policy in newsvendor inventory models with general parametric demand distributions and unknown parameter values. We show that the newsvendor problem withobservable lost sales reduces to a sequence of single-period problems, while the newsvendor problem withunobservable lost sales requires a dynamic analysis. Using a Bayesian Markov decision process approach we show that the optimalin ventory level in the presence of censored demand ishigher than would be determined using a Bayesian myopic policy. We explore the economic rationality for this observation and illustrate it with numerical examples.

Journal ArticleDOI
TL;DR: This paper presents an approach to manage inventory decisions at all stages of the supply chain in an integrated manner that allows an inventory order policy to be determined, aimed at optimizing the performance of the whole supply chain.

Book ChapterDOI
TL;DR: This work model the two-way handshake mechanism of the IEEE 802.11 standard with a fixed network topology using Probabilistic timed automata, a formal description mechanism in which both nondeterministic choice and probabilistic choice can be represented.
Abstract: The international standard IEEE 802.11 was developed recently in recognition of the increased demand for wireless local area networks. Its medium access control mechanism is described according to a variant of the Carrier Sense Multiple Access with Collision Avoidance (CSMA/CA) scheme. Although collisions cannot always be prevented, randomised exponential backoff rules are used in the retransmission scheme to minimise the likelihood of repeated collisions. More precisely, the backoff procedure involves a uniform probabilistic choice of an integer-valued delay from an interval, where the size of the interval grows exponentially with regard to the number of retransmissions of the current data packet. We model the two-way handshake mechanism of the IEEE 802.11 standard with a fixed network topology using probabilistic timed automata, a formal description mechanism in which both nondeterministic choice and probabilistic choice can be represented. From our probabilistic timed automaton model, we obtain a finite-state Markov decision process via a property-preserving discrete-time semantics. The Markov decision process is then verified using Prism, a probabilistic model checking tool, against probabilistic, timed properties such as "at most 5,000 microseconds pass before a station sends its packet correctly."

Journal ArticleDOI
TL;DR: For risk-sensitive control of finite Markov chains a counterpart of the popular Q-learning algorithm for classical Markov decision processes is proposed, and the algorithm is shown to converge with probability one to the desired solution.
Abstract: We propose for risk-sensitive control of finite Markov chains a counterpart of the popular Q-learning algorithm for classical Markov decision processes. The algorithm is shown to converge with probability one to the desired solution. The proof technique is an adaptation of the o.d.e. approach for the analysis of stochastic approximation algorithms, with most of the work involved used for the analysis of the specific o.d.e.s that arise.

Journal ArticleDOI
TL;DR: The existence of an optimal feedback law is established for the risk-sensitive optimal control problem with denumerable state space and a solution can be found constructively using either value iteration or policy iteration under suitable conditions on initial feedback law.
Abstract: The existence of an optimal feedback law is established for the risk-sensitive optimal control problem with denumerable state space. The main assumptions imposed are irreducibility and anear monotonicity condition on the one-step cost function. A solution can be found constructively using either value iteration or policy iteration under suitable conditions on initial feedback law.

Journal ArticleDOI
TL;DR: The optimal dynamic policy is shown to have a rather complex structure which leads to more implementable policies and a double-threshold policy is presented and exact and approximate methods for evaluating the performance of this policy and computing its optimal parameters are derived.
Abstract: We consider a make-to-stock production/inventory system consisting of a single deteriorating machine which produces a single item. We formulate the integrated decisions of maintenance and production using a Markov Decision Process. The optimal dynamic policy is shown to have a rather complex structure which leads us to consider more implementable policies. We present a double-threshold policy and derive exact and approximate methods for evaluating the performance of this policy and computing its optimal parameters. A detailed numerical study demonstrates that the proposed policy and our approximate method for computing its parameters perform extremely well. Finally, we show that policies which do not address maintenance and production control decisions in an integrated manner can perform rather badly.

Book ChapterDOI
01 Jan 2002
TL;DR: The convex analytic approach to classical Markov decision processes wherein they are cast as a static convex programming problem in the space of measures is described.
Abstract: This article describes the convex analytic approach to classical Markov decision processes wherein they are cast as a static convex programming problem in the space of measures. Applications to multiobjective problems are described.

Proceedings ArticleDOI
28 Jul 2002
TL;DR: An algorithm for coordinated decision making in cooperative multiagent settings, where the agents' value function can be represented as a sum of context-specific value rules using an efficient linear programming algorithm is presented.
Abstract: We present an algorithm for coordinated decision making in cooperative multiagent settings, where the agents' value function canbe represented as a sum of context-specific value rules. The task of finding an optimal joint action in this setting leads to an algorithm where the coordination structure between agents depends on the current state of the system and even on the actual mmaerical values assigned to the value rules. We apply this framework to the task of multiagent planning in dynamic systems, showing how a joint value function of the associated Markov Decision Process can be approximated as a set of value rules using an efficient linear programming algorithm. The agents then apply the coordination graph algorithm at each iteration of the process to decide on the highest-value joint action, potentially leading to a different coordination pattern at each step of the plan.

Journal ArticleDOI
TL;DR: In this article, the authors study the employee staffing problem in a service organization that uses employee service capacity to meet random, nonstationary service requirements, and develop a Markov Decision Process (MDP) model which explicitly represents the stochastic nature of these effects.
Abstract: We study the employee staffing problem in a service organization that uses employee service capacity to meet random, nonstationary service requirements. The employees experience learning and turnover on the job, and we develop a Markov Decision Process (MDP) model which explicitly represents the stochastic nature of these effects. Theoretical results show that the optimal hiring policy is of a state-dependent "hire-up-to" type, similar to an inventory "order-up-to" policy. For two important special cases, a myopic policy is optimal. We also test a linear programming (LP) based heuristic, which uses average learning and turnover behavior, in stationary environments. In most cases, the LP-based policy performs quite well, within 1% of optimality. When flexible capacity--in the form of overtime or outsourcing--is expensive or not available, however, explicit modeling of stochastic learning and turnover effects may improve performance significantly.

Book ChapterDOI
01 Jan 2002
TL;DR: This chapter presents in this chapter a survey on applications of MDPs to communication networks and the theoretical tools that have been developed to model and to solve the resulting control problems.
Abstract: We present in this chapter a survey on applications of MDPs to communication networks. We survey both the different application areas in communication networks as well as the theoretical tools that have been developed to model and to solve the resulting control problems.

Journal ArticleDOI
TL;DR: This paper presents results that demonstrate that there are other practical control policies that almost always provide much better solutions for this problem than the CEC policies commonly used in practice.
Abstract: This paper examines several different policies for an inventory control problem in which the demand process is nonstationary and partially observed. The probability distribution for the demand in each period is determined by the state of a Markov chain, the core process. However, the state of this core process is not directly observed, only the actual demand is observed by the decision maker. Given this demand process, the inventory control problem is a composite-state, partially observed Markov decision process POMDP, which is an appropriate model for a number of dynamic demand problems. In practice, managers often use certainty equivalent control CEC policies to solve such a problem. However, this paper presents results that demonstrate that there are other practical control policies that almost always provide much better solutions for this problem than the CEC policies commonly used in practice. The computational results also indicate how specific problem characteristics influence the performance of each of the alternative policies.

Journal ArticleDOI
TL;DR: It is shown that although the start-up company should be more conservative in its component purchasing strategy than if it were a well-established company, it should not be too conservative and its strategy monotone in the amount of capital it has available.
Abstract: New start-up companies, which are considered to be a vital ingredient in a successful economy, have a different objective than established companies: They want to maximise their chance of long-term survival. We examine the implications for their operating decisions of this different criterion by considering an abstraction of the inventory problem faced by a start-up manufacturing company. The problem is modelled under two criteria as a Markov decision process; the characteristics of the optimal policies under the two criteria are compared. It is shown that although the start-up company should be more conservative in its component purchasing strategy than if it were a well-established company, it should not be too conservative. Nor is its strategy monotone in the amount of capital it has available. The models are extended to allow for interest on investment and inflation.

Proceedings ArticleDOI
28 Jul 2002
TL;DR: A plnning algorithm is described that integrates two approaches to solving Markov decision processes with large state spaces in a novel way that exploits symbolic model-checking techniques and demonstrates their usefulness for decision-theoretic planning.
Abstract: We describe a plnning algorithm that integrates two approaches to solving Markov decision processes with large state spaces. State abstraction is used to avoid evaluating states individually. Forward search from a start state, guided by an admissible heuristic, is used to avoid evaluating all states. We combine these two approaches in a novel way that exploits symbolic model-checking techniques and demonstrates their usefulness for decision-theoretic planning.

Journal ArticleDOI
TL;DR: The polyhedral foundation of the PCL framework is developed, based on the structural and algorithmic properties of a new polytope associated with an accessible set system-extended polymatroid, and PCL-indexability is interpreted as a form of the classic economics law of diminishing marginal returns.
Abstract: This paper develops a polyhedral approach to the design, analysis, and computation of dynamic allocation indices for scheduling binary-action (engage/rest) Markovian stochastic projects which can change state when rested (restless bandits (RBs)), based on partial conservation laws (PCLs). This extends previous work by the author [J. Nino-Mora (2001): Restless bandits, partial conservation laws and indexability. Adv. Appl. Probab. 33, 76–98], where PCLs were shown to imply the optimality of index policies with a postulated structure in stochastic scheduling problems, under admissible linear objectives, and they were deployed to obtain simple sufficient conditions for the existence of Whittle's (1988) RB index (indexability), along with an adaptive-greedy index algorithm. The new contributions include: (i) we develop the polyhedral foundation of the PCL framework, based on the structural and algorithmic properties of a new polytope associated with an accessible set system -extended polymatroid}); (ii) we present new dynamic allocation indices for RBs, motivated by an admission control model, which extend Whittle's and have a significantly increased scope; (iii) we deploy PCLs to obtain both sufficient conditions for the existence of the new indices (PCL-indexability), and a new adaptive-greedy index algorithm; (iv) we interpret PCL-indexability as a form of the classic economics law of diminishing marginal returns, and characterize the index as an optimal marginal cost rate; we further solve a related optimal constrained control problem; (v) we carry out a PCL-indexability analysis of the motivating admission control model, under time-discounted and long-run average criteria; this gives, under mild conditions, a new index characterization of optimal threshold policies; and (vi) we apply the latter to present new heuristic index policies for two hard queueing control problems: admission control and routing to parallel queues; and scheduling a multiclass make-to-stock queue with lost sales, both under state-dependent holding cost rates and birth-death dynamics.

Journal ArticleDOI
01 Jun 2002
TL;DR: It is suggested that genetic algorithms are probably the most general approach for adding generalization although they might be not the only solution.
Abstract: We analyze learning classifier systems in the light of tabular reinforcement learning. We note that although genetic algorithms are the most distinctive feature of learning classifier systems, it is not clear whether genetic algorithms are important to learning classifiers systems. In fact, there are models which are strongly based on evolutionary computation (e.g., Wilson's XCS) and others which do not exploit evolutionary computation at all (e.g., Stolzmann's ACS). To find some clarifications, we try to develop learning classifier systems “from scratch”, i.e., starting from one of the most known reinforcement learning technique, Q-learning. We first consider thebasics of reinforcement learning: a problem modeled as a Markov decision process and tabular Q-learning. We introduce a formal framework to define a general purpose rule-based representation which we use to implement tabular Q-learning. We formally define generalization within rules and discuss the possible approaches to extend our rule-based Q-learning with generalization capabilities. We suggest that genetic algorithms are probably the most general approach for adding generalization although they might be not the only solution.