Showing papers on "Markov decision process published in 2009"

PDF

Open Access

Journal Article•DOI•

Optimality of Myopic Sensing in Multichannel Opportunistic Access

[...]

Sahand Haji Ali Ahmad¹, Mingyan Liu¹, Tara Javidi², Qing Zhao³, Bhaskar Krishnamachari⁴ - Show less +1 more•Institutions (4)

University of Michigan¹, University of California, San Diego², University of California, Davis³, University of Southern California⁴

01 Sep 2009-IEEE Transactions on Information Theory

TL;DR: It is shown that a myopic policy that maximizes the immediate one-step reward is optimal when the state transitions are positively correlated over time and when the number of channels is limited to two or three, while presenting a counterexample for the case of four channels.

...read moreread less

Abstract: This paper considers opportunistic communication over multiple channels where the state (ldquogoodrdquo or ldquobadrdquo) of each channel evolves as independent and identically distributed (i.i.d.) Markov processes. A user, with limited channel sensing capability, chooses one channel to sense and decides whether to use the channel (based on the sensing result) in each time slot. A reward is obtained whenever the user senses and accesses a ldquogoodrdquo channel. The objective is to design a channel selection policy that maximizes the expected total (discounted or average) reward accrued over a finite or infinite horizon. This problem can be cast as a partially observed Markov decision process (POMDP) or a restless multiarmed bandit process, to which optimal solutions are often intractable. This paper shows that a myopic policy that maximizes the immediate one-step reward is optimal when the state transitions are positively correlated over time. When the state transitions are negatively correlated, we show that the same policy is optimal when the number of channels is limited to two or three, while presenting a counterexample for the case of four channels. This result finds applications in opportunistic transmission scheduling in a fading environment, cognitive radio networks for spectrum overlay, and resource-constrained jamming and antijamming.

...read moreread less

416 citations

Journal Article•DOI•

Reinforcement Learning in Finite MDPs: PAC Analysis

[...]

Alexander Strehl, Lihong Li, Michael L. Littman

01 Dec 2009-Journal of Machine Learning Research

TL;DR: The current state-of-the-art for near-optimal behavior in finite Markov Decision Processes with a polynomial number of samples is summarized by presenting bounds for the problem in a unified theoretical framework.

...read moreread less

Abstract: We study the problem of learning near-optimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These "PAC-MDP" algorithms include the well-known E3 and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm. We summarize the current state-of-the-art by presenting bounds for the problem in a unified theoretical framework. A more refined analysis for upper and lower bounds is presented to yield insight into the differences between the model-free Delayed Q-learning and the model-based R-MAX.

...read moreread less

289 citations

Journal Article•DOI•

A generic model for optimizing single-hop transmission policy of replenishable sensors

[...]

Jing Lei¹, Roy D. Yates¹, Larry J. Greenstein¹•Institutions (1)

Rutgers University¹

01 Feb 2009-IEEE Transactions on Wireless Communications

TL;DR: A generic mathematical framework is proposed to characterize the policy for single hop transmission over a replenishable sensor network, and a Markov chain model is introduced to describe different modes of energy renewal.

...read moreread less

Abstract: Energy harvesting from the working environment has received increasing attention in the research of wireless sensor networks. Recent developments in this area can be used to replenish the power supply of sensors. However, power management is still a crucial issue for such networks due to the uncertainty of stochastic replenishment. In this paper, we propose a generic mathematical framework to characterize the policy for single hop transmission over a replenishable sensor network. Firstly, we introduce a Markov chain model to describe different modes of energy renewal. Then, we derive the optimal transmission policy for sensors with different energy budgets. Depending on the energy status of a sensor and the reward for successfully transmitting a message, we prove the existence of optimal thresholds that maximize the average reward rate. Our results are quite general since the reward values can be made application-specific for different design objectives. Compared with the unconditional transmit-all policy, which transmits every message as long as the energy storage is positive, the proposed optimal transmission policy is shown to achieve significant gains in the average reward rate.

...read moreread less

266 citations

Proceedings Article•DOI•

Improving recommender systems with adaptive conversational strategies

[...]

Tariq Mahmood¹, Francesco Ricci²•Institutions (2)

University of Trento¹, Free University of Bozen-Bolzano²

29 Jun 2009

TL;DR: It is shown that the optimal strategy is different from the fixed one, and supports more effective and efficient interaction sessions, and allows conversational systems to autonomously improve a fixed strategy and eventually learn a better one using reinforcement learning techniques.

...read moreread less

Abstract: Conversational recommender systems (CRSs) assist online users in their information-seeking and decision making tasks by supporting an interactive process. Although these processes could be rather diverse, CRSs typically follow a fixed strategy, e.g., based on critiquing or on iterative query reformulation. In a previous paper, we proposed a novel recommendation model that allows conversational systems to autonomously improve a fixed strategy and eventually learn a better one using reinforcement learning techniques. This strategy is optimal for the given model of the interaction and it is adapted to the users' behaviors. In this paper we validate our approach in an online CRS by means of a user study involving several hundreds of testers. We show that the optimal strategy is different from the fixed one, and supports more effective and efficient interaction sessions.

...read moreread less

250 citations

Journal Article•DOI•

An Approximate Dynamic Programming Approach to Network Revenue Management with Customer Choice

[...]

Dan Zhang¹, Daniel Adelman²•Institutions (2)

Desautels Faculty of Management¹, University of Chicago²

01 Aug 2009-Transportation Science

TL;DR: This work develops a column generation algorithm to solve the problem for a multinomial logit choice model with disjoint consideration sets (MNLD), and derives a bound as a by-product of a decomposition heuristic.

...read moreread less

Abstract: We consider a network revenue management problem where customers choose among open fare products according to some prespecified choice model. Starting with a Markov decision process (MDP) formulation, we approximate the value function with an affine function of the state vector. We show that the resulting problem provides a tighter bound for the MDP value than the choice-based linear program. We develop a column generation algorithm to solve the problem for a multinomial logit choice model with disjoint consideration sets (MNLD). We also derive a bound as a by-product of a decomposition heuristic. Our numerical study shows the policies from our solution approach can significantly outperform heuristics from the choice-based linear program.

...read moreread less

223 citations

Journal Article•DOI•

Online Markov Decision Processes

[...]

Eyal Even-Dar¹, Sham M. Kakade², Yishay Mansour³•Institutions (3)

Google¹, Toyota Technological Institute², Tel Aviv University³

01 Aug 2009-Mathematics of Operations Research

TL;DR: This work considers a Markov decision process (MDP) setting in which the reward function is allowed to change after each time step, yet the dynamics remain fixed, and provides efficient algorithms, which have regret bounds with no dependence on the size of state space.

...read moreread less

Abstract: We consider a Markov decision process (MDP) setting in which the reward function is allowed to change after each time step (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well an agent can do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which have regret bounds with no dependence on the size of state space. Instead, these bounds depend only on a certain horizon time of the process and logarithmically on the number of actions.

...read moreread less

190 citations

Book Chapter•DOI•

Active Learning for Reward Estimation in Inverse Reinforcement Learning

[...]

Manuel Lopes¹, Francisco S. Melo², Luis Montesano³•Institutions (3)

Instituto Superior Técnico¹, Carnegie Mellon University², University of Zaragoza³

27 Aug 2009

TL;DR: An algorithm is proposed that allows the agent to query the demonstrator for samples at specific states, instead of relying only on samples provided at "arbitrary" states, to estimate the reward function with similar accuracy as other methods from the literature while reducing the amount of policy samples required from the expert.

...read moreread less

Abstract: Inverse reinforcement learning addresses the general problem of recovering a reward function from samples of a policy provided by an expert/demonstrator. In this paper, we introduce active learning for inverse reinforcement learning. We propose an algorithm that allows the agent to query the demonstrator for samples at specific states, instead of relying only on samples provided at "arbitrary" states. The purpose of our algorithm is to estimate the reward function with similar accuracy as other methods from the literature while reducing the amount of policy samples required from the expert. We also discuss the use of our algorithm in higher dimensional problems, using both Monte Carlo and gradient methods. We present illustrative results of our algorithm in several simulated examples of different complexities.

...read moreread less

189 citations

Journal Article•DOI•

Reoptimization Approaches for the Vehicle-Routing Problem with Stochastic Demands

[...]

Nicola Secomandi¹, François Margot¹•Institutions (1)

Carnegie Mellon University¹

01 Jan 2009-Operations Research

TL;DR: Comparisons with an existing heuristic from the literature and a lower bound computed with complete knowledge of customer demands show that the best partial reoptimization heuristics outperform this heuristic and are on average no more than 10%--13% away from this lower bound, depending on the type of instances.

...read moreread less

Abstract: We consider the vehicle-routing problem with stochastic demands (VRPSD) under reoptimization. We develop and analyze a finite-horizon Markov decision process (MDP) formulation for the single-vehicle case and establish a partial characterization of the optimal policy. We also propose a heuristic solution methodology for our MDP, named partial reoptimization, based on the idea of restricting attention to a subset of all the possible states and computing an optimal policy on this restricted set of states. We discuss two families of computationally efficient partial reoptimization heuristics and illustrate their performance on a set of instances with up to and including 100 customers. Comparisons with an existing heuristic from the literature and a lower bound computed with complete knowledge of customer demands show that our best partial reoptimization heuristics outperform this heuristic and are on average no more than 10%--13% away from this lower bound, depending on the type of instances.

...read moreread less

176 citations

Journal Article•DOI•

Distributed Spectrum Sensing and Access in Cognitive Radio Networks With Energy Constraint

[...]

Yunxia Chen¹, Qing Zhao², Ananthram Swami³•Institutions (3)

Cisco Systems, Inc.¹, University of California, Davis², United States Army Research Laboratory³

01 Feb 2009-IEEE Transactions on Signal Processing

TL;DR: This work designs distributed spectrum sensing and access strategies for opportunistic spectrum access (OSA) under an energy constraint on secondary users that maximize the throughput of the secondary user during its battery lifetime and establishes threshold structures of the optimal policies.

...read moreread less

Abstract: We design distributed spectrum sensing and access strategies for opportunistic spectrum access (OSA) under an energy constraint on secondary users. Both the continuous and the bursty traffic models are considered for different applications of the secondary network. In each slot, a secondary user sequentially decides whether to sense, where in the spectrum to sense, and whether to access. By casting this sequential decision-making problem in the framework of partially observable Markov decision processes, we obtain stationary optimal spectrum sensing and access policies that maximize the throughput of the secondary user during its battery lifetime. We also establish threshold structures of the optimal policies and study the fundamental tradeoffs involved in the energy-constrained OSA design. Numerical results are provided to investigate the impact of the secondary user's residual energy on the optimal spectrum sensing and access decisions.

...read moreread less

175 citations

Journal Article•DOI•

Empirical Comparison of Markov and Quantum models of decision-making

[...]

Jerome R. Busemeyer¹, Zheng Wang², Ariane Lambert-Mogiliansky³•Institutions (3)

Indiana University¹, Ohio State University², Paris School of Economics³

01 Oct 2009-Journal of Mathematical Psychology

TL;DR: There are at least two general theories for building probabilistic-dynamical systems: one is Markov theory and another is quantum theory as mentioned in this paper, and the decision about whether to use a Markov or quantum system depends on which of these laws are empirically obeyed in an application.

...read moreread less

170 citations

Proceedings Article•

REGAL: a regularization based algorithm for reinforcement learning in weakly communicating MDPs

[...]

Peter L. Bartlett¹, Ambuj Tewari²•Institutions (2)

University of California, Berkeley¹, Toyota Technological Institute at Chicago²

18 Jun 2009

TL;DR: An algorithm is provided that achieves the optimal regret rate in an unknown weakly communicating Markov Decision Process (MDP) where, in each episode, it picks a policy using regularization based on the span of the optimal bias vector.

...read moreread less

Abstract: We provide an algorithm that achieves the optimal regret rate in an unknown weakly communicating Markov Decision Process (MDP). The algorithm proceeds in episodes where, in each episode, it picks a policy using regularization based on the span of the optimal bias vector. For an MDP with S states and A actions whose optimal bias vector has span bounded by H, we show a regret bound of O(HS√AT). We also relate the span to various diameter-like quantities associated with the MDP, demonstrating how our results improve on previous regret bounds.

...read moreread less

Journal Article•DOI•

Counterexample Generation in Probabilistic Model Checking

[...]

Tingting Han¹, Joost-Pieter Katoen¹, D. Berteun¹•Institutions (1)

RWTH Aachen University¹

01 Mar 2009-IEEE Transactions on Software Engineering

TL;DR: Algorithms for counterexample generation for probabilistic CTL formulae in discrete-time Markov chains and a simple algorithm to generate (minimal) regular expressions that can act as countereXamples are considered.

...read moreread less

Abstract: Providing evidence for the refutation of a property is an essential, if not the most important, feature of model checking. This paper considers algorithms for counterexample generation for probabilistic CTL formulae in discrete-time Markov chains. Finding the strongest evidence (i.e., the most probable path) violating a (bounded) until-formula is shown to be reducible to a single-source (hop-constrained) shortest path problem. Counterexamples of smallest size that deviate most from the required probability bound can be obtained by applying (small amendments to) k-shortest (hop-constrained) paths algorithms. These results can be extended to Markov chains with rewards, to LTL model checking, and are useful for Markov decision processes. Experimental results show that typically the size of a counterexample is excessive. To obtain much more compact representations, we present a simple algorithm to generate (minimal) regular expressions that can act as counterexamples. The feasibility of our approach is illustrated by means of two communication protocols: leader election in an anonymous ring network and the Crowds protocol.

...read moreread less

Book Chapter•DOI•

Continuous-Time Markov Decision Processes

[...]

Xianping Guo¹, Onésimo Hernández-Lerma²•Institutions (2)

Sun Yat-sen University¹, Instituto Politécnico Nacional²

01 Jan 2009

TL;DR: The basic model of continuous-time MDPs and the concept of a Markov policy are stated in precise terms and the basic optimality criteria is introduced that the author is interested in.

...read moreread less

Abstract: In Chap. 2, we formally introduce the concepts associated to a continuous time MDP. Namely, the basic model of continuous-time MDPs and the concept of a Markov policy are stated in precise terms in Sect. 2.2. We also give, in Sect. 2.3, a precise definition of state and action processes in continuous-time MDPs, together with some fundamental properties of these two processes. Then, in Sect. 2.4, we introduce the basic optimality criteria that we are interested in.

...read moreread less

Proceedings Article•

Inverse reinforcement learning in partially observable environments

[...]

Jaedeug Choi¹, Kee-Eung Kim¹•Institutions (1)

KAIST¹

11 Jul 2009

TL;DR: This paper presents IRL algorithms for partially observable environments that can be modeled as a partially observable Markov decision process (POMDP) and deals with two cases according to the representation of the given expert's behavior.

...read moreread less

Abstract: Inverse reinforcement learning (IRL) is the problem of recovering the underlying reward function from the behaviour of an expert. Most of the existing algorithms for IRL assume that the expert's environment is modeled as a Markov decision process (MDP), although they should be able to handle partially observable settings in order to widen the applicability to more realistic scenarios. In this paper, we present an extension of the classical IRL algorithm by Ng and Russell to partially observable environments. We discuss technical issues and challenges, and present the experimental results on some of the benchmark partially observable domains.

...read moreread less

Journal Article•DOI•

Convergence Results for Some Temporal Difference Methods Based on Least Squares

[...]

Huizhen Yu¹, Dimitri P. Bertsekas¹•Institutions (1)

Massachusetts Institute of Technology¹

30 Jun 2009-IEEE Transactions on Automatic Control

TL;DR: An average cost method is introduced, patterned after the known discounted cost method, and it is proved its convergence for a range of constant stepsize choices and the convergence rate is optimal within the class of temporal difference methods.

...read moreread less

Abstract: We consider finite-state Markov decision processes, and prove convergence and rate of convergence results for certain least squares policy evaluation algorithms of the type known as LSPE(lambda ). These are temporal difference methods for constructing a linear function approximation of the cost function of a stationary policy, within the context of infinite-horizon discounted and average cost dynamic programming. We introduce an average cost method, patterned after the known discounted cost method, and we prove its convergence for a range of constant stepsize choices. We also show that the convergence rate of both the discounted and the average cost methods is optimal within the class of temporal difference methods. Analysis and experiment indicate that our methods are substantially and often dramatically faster than TD(lambda), as well as more reliable.

...read moreread less

Journal Article•DOI•

Markov Decision Processes with Arbitrary Reward Processes

[...]

Jia Yuan Yu¹, Shie Mannor¹, Nahum Shimkin²•Institutions (2)

McGill University¹, Technion – Israel Institute of Technology²

01 Aug 2009-Mathematics of Operations Research

TL;DR: An efficient online algorithm is presented that ensures that the agent's average performance loss vanishes over time, provided that the environment is oblivious to the agents' actions.

...read moreread less

Abstract: We consider a learning problem where the decision maker interacts with a standard Markov decision process, with the exception that the reward functions vary arbitrarily over time. We show that, against every possible realization of the reward process, the agent can perform as well---in hindsight---as every stationary policy. This generalizes the classical no-regret result for repeated games. Specifically, we present an efficient online algorithm---in the spirit of reinforcement learning---that ensures that the agent's average performance loss vanishes over time, provided that the environment is oblivious to the agent's actions. Moreover, it is possible to modify the basic algorithm to cope with instances where reward observations are limited to the agent's trajectory. We present further modifications that reduce the computational cost by using function approximation and that track the optimal policy through infrequent changes.

...read moreread less

Journal Article•DOI•

Policy iteration for decentralized control of Markov decision processes

[...]

Daniel S. Bernstein¹, Christopher Amato¹, Eric A. Hansen², Shlomo Zilberstein¹•Institutions (2)

University of Massachusetts Amherst¹, Mississippi State University²

01 Jan 2009-Journal of Artificial Intelligence Research

TL;DR: In this article, the authors present an optimal policy iteration algorithm for solving DEC-POMDPs, which alternates between expanding the controller and performing value-preserving transformations.

...read moreread less

Abstract: Coordination of distributed agents is required for problems arising in many areas, including multi-robot systems, networking and e-commerce. As a formal framework for such problems, we use the decentralized partially observable Markov decision process (DECPOMDP). Though much work has been done on optimal dynamic programming algorithms for the single-agent version of the problem, optimal algorithms for the multiagent case have been elusive. The main contribution of this paper is an optimal policy iteration algorithm for solving DEC-POMDPs. The algorithm uses stochastic finite-state controllers to represent policies. The solution can include a correlation device, which allows agents to correlate their actions without communicating. This approach alternates between expanding the controller and performing value-preserving transformations, which modify the controller without sacrificing value. We present two efficient value-preserving transformations: one can reduce the size of the controller and the other can improve its value while keeping the size fixed. Empirical results demonstrate the usefulness of value-preserving transformations in increasing value while keeping controller size to a minimum. To broaden the applicability of the approach, we also present a heuristic version of the policy iteration algorithm, which sacrifices convergence to optimality. This algorithm further reduces the size of the controllers at each step by assuming that probability distributions over the other agents' actions are known. While this assumption may not hold in general, it helps produce higher quality solutions in our test problems.

...read moreread less

Journal Article•DOI•

A POMDP framework for coordinated guidance of autonomous UAVs for multitarget tracking

[...]

Scott A. Miller, Zachary A. Harris, Edwin K. P. Chong¹•Institutions (1)

Colorado State University¹

01 Jan 2009-EURASIP Journal on Advances in Signal Processing

TL;DR: A new approximation method called nominal belief-state optimization (NBO), combined with other application-specific approximations and techniques within the POMDP framework, produces a practical design that coordinates the UAVs to achieve good long-term mean-squared-error tracking performance in the presence of occlusions and dynamic constraints.

...read moreread less

Abstract: This paper discusses the application of the theory of partially observable Markov decision processes (POMDPs) to the design of guidance algorithms for controlling the motion of unmanned aerial vehicles (UAVs) with onboard sensors to improve tracking of multiple ground targets. While POMDP problems are intractable to solve exactly, principled approximation methods can be devised based on the theory that characterizes optimal solutions. A new approximation method called nominal belief-state optimization (NBO), combined with other application-specific approximations and techniques within the POMDP framework, produces a practical design that coordinates the UAVs to achieve good long-term mean-squared-error tracking performance in the presence of occlusions and dynamic constraints. The flexibility of the design is demonstrated by extending the objective to reduce the probability of a track swap in ambiguous situations.

...read moreread less

Journal Article•DOI•

Randomized shortest-path problems: Two related models

[...]

Marco Saerens¹, Youssef Achbany¹, François Fouss¹, Luh Yen¹•Institutions (1)

Université catholique de Louvain¹

01 Aug 2009-Neural Computation

TL;DR: This work revisits Akamatsu's model by recasting it into a sum-over-paths statistical physics formalism allowing easy derivation of all the quantities of interest in an elegant, unified way and shows that the unique optimal policy can be obtained by solving a simple linear system of equations.

...read moreread less

Abstract: This letter addresses the problem of designing the transition probabilities of a finite Markov chain (the policy) in order to minimize the expected cost for reaching a destination node from a source node while maintaining a fixed level of entropy spread throughout the network (the exploration). It is motivated by the following scenario. Suppose you have to route agents through a network in some optimal way, for instance, by minimizing the total travel cost---nothing particular up to now---you could use a standard shortest-path algorithm. Suppose, however, that you want to avoid pure deterministic routing policies in order, for instance, to allow some continual exploration of the network, avoid congestion, or avoid complete predictability of your routing strategy. In other words, you want to introduce some randomness or unpredictability in the routing policy (i.e., the routing policy is randomized). This problem, which will be called the randomized shortest-path problem (RSP), is investigated in this work. The global level of randomness of the routing policy is quantified by the expected Shannon entropy spread throughout the network and is provided a priori by the designer. Then, necessary conditions to compute the optimal randomized policy---minimizing the expected routing cost---are derived. Iterating these necessary conditions, reminiscent of Bellman's value iteration equations, allows computing an optimal policy, that is, a set of transition probabilities in each node. Interestingly and surprisingly enough, this first model, while formulated in a totally different framework, is equivalent to Akamatsu's model (1996), appearing in transportation science, for a special choice of the entropy constraint. We therefore revisit Akamatsu's model by recasting it into a sum-over-paths statistical physics formalism allowing easy derivation of all the quantities of interest in an elegant, unified way. For instance, it is shown that the unique optimal policy can be obtained by solving a simple linear system of equations. This second model is therefore more convincing because of its computational efficiency and soundness. Finally, simulation results obtained on simple, illustrative examples show that the models behave as expected.

...read moreread less

Journal Article•DOI•

Graphical models for interactive POMDPs: representations and solutions

[...]

Prashant Doshi¹, Yifeng Zeng², Qiongyu Chen³•Institutions (3)

University of Georgia¹, Aalborg University², National University of Singapore³

01 Jun 2009-Autonomous Agents and Multi-Agent Systems

TL;DR: New graphical representations for the problem of sequential decision making in partially observable multiagent environments, as formalized by interactive partially observable Markov decision processes (I-POMDPs), and the error bound of the approximation technique are discussed and demonstrated.

...read moreread less

Abstract: We develop new graphical representations for the problem of sequential decision making in partially observable multiagent environments, as formalized by interactive partially observable Markov decision processes (I-POMDPs). The graphical models called interactive influence diagrams (I-IDs) and their dynamic counterparts, interactive dynamic influence diagrams (I-DIDs), seek to explicitly model the structure that is often present in real-world problems by decomposing the situation into chance and decision variables, and the dependencies between the variables. I-DIDs generalize DIDs, which may be viewed as graphical representations of POMDPs, to multiagent settings in the same way that I-POMDPs generalize POMDPs. I-DIDs may be used to compute the policy of an agent given its belief as the agent acts and observes in a setting that is populated by other interacting agents. Using several examples, we show how I-IDs and I-DIDs may be applied and demonstrate their usefulness. We also show how the models may be solved using the standard algorithms that are applicable to DIDs. Solving I-DIDs exactly involves knowing the solutions of possible models of the other agents. The space of models grows exponentially with the number of time steps. We present a method of solving I-DIDs approximately by limiting the number of other agents' candidate models at each time step to a constant. We do this by clustering models that are likely to be behaviorally equivalent and selecting a representative set from the clusters. We discuss the error bound of the approximation technique and demonstrate its empirical performance.

...read moreread less

Journal Article•DOI•

A tutorial on partially observable Markov decision processes

[...]

Michael L. Littman¹•Institutions (1)

Rutgers University¹

01 Jun 2009-Journal of Mathematical Psychology

TL;DR: The purpose of this article is to introduce the POMDP model to behavioral scientists who may wish to apply the framework to the problem of understanding normative behavior in experimental settings.

...read moreread less

Journal Article•DOI•

Partially Observable Markov Decision Process Approximations for Adaptive Sensing

[...]

Edwin K. P. Chong¹, Chris Kreucher, Alfred O. Hero²•Institutions (2)

Colorado State University¹, University of Michigan²

01 Sep 2009-Discrete Event Dynamic Systems

TL;DR: This work describes an approach to adaptive sensing based on approximately solving a partially observable Markov decision process (POMDP) formulation of the problem, and describes a variety of approximation methods.

...read moreread less

Abstract: Adaptive sensing involves actively managing sensor resources to achieve a sensing task, such as object detection, classification, and tracking, and represents a promising direction for new applications of discrete event system methods. We describe an approach to adaptive sensing based on approximately solving a partially observable Markov decision process (POMDP) formulation of the problem. Such approximations are necessary because of the very large state space involved in practical adaptive sensing problems, precluding exact computation of optimal solutions. We review the theory of POMDPs and show how the theory applies to adaptive sensing problems. We then describe a variety of approximation methods, with examples to illustrate their application in adaptive sensing. The examples also demonstrate the gains that are possible from nonmyopic methods relative to myopic methods, and highlight some insights into the dependence of such gains on the sensing resources and environment.

...read moreread less

Journal Article•DOI•

Combined production and maintenance scheduling for a multiple‐product, single‐ machine production system

[...]

Thomas W. Sloan¹, J. George Shanthikumar²•Institutions (2)

University of Miami¹, University of California, Berkeley²

05 Jan 2009-Production and Operations Management

TL;DR: A Markov decision process model is developed that simultaneously determines maintenance and production schedules for a multiple-product, single-machine production system, accounting for the fact that equipment condition can affect the yield of different product types differently.

...read moreread less

Abstract: Traditionally, the problems of equipment maintenance scheduling and production scheduling in a multi-product environment have been treated independently. In this paper, we develop a Markov decision process model that simultaneously determines maintenance and production schedules for a multiple-product, single-machine production system, accounting for the fact that equipment condition can affect the yield of different product types differently. The problem was motivated by an application in semiconductor manufacturing. After examining structural properties of the optimal policy, we compare the combined method to an approach often used in practice. In the nearly 6,000 test problems studied, the reward from the combined method was an average of more than 25 percent greater than the reward from the traditional method.

...read moreread less

Journal Article•DOI•

Pricing substitutable flights in airline revenue management

[...]

Dan Zhang¹, William L. Cooper²•Institutions (2)

Desautels Faculty of Management¹, University of Minnesota²

16 Sep 2009-European Journal of Operational Research

TL;DR: Extensive numerical experiments show the value- and policy-approximation approaches to work well across a wide range of problem parameters, and to outperform the pooling-based heuristics in most cases.

...read moreread less

Journal Article•DOI•

Practical solution techniques for first-order MDPs

[...]

Scott Sanner, Craig Boutilier¹•Institutions (1)

University of Toronto¹

01 Apr 2009-Artificial Intelligence

TL;DR: This article proposes an approach that translates an expressive subset of the PPDDL representation to a first-order MDP (FOMDP) specification and then derives a domain-independent policy without grounding at any intermediate step and presents proof-of-concept results of this approach.

...read moreread less

Journal Article•DOI•

Optimal combined intrusion detection and biometric-based continuous authentication in high security mobile ad hoc networks

[...]

Jie Liu¹, F. R. Yu¹, Chung-Horng Lung¹, Helen Tang•Institutions (1)

Carleton University¹

01 Feb 2009-IEEE Transactions on Wireless Communications

TL;DR: This paper proposes a framework of combining intrusion detection and continuous authentication in MANETs, where multimodal biometrics are used for continuous authentication, and intrusion detection is modeled as sensors to detect system security state.

...read moreread less

Abstract: Two complementary classes of approaches exist to protect high security mobile ad hoc networks (MANETs), prevention-based approaches, such as authentication, and detection-based approaches, such as intrusion detection. Most previous work studies these two classes of issues separately. In this paper, we propose a framework of combining intrusion detection and continuous authentication in MANETs. In this framework, multimodal biometrics are used for continuous authentication, and intrusion detection is modeled as sensors to detect system security state. We formulate the whole system as a partially observed Markov decision process considering both system security requirements and resource constraints. We then use dynamic programming-based hidden Markov model scheduling algorithms to derive the optimal schemes for both intrusion detection and continuous authentication. Extensive simulations show the effectiveness of the proposed scheme.

...read moreread less

Proceedings Article•

Minimal sufficient explanations for factored Markov Decision Processes

[...]

Omar Zia Khan¹, Pascal Poupart¹, James P. Black¹•Institutions (1)

University of Waterloo¹

19 Sep 2009

TL;DR: A technique to explain policies for factored MDP by populating a set of domain-independent templates and a mechanism to determine a minimal set of templates that, viewed together, completely justify the policy.

...read moreread less

Abstract: Explaining policies of Markov Decision Processes (MDPs) is complicated due to their probabilistic and sequential nature We present a technique to explain policies for factored MDP by populating a set of domain-independent templates We also present a mechanism to determine a minimal set of templates that, viewed together, completely justify the policy Our explanations can be generated automatically at run-time with no additional effort required from the MDP designer We demonstrate our technique using the problems of advising undergraduate students in their course selection and assisting people with dementia in completing the task of handwashing We also evaluate our explanations for course-advising through a user study involving students

...read moreread less

Proceedings Article•DOI•

Markov-HTN Planning Approach to Enhance Flexibility of Automatic Web Service Composition

[...]

Kun Chen, Jiuyun Xu, Stephan Reiff-Marganiec¹•Institutions (1)

University of Leicester¹

06 Jul 2009

TL;DR: A model of combining a Markov decision process model and HTN planning to address Web services composition is proposed and it is shown that the proposed approach works effectively.

...read moreread less

Abstract: Automatic Web services composition can be achieved by using AI planning techniques. HTN planning has been adopted to handle the OWL-S Web service composition problem. However, existing composition methods based on HTN planning have not considered the choice of decompositions available to a problem which can lead to a variety of valid solutions.In this paper, we propose a model of combining a Markov decision process model and HTN planning to address Web services composition. In the model, HTN planning is enhanced to decompose a task in multiple ways and hence be able to find more than one plan,taking both functional and non-functional properties into account. Furthermore, an evaluation method to choose the optimal plan and some experimental results illustrate that the proposed approach works effectively.

...read moreread less

Proceedings Article•DOI•

Optimal Power Management for Server Farm to Support Green Computing

[...]

Dusit Niyato¹, Sivadon Chaisiri¹, Lee Bu Sung¹•Institutions (1)

Nanyang Technological University¹

18 May 2009

TL;DR: An optimal power management (OPM) used by a batch scheduler in a server farm is proposed and the result shows that with OPM the job waiting time can be maintained below the maximum threshold while the power consumption is much smaller than that without OPM.

...read moreread less

Abstract: Green computing is a new paradigm of designing the computer system which considers not only the processing performance but also the energy efficiency. Power management is one of the approaches in green computing to reduce the power consumption in distributed computing system. In this paper, we first propose an optimal power management (OPM) used by a batch scheduler in a server farm. This OPM observes the state of a server farm and makes the decision to switch the operation mode (i.e., active or sleep) of the server to minimize the power consumption while the performance requirements are met. An optimization problem based on constrained Markov decision process (CMDP) is formulated and solved to obtain an optimal decision of OPM. Given that OPM is used in the server farm, then an assignment of users to the server farms by a job broker is considered. This assignment is to ensure that the cost due to power consumption and network transportation is minimized. The performance of the system is extensively evaluated. The result shows that with OPM the job waiting time can be maintained below the maximum threshold while the power consumption is much smaller than that without OPM.

...read moreread less

Journal Article•DOI•

Selective Maintenance Decision-Making Over Extended Planning Horizons

[...]

Lisa M. Maillart¹, C.R. Cassady², Chase Rainwater³, Kellie Schneider²•Institutions (3)

University of Pittsburgh¹, University of Arkansas², University of Florida³

04 Aug 2009-IEEE Transactions on Reliability

TL;DR: The results indicate that these policies rarely differ, and that when they do, the difference in long-run mission reliability is minimal, which suggests that future work should concentrate on extending results for the single-mission problem.

...read moreread less

Abstract: ?Selective maintenance? models determine the optimal subset of desirable maintenance actions to perform when maintenance resources are constrained. We analyse a corrective selective maintenance model that identifies which components to replace in the finitely long periods of time between missions performed by a series-parallel system. We formulate this multi-mission problem as a stochastic dynamic program, and compare the resulting optimal infinite-horizon policy to both the optimal single-mission, and two-mission policies by executing a large numerical experiment. Our results indicate that these policies rarely differ, and that when they do, the difference in long-run mission reliability is minimal, which suggests that future work should concentrate on extending results for the single-mission problem.

...read moreread less

Collapse