Showing papers on "Markov decision process published in 2016"

PDF

Open Access

Posted Content•

RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning

[...]

Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, Pieter Abbeel - Show less +2 more

04 Nov 2016-arXiv: Artificial Intelligence

TL;DR: This paper proposes to represent a "fast" reinforcement learning algorithm as a recurrent neural network (RNN) and learn it from data, encoded in the weights of the RNN, which are learned slowly through a general-purpose ("slow") RL algorithm.

...read moreread less

Abstract: Deep reinforcement learning (deep RL) has been successful in learning sophisticated behaviors automatically; however, the learning process requires a huge number of trials. In contrast, animals can learn new tasks in just a few trials, benefiting from their prior knowledge about the world. This paper seeks to bridge this gap. Rather than designing a "fast" reinforcement learning algorithm, we propose to represent it as a recurrent neural network (RNN) and learn it from data. In our proposed method, RL$^2$, the algorithm is encoded in the weights of the RNN, which are learned slowly through a general-purpose ("slow") RL algorithm. The RNN receives all information a typical RL algorithm would receive, including observations, actions, rewards, and termination flags; and it retains its state across episodes in a given Markov Decision Process (MDP). The activations of the RNN store the state of the "fast" RL algorithm on the current (previously unseen) MDP. We evaluate RL$^2$ experimentally on both small-scale and large-scale problems. On the small-scale side, we train it to solve randomly generated multi-arm bandit problems and finite MDPs. After RL$^2$ is trained, its performance on new MDPs is close to human-designed algorithms with optimality guarantees. On the large-scale side, we test RL$^2$ on a vision-based navigation task and show that it scales up to high-dimensional problems.

...read moreread less

668 citations

Book•

A Concise Introduction to Decentralized POMDPs

[...]

Frans A. Oliehoek, Christopher Amato

03 Jun 2016

TL;DR: This book introduces multiagent planning under uncertainty as formalized by decentralized partially observable Markov decision processes (Dec-POMDPs).

...read moreread less

Abstract: This book introduces multiagent planning under uncertainty as formalized by decentralized partially observable Markov decision processes (Dec-POMDPs). The intended audience is researchers and graduate students working in the fields of artificial intelligence related to sequential decision making: reinforcement learning, decision-theoretic planning for single agents, classical multiagent planning, decentralized control, and operations research.

...read moreread less

658 citations

Posted Content•

Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving

[...]

Shai Shalev-Shwartz, Shaked Shammah, Amnon Shashua

11 Oct 2016-arXiv: Artificial Intelligence

TL;DR: This paper applies deep reinforcement learning to the problem of forming long term driving strategies and shows how policy gradient iterations can be used without Markovian assumptions, and decomposes the problem into a composition of a Policy for Desires and trajectory planning with hard constraints.

...read moreread less

Abstract: Autonomous driving is a multi-agent setting where the host vehicle must apply sophisticated negotiation skills with other road users when overtaking, giving way, merging, taking left and right turns and while pushing ahead in unstructured urban roadways. Since there are many possible scenarios, manually tackling all possible cases will likely yield a too simplistic policy. Moreover, one must balance between unexpected behavior of other drivers/pedestrians and at the same time not to be too defensive so that normal traffic flow is maintained. In this paper we apply deep reinforcement learning to the problem of forming long term driving strategies. We note that there are two major challenges that make autonomous driving different from other robotic tasks. First, is the necessity for ensuring functional safety - something that machine learning has difficulty with given that performance is optimized at the level of an expectation over many instances. Second, the Markov Decision Process model often used in robotics is problematic in our case because of unpredictable behavior of other agents in this multi-agent scenario. We make three contributions in our work. First, we show how policy gradient iterations can be used without Markovian assumptions. Second, we decompose the problem into a composition of a Policy for Desires (which is to be learned) and trajectory planning with hard constraints (which is not learned). The goal of Desires is to enable comfort of driving, while hard constraints guarantees the safety of driving. Third, we introduce a hierarchical temporal abstraction we call an "Option Graph" with a gating mechanism that significantly reduces the effective horizon and thereby reducing the variance of the gradient estimation even further.

...read moreread less

575 citations

Proceedings Article•DOI•

Delay-optimal computation task scheduling for mobile-edge computing systems

[...]

Juan Liu¹, Yuyi Mao¹, Jun Zhang¹, Khaled Ben Letaief¹•Institutions (1)

Hong Kong University of Science and Technology¹

10 Jul 2016

TL;DR: By analyzing the average delay of each task and the average power consumption at the mobile device, a power-constrained delay minimization problem is formulated, and an efficient one-dimensional search algorithm is proposed to find the optimal task scheduling policy.

...read moreread less

Abstract: Mobile-edge computing (MEC) emerges as a promising paradigm to improve the quality of computation experience for mobile devices. Nevertheless, the design of computation task scheduling policies for MEC systems inevitably encounters a challenging two-timescale stochastic optimization problem. Specifically, in the larger timescale, whether to execute a task locally at the mobile device or to offload a task to the MEC server for cloud computing should be decided, while in the smaller timescale, the transmission policy for the task input data should adapt to the channel side information. In this paper, we adopt a Markov decision process approach to handle this problem, where the computation tasks are scheduled based on the queueing state of the task buffer, the execution state of the local processing unit, as well as the state of the transmission unit. By analyzing the average delay of each task and the average power consumption at the mobile device, we formulate a power-constrained delay minimization problem, and propose an efficient one-dimensional search algorithm to find the optimal task scheduling policy. Simulation results are provided to demonstrate the capability of the proposed optimal stochastic task scheduling policy in achieving a shorter average execution delay compared to the baseline policies.

...read moreread less

483 citations

Proceedings Article•DOI•

Delay-Optimal Computation Task Scheduling for Mobile-Edge Computing Systems

[...]

Juan Liu¹, Yuyi Mao¹, Jun Zhang¹, Khaled Ben Letaief¹•Institutions (1)

Hong Kong University of Science and Technology¹

26 Apr 2016-arXiv: Information Theory

TL;DR: In this paper, the authors adopt a Markov decision process approach to handle the problem of computation task scheduling in MEC systems, where the computation tasks are scheduled based on the queueing state of task buffer, the execution state of the local processing unit, as well as the state of transmission unit.

...read moreread less

272 citations

Journal Article•DOI•

Multi-agent reinforcement learning as a rehearsal for decentralized planning

[...]

Landon Kraemer¹, Bikramjit Banerjee¹•Institutions (1)

University of Southern Mississippi¹

19 May 2016-Neurocomputing

TL;DR: A novel MARL approach is proposed in which agents are allowed to rehearse with information that will not be available during policy execution, and it is shown experimentally that incorporating rehearsal features can enhance the learning rate compared to non-rehearsal-based learners.

...read moreread less

259 citations

Posted Content•

Value Iteration Networks

[...]

Aviv Tamar¹, Yi Wu¹, Garrett Thomas¹, Sergey Levine¹, Pieter Abbeel¹ - Show less +1 more•Institutions (1)

University of California, Berkeley¹

09 Feb 2016-arXiv: Artificial Intelligence

TL;DR: The Value Iteration Network (VIN) as discussed by the authors is a differentiable approximation of the value iteration algorithm, which can be represented as a convolutional neural network and trained end-to-end using standard backpropagation.

...read moreread less

Abstract: We introduce the value iteration network (VIN): a fully differentiable neural network with a `planning module' embedded within. VINs can learn to plan, and are suitable for predicting outcomes that involve planning-based reasoning, such as policies for reinforcement learning. Key to our approach is a novel differentiable approximation of the value-iteration algorithm, which can be represented as a convolutional neural network, and trained end-to-end using standard backpropagation. We evaluate VIN based policies on discrete and continuous path-planning domains, and on a natural-language based search task. We show that by learning an explicit planning computation, VIN policies generalize better to new, unseen domains.

...read moreread less

253 citations

Proceedings Article•

Value iteration networks

[...]

Aviv Tamar¹, Yi Wu¹, Garrett Thomas¹, Sergey Levine¹, Pieter Abbeel¹ - Show less +1 more•Institutions (1)

University of California, Berkeley¹

05 Dec 2016

TL;DR: The value iteration network (VIN) as mentioned in this paper is a differentiable approximation of the value iteration algorithm, which can be represented as a convolutional neural network, and trained end-to-end using standard backpropagation.

...read moreread less

Abstract: We introduce the value iteration network (VIN): a fully differentiable neural network with a 'planning module' embedded within. VINs can learn to plan, and are suitable for predicting outcomes that involve planning-based reasoning, such as policies for reinforcement learning. Key to our approach is a novel differentiable approximation of the value-iteration algorithm, which can be represented as a convolutional neural network, and trained end-to-end using standard backpropagation. We evaluate VIN based policies on discrete and continuous path-planning domains, and on a natural-language based search task. We show that by learning an explicit planning computation, VIN policies generalize better to new, unseen domains.

...read moreread less

215 citations

Posted Content•

Graying the black box: Understanding DQNs

[...]

Tom Zahavy¹, Nir Ben Zrihem¹, Shie Mannor¹•Institutions (1)

Technion – Israel Institute of Technology¹

08 Feb 2016-arXiv: Learning

TL;DR: In this article, a semi-aggregated Markov decision process (SAMDP) is proposed to identify spatio-temporal abstractions directly from features and may be used as a sub-goal detector in future work.

...read moreread less

Abstract: In recent years there is a growing interest in using deep representations for reinforcement learning. In this paper, we present a methodology and tools to analyze Deep Q-networks (DQNs) in a non-blind matter. Moreover, we propose a new model, the Semi Aggregated Markov Decision Process (SAMDP), and an algorithm that learns it automatically. The SAMDP model allows us to identify spatio-temporal abstractions directly from features and may be used as a sub-goal detector in future work. Using our tools we reveal that the features learned by DQNs aggregate the state space in a hierarchical fashion, explaining its success. Moreover, we are able to understand and describe the policies learned by DQNs for three different Atari2600 games and suggest ways to interpret, debug and optimize deep neural networks in reinforcement learning.

...read moreread less

191 citations

Journal Article•DOI•

Output-feedback adaptive optimal control of interconnected systems based on robust adaptive dynamic programming

[...]

Weinan Gao¹, Yu Jiang¹, Zhong-Ping Jiang², Tianyou Chai²•Institutions (2)

New York University¹, Northeastern University (China)²

01 Oct 2016-Automatica

TL;DR: The obtained adaptive and optimal output-feedback controllers differ from the existing literature on the ADP in that they are derived from sampled-data systems theory and are guaranteed to be robust to dynamic uncertainties.

...read moreread less

183 citations

Proceedings Article•DOI•

Intent-aware long-term prediction of pedestrian motion

[...]

Vasiliy Karasev¹, Alper Ayvaci², Bernd Heisele², Stefano Soatto¹•Institutions (2)

University of California, Los Angeles¹, Honda²

16 May 2016

TL;DR: A method to predict long-term motion of pedestrians, modeling their behavior as jump-Markov processes with their goal a hidden variable and intent as a policy in a Markov decision process framework.

...read moreread less

Abstract: We present a method to predict long-term motion of pedestrians, modeling their behavior as jump-Markov processes with their goal a hidden variable. Assuming approximately rational behavior, and incorporating environmental constraints and biases, including time-varying ones imposed by traffic lights, we model intent as a policy in a Markov decision process framework. We infer pedestrian state using a Rao-Blackwellized filter, and intent by planning according to a stochastic policy, reflecting individual preferences in aiming at the same goal.

...read moreread less

Journal Article•DOI•

Value iteration and adaptive dynamic programming for data-driven adaptive optimal control design

[...]

Tao Bian¹, Zhong-Ping Jiang¹•Institutions (1)

New York University¹

01 Sep 2016-Automatica

TL;DR: A continuous-time version of the traditional value iteration (VI) algorithm is presented with rigorous convergence analysis, crucial for developing new adaptive dynamic programming methods to solve the adaptive optimal control problem and the stochastic robust optimal Control problem for linear continuous- time systems.

...read moreread less

Journal Article•DOI•

Delay-Optimal Virtualized Radio Resource Scheduling in Software-Defined Vehicular Networks via Stochastic Learning

[...]

Qiang Zheng¹, Kan Zheng¹, Haijun Zhang², Victor C. M. Leung²•Institutions (2)

Beijing University of Posts and Telecommunications¹, University of British Columbia²

04 Mar 2016-IEEE Transactions on Vehicular Technology

TL;DR: Simulation results show that the proposed scheme outperforms traditional schemes, and a delay-optimal virtualized radio resource scheduling scheme is proposed via stochastic learning.

...read moreread less

Abstract: Due to the high density of vehicles and various types of vehicular services, it is challenging to guarantee the quality of vehicular services in current Long-Term Evolution (LTE) networks in a cost-efficient manner. Fortunately, with the development of fifth-generation (5G) technology, the installation of a large number of small cells is foreseen as one of the practical ways to achieve the low-delay requirement in vehicular environments. However, it may cause a huge operating expense and capital expenditure to mobile network operators due to the limited backhaul capacity and the explosion of signaling. In this paper, we integrate software-defined networking and radio resource virtualization into an LTE system for vehicular networks, i.e., software-defined heterogeneous vehicular network (SERVICE) . Based on this proposed system framework, a delay-optimal virtualized radio resource scheduling scheme is proposed via stochastic learning. The delay optimal problem is formulated as an infinite-horizon average-cost partially observed Markov decision process (POMDP). Then, an equivalent Bellman equation is derived to solve it. The proposed scheme can be divided into two stages, i.e., macro virtualization resource allocation (MaVRA) and micro virtualization resource allocation (MiVRA). The former is executed based on large timescale variables (traffic density), whereas the latter is operated according to short timescale variables (channel state and queue state). Simulation results show that the proposed scheme outperforms traditional schemes.

...read moreread less

Journal Article•DOI•

Traffic flow optimization

[...]

Erwin Walraven¹, Matthijs T. J. Spaan¹, Bram Bakker•Institutions (1)

Delft University of Technology¹

01 Jun 2016-Engineering Applications of Artificial Intelligence

TL;DR: A new method to optimize traffic flow, based on reinforcement learning is proposed, which uses Q-learning to learn policies dictating the maximum driving speed that is allowed on a highway, such that traffic congestion is reduced.

...read moreread less

Journal Article•DOI•

Probabilistic inference for determining options in reinforcement learning

[...]

Christian Daniel¹, Herke van Hoof¹, Jan Peters¹, Gerhard Neumann¹•Institutions (1)

Technische Universität Darmstadt¹

01 Sep 2016-Machine Learning

TL;DR: The proposed approach is based on parametric option representations and works well in combination with current policy search methods, which are particularly well suited for continuous real-world tasks.

...read moreread less

Abstract: Tasks that require many sequential decisions or complex solutions are hard to solve using conventional reinforcement learning algorithms. Based on the semi Markov decision process setting (SMDP) and the option framework, we propose a model which aims to alleviate these concerns. Instead of learning a single monolithic policy, the agent learns a set of simpler sub-policies as well as the initiation and termination probabilities for each of those sub-policies. While existing option learning algorithms frequently require manual specification of components such as the sub-policies, we present an algorithm which infers all relevant components of the option framework from data. Furthermore, the proposed approach is based on parametric option representations and works well in combination with current policy search methods, which are particularly well suited for continuous real-world tasks. We present results on SMDPs with discrete as well as continuous state-action spaces. The results show that the presented algorithm can combine simple sub-policies to solve complex tasks and can improve learning performance on simpler tasks.

...read moreread less

Proceedings Article•DOI•

Q-Learning for robust satisfaction of signal temporal logic specifications

[...]

Derya Aksaray¹, Austin Jones², Zhaodan Kong³, Mac Schwager⁴, Calin Belta⁵ - Show less +1 more•Institutions (5)

Massachusetts Institute of Technology¹, Georgia Institute of Technology², University of California, Davis³, Stanford University⁴, Boston University⁵

27 Dec 2016

TL;DR: In this paper, the authors formulate two synthesis problems where the desired STL specification is enforced by maximizing the probability of satisfaction, and the expected robustness degree, a measure quantifying the quality of satisfaction.

...read moreread less

Abstract: This paper addresses the problem of learning optimal policies for satisfying signal temporal logic (STL) specifications by agents with unknown stochastic dynamics. The system is modeled as a Markov decision process, in which the states represent partitions of a continuous space and the transition probabilities are unknown. We formulate two synthesis problems where the desired STL specification is enforced by maximizing the probability of satisfaction, and the expected robustness degree, that is, a measure quantifying the quality of satisfaction. We discuss that Q-learning is not directly applicable to these problems because, based on the quantitative semantics of STL, the probability of satisfaction and expected robustness degree are not in the standard objective form of Q-learning. To resolve this issue, we propose an approximation of STL synthesis problems that can be solved via Q-learning, and we derive some performance bounds for the policies obtained by the approximate approach. The performance of the proposed method is demonstrated via simulations.

...read moreread less

Proceedings Article•

Safe Exploration in Finite Markov Decision Processes with Gaussian Processes

[...]

Matteo Turchetta¹, Felix Berkenkamp², Andreas Krause²•Institutions (2)

Institute of Robotics and Intelligent Systems¹, ETH Zurich²

01 Jun 2016

TL;DR: In this paper, the authors define safety in terms of an a priori unknown safety constraint that depends on states and actions and satisfies certain regularity conditions expressed via a Gaussian process prior.

...read moreread less

Abstract: In classical reinforcement learning agents accept arbitrary short term loss for long term gain when exploring their environment. This is infeasible for safety critical applications such as robotics, where even a single unsafe action may cause system failure or harm the environment. In this paper, we address the problem of safely exploring finite Markov decision processes (MDP). We define safety in terms of an a priori unknown safety constraint that depends on states and actions and satisfies certain regularity conditions expressed via a Gaussian process prior. We develop a novel algorithm, SAFEMDP, for this task and prove that it completely explores the safely reachable part of the MDP without violating the safety constraint. To achieve this, it cautiously explores safe states and actions in order to gain statistical confidence about the safety of unvisited state-action pairs from noisy observations collected while navigating the environment. Moreover, the algorithm explicitly considers reachability when exploring the MDP, ensuring that it does not get stuck in any state with no safe way out. We demonstrate our method on digital terrain models for the task of exploring an unknown map with a rover.

...read moreread less

Book•

Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing

[...]

Vikram Krishnamurthy¹•Institutions (1)

Cornell University¹

21 Mar 2016

TL;DR: A tutorial on partially observable markov decision processes, a tutorial on controlled stochastic process encyclopedia of mathematics, and optimal control of partially observable piecewise.

...read moreread less

Abstract: Covering formulation, algorithms, and structural results, and linking theory to real-world applications in controlled sensing (including social learning, adaptive radars and sequential detection), this book focuses on the conceptual foundations of partially observed Markov decision processes (POMDPs). It emphasizes structural results in stochastic dynamic programming, enabling graduate students and researchers in engineering, operations research, and economics to understand the underlying unifying themes without getting weighed down by mathematical technicalities. Bringing together research from across the literature, the book provides an introduction to nonlinear filtering followed by a systematic development of stochastic dynamic programming, lattice programming and reinforcement learning for POMDPs. Questions addressed in the book include: when does a POMDP have a threshold optimal policy? When are myopic policies optimal? How do local and global decision makers interact in adaptive decision making in multi-agent social learning where there is herding and data incest? And how can sophisticated radars and sensors adapt their sensing in real time?

...read moreread less

Journal Article•DOI•

Tutorial on Stochastic Optimization in Energy—Part I: Modeling and Policies

[...]

Warren B. Powell¹, Stephan Meisel²•Institutions (2)

Princeton University¹, University of Münster²

01 Mar 2016-IEEE Transactions on Power Systems

TL;DR: This two-part tutorial proposes a simple, straightforward canonical model (that is most familiar to people with a control theory background), and introduces four fundamental classes of policies which integrate the competing strategies that have been proposed under names such as control theory, dynamic programming, stochastic programming and robust optimization.

...read moreread less

Abstract: There is a wide range of problems in energy systems that require making decisions in the presence of different forms of uncertainty. The fields that address sequential, stochastic decision problems lack a standard canonical modeling framework, with fragmented, competing solution strategies. Recognizing that we will never agree on a single notational system, this two-part tutorial proposes a simple, straightforward canonical model (that is most familiar to people with a control theory background), and introduces four fundamental classes of policies which integrate the competing strategies that have been proposed under names such as control theory, dynamic programming, stochastic programming and robust optimization. Part II of the tutorial illustrates the modeling framework using a simple energy storage problem, where we show that, depending on the problem characteristics, each of the four classes of policies may be best.

...read moreread less

Posted Content•

Policy Networks with Two-Stage Training for Dialogue Systems

[...]

Mehdi Fatemi¹, Layla El Asri², Hannes Schulz³, Jing He⁴, Kaheer Suleman⁵ - Show less +1 more•Institutions (5)

Microsoft¹, Georgia Institute of Technology², University of Bonn³, Shandong University⁴, University of Waterloo⁵

10 Jun 2016-arXiv: Computation and Language

TL;DR: This paper shows that, on summary state and action spaces, deep Reinforcement Learning (RL) outperforms Gaussian Processes methods and shows that a deep RL method based on an actor-critic architecture can exploit a small amount of data very efficiently.

...read moreread less

Abstract: In this paper, we propose to use deep policy networks which are trained with an advantage actor-critic method for statistically optimised dialogue systems. First, we show that, on summary state and action spaces, deep Reinforcement Learning (RL) outperforms Gaussian Processes methods. Summary state and action spaces lead to good performance but require pre-engineering effort, RL knowledge, and domain expertise. In order to remove the need to define such summary spaces, we show that deep RL can also be trained efficiently on the original state and action spaces. Dialogue systems based on partially observable Markov decision processes are known to require many dialogues to train, which makes them unappealing for practical deployment. We show that a deep RL method based on an actor-critic architecture can exploit a small amount of data very efficiently. Indeed, with only a few hundred dialogues collected with a handcrafted policy, the actor-critic deep learner is considerably bootstrapped from a combination of supervised and batch RL. In addition, convergence to an optimal policy is significantly sped up compared to other deep RL methods initialized on the data with batch RL. All experiments are performed on a restaurant domain derived from the Dialogue State Tracking Challenge 2 (DSTC2) dataset.

...read moreread less

Proceedings Article•DOI•

Multi-agent reinforcement learning

[...]

Milos S. Stankovic¹•Institutions (1)

University of Belgrade¹

01 Nov 2016

TL;DR: This work proposes new algorithms for multi-agent distributed iterative value function approximation where the agents are allowed to have different behavior policies while evaluating the response to a single target policy based on consensus-based distributed stochastic approximation.

...read moreread less

Abstract: Reinforcement learning deals with the problem of how to map situations (states) to actions so as to maximize a numerical reward while interacting with dynamical and uncertain environment. Within the framework of Markov Decision Processes (MDPs) these methods are typically based on approximate dynamic programming using appropriate calculation/approximation of the value function. In this work we propose new algorithms for multi-agent distributed iterative value function approximation where the agents are allowed to have different behavior policies while evaluating the response to a single target policy. The algorithms assume linear parametrization of the value function and are based on consensus-based distributed stochastic approximation. Under appropriate assumptions on the time-varying network topology and the overall state-visiting distributions of the agents we prove weak convergence of the parameter estimates to the globally optimal point. It is demonstrated that the agents are able to together reach this solution even when the individual agents cannot.

...read moreread less

Journal Article•

Regularized policy iteration with nonparametric function spaces

[...]

Amir-massoud Farahmand¹, Mohammad Ghavamzadeh², Csaba Szepesvári³, Shie Mannor⁴•Institutions (4)

Mitsubishi Electric Research Laboratories¹, Adobe Systems², University of Alberta³, Technion – Israel Institute of Technology⁴

01 Jan 2016-Journal of Machine Learning Research

TL;DR: This work analyzes the statistical properties of REG-LSPI and provides an upper bound on the policy evaluation error and the performance loss of the policy returned by this method, the first work that provides such a strong guarantee for a nonparametric approximate policy iteration algorithm.

...read moreread less

Abstract: We study two regularization-based approximate policy iteration algorithms, namely REG-LSPI and REG-BRM, to solve reinforcement learning and planning problems in discounted Markov Decision Processes with large state and finite action spaces. The core of these algorithms are the regularized extensions of the Least-Squares Temporal Difference (LSTD) learning and Bellman Residual Minimization (BRM), which are used in the algorithms' policy evaluation steps. Regularization provides a convenient way to control the complexity of the function space to which the estimated value function belongs and as a result enables us to work with rich nonparametric function spaces. We derive efficient implementations of our methods when the function space is a reproducing kernel Hilbert space. We analyze the statistical properties of REG-LSPI and provide an upper bound on the policy evaluation error and the performance loss of the policy returned by this method. Our bound shows the dependence of the loss on the number of samples, the capacity of the function space, and some intrinsic properties of the underlying Markov Decision Process. The dependence of the policy evaluation bound on the number of samples is minimax optimal. This is the first work that provides such a strong guarantee for a nonparametric approximate policy iteration algorithm.

...read moreread less

Book Chapter•DOI•

Parameter Synthesis for Markov Models: Faster Than Ever

[...]

Tim Quatmann¹, Christian Dehnert¹, Nils Jansen², Sebastian Junges¹, Joost-Pieter Katoen¹ - Show less +1 more•Institutions (2)

RWTH Aachen University¹, University of Texas at Austin²

17 Oct 2016

TL;DR: In this article, the authors propose a technique for verifying probabilistic models whose transition probabilities are parametric, replacing parametric transitions by non-deterministic choices of extremal values.

...read moreread less

Abstract: We propose a conceptually simple technique for verifying probabilistic models whose transition probabilities are parametric. The key is to replace parametric transitions by nondeterministic choices of extremal values. Analysing the resulting parameter-free model using off-the-shelf means yields (refinable) lower and upper bounds on probabilities of regions in the parameter space. The technique outperforms the existing analysis of parametric Markov chains by several orders of magnitude regarding both run-time and scalability. Its beauty is its applicability to various probabilistic models. It in particular provides the first sound and feasible method for performing parameter synthesis of Markov decision processes.

...read moreread less

Journal Article•DOI•

A cell transmission model for dynamic lane reversal with autonomous vehicles

[...]

Michael W. Levin¹, Stephen D. Boyles¹•Institutions (1)

University of Texas at Austin¹

01 Jul 2016-Transportation Research Part C-emerging Technologies

TL;DR: In this article, the authors present a cell transmission model formulation for dynamic lane reversal and derive an effective heuristic based on theoretical results from the integer program for a single bottleneck link with varying demands.

...read moreread less

Abstract: Autonomous vehicles admit consideration of novel traffic behaviors such as reservation-based intersection controls and dynamic lane reversal. The authors present a cell transmission model formulation for dynamic lane reversal. For deterministic demand, the authors formulate the dynamic lane reversal control problem for a single link as an integer program and derive theoretical results. In reality, demand is not known perfectly at arbitrary times in the future. To address stochastic demand, the authors present a Markov decision process formulation. Due to the large state size, the Markov decision process is intractable. However, based on theoretical results from the integer program, the authors derive an effective heuristic. They demonstrate significant improvements over a fixed lane configuration both on a single bottleneck link with varying demands, and on the downtown Austin network.

...read moreread less

Proceedings Article•DOI•

Stochastic content-centric multicast scheduling for cache-enabled heterogeneous cellular networks

[...]

Bo Zhou¹, Ying Cui¹, Meixia Tao¹•Institutions (1)

Shanghai Jiao Tong University¹

01 Dec 2016

TL;DR: In this article, the authors study the optimal content delivery strategy in cache-enabled HetNets by taking into account the inherent multicast capability of wireless medium, and formulate a stochastic multicast scheduling problem to jointly minimize the average network delay and power costs.

...read moreread less

Abstract: Caching at small base stations (SBSs) has demonstrated significant benefits in alleviating the backhaul requirement in heterogeneous cellular networks (HetNets). While many existing works focus on what contents to cache at each SBS, an equally important but much less investigated problem is what contents to deliver given the cache status and user requests. In this paper, we study the optimal content delivery strategy in cache-enabled HetNets by taking into account the inherent multicast capability of wireless medium. We establish a content-centric request queue model and then formulate a stochastic multicast scheduling problem to jointly minimize the average network delay and power costs. This stochastic optimization problem is an infinite horizon average cost Markov decision process (MDP), which is well known to be challenging. By using relative value iteration algorithm and the special properties of the request queue dynamics, we characterize some properties of the value function of the MDP. Based on these properties, we show that the optimal multicast scheduling policy, which is adaptive to the request queue state, is of the threshold type. Finally, we propose a low complexity optimal algorithm by exploiting the structural properties of the optimal policy.

...read moreread less

Posted Content•

Q-Learning for Robust Satisfaction of Signal Temporal Logic Specifications

[...]

Derya Aksaray¹, Austin Jones², Zhaodan Kong³, Mac Schwager⁴, Calin Belta⁵ - Show less +1 more•Institutions (5)

Massachusetts Institute of Technology¹, Georgia Institute of Technology², University of California, Davis³, Stanford University⁴, Boston University⁵

23 Sep 2016-arXiv: Systems and Control

TL;DR: This paper proposes an approximation of STL synthesis problems that can be solved via Q-learning, and derives some performance bounds for the policies obtained by the approximate approach.

...read moreread less

Journal Article•DOI•

Solving the Dynamic Vehicle Routing Problem Under Traffic Congestion

[...]

Gitae Kim¹, Yew-Soon Ong¹, Taesu Cheong², Puay Siew Tan•Institutions (2)

Nanyang Technological University¹, Korea University²

14 Apr 2016-IEEE Transactions on Intelligent Transportation Systems

TL;DR: A Markov decision process model is proposed to solve the dynamic vehicle routing problem (DVRP) with nonstationary stochastic travel times under traffic congestion and adopts a rollout-based approach to the solution, using approximate dynamic programming to avoid the curse of dimensionality.

...read moreread less

Abstract: This paper proposes a dynamic vehicle routing problem (DVRP) model with nonstationary stochastic travel times under traffic congestion. Depending on the traffic conditions, the travel time between two nodes, particularly in a city, may not be proportional to distance and changes both dynamically and stochastically over time. Considering this environment, we propose a Markov decision process model to solve this problem and adopt a rollout-based approach to the solution, using approximate dynamic programming to avoid the curse of dimensionality. We also investigate how to estimate the probability distribution of travel times of arcs which, reflecting reality, are considered to consist of multiple road segments. Experiments are conducted using a real-world problem faced by Singapore logistics/delivery company and authentic road traffic information.

...read moreread less

Journal Article•DOI•

Synthesis of Human-in-the-Loop Control Protocols for Autonomous Systems

[...]

Lu Feng¹, Clemens Wiltsche², Laura Humphrey³, Ufuk Topcu¹•Institutions (3)

University of Pennsylvania¹, University of Oxford², Air Force Research Laboratory³

08 Mar 2016-IEEE Transactions on Automation Science and Engineering

TL;DR: This work proposes an approach to synthesize control protocols for autonomous systems that account for uncertainties and imperfections in interactions with human operators, using abstractions based on Markov decision processes and augment these models to stochastic two-player games.

...read moreread less

Abstract: We propose an approach to synthesize control protocols for autonomous systems that account for uncertainties and imperfections in interactions with human operators. As an illustrative example, we consider a scenario involving road network surveillance by an unmanned aerial vehicle (UAV) that is controlled remotely by a human operator but also has a certain degree of autonomy. Depending on the type (i.e., probabilistic and/or nondeterministic) of knowledge about the uncertainties and imperfections in the human–automation interactions, we use abstractions based on Markov decision processes and augment these models to stochastic two-player games. Our approach enables the synthesis of operator-dependent optimal mission plans for the UAV, highlighting the effects of operator characteristics (e.g., workload, proficiency, and fatigue) on UAV mission performance. It can also provide informative feedback (e.g., Pareto curves showing the trade-offs between multiple mission objectives), potentially assisting the operator in decision-making. We demonstrate the applicability of our approach via a detailed UAV mission planning case study.

...read moreread less

Journal Article•DOI•

Perspectives of approximate dynamic programming

[...]

Warren B. Powell¹•Institutions (1)

Princeton University¹

01 Jun 2016-Annals of Operations Research

TL;DR: There is actually a common theme to these strategies, and underpinning the entire field remains the fundamental algorithmic strategies of value and policy iteration that were first introduced in the 1950’s and 60s.

...read moreread less

Abstract: Approximate dynamic programming has evolved, initially independently, within operations research, computer science and the engineering controls community, all search- ing for practical tools for solving sequential stochastic optimization problems. More so than other communities, operations research continued to develop the theory behind the basic model introduced by Bellman with discrete states and actions, even while authors as early as Bellman himself recognized its limits due to the "curse of dimensionality" inherent in discrete state spaces. In response to these limitations, subcommunities in computer science, control theory and operations research have developed a variety of methods for solving dif- ferent classes of stochastic, dynamic optimization problems, creating the appearance of a jungle of competing approaches. In this article, we show that there is actually a common theme to these strategies, and underpinning the entire field remains the fundamental algo- rithmic strategies of value and policy iteration that were first introduced in the 1950's and 60's. Dynamic programming involves making decisions over time, under uncertainty. These problems arise in a wide range of applications, spanning business, science, engineering, economics, medicine and health, and operations. While tremendous successes have been achieved in specific problem settings, we lack general purpose tools with the broad applica- bility enjoyed by algorithmic strategies such as linear, nonlinear and integer programming. This paper provides an introduction to the challenges of dynamic programming, and describes the contributions made by different subcommunities, with special emphasis on computer science which pioneered a field known as reinforcement learning, and the opera- tions research community which has made contributions through several subcommunities, including stochastic programming, simulation optimization and approximate dynamic pro- gramming. Our presentation recognizes, but does not do justice to, the important contribu- tions made in the engineering controls communities.

...read moreread less

Proceedings Article•DOI•

Reinforcement learning for energy harvesting point-to-point communications

[...]

Andrea Ortiz¹, Hussein Al-Shatri¹, Xiang Li², Tobias Weber², Anja Klein¹ - Show less +1 more•Institutions (2)

Technische Universität Darmstadt¹, University of Rostock²

22 May 2016

TL;DR: Numerical results show that the performance of the proposed approach, which requires only causal knowledge of the energy harvesting process and channel coefficients, has only a small degradation compared to the optimum case which requires perfect non-causal knowledge.

...read moreread less

Abstract: Energy harvesting point-to-point communications are considered. The transmitter harvests energy from the environment and stores it in a finite battery. It is assumed that the transmitter has always data to transmit and the harvested energy is used exclusively for data transmission. As in practical scenarios prior knowledge about the energy harvesting process might not be available, we assume that at each time instant only information about the current state of the transmitter is available, i.e., harvested energy, battery level and channel coefficient. We model the scenario as a Markov decision process and we implement reinforcement learning at the transmitter to find a power allocation policy that aims at maximizing the throughput. To overcome the limitations of traditional reinforcement learning algorithms, we apply the concept of function approximation and we propose a set of binary functions to approximate the expected throughput given the state of the transmitter. Numerical results show that the performance of the proposed approach, which requires only causal knowledge of the energy harvesting process and channel coefficients, has only a small degradation compared to the optimum case which requires perfect non-causal knowledge. Additionally, the proposed approach outperforms naive policies that assume only causal knowledge at the transmitter.

...read moreread less

Collapse