scispace - formally typeset
Search or ask a question

Showing papers on "Markov decision process published in 2016"


Posted Content
TL;DR: This paper proposes to represent a "fast" reinforcement learning algorithm as a recurrent neural network (RNN) and learn it from data, encoded in the weights of the RNN, which are learned slowly through a general-purpose ("slow") RL algorithm.
Abstract: Deep reinforcement learning (deep RL) has been successful in learning sophisticated behaviors automatically; however, the learning process requires a huge number of trials. In contrast, animals can learn new tasks in just a few trials, benefiting from their prior knowledge about the world. This paper seeks to bridge this gap. Rather than designing a "fast" reinforcement learning algorithm, we propose to represent it as a recurrent neural network (RNN) and learn it from data. In our proposed method, RL$^2$, the algorithm is encoded in the weights of the RNN, which are learned slowly through a general-purpose ("slow") RL algorithm. The RNN receives all information a typical RL algorithm would receive, including observations, actions, rewards, and termination flags; and it retains its state across episodes in a given Markov Decision Process (MDP). The activations of the RNN store the state of the "fast" RL algorithm on the current (previously unseen) MDP. We evaluate RL$^2$ experimentally on both small-scale and large-scale problems. On the small-scale side, we train it to solve randomly generated multi-arm bandit problems and finite MDPs. After RL$^2$ is trained, its performance on new MDPs is close to human-designed algorithms with optimality guarantees. On the large-scale side, we test RL$^2$ on a vision-based navigation task and show that it scales up to high-dimensional problems.

668 citations


Book
03 Jun 2016
TL;DR: This book introduces multiagent planning under uncertainty as formalized by decentralized partially observable Markov decision processes (Dec-POMDPs).
Abstract: This book introduces multiagent planning under uncertainty as formalized by decentralized partially observable Markov decision processes (Dec-POMDPs). The intended audience is researchers and graduate students working in the fields of artificial intelligence related to sequential decision making: reinforcement learning, decision-theoretic planning for single agents, classical multiagent planning, decentralized control, and operations research.

658 citations


Posted Content
TL;DR: This paper applies deep reinforcement learning to the problem of forming long term driving strategies and shows how policy gradient iterations can be used without Markovian assumptions, and decomposes the problem into a composition of a Policy for Desires and trajectory planning with hard constraints.
Abstract: Autonomous driving is a multi-agent setting where the host vehicle must apply sophisticated negotiation skills with other road users when overtaking, giving way, merging, taking left and right turns and while pushing ahead in unstructured urban roadways. Since there are many possible scenarios, manually tackling all possible cases will likely yield a too simplistic policy. Moreover, one must balance between unexpected behavior of other drivers/pedestrians and at the same time not to be too defensive so that normal traffic flow is maintained. In this paper we apply deep reinforcement learning to the problem of forming long term driving strategies. We note that there are two major challenges that make autonomous driving different from other robotic tasks. First, is the necessity for ensuring functional safety - something that machine learning has difficulty with given that performance is optimized at the level of an expectation over many instances. Second, the Markov Decision Process model often used in robotics is problematic in our case because of unpredictable behavior of other agents in this multi-agent scenario. We make three contributions in our work. First, we show how policy gradient iterations can be used without Markovian assumptions. Second, we decompose the problem into a composition of a Policy for Desires (which is to be learned) and trajectory planning with hard constraints (which is not learned). The goal of Desires is to enable comfort of driving, while hard constraints guarantees the safety of driving. Third, we introduce a hierarchical temporal abstraction we call an "Option Graph" with a gating mechanism that significantly reduces the effective horizon and thereby reducing the variance of the gradient estimation even further.

575 citations


Proceedings ArticleDOI
10 Jul 2016
TL;DR: By analyzing the average delay of each task and the average power consumption at the mobile device, a power-constrained delay minimization problem is formulated, and an efficient one-dimensional search algorithm is proposed to find the optimal task scheduling policy.
Abstract: Mobile-edge computing (MEC) emerges as a promising paradigm to improve the quality of computation experience for mobile devices. Nevertheless, the design of computation task scheduling policies for MEC systems inevitably encounters a challenging two-timescale stochastic optimization problem. Specifically, in the larger timescale, whether to execute a task locally at the mobile device or to offload a task to the MEC server for cloud computing should be decided, while in the smaller timescale, the transmission policy for the task input data should adapt to the channel side information. In this paper, we adopt a Markov decision process approach to handle this problem, where the computation tasks are scheduled based on the queueing state of the task buffer, the execution state of the local processing unit, as well as the state of the transmission unit. By analyzing the average delay of each task and the average power consumption at the mobile device, we formulate a power-constrained delay minimization problem, and propose an efficient one-dimensional search algorithm to find the optimal task scheduling policy. Simulation results are provided to demonstrate the capability of the proposed optimal stochastic task scheduling policy in achieving a shorter average execution delay compared to the baseline policies.

483 citations


Proceedings ArticleDOI
TL;DR: In this paper, the authors adopt a Markov decision process approach to handle the problem of computation task scheduling in MEC systems, where the computation tasks are scheduled based on the queueing state of task buffer, the execution state of the local processing unit, as well as the state of transmission unit.
Abstract: Mobile-edge computing (MEC) emerges as a promising paradigm to improve the quality of computation experience for mobile devices. Nevertheless, the design of computation task scheduling policies for MEC systems inevitably encounters a challenging two-timescale stochastic optimization problem. Specifically, in the larger timescale, whether to execute a task locally at the mobile device or to offload a task to the MEC server for cloud computing should be decided, while in the smaller timescale, the transmission policy for the task input data should adapt to the channel side information. In this paper, we adopt a Markov decision process approach to handle this problem, where the computation tasks are scheduled based on the queueing state of the task buffer, the execution state of the local processing unit, as well as the state of the transmission unit. By analyzing the average delay of each task and the average power consumption at the mobile device, we formulate a power-constrained delay minimization problem, and propose an efficient one-dimensional search algorithm to find the optimal task scheduling policy. Simulation results are provided to demonstrate the capability of the proposed optimal stochastic task scheduling policy in achieving a shorter average execution delay compared to the baseline policies.

272 citations


Journal ArticleDOI
TL;DR: A novel MARL approach is proposed in which agents are allowed to rehearse with information that will not be available during policy execution, and it is shown experimentally that incorporating rehearsal features can enhance the learning rate compared to non-rehearsal-based learners.

259 citations


Posted Content
TL;DR: The Value Iteration Network (VIN) as discussed by the authors is a differentiable approximation of the value iteration algorithm, which can be represented as a convolutional neural network and trained end-to-end using standard backpropagation.
Abstract: We introduce the value iteration network (VIN): a fully differentiable neural network with a `planning module' embedded within. VINs can learn to plan, and are suitable for predicting outcomes that involve planning-based reasoning, such as policies for reinforcement learning. Key to our approach is a novel differentiable approximation of the value-iteration algorithm, which can be represented as a convolutional neural network, and trained end-to-end using standard backpropagation. We evaluate VIN based policies on discrete and continuous path-planning domains, and on a natural-language based search task. We show that by learning an explicit planning computation, VIN policies generalize better to new, unseen domains.

253 citations


Proceedings Article
05 Dec 2016
TL;DR: The value iteration network (VIN) as mentioned in this paper is a differentiable approximation of the value iteration algorithm, which can be represented as a convolutional neural network, and trained end-to-end using standard backpropagation.
Abstract: We introduce the value iteration network (VIN): a fully differentiable neural network with a 'planning module' embedded within. VINs can learn to plan, and are suitable for predicting outcomes that involve planning-based reasoning, such as policies for reinforcement learning. Key to our approach is a novel differentiable approximation of the value-iteration algorithm, which can be represented as a convolutional neural network, and trained end-to-end using standard backpropagation. We evaluate VIN based policies on discrete and continuous path-planning domains, and on a natural-language based search task. We show that by learning an explicit planning computation, VIN policies generalize better to new, unseen domains.

215 citations


Posted Content
TL;DR: In this article, a semi-aggregated Markov decision process (SAMDP) is proposed to identify spatio-temporal abstractions directly from features and may be used as a sub-goal detector in future work.
Abstract: In recent years there is a growing interest in using deep representations for reinforcement learning. In this paper, we present a methodology and tools to analyze Deep Q-networks (DQNs) in a non-blind matter. Moreover, we propose a new model, the Semi Aggregated Markov Decision Process (SAMDP), and an algorithm that learns it automatically. The SAMDP model allows us to identify spatio-temporal abstractions directly from features and may be used as a sub-goal detector in future work. Using our tools we reveal that the features learned by DQNs aggregate the state space in a hierarchical fashion, explaining its success. Moreover, we are able to understand and describe the policies learned by DQNs for three different Atari2600 games and suggest ways to interpret, debug and optimize deep neural networks in reinforcement learning.

191 citations


Journal ArticleDOI
TL;DR: The obtained adaptive and optimal output-feedback controllers differ from the existing literature on the ADP in that they are derived from sampled-data systems theory and are guaranteed to be robust to dynamic uncertainties.

183 citations


Proceedings ArticleDOI
16 May 2016
TL;DR: A method to predict long-term motion of pedestrians, modeling their behavior as jump-Markov processes with their goal a hidden variable and intent as a policy in a Markov decision process framework.
Abstract: We present a method to predict long-term motion of pedestrians, modeling their behavior as jump-Markov processes with their goal a hidden variable. Assuming approximately rational behavior, and incorporating environmental constraints and biases, including time-varying ones imposed by traffic lights, we model intent as a policy in a Markov decision process framework. We infer pedestrian state using a Rao-Blackwellized filter, and intent by planning according to a stochastic policy, reflecting individual preferences in aiming at the same goal.

Journal ArticleDOI
TL;DR: A continuous-time version of the traditional value iteration (VI) algorithm is presented with rigorous convergence analysis, crucial for developing new adaptive dynamic programming methods to solve the adaptive optimal control problem and the stochastic robust optimal Control problem for linear continuous- time systems.

Journal ArticleDOI
TL;DR: Simulation results show that the proposed scheme outperforms traditional schemes, and a delay-optimal virtualized radio resource scheduling scheme is proposed via stochastic learning.
Abstract: Due to the high density of vehicles and various types of vehicular services, it is challenging to guarantee the quality of vehicular services in current Long-Term Evolution (LTE) networks in a cost-efficient manner. Fortunately, with the development of fifth-generation (5G) technology, the installation of a large number of small cells is foreseen as one of the practical ways to achieve the low-delay requirement in vehicular environments. However, it may cause a huge operating expense and capital expenditure to mobile network operators due to the limited backhaul capacity and the explosion of signaling. In this paper, we integrate software-defined networking and radio resource virtualization into an LTE system for vehicular networks, i.e., software-defined heterogeneous vehicular network (SERVICE) . Based on this proposed system framework, a delay-optimal virtualized radio resource scheduling scheme is proposed via stochastic learning. The delay optimal problem is formulated as an infinite-horizon average-cost partially observed Markov decision process (POMDP). Then, an equivalent Bellman equation is derived to solve it. The proposed scheme can be divided into two stages, i.e., macro virtualization resource allocation (MaVRA) and micro virtualization resource allocation (MiVRA). The former is executed based on large timescale variables (traffic density), whereas the latter is operated according to short timescale variables (channel state and queue state). Simulation results show that the proposed scheme outperforms traditional schemes.

Journal ArticleDOI
TL;DR: A new method to optimize traffic flow, based on reinforcement learning is proposed, which uses Q-learning to learn policies dictating the maximum driving speed that is allowed on a highway, such that traffic congestion is reduced.

Journal ArticleDOI
TL;DR: The proposed approach is based on parametric option representations and works well in combination with current policy search methods, which are particularly well suited for continuous real-world tasks.
Abstract: Tasks that require many sequential decisions or complex solutions are hard to solve using conventional reinforcement learning algorithms. Based on the semi Markov decision process setting (SMDP) and the option framework, we propose a model which aims to alleviate these concerns. Instead of learning a single monolithic policy, the agent learns a set of simpler sub-policies as well as the initiation and termination probabilities for each of those sub-policies. While existing option learning algorithms frequently require manual specification of components such as the sub-policies, we present an algorithm which infers all relevant components of the option framework from data. Furthermore, the proposed approach is based on parametric option representations and works well in combination with current policy search methods, which are particularly well suited for continuous real-world tasks. We present results on SMDPs with discrete as well as continuous state-action spaces. The results show that the presented algorithm can combine simple sub-policies to solve complex tasks and can improve learning performance on simpler tasks.

Proceedings ArticleDOI
27 Dec 2016
TL;DR: In this paper, the authors formulate two synthesis problems where the desired STL specification is enforced by maximizing the probability of satisfaction, and the expected robustness degree, a measure quantifying the quality of satisfaction.
Abstract: This paper addresses the problem of learning optimal policies for satisfying signal temporal logic (STL) specifications by agents with unknown stochastic dynamics. The system is modeled as a Markov decision process, in which the states represent partitions of a continuous space and the transition probabilities are unknown. We formulate two synthesis problems where the desired STL specification is enforced by maximizing the probability of satisfaction, and the expected robustness degree, that is, a measure quantifying the quality of satisfaction. We discuss that Q-learning is not directly applicable to these problems because, based on the quantitative semantics of STL, the probability of satisfaction and expected robustness degree are not in the standard objective form of Q-learning. To resolve this issue, we propose an approximation of STL synthesis problems that can be solved via Q-learning, and we derive some performance bounds for the policies obtained by the approximate approach. The performance of the proposed method is demonstrated via simulations.

Proceedings Article
01 Jun 2016
TL;DR: In this paper, the authors define safety in terms of an a priori unknown safety constraint that depends on states and actions and satisfies certain regularity conditions expressed via a Gaussian process prior.
Abstract: In classical reinforcement learning agents accept arbitrary short term loss for long term gain when exploring their environment. This is infeasible for safety critical applications such as robotics, where even a single unsafe action may cause system failure or harm the environment. In this paper, we address the problem of safely exploring finite Markov decision processes (MDP). We define safety in terms of an a priori unknown safety constraint that depends on states and actions and satisfies certain regularity conditions expressed via a Gaussian process prior. We develop a novel algorithm, SAFEMDP, for this task and prove that it completely explores the safely reachable part of the MDP without violating the safety constraint. To achieve this, it cautiously explores safe states and actions in order to gain statistical confidence about the safety of unvisited state-action pairs from noisy observations collected while navigating the environment. Moreover, the algorithm explicitly considers reachability when exploring the MDP, ensuring that it does not get stuck in any state with no safe way out. We demonstrate our method on digital terrain models for the task of exploring an unknown map with a rover.

Book
21 Mar 2016
TL;DR: A tutorial on partially observable markov decision processes, a tutorial on controlled stochastic process encyclopedia of mathematics, and optimal control of partially observable piecewise.
Abstract: Covering formulation, algorithms, and structural results, and linking theory to real-world applications in controlled sensing (including social learning, adaptive radars and sequential detection), this book focuses on the conceptual foundations of partially observed Markov decision processes (POMDPs). It emphasizes structural results in stochastic dynamic programming, enabling graduate students and researchers in engineering, operations research, and economics to understand the underlying unifying themes without getting weighed down by mathematical technicalities. Bringing together research from across the literature, the book provides an introduction to nonlinear filtering followed by a systematic development of stochastic dynamic programming, lattice programming and reinforcement learning for POMDPs. Questions addressed in the book include: when does a POMDP have a threshold optimal policy? When are myopic policies optimal? How do local and global decision makers interact in adaptive decision making in multi-agent social learning where there is herding and data incest? And how can sophisticated radars and sensors adapt their sensing in real time?

Journal ArticleDOI
TL;DR: This two-part tutorial proposes a simple, straightforward canonical model (that is most familiar to people with a control theory background), and introduces four fundamental classes of policies which integrate the competing strategies that have been proposed under names such as control theory, dynamic programming, stochastic programming and robust optimization.
Abstract: There is a wide range of problems in energy systems that require making decisions in the presence of different forms of uncertainty. The fields that address sequential, stochastic decision problems lack a standard canonical modeling framework, with fragmented, competing solution strategies. Recognizing that we will never agree on a single notational system, this two-part tutorial proposes a simple, straightforward canonical model (that is most familiar to people with a control theory background), and introduces four fundamental classes of policies which integrate the competing strategies that have been proposed under names such as control theory, dynamic programming, stochastic programming and robust optimization. Part II of the tutorial illustrates the modeling framework using a simple energy storage problem, where we show that, depending on the problem characteristics, each of the four classes of policies may be best.

Posted Content
TL;DR: This paper shows that, on summary state and action spaces, deep Reinforcement Learning (RL) outperforms Gaussian Processes methods and shows that a deep RL method based on an actor-critic architecture can exploit a small amount of data very efficiently.
Abstract: In this paper, we propose to use deep policy networks which are trained with an advantage actor-critic method for statistically optimised dialogue systems. First, we show that, on summary state and action spaces, deep Reinforcement Learning (RL) outperforms Gaussian Processes methods. Summary state and action spaces lead to good performance but require pre-engineering effort, RL knowledge, and domain expertise. In order to remove the need to define such summary spaces, we show that deep RL can also be trained efficiently on the original state and action spaces. Dialogue systems based on partially observable Markov decision processes are known to require many dialogues to train, which makes them unappealing for practical deployment. We show that a deep RL method based on an actor-critic architecture can exploit a small amount of data very efficiently. Indeed, with only a few hundred dialogues collected with a handcrafted policy, the actor-critic deep learner is considerably bootstrapped from a combination of supervised and batch RL. In addition, convergence to an optimal policy is significantly sped up compared to other deep RL methods initialized on the data with batch RL. All experiments are performed on a restaurant domain derived from the Dialogue State Tracking Challenge 2 (DSTC2) dataset.

Proceedings ArticleDOI
01 Nov 2016
TL;DR: This work proposes new algorithms for multi-agent distributed iterative value function approximation where the agents are allowed to have different behavior policies while evaluating the response to a single target policy based on consensus-based distributed stochastic approximation.
Abstract: Reinforcement learning deals with the problem of how to map situations (states) to actions so as to maximize a numerical reward while interacting with dynamical and uncertain environment. Within the framework of Markov Decision Processes (MDPs) these methods are typically based on approximate dynamic programming using appropriate calculation/approximation of the value function. In this work we propose new algorithms for multi-agent distributed iterative value function approximation where the agents are allowed to have different behavior policies while evaluating the response to a single target policy. The algorithms assume linear parametrization of the value function and are based on consensus-based distributed stochastic approximation. Under appropriate assumptions on the time-varying network topology and the overall state-visiting distributions of the agents we prove weak convergence of the parameter estimates to the globally optimal point. It is demonstrated that the agents are able to together reach this solution even when the individual agents cannot.

Journal Article
TL;DR: This work analyzes the statistical properties of REG-LSPI and provides an upper bound on the policy evaluation error and the performance loss of the policy returned by this method, the first work that provides such a strong guarantee for a nonparametric approximate policy iteration algorithm.
Abstract: We study two regularization-based approximate policy iteration algorithms, namely REG-LSPI and REG-BRM, to solve reinforcement learning and planning problems in discounted Markov Decision Processes with large state and finite action spaces. The core of these algorithms are the regularized extensions of the Least-Squares Temporal Difference (LSTD) learning and Bellman Residual Minimization (BRM), which are used in the algorithms' policy evaluation steps. Regularization provides a convenient way to control the complexity of the function space to which the estimated value function belongs and as a result enables us to work with rich nonparametric function spaces. We derive efficient implementations of our methods when the function space is a reproducing kernel Hilbert space. We analyze the statistical properties of REG-LSPI and provide an upper bound on the policy evaluation error and the performance loss of the policy returned by this method. Our bound shows the dependence of the loss on the number of samples, the capacity of the function space, and some intrinsic properties of the underlying Markov Decision Process. The dependence of the policy evaluation bound on the number of samples is minimax optimal. This is the first work that provides such a strong guarantee for a nonparametric approximate policy iteration algorithm.

Book ChapterDOI
17 Oct 2016
TL;DR: In this article, the authors propose a technique for verifying probabilistic models whose transition probabilities are parametric, replacing parametric transitions by non-deterministic choices of extremal values.
Abstract: We propose a conceptually simple technique for verifying probabilistic models whose transition probabilities are parametric. The key is to replace parametric transitions by nondeterministic choices of extremal values. Analysing the resulting parameter-free model using off-the-shelf means yields (refinable) lower and upper bounds on probabilities of regions in the parameter space. The technique outperforms the existing analysis of parametric Markov chains by several orders of magnitude regarding both run-time and scalability. Its beauty is its applicability to various probabilistic models. It in particular provides the first sound and feasible method for performing parameter synthesis of Markov decision processes.

Journal ArticleDOI
TL;DR: In this article, the authors present a cell transmission model formulation for dynamic lane reversal and derive an effective heuristic based on theoretical results from the integer program for a single bottleneck link with varying demands.
Abstract: Autonomous vehicles admit consideration of novel traffic behaviors such as reservation-based intersection controls and dynamic lane reversal. The authors present a cell transmission model formulation for dynamic lane reversal. For deterministic demand, the authors formulate the dynamic lane reversal control problem for a single link as an integer program and derive theoretical results. In reality, demand is not known perfectly at arbitrary times in the future. To address stochastic demand, the authors present a Markov decision process formulation. Due to the large state size, the Markov decision process is intractable. However, based on theoretical results from the integer program, the authors derive an effective heuristic. They demonstrate significant improvements over a fixed lane configuration both on a single bottleneck link with varying demands, and on the downtown Austin network.

Proceedings ArticleDOI
01 Dec 2016
TL;DR: In this article, the authors study the optimal content delivery strategy in cache-enabled HetNets by taking into account the inherent multicast capability of wireless medium, and formulate a stochastic multicast scheduling problem to jointly minimize the average network delay and power costs.
Abstract: Caching at small base stations (SBSs) has demonstrated significant benefits in alleviating the backhaul requirement in heterogeneous cellular networks (HetNets). While many existing works focus on what contents to cache at each SBS, an equally important but much less investigated problem is what contents to deliver given the cache status and user requests. In this paper, we study the optimal content delivery strategy in cache-enabled HetNets by taking into account the inherent multicast capability of wireless medium. We establish a content-centric request queue model and then formulate a stochastic multicast scheduling problem to jointly minimize the average network delay and power costs. This stochastic optimization problem is an infinite horizon average cost Markov decision process (MDP), which is well known to be challenging. By using relative value iteration algorithm and the special properties of the request queue dynamics, we characterize some properties of the value function of the MDP. Based on these properties, we show that the optimal multicast scheduling policy, which is adaptive to the request queue state, is of the threshold type. Finally, we propose a low complexity optimal algorithm by exploiting the structural properties of the optimal policy.

Posted Content
TL;DR: This paper proposes an approximation of STL synthesis problems that can be solved via Q-learning, and derives some performance bounds for the policies obtained by the approximate approach.
Abstract: This paper addresses the problem of learning optimal policies for satisfying signal temporal logic (STL) specifications by agents with unknown stochastic dynamics. The system is modeled as a Markov decision process, in which the states represent partitions of a continuous space and the transition probabilities are unknown. We formulate two synthesis problems where the desired STL specification is enforced by maximizing the probability of satisfaction, and the expected robustness degree, that is, a measure quantifying the quality of satisfaction. We discuss that Q-learning is not directly applicable to these problems because, based on the quantitative semantics of STL, the probability of satisfaction and expected robustness degree are not in the standard objective form of Q-learning. To resolve this issue, we propose an approximation of STL synthesis problems that can be solved via Q-learning, and we derive some performance bounds for the policies obtained by the approximate approach. The performance of the proposed method is demonstrated via simulations.

Journal ArticleDOI
TL;DR: A Markov decision process model is proposed to solve the dynamic vehicle routing problem (DVRP) with nonstationary stochastic travel times under traffic congestion and adopts a rollout-based approach to the solution, using approximate dynamic programming to avoid the curse of dimensionality.
Abstract: This paper proposes a dynamic vehicle routing problem (DVRP) model with nonstationary stochastic travel times under traffic congestion. Depending on the traffic conditions, the travel time between two nodes, particularly in a city, may not be proportional to distance and changes both dynamically and stochastically over time. Considering this environment, we propose a Markov decision process model to solve this problem and adopt a rollout-based approach to the solution, using approximate dynamic programming to avoid the curse of dimensionality. We also investigate how to estimate the probability distribution of travel times of arcs which, reflecting reality, are considered to consist of multiple road segments. Experiments are conducted using a real-world problem faced by Singapore logistics/delivery company and authentic road traffic information.

Journal ArticleDOI
TL;DR: This work proposes an approach to synthesize control protocols for autonomous systems that account for uncertainties and imperfections in interactions with human operators, using abstractions based on Markov decision processes and augment these models to stochastic two-player games.
Abstract: We propose an approach to synthesize control protocols for autonomous systems that account for uncertainties and imperfections in interactions with human operators. As an illustrative example, we consider a scenario involving road network surveillance by an unmanned aerial vehicle (UAV) that is controlled remotely by a human operator but also has a certain degree of autonomy. Depending on the type (i.e., probabilistic and/or nondeterministic) of knowledge about the uncertainties and imperfections in the human–automation interactions, we use abstractions based on Markov decision processes and augment these models to stochastic two-player games. Our approach enables the synthesis of operator-dependent optimal mission plans for the UAV, highlighting the effects of operator characteristics (e.g., workload, proficiency, and fatigue) on UAV mission performance. It can also provide informative feedback (e.g., Pareto curves showing the trade-offs between multiple mission objectives), potentially assisting the operator in decision-making. We demonstrate the applicability of our approach via a detailed UAV mission planning case study.

Journal ArticleDOI
TL;DR: There is actually a common theme to these strategies, and underpinning the entire field remains the fundamental algorithmic strategies of value and policy iteration that were first introduced in the 1950’s and 60s.
Abstract: Approximate dynamic programming has evolved, initially independently, within operations research, computer science and the engineering controls community, all search- ing for practical tools for solving sequential stochastic optimization problems. More so than other communities, operations research continued to develop the theory behind the basic model introduced by Bellman with discrete states and actions, even while authors as early as Bellman himself recognized its limits due to the "curse of dimensionality" inherent in discrete state spaces. In response to these limitations, subcommunities in computer science, control theory and operations research have developed a variety of methods for solving dif- ferent classes of stochastic, dynamic optimization problems, creating the appearance of a jungle of competing approaches. In this article, we show that there is actually a common theme to these strategies, and underpinning the entire field remains the fundamental algo- rithmic strategies of value and policy iteration that were first introduced in the 1950's and 60's. Dynamic programming involves making decisions over time, under uncertainty. These problems arise in a wide range of applications, spanning business, science, engineering, economics, medicine and health, and operations. While tremendous successes have been achieved in specific problem settings, we lack general purpose tools with the broad applica- bility enjoyed by algorithmic strategies such as linear, nonlinear and integer programming. This paper provides an introduction to the challenges of dynamic programming, and describes the contributions made by different subcommunities, with special emphasis on computer science which pioneered a field known as reinforcement learning, and the opera- tions research community which has made contributions through several subcommunities, including stochastic programming, simulation optimization and approximate dynamic pro- gramming. Our presentation recognizes, but does not do justice to, the important contribu- tions made in the engineering controls communities.

Proceedings ArticleDOI
22 May 2016
TL;DR: Numerical results show that the performance of the proposed approach, which requires only causal knowledge of the energy harvesting process and channel coefficients, has only a small degradation compared to the optimum case which requires perfect non-causal knowledge.
Abstract: Energy harvesting point-to-point communications are considered. The transmitter harvests energy from the environment and stores it in a finite battery. It is assumed that the transmitter has always data to transmit and the harvested energy is used exclusively for data transmission. As in practical scenarios prior knowledge about the energy harvesting process might not be available, we assume that at each time instant only information about the current state of the transmitter is available, i.e., harvested energy, battery level and channel coefficient. We model the scenario as a Markov decision process and we implement reinforcement learning at the transmitter to find a power allocation policy that aims at maximizing the throughput. To overcome the limitations of traditional reinforcement learning algorithms, we apply the concept of function approximation and we propose a set of binary functions to approximate the expected throughput given the state of the transmitter. Numerical results show that the performance of the proposed approach, which requires only causal knowledge of the energy harvesting process and channel coefficients, has only a small degradation compared to the optimum case which requires perfect non-causal knowledge. Additionally, the proposed approach outperforms naive policies that assume only causal knowledge at the transmitter.