Showing papers on "Markov decision process published in 2019"

PDF

Open Access

Journal Article•DOI•

Space/Aerial-Assisted Computing Offloading for IoT Applications: A Learning-Based Approach

[...]

Xiongwen Cheng¹, Feng Lyu², Wei Quan³, Conghao Zhou², Hongli He⁴, Weisen Shi², Xuemin Shen² - Show less +3 more•Institutions (4)

Xidian University¹, University of Waterloo², Beijing Jiaotong University³, Zhejiang University⁴

21 Mar 2019-IEEE Journal on Selected Areas in Communications

TL;DR: Simulation results show that the proposed edge VM allocation and task scheduling approach can achieve near-optimal performance with very low complexity and the proposed learning-based computing offloading algorithm not only converges fast but also achieves a lower total cost compared with other offloading approaches.

...read moreread less

Abstract: Internet of Things (IoT) computing offloading is a challenging issue, especially in remote areas where common edge/cloud infrastructure is unavailable. In this paper, we present a space-air-ground integrated network (SAGIN) edge/cloud computing architecture for offloading the computation-intensive applications considering remote energy and computation constraints, where flying unmanned aerial vehicles (UAVs) provide near-user edge computing and satellites provide access to the cloud computing. First, for UAV edge servers, we propose a joint resource allocation and task scheduling approach to efficiently allocate the computing resources to virtual machines (VMs) and schedule the offloaded tasks. Second, we investigate the computing offloading problem in SAGIN and propose a learning-based approach to learn the optimal offloading policy from the dynamic SAGIN environments. Specifically, we formulate the offloading decision making as a Markov decision process where the system state considers the network dynamics. To cope with the system dynamics and complexity, we propose a deep reinforcement learning-based computing offloading approach to learn the optimal offloading policy on-the-fly, where we adopt the policy gradient method to handle the large action space and actor-critic method to accelerate the learning process. Simulation results show that the proposed edge VM allocation and task scheduling approach can achieve near-optimal performance with very low complexity and the proposed learning-based computing offloading algorithm not only converges fast but also achieves a lower total cost compared with other offloading approaches.

...read moreread less

537 citations

Journal Article•DOI•

Model-Free Real-Time EV Charging Scheduling Based on Deep Reinforcement Learning

[...]

Zhiqiang Wan¹, Hepeng Li², Haibo He¹, Danil V. Prokhorov³•Institutions (3)

University of Rhode Island¹, Chinese Academy of Sciences², Toyota³

01 Sep 2019-IEEE Transactions on Smart Grid

TL;DR: A model-free approach based on deep reinforcement learning is proposed to determine the optimal strategy for charging strategy due to the existence of randomness in traffic conditions, user's commuting behavior, and the pricing process of the utility.

...read moreread less

Abstract: Driven by the recent advances in electric vehicle (EV) technologies, EVs have become important for smart grid economy. When EVs participate in demand response program which has real-time pricing signals, the charging cost can be greatly reduced by taking full advantage of these pricing signals. However, it is challenging to determine an optimal charging strategy due to the existence of randomness in traffic conditions, user’s commuting behavior, and the pricing process of the utility. Conventional model-based approaches require a model of forecast on the uncertainty and optimization for the scheduling process. In this paper, we formulate this scheduling problem as a Markov Decision Process (MDP) with unknown transition probability. A model-free approach based on deep reinforcement learning is proposed to determine the optimal strategy for this problem. The proposed approach can adaptively learn the transition probability and does not require any system model information. The architecture of the proposed approach contains two networks: a representation network to extract discriminative features from the electricity prices and a Q network to approximate the optimal action-value function. Numerous experimental results demonstrate the effectiveness of the proposed approach.

...read moreread less

277 citations

Journal Article•DOI•

A Deep Reinforcement Learning Network for Traffic Light Cycle Control

[...]

Xiaoyuan Liang¹, Xunsheng Du², Guiling Wang¹, Zhu Han²•Institutions (2)

New Jersey Institute of Technology¹, University of Houston²

03 Jan 2019-IEEE Transactions on Vehicular Technology

TL;DR: A deep reinforcement learning model is proposed to control the traffic light cycle that incorporates multiple optimization elements to improve the performance, such as dueling network, target network, double Q-learning network, and prioritized experience replay.

...read moreread less

Abstract: Existing inefficient traffic light cycle control causes numerous problems, such as long delay and waste of energy. To improve efficiency, taking real-time traffic information as an input and dynamically adjusting the traffic light duration accordingly is a must. Existing works either split the traffic signal into equal duration or only leverage limited traffic information. In this paper, we study how to decide the traffic signal duration based on the collected data from different sensors. We propose a deep reinforcement learning model to control the traffic light cycle. In the model, we quantify the complex traffic scenario as states by collecting traffic data and dividing the whole intersection into small grids. The duration changes of a traffic light are the actions, which are modeled as a high-dimension Markov decision process. The reward is the cumulative waiting time difference between two cycles. To solve the model, a convolutional neural network is employed to map states to rewards. The proposed model incorporates multiple optimization elements to improve the performance, such as dueling network, target network, double Q-learning network, and prioritized experience replay. We evaluate our model via simulation on a Simulation of Urban MObility simulator. Simulation results show the efficiency of our model in controlling traffic lights.

...read moreread less

271 citations

Posted Content•

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

[...]

Alekh Agarwal¹, Sham M. Kakade², Jason D. Lee³, Gaurav Mahajan⁴•Institutions (4)

Microsoft¹, University of Washington², Princeton University³, University of California, San Diego⁴

01 Aug 2019-arXiv: Learning

TL;DR: This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs), and shows an important interplay between estimation error, approximation error, and exploration.

...read moreread less

Abstract: Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution or how they cope with approximation error due to using a restricted class of parametric policies. This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: "tabular" policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy; and parametric policy classes (considering both log-linear and neural policy classes), which may not contain the optimal policy and where we provide agnostic learning results. One central contribution of this work is in providing approximation guarantees that are average case -- which avoid explicit worst-case dependencies on the size of state space -- by making a formal connection to supervised learning under distribution shift. This characterization shows an important interplay between estimation error, approximation error, and exploration (as characterized through a precisely defined condition number).

...read moreread less

248 citations

Journal Article•DOI•

Online Deep Reinforcement Learning for Computation Offloading in Blockchain-Empowered Mobile Edge Computing

[...]

Xiaoyu Qiu¹, Luobin Liu¹, Wuhui Chen¹, Zicong Hong¹, Zibin Zheng¹ - Show less +1 more•Institutions (1)

Sun Yat-sen University¹

20 Jun 2019-IEEE Transactions on Vehicular Technology

TL;DR: This paper forms the online offloading problem as a Markov decision process by considering both the blockchain mining tasks and data processing tasks and introduces an adaptive genetic algorithm into the exploration of deep reinforcement learning to effectively avoid useless exploration and speed up the convergence without reducing performance.

...read moreread less

Abstract: Offloading computation-intensive tasks (e.g., blockchain consensus processes and data processing tasks) to the edge/cloud is a promising solution for blockchain-empowered mobile edge computing. However, the traditional offloading approaches (e.g., auction-based and game-theory approaches) fail to adjust the policy according to the changing environment and cannot achieve long-term performance. Moreover, the existing deep reinforcement learning-based offloading approaches suffer from the slow convergence caused by high-dimensional action space. In this paper, we propose a new model-free deep reinforcement learning-based online computation offloading approach for blockchain-empowered mobile edge computing in which both mining tasks and data processing tasks are considered. First, we formulate the online offloading problem as a Markov decision process by considering both the blockchain mining tasks and data processing tasks. Then, to maximize long-term offloading performance, we leverage deep reinforcement learning to accommodate highly dynamic environments and address the computational complexity. Furthermore, we introduce an adaptive genetic algorithm into the exploration of deep reinforcement learning to effectively avoid useless exploration and speed up the convergence without reducing performance. Finally, our experimental results demonstrate that our algorithm can converge quickly and outperform three benchmark policies.

...read moreread less

223 citations

Journal Article•DOI•

Advanced planning for autonomous vehicles using reinforcement learning and deep inverse reinforcement learning

[...]

Changxi You¹, Jianbo Lu², Dimitar Petrov Filev², Panagiotis Tsiotras¹•Institutions (2)

Georgia Institute of Technology¹, Ford Motor Company²

01 Apr 2019-Robotics and Autonomous Systems

TL;DR: Simulated results demonstrate the desired driving behaviors of an autonomous vehicle using both the reinforcement learning and inverse reinforcement learning techniques.

...read moreread less

172 citations

Journal Article•DOI•

Joint Status Sampling and Updating for Minimizing Age of Information in the Internet of Things

[...]

Bo Zhou¹, Walid Saad¹•Institutions (1)

Virginia Tech¹

29 Jul 2019-IEEE Transactions on Communications

TL;DR: In this article, a real-time IoT monitoring system is considered, in which the IoT devices sample a physical process with a sampling cost and send the status packet to a given destination with an updating cost.

...read moreread less

Abstract: The effective operation of time-critical Internet of things (IoT) applications requires real-time reporting of fresh status information of underlying physical processes. In this paper, a real-time IoT monitoring system is considered, in which the IoT devices sample a physical process with a sampling cost and send the status packet to a given destination with an updating cost. This joint status sampling and updating process is designed to minimize the average age of information (AoI) at the destination node under an average energy cost constraint at each device. This stochastic problem is formulated as an infinite horizon average cost constrained Markov decision process (CMDP) and transformed into an unconstrained Markov decision process (MDP) using a Lagrangian method. For the single IoT device case, the optimal policy for the CMDP is shown to be a randomized mixture of two deterministic policies for the unconstrained MDP, which is of threshold type. This reveals a fundamental tradeoff between the average AoI at the destination and the sampling and updating costs. Then, a structure-aware optimal algorithm to obtain the optimal policy of the CMDP is proposed and the impact of the wireless channel dynamics is studied while demonstrating that channels having a larger mean channel gain and less scattering can achieve better AoI performance. For the case of multiple IoT devices, a low-complexity semi-distributed suboptimal policy is proposed with the updating control at the destination and the sampling control at each IoT device. Then, an online learning algorithm is developed to obtain this policy, which can be implemented at each IoT device and requires only the local knowledge and small signaling from the destination. The proposed learning algorithm is shown to converge almost surely to the suboptimal policy. Simulation results show the structural properties of the optimal policy for the single IoT device case; and show that the proposed policy for multiple IoT devices outperforms a zero-wait baseline policy, with average AoI reductions reaching up to 33%.

...read moreread less

168 citations

Journal Article•DOI•

Migration Modeling and Learning Algorithms for Containers in Fog Computing

[...]

Zhiqing Tang¹, Xiaojie Zhou¹, Fuming Zhang¹, Weijia Jia², Wei Zhao³ - Show less +1 more•Institutions (3)

Shanghai Jiao Tong University¹, University of Macau², American University of Sharjah³

01 Sep 2019-IEEE Transactions on Services Computing

TL;DR: This paper proposes novel container migration algorithms and architecture to support mobility tasks with various application requirements and demonstrates that the strategy outperforms the existing baseline approaches in terms of delay, power consumption, and migration cost.

...read moreread less

Abstract: Fog Computing (FC) is a flexible architecture to support distributed domain-specific applications with cloud-like quality of service. However, current FC still lacks the mobility support mechanism when facing many mobile users with diversified application quality requirements. Such mobility support mechanism can be critical such as in the industrial internet where human, products, and devices are moveable. To fill in such gaps, in this paper we propose novel container migration algorithms and architecture to support mobility tasks with various application requirements. Our algorithms are realized from three aspects: 1) We consider mobile application tasks can be hosted in a container of a corresponding fog node that can be migrated, taking the communication delay and computational power consumption into consideration; 2) We further model such container migration strategy as multiple dimensional Markov Decision Process (MDP) spaces. To effectively reduce the large MDP spaces, efficient deep reinforcement learning algorithms are devised to achieve fast decision-making and 3) We implement the model and algorithms as a container migration prototype system and test its feasibility and performance. Extensive experiments show that our strategy outperforms the existing baseline approaches 2.9, 48.5 and 58.4 percent on average in terms of delay, power consumption, and migration cost, respectively.

...read moreread less

161 citations

Posted Content•

A Theory of Regularized Markov Decision Processes

[...]

Matthieu Geist¹, Bruno Scherrer², Olivier Pietquin¹•Institutions (2)

Google¹, Institut Élie Cartan de Lorraine²

31 Jan 2019-arXiv: Learning

TL;DR: A general theory of regularized Markov Decision Processes that generalizes these approaches in two directions: a larger class of regularizers, and the general modified policy iteration approach, encompassing both policy iteration and value iteration.

...read moreread less

Abstract: Many recent successful (deep) reinforcement learning algorithms make use of regularization, generally based on entropy or Kullback-Leibler divergence. We propose a general theory of regularized Markov Decision Processes that generalizes these approaches in two directions: we consider a larger class of regularizers, and we consider the general modified policy iteration approach, encompassing both policy iteration and value iteration. The core building blocks of this theory are a notion of regularized Bellman operator and the Legendre-Fenchel transform, a classical tool of convex optimization. This approach allows for error propagation analyses of general algorithmic schemes of which (possibly variants of) classical algorithms such as Trust Region Policy Optimization, Soft Q-learning, Stochastic Actor Critic or Dynamic Policy Programming are special cases. This also draws connections to proximal convex optimization, especially to Mirror Descent.

...read moreread less

160 citations

Journal Article•DOI•

Dynamic Energy Management of a Microgrid Using Approximate Dynamic Programming and Deep Recurrent Neural Network Learning

[...]

Peng Zeng¹, Hepeng Li¹, Haibo He², Shuhui Li³•Institutions (3)

Chinese Academy of Sciences¹, University of Rhode Island², University of Alabama³

01 Jul 2019-IEEE Transactions on Smart Grid

TL;DR: A novel dynamic energy management system is developed to incorporate efficient management of energy storage system into MG real- time dispatch while considering power flow constraints and uncertainties in load, renewable generation and real-time electricity price.

...read moreread less

Abstract: This paper focuses on economical operation of a microgrid (MG) in real-time. A novel dynamic energy management system is developed to incorporate efficient management of energy storage system into MG real-time dispatch while considering power flow constraints and uncertainties in load, renewable generation and real-time electricity price. The developed dynamic energy management mechanism does not require long-term forecast and optimization or distribution knowledge of the uncertainty, but can still optimize the long-term operational costs of MGs. First, the real-time scheduling problem is modeled as a finite-horizon Markov decision process over a day. Then, approximate dynamic programming and deep recurrent neural network learning are employed to derive a near optimal real-time scheduling policy. Last, using real power grid data from California independent system operator, a detailed simulation study is carried out to validate the effectiveness of the proposed method.

...read moreread less

155 citations

Journal Article•DOI•

Dynamic Service Migration in Mobile Edge Computing Based on Markov Decision Process

[...]

Shiqiang Wang, Rahul Urgaonkar¹, Murtaza Zafer, Ting He², Kevin S. Chan³, Kin K. Leung⁴ - Show less +2 more•Institutions (4)

Amazon.com¹, Pennsylvania State University², United States Army Research Laboratory³, Imperial College London⁴

01 Jun 2019-IEEE ACM Transactions on Networking

TL;DR: In this paper, the authors formulate the service migration problem as a Markov decision process (MDP) and provide a mathematical framework to design optimal service migration policies in mobile edge computing.

...read moreread less

Abstract: In mobile edge computing, local edge servers can host cloud-based services, which reduces network overhead and latency but requires service migrations as users move to new locations. It is challenging to make migration decisions optimally because of the uncertainty in such a dynamic cloud environment. In this paper, we formulate the service migration problem as a Markov decision process (MDP). Our formulation captures general cost models and provides a mathematical framework to design optimal service migration policies. In order to overcome the complexity associated with computing the optimal policy, we approximate the underlying state space by the distance between the user and service locations. We show that the resulting MDP is exact for the uniform 1-D user mobility, while it provides a close approximation for uniform 2-D mobility with a constant additive error. We also propose a new algorithm and a numerical technique for computing the optimal solution, which is significantly faster than traditional methods based on the standard value or policy iteration. We illustrate the application of our solution in practical scenarios where many theoretical assumptions are relaxed. Our evaluations based on real-world mobility traces of San Francisco taxis show the superior performance of the proposed solution compared to baseline solutions.

...read moreread less

Journal Article•DOI•

Evaluation of Reinforcement Learning-Based False Data Injection Attack to Automatic Voltage Control

[...]

Ying Chen¹, Shaowei Huang¹, Feng Liu¹, Zhisheng Wang¹, Xinwei Sun¹ - Show less +1 more•Institutions (1)

Tsinghua University¹

01 Mar 2019-IEEE Transactions on Smart Grid

TL;DR: A novel strategy of FDI attacks is proposed, which aims to distort normal operation of a power system regulated by automatic voltage controls (AVCs) and can help maintain the security of the AVC system, even under heavy system loading.

...read moreread less

Abstract: False data injection (FDI) attacks intend to threaten the security of power systems. In this paper, a novel strategy of FDI attacks is proposed, which aims to distort normal operation of a power system regulated by automatic voltage controls (AVCs). Such attacks can be launched from a single substation by the attacker who has little knowledge of the whole power grid. The optimal attack strategy is modeled as a partial observable Markov decision process (POMDP). Then, a $\mathcal {Q}$ -learning algorithm with nearest sequence memory is adopted to enable on-line learning and attacking. Stealthy attack strategies are also developed and incorporated into the POMDP model. Various tests are performed upon the IEEE 39-bus systems. Corresponding results verify the efficacy of the proposed attack strategies. The feasibility of independent and data-driven FDI attacks is investigated. Moreover, a bad data detection and correction method are presented based on kernel density estimation to mitigate the disruptive impacts of the proposed FDI attacks. Test results show that this defensive method can help maintain the security of the AVC system, even under heavy system loading.

...read moreread less

Posted Content•

Lyapunov-based Safe Policy Optimization for Continuous Control

[...]

Yinlam Chow¹, Ofir Nachum, Aleksandra Faust, Edgar A. Duéñez-Guzmán, Mohammad Ghavamzadeh - Show less +1 more•Institutions (1)

Google¹

30 Apr 2019-arXiv: Learning

TL;DR: Safe policy optimization algorithms based on a Lyapunov approach to solve continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through safe policies, i.e.,~policies that do not take the agent to undesirable situations are presented.

...read moreread less

Abstract: We study continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through safe policies, i.e.,~policies that do not take the agent to undesirable situations. We formulate these problems as constrained Markov decision processes (CMDPs) and present safe policy optimization algorithms that are based on a Lyapunov approach to solve them. Our algorithms can use any standard policy gradient (PG) method, such as deep deterministic policy gradient (DDPG) or proximal policy optimization (PPO), to train a neural network policy, while guaranteeing near-constraint satisfaction for every policy update by projecting either the policy parameter or the action onto the set of feasible solutions induced by the state-dependent linearized Lyapunov constraints. Compared to the existing constrained PG algorithms, ours are more data efficient as they are able to utilize both on-policy and off-policy data. Moreover, our action-projection algorithm often leads to less conservative policy updates and allows for natural integration into an end-to-end PG training pipeline. We evaluate our algorithms and compare them with the state-of-the-art baselines on several simulated (MuJoCo) tasks, as well as a real-world indoor robot navigation problem, demonstrating their effectiveness in terms of balancing performance and constraint satisfaction. Videos of the experiments can be found in the following link: this https URL.

...read moreread less

Journal Article•DOI•

Real-Time Energy Management of a Microgrid Using Deep Reinforcement Learning

[...]

Ying Ji, Jianhui Wang, Jiacan Xu, Xiaoke Fang, Huaguang Zhang - Show less +1 more

15 Jun 2019-Energies

TL;DR: In this article, the authors proposed a learning-based approach for real-time scheduling of an MG considering the uncertainty of the load demand, renewable energy, and electricity price, which is modeled as a Markov Decision Process (MDP) with an objective of minimizing the daily operating cost.

...read moreread less

Abstract: Driven by the recent advances and applications of smart-grid technologies, our electric power grid is undergoing radical modernization. Microgrid (MG) plays an important role in the course of modernization by providing a flexible way to integrate distributed renewable energy resources (RES) into the power grid. However, distributed RES, such as solar and wind, can be highly intermittent and stochastic. These uncertain resources combined with load demand result in random variations in both the supply and the demand sides, which make it difficult to effectively operate a MG. Focusing on this problem, this paper proposed a novel energy management approach for real-time scheduling of an MG considering the uncertainty of the load demand, renewable energy, and electricity price. Unlike the conventional model-based approaches requiring a predictor to estimate the uncertainty, the proposed solution is learning-based and does not require an explicit model of the uncertainty. Specifically, the MG energy management is modeled as a Markov Decision Process (MDP) with an objective of minimizing the daily operating cost. A deep reinforcement learning (DRL) approach is developed to solve the MDP. In the DRL approach, a deep feedforward neural network is designed to approximate the optimal action-value function, and the deep Q-network (DQN) algorithm is used to train the neural network. The proposed approach takes the state of the MG as inputs, and outputs directly the real-time generation schedules. Finally, using real power-grid data from the California Independent System Operator (CAISO), case studies are carried out to demonstrate the effectiveness of the proposed approach.

...read moreread less

Journal Article•DOI•

Age of Information Minimization for an Energy Harvesting Cognitive Radio

[...]

Shiyang Leng¹, Aylin Yener¹•Institutions (1)

Pennsylvania State University¹

10 May 2019-IEEE Transactions on Cognitive Communications and Networking

TL;DR: This paper studies average AoI minimization in cognitive radio energy harvesting communications with a primary user with access rights to spectrum and a secondary user who can utilize the spectrum only when it is left idle by the primary user.

...read moreread less

Abstract: Age of information (AoI) is a performance metric that measures the timeliness and freshness of information, and is particularly relevant in applications with time-sensitive data. This paper studies average AoI minimization in cognitive radio energy harvesting communications. More specifically, the system studied has a primary user with access rights to spectrum, and a secondary user who can utilize the spectrum only when it is left idle by the primary user. The secondary user is an energy harvesting sensor that harvests ambient energy with which it performs spectrum sensing and status updates of its sensing data to a destination. The status-updates are sent by opportunistically accessing the primary user’s spectrum. The secondary user aims to minimize the average AoI by adaptively making sensing and update decisions based on its energy availability and the availability of the primary spectrum with either perfect or imperfect spectrum sensing. The sequential decision problems are formulated as partially observable Markov decision processes and solved by dynamic programming for finite and infinite horizon. The properties of the optimal sensing and updating policies are investigated and shown to have threshold structure. Numerical results are presented to confirm the analytical findings.

...read moreread less

Proceedings Article•DOI•

NFVdeep: adaptive online service function chain deployment with deep reinforcement learning

[...]

Yikai Xiao¹, Qixia Zhang¹, Fangming Liu¹, Jia Wang², Miao Zhao², Zhongxing Zhang¹, Jiaxing Zhang¹ - Show less +3 more•Institutions (2)

Huazhong University of Science and Technology¹, Hong Kong Polytechnic University²

24 Jun 2019

TL;DR: This paper proposes NFVdeep, an adaptive, online, deep reinforcement learning approach to automatically deploy SFCs for requests with different QoS requirements, which surpasses the state-of-the-art methods by 32.59% higher accepted throughput and 33.29% lower operation cost on average.

...read moreread less

Abstract: With the evolution of network function virtualization (NFV), diverse network services can be flexibly offered as service function chains (SFCs) consisted of different virtual network functions (VNFs). However, network state and traffic typically exhibit unpredictable variations due to stochastically arriving requests with different quality of service (QoS) requirements. Thus, an adaptive online SFC deployment approach is needed to handle the real-time network variations and various service requests. In this paper, we firstly introduce a Markov decision process (MDP) model to capture the dynamic network state transitions. In order to jointly minimize the operation cost of NFV providers and maximize the total throughput of requests, we propose NFVdeep, an adaptive, online, deep reinforcement learning approach to automatically deploy SFCs for requests with different QoS requirements. Specifically, we use a serialization-and-backtracking method to effectively deal with large discrete action space. We also adopt a policy gradient based method to improve the training efficiency and convergence to optimality. Extensive experimental results demonstrate that NFVdeep converges fast in the training process and responds rapidly to arriving requests especially in large, frequently transferred network state space. Consequently, NFVdeep surpasses the state-of-the-art methods by 32.59% higher accepted throughput and 33.29% lower operation cost on average.

...read moreread less

Posted Content•

Provably Efficient Exploration in Policy Optimization

[...]

Qi Cai¹, Zhuoran Yang², Chi Jin², Zhaoran Wang¹•Institutions (2)

Northwestern University¹, Princeton University²

12 Dec 2019-arXiv: Learning

TL;DR: This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves regret.

...read moreread less

Abstract: While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an ``optimistic version'' of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves $\tilde{O}(\sqrt{d^2 H^3 T} )$ regret. Here $d$ is the feature dimension, $H$ is the episode horizon, and $T$ is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.

...read moreread less

Proceedings Article•DOI•

A Deep Value-network Based Approach for Multi-Driver Order Dispatching

[...]

Xiaocheng Tang¹, Zhiwei Qin¹, Fan Zhang¹, Zhaodong Wang², Zhe Xu¹, Yintai Ma³, Hongtu Zhu¹, Jieping Ye¹ - Show less +4 more•Institutions (3)

DiDi¹, Washington State University², Northwestern University³

25 Jul 2019

TL;DR: This work proposes a deep reinforcement learning based solution for order dispatching and conducts large scale online A/B tests on DiDi's ride-dispatching platform to show that the proposed method achieves significant improvement on both total driver income and user experience related metrics.

...read moreread less

Abstract: Recent works on ride-sharing order dispatching have highlighted the importance of taking into account both the spatial and temporal dynamics in the dispatching process for improving the transportation system efficiency. At the same time, deep reinforcement learning has advanced to the point where it achieves superhuman performance in a number of fields. In this work, we propose a deep reinforcement learning based solution for order dispatching and we conduct large scale online A/B tests on DiDi's ride-dispatching platform to show that the proposed method achieves significant improvement on both total driver income and user experience related metrics. In particular, we model the ride dispatching problem as a Semi Markov Decision Process to account for the temporal aspect of the dispatching actions. To improve the stability of the value iteration with nonlinear function approximators like neural networks, we propose Cerebellar Value Networks (CVNet) with a novel distributed state representation layer. We further derive a regularized policy evaluation scheme for CVNet that penalizes large Lipschitz constant of the value network for additional robustness against adversarial perturbation and noises. Finally, we adapt various transfer learning methods to CVNet for increased learning adaptability and efficiency across multiple cities. We conduct extensive offline simulations based on real dispatching data as well as online AB tests through the DiDi's platform. Results show that CVNet consistently outperforms other recently proposed dispatching methods. We finally show that the performance can be further improved through the efficient use of transfer learning.

...read moreread less

Journal Article•DOI•

Deep reinforcement learning-based controller for path following of an unmanned surface vehicle

[...]

Joohyun Woo¹, Chan-Woo Yu², Nakwan Kim¹•Institutions (2)

Seoul National University¹, Agency for Defense Development²

01 Jul 2019-Ocean Engineering

TL;DR: A deep deterministic policy gradient (DDPG) algorithm, which is an actor-critic-based reinforcement learning algorithm, was adapted to capture the USV’s experience during the path-following trials.

...read moreread less

Journal Article•DOI•

Cooperative Communications With Relay Selection Based on Deep Reinforcement Learning in Wireless Sensor Networks

[...]

Yuhan Su¹, Xiaozhen Lu¹, Yifeng Zhao¹, Lianfen Huang¹, Xiaojiang Du² - Show less +1 more•Institutions (2)

Xiamen University¹, Temple University²

15 Oct 2019-IEEE Sensors Journal

TL;DR: This paper proposes DQ-RSS, a deep-reinforcement-learning-based relay selection scheme in WSNs and uses DQN to process high-dimensional state spaces and accelerate the learning rate, and compares the network performance on the basis of three aspects: outage probability, system capacity, and energy consumption.

...read moreread less

Abstract: Cooperative communication technology has become a research hotspot in wireless sensor networks (WSNs) in recent years, and will become one of the key technologies for improving spectrum utilization in wireless communication systems in the future. It leverages cooperation among multiple relay nodes in the wireless network to realize path transmission sharing, thereby improving the system throughput. In this paper, we model the process of cooperative communications with relay selection in WSNs as a Markov decision process and propose DQ-RSS, a deep-reinforcement-learning-based relay selection scheme, in WSNs. In DQ-RSS, a deep-Q-network (DQN) is trained according to the outage probability and mutual information, and the optimal relay is selected from a plurality of relay nodes without the need for a network model or prior data. More specifically, we use DQN to process high-dimensional state spaces and accelerate the learning rate. We compare DQ-RSS with the Q-learning-based relay selection scheme and evaluate the network performance on the basis of three aspects: outage probability, system capacity, and energy consumption. Simulation results indicate that DQ-RSS can achieve better performance on these elements and save the convergence time compared with existing schemes.

...read moreread less

Journal Article•DOI•

Multi-Tenant Cross-Slice Resource Orchestration: A Deep Reinforcement Learning Approach

[...]

Xianfu Chen¹, Zhifeng Zhao, Celimuge Wu², Mehdi Bennis³, Hang Liu⁴, Yusheng Ji⁵, Honggang Zhang⁶ - Show less +3 more•Institutions (6)

VTT Technical Research Centre of Finland¹, University of Electro-Communications², University of Oulu³, The Catholic University of America⁴, National Institute of Informatics⁵, Zhejiang University⁶

08 Aug 2019-IEEE Journal on Selected Areas in Communications

TL;DR: This paper linearly decomposes the per-SP Markov decision process to simplify the decision makings at a SP and derive an online scheme based on deep reinforcement learning to approach the optimal abstract control policies.

...read moreread less

Abstract: With the cellular networks becoming increasingly agile, a major challenge lies in how to support diverse services for mobile users (MUs) over a common physical network infrastructure. Network slicing is a promising solution to tailor the network to match such service requests. This paper considers a system with radio access network (RAN)-only slicing, where the physical infrastructure is split into slices providing computation and communication functionalities. A limited number of channels are auctioned across scheduling slots to MUs of multiple service providers (SPs) (i.e., the tenants). Each SP behaves selfishly to maximize the expected long-term payoff from the competition with other SPs for the orchestration of channels, which provides its MUs with the opportunities to access the computation and communication slices. This problem is modelled as a stochastic game, in which the decision makings of a SP depend on the global network dynamics as well as the joint control policy of all SPs. To approximate the Nash equilibrium solutions, we first construct an abstract stochastic game with the local conjectures of channel auction among the SPs. We then linearly decompose the per-SP Markov decision process to simplify the decision makings at a SP and derive an online scheme based on deep reinforcement learning to approach the optimal abstract control policies. Numerical experiments show significant performance gains from our scheme.

...read moreread less

Journal Article•DOI•

Deep Neural Network Compression for Aircraft Collision Avoidance Systems

[...]

Kyle D. Julian¹, Mykel J. Kochenderfer¹, Michael P. Owen²•Institutions (2)

Stanford University¹, Massachusetts Institute of Technology²

01 Mar 2019-Journal of Guidance Control and Dynamics

TL;DR: A deep neural network is used to approximate the table, reducing the required storage space by a factor of 1000 and enabling the collision avoidance system to operate using current avionics systems.

...read moreread less

Abstract: One approach to designing decision-making logic for an aircraft collision avoidance system frames the problem as a Markov decision process and optimizes the system using dynamic programming. The re...

...read moreread less

Journal Article•DOI•

Deep Reinforcement Learning-Based Energy Management for a Series Hybrid Electric Vehicle Enabled by History Cumulative Trip Information

[...]

Yuecheng Li¹, Hongwen He¹, Jiankun Peng¹, Hong Wang²•Institutions (2)

Beijing Institute of Technology¹, University of Waterloo²

03 Jul 2019-IEEE Transactions on Vehicular Technology

TL;DR: Deep reinforcement learning (DRL) is utilizes to develop EMSs for a series HEV due to DRL's advantages of requiring no future driving information in derivation and good generalization in solving energy management problem formulated as a Markov decision process.

...read moreread less

Abstract: It is essential to develop proper energy management strategies (EMSs) with broad adaptability for hybrid electric vehicles (HEVs). This paper utilizes deep reinforcement learning (DRL) to develop EMSs for a series HEV due to DRL's advantages of requiring no future driving information in derivation and good generalization in solving energy management problem formulated as a Markov decision process. History cumulative trip information is also integrated for effective state of charge guidance in DRL-based EMSs. The proposed method is systematically introduced from offline training to online applications; its learning ability, optimality, and generalization are validated by comparisons with fuel economy benchmark optimized by dynamic programming, and real-time EMSs based on model predictive control (MPC). Simulation results indicate that without a priori knowledge of future trip, original DRL-based EMS achieves an average 3.5% gap from benchmark, superior to MPC-based EMS with accurate prediction; after further applying output frequency adjustment, a mean gap of 8.7%, which is comparable with MPC-based EMS with mean prediction error of 1 m/s, is maintained with concurrently noteworthy improvement in reducing engine start times. Besides, its impressive computation speed of about 0.001 s per simulation step proves its practical application potential, and this method is independent of powertrain topology such that it is applicative for any type of HEVs even when future driving information is unavailable.

...read moreread less

Proceedings Article•

RUDDER: Return Decomposition for Delayed Rewards

[...]

Jose A. Arjona-Medina¹, Michael Gillhofer¹, Michael Widrich¹, Thomas Unterthiner¹, Johannes Brandstetter², Sepp Hochreiter¹ - Show less +2 more•Institutions (2)

Johannes Kepler University of Linz¹, Helsinki Institute of Physics²

01 Jan 2019

TL;DR: In this paper, a reinforcement learning approach for delayed rewards in finite Markov decision processes (MDPs) is proposed, which aims at making the expected future rewards zero, which simplifies Q-value estimation to computing the mean of the immediate reward.

...read moreread less

Abstract: We propose RUDDER, a novel reinforcement learning approach for delayed rewards in finite Markov decision processes (MDPs). In MDPs the Q-values are equal to the expected immediate reward plus the expected future rewards. The latter are related to bias problems in temporal difference (TD) learning and to high variance problems in Monte Carlo (MC) learning. Both problems are even more severe when rewards are delayed. RUDDER aims at making the expected future rewards zero, which simplifies Q-value estimation to computing the mean of the immediate reward. We propose the following two new concepts to push the expected future rewards toward zero. (i) Reward redistribution that leads to return-equivalent decision processes with the same optimal policies and, when optimal, zero expected future rewards. (ii) Return decomposition via contribution analysis which transforms the reinforcement learning task into a regression task at which deep learning excels. On artificial tasks with delayed rewards, RUDDER is significantly faster than MC and exponentially faster than Monte Carlo Tree Search (MCTS), TD(λ), and reward shaping approaches. At Atari games, RUDDER on top of a Proximal Policy Optimization (PPO) baseline improves the scores, which is most prominent at games with delayed rewards.

...read moreread less

Journal Article•DOI•

Task migration for mobile edge computing using deep reinforcement learning

[...]

Cheng Zhang¹, Zixuan Zheng¹•Institutions (1)

Waseda University¹

01 Jul 2019-Future Generation Computer Systems

TL;DR: A deep Q-network (DQN) based technique for task migration in MEC system that can learn the optimal task migration policy from previous experiences without necessarily acquiring the information about users’ mobility pattern in advance.

...read moreread less

Proceedings Article•

Counterfactual off-policy evaluation with gumbel-max structural causal models

[...]

Michael Oberst¹, David Sontag¹•Institutions (1)

Massachusetts Institute of Technology¹

14 May 2019

TL;DR: An off-policy evaluation procedure for highlighting episodes where applying a reinforcement learned policy is likely to have produced a substantially different outcome than the observed policy, and a class of structural causal models for generating counterfactual trajectories in finite partially observable Markov Decision Processes (POMDPs).

...read moreread less

Abstract: We introduce an off-policy evaluation procedure for highlighting episodes where applying a reinforcement learned (RL) policy is likely to have produced a substantially different outcome than the observed policy. In particular, we introduce a class of structural causal models (SCMs) for generating counterfactual trajectories in finite partially observable Markov Decision Processes (POMDPs). We see this as a useful procedure for off-policy "debugging" in high-risk settings (e.g., healthcare); by decomposing the expected difference in reward between the RL and observed policy into specific episodes, we can identify episodes where the counterfactual difference in reward is most dramatic. This in turn can be used to facilitate review of specific episodes by domain experts. We demonstrate the utility of this procedure with a synthetic environment of sepsis management.

...read moreread less

Journal Article•DOI•

Managing engineering systems with large state and action spaces through deep reinforcement learning

[...]

C. P. Andriotis¹, K. G. Papakonstantinou¹•Institutions (1)

Pennsylvania State University¹

01 Nov 2019-Reliability Engineering & System Safety

TL;DR: The Deep Centralized Multi-agent Actor Critic (DCMAC) as discussed by the authors is an off-policy actor-critic DRL algorithm that directly probes the state/belief space of the underlying MDP/POMDP, providing efficient life-cycle policies for large multi-component systems operating in high-dimensional spaces.

...read moreread less

Journal Article•DOI•

Indirect Customer-to-Customer Energy Trading With Reinforcement Learning

[...]

Tao Chen¹, Wencong Su¹•Institutions (1)

University of Michigan¹

01 Jul 2019-IEEE Transactions on Smart Grid

TL;DR: The role of emerging energy brokers (middlemen) in a localized event-driven market (LEM) at the distribution level for facilitating indirect customer-to-customer energy trading and some reinforcement learning and data-driven methods applied are explored.

...read moreread less

Abstract: In this paper, we explore the role of emerging energy brokers (middlemen) in a localized event-driven market (LEM) at the distribution level for facilitating indirect customer-to-customer energy trading. This proposed LEM does not aim to replace any existing energy service or become the best market model; but instead to diversify the energy ecosystem at the edge of distribution networks. In light of this philosophy, the market mechanism will provide additional options for customers and prosumers who have the willingness to directly participate in the retail electricity market occasionally, on top of using existing utility services. It also helps in improving market efficiency and encouraging local-level power balance, while taking into account the characteristics of customers’ behavior. The energy trading process will be built as a Markov decision process with some reinforcement learning and data-driven methods applied. Some economic concepts, like search friction , related to this kind of typical search cost involved market model are also discussed.

...read moreread less

Journal Article•DOI•

Reinforcement Learning-Based Multiaccess Control and Battery Prediction With Energy Harvesting in IoT Systems

[...]

Man Chu¹, Hang Li², Xuewen Liao¹, Shuguang Cui²•Institutions (2)

Xi'an Jiaotong University¹, The Chinese University of Hong Kong²

01 Apr 2019-IEEE Internet of Things Journal

TL;DR: In this article, the joint access control and battery prediction problems in a small-cell IoT system including multiple EH user equipments (UEs) and one base station (BS) with limited uplink access channels were investigated.

...read moreread less

Abstract: Energy harvesting (EH) is a promising technique to fulfill the long-term and self-sustainable operations for Internet of Things (IoT) systems. In this paper, we study the joint access control and battery prediction problems in a small-cell IoT system including multiple EH user equipments (UEs) and one base station (BS) with limited uplink access channels. Each UE has a rechargeable battery with finite capacity. The system control is modeled as a Markov decision process without complete prior knowledge assumed at the BS, which also deals with large sizes in both state and action spaces. First, to handle the access control problem assuming causal battery and channel state information, we propose a scheduling algorithm that maximizes the uplink transmission sum rate based on reinforcement learning (RL) with deep ${Q}$ -network enhancement. Second, for the battery prediction problem, with a fixed round-robin access control policy adopted, we develop an RL-based algorithm to minimize the prediction loss (error) without any model knowledge about the energy source and energy arrival process. Finally, the joint access control and battery prediction problem is investigated, where we propose a two-layer RL network to simultaneously deal with maximizing the sum rate and minimizing the prediction loss: the first layer is for battery prediction, the second layer generates the access policy based on the output from the first layer. Experiment results show that the three proposed RL algorithms can achieve better performances compared with existing benchmarks.

...read moreread less

Journal Article•DOI•

Feedback Deep Deterministic Policy Gradient With Fuzzy Reward for Robotic Multiple Peg-in-Hole Assembly Tasks

[...]

Jing Xu¹, Zhimin Hou¹, Wei Wang¹, Bohao Xu², Kuangen Zhang¹, Ken Chen¹ - Show less +2 more•Institutions (2)

Tsinghua University¹, Dalian Jiaotong University²

01 Mar 2019-IEEE Transactions on Industrial Informatics

TL;DR: A model-driven deep deterministic policy gradient algorithm is proposed to accomplish the assembly task through the learned policy without analyzing the contact states, and a fuzzy reward system is utilized for the complex assembly process to improve the learning efficiency.

...read moreread less

Abstract: The automatic completion of multiple peg-in-hole assembly tasks by robots remains a formidable challenge because the traditional control strategies require a complex analysis of the contact model In this paper, the assembly task is formulated as a Markov decision process, and a model-driven deep deterministic policy gradient algorithm is proposed to accomplish the assembly task through the learned policy without analyzing the contact states In our algorithm, the learning process is driven by a simple traditional force controller In addition, a feedback exploration strategy is proposed to ensure that our algorithm can efficiently explore the optimal assembly policy and avoid risky actions, which can address the data efficiency and guarantee stability in realistic assembly scenarios To improve the learning efficiency, we utilize a fuzzy reward system for the complex assembly process Then, simulations and realistic experiments of a dual peg-in-hole assembly demonstrate the effectiveness of the proposed algorithm The advantages of the fuzzy reward system and feedback exploration strategy are validated by comparing the performances of different cases in simulations and experiments

...read moreread less

Collapse