scispace - formally typeset
Search or ask a question

Showing papers on "Markov decision process published in 2020"


Journal ArticleDOI
TL;DR: This article develops an asynchronous advantage actor–critic-based cooperation computation offloading and resource allocation algorithm to solve the MDP problem and designs a multiobjective function to maximize the computation rate of MEC systems and the transaction throughput of blockchain systems.
Abstract: Mobile-edge computing (MEC) is a promising paradigm to improve the quality of computation experience of mobile devices because it allows mobile devices to offload computing tasks to MEC servers, benefiting from the powerful computing resources of MEC servers. However, the existing computation-offloading works have also some open issues: 1) security and privacy issues; 2) cooperative computation offloading; and 3) dynamic optimization. To address the security and privacy issues, we employ the blockchain technology that ensures the reliability and irreversibility of data in MEC systems. Meanwhile, we jointly design and optimize the performance of blockchain and MEC. In this article, we develop a cooperative computation offloading and resource allocation framework for blockchain-enabled MEC systems. In the framework, we design a multiobjective function to maximize the computation rate of MEC systems and the transaction throughput of blockchain systems by jointly optimizing offloading decision, power allocation, block size, and block interval. Due to the dynamic characteristics of the wireless fading channel and the processing queues at MEC servers, the joint optimization is formulated as a Markov decision process (MDP). To tackle the dynamics and complexity of the blockchain-enabled MEC system, we develop an asynchronous advantage actor–critic-based cooperation computation offloading and resource allocation algorithm to solve the MDP problem. In the algorithm, deep neural networks are optimized by utilizing asynchronous gradient descent and eliminating the correlation of data. The simulation results show that the proposed algorithm converges fast and achieves significant performance improvements over existing schemes in terms of total reward.

241 citations


Journal ArticleDOI
TL;DR: In this paper, the authors investigated an energy cost minimization problem for a smart home in the absence of a building thermal dynamics model with the consideration of a comfortable temperature range, and proposed an energy management algorithm based on deep deterministic policy gradients.
Abstract: In this article, we investigate an energy cost minimization problem for a smart home in the absence of a building thermal dynamics model with the consideration of a comfortable temperature range. Due to the existence of model uncertainty, parameter uncertainty (e.g., renewable generation output, nonshiftable power demand, outdoor temperature, and electricity price), and temporally coupled operational constraints, it is very challenging to design an optimal energy management algorithm for scheduling heating, ventilation, and air conditioning systems and energy storage systems in the smart home. To address the challenge, we first formulate the above problem as a Markov decision process, and then propose an energy management algorithm based on deep deterministic policy gradients. It is worth mentioning that the proposed algorithm does not require the prior knowledge of uncertain parameters and building the thermal dynamics model. The simulation results based on real-world traces demonstrate the effectiveness and robustness of the proposed algorithm.

213 citations


Journal ArticleDOI
TL;DR: Simulation results demonstrate that the proposed cooperative caching system can reduce the system cost, as well as the content delivery latency, and improve content hit ratio, as compared to the noncooperative and random edge caching schemes.
Abstract: In this article, we propose a cooperative edge caching scheme, a new paradigm to jointly optimize the content placement and content delivery in the vehicular edge computing and networks, with the aid of the flexible trilateral cooperations among a macro-cell station, roadside units, and smart vehicles. We formulate the joint optimization problem as a double time-scale Markov decision process (DTS-MDP), based on the fact that the time-scale of content timeliness changes less frequently as compared to the vehicle mobility and network states during the content delivery process. At the beginning of the large time-scale, the content placement/updating decision can be obtained according to the content popularity, vehicle driving paths, and resource availability. On the small time-scale, the joint vehicle scheduling and bandwidth allocation scheme is designed to minimize the content access cost while satisfying the constraint on content delivery latency. To solve the long-term mixed integer linear programming (LT-MILP) problem, we propose a nature-inspired method based on the deep deterministic policy gradient (DDPG) framework to obtain a suboptimal solution with a low computation complexity. The simulation results demonstrate that the proposed cooperative caching system can reduce the system cost, as well as the content delivery latency, and improve content hit ratio, as compared to the noncooperative and random edge caching schemes.

212 citations


Proceedings Article
15 Jul 2020
TL;DR: One insight of this work is in formalizing the importance how a favorable initial state distribution provides a means to circumvent worst-case exploration issues, analogous to the global convergence guarantees of iterative value function based algorithms.
Abstract: Policy gradient (PG) methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution (say with a sufficiently rich policy class); how they cope with approximation error due to using a restricted class of parametric policies; or their finite sample behavior. Such characterizations are important not only to compare these methods to their approximate value function counterparts (where such issues are relatively well understood, at least in the worst case), but also to help with more principled approaches to algorithm design. This work provides provable characterizations of computational, approximation, and sample size issues with regards to policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: 1) ``tabular'' policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy, and 2) restricted policy classes, which may not contain the optimal policy and where we provide agnostic learning results. In the emph{tabular setting}, our main results are: 1) convergence rate to global optimum for direct parameterization and projected gradient ascent 2) an asymptotic convergence to global optimum for softmax policy parameterization and PG; and a convergence rate with additional entropy regularization, and 3) dimension-free convergence to global optimum for softmax policy parameterization and Natural Policy Gradient (NPG) method with exact gradients. In emph{function approximation}, we further analyze NPG with exact as well as inexact gradients under certain smoothness assumptions on the policy parameterization and establish rates of convergence in terms of the quality of the initial state distribution. One insight of this work is in formalizing how a favorable initial state distribution provides a means to circumvent worst-case exploration issues. Overall, these results place PG methods under a solid theoretical footing, analogous to the global convergence guarantees of iterative value function based algorithms.

198 citations


Journal ArticleDOI
TL;DR: A data-driven method based on neural network (NN) and Q -learning algorithm is developed, which achieves superior performance on cost-effective schedules for HEM system, and demonstrates the effectiveness of the newly developed framework.
Abstract: This paper proposes a novel framework for home energy management (HEM) based on reinforcement learning in achieving efficient home-based demand response (DR). The concerned hour-ahead energy consumption scheduling problem is duly formulated as a finite Markov decision process (FMDP) with discrete time steps. To tackle this problem, a data-driven method based on neural network (NN) and ${Q}$ -learning algorithm is developed, which achieves superior performance on cost-effective schedules for HEM system. Specifically, real data of electricity price and solar photovoltaic (PV) generation are timely processed for uncertainty prediction by extreme learning machine (ELM) in the rolling time windows. The scheduling decisions of the household appliances and electric vehicles (EVs) can be subsequently obtained through the newly developed framework, of which the objective is dual, i.e., to minimize the electricity bill as well as the DR induced dissatisfaction. Simulations are performed on a residential house level with multiple home appliances, an EV and several PV panels. The test results demonstrate the effectiveness of the proposed data-driven based HEM framework.

194 citations


Journal ArticleDOI
Shu Luo1
TL;DR: This paper addresses the dynamic flexible job shop scheduling problem (DFJSP) under new job insertions aiming at minimizing the total tardiness and confirms both the superiority and generality of DQN compared to each composite rule, other well-known dispatching rules as well as the stand Q-learning-based agent.

170 citations


Journal ArticleDOI
TL;DR: A model-free approach based on safe deep reinforcement learning (SDRL) is proposed to solve the EV charging/discharging scheduling problem as a constrained Markov Decision Process (CMDP) to minimize the charging cost as well as guarantee the EV can be fully charged.
Abstract: Electric vehicles (EVs) have been popularly adopted and deployed over the past few years because they are environment-friendly. When integrated into smart grids, EVs can operate as flexible loads or energy storage devices to participate in demand response (DR). By taking advantage of time-varying electricity prices in DR, the charging cost can be reduced by optimizing the charging/discharging schedules. However, since there exists randomness in the arrival and departure time of an EV and the electricity price, it is difficult to determine the optimal charging/discharging schedules to guarantee that the EV is fully charged upon departure. To address this issue, we formulate the EV charging/discharging scheduling problem as a constrained Markov Decision Process (CMDP). The aim is to find a constrained charging/discharging scheduling strategy to minimize the charging cost as well as guarantee the EV can be fully charged. To solve the CMDP, a model-free approach based on safe deep reinforcement learning (SDRL) is proposed. The proposed approach does not require any domain knowledge about the randomness. It directly learns to generate the constrained optimal charging/discharging schedules with a deep neural network (DNN). Unlike existing reinforcement learning (RL) or deep RL (DRL) paradigms, the proposed approach does not need to manually design a penalty term or tune a penalty coefficient. Numerical experiments with real-world electricity prices demonstrate the effectiveness of the proposed approach.

166 citations


Journal ArticleDOI
TL;DR: This article investigates an important computation offloading scheduling problem in a typical VEC scenario, where a VT traveling along an expressway intends to schedule its tasks waiting in the queue to minimize the long-term cost in terms of a tradeoff between task latency and energy consumption.
Abstract: Vehicular edge computing (VEC) is a new computing paradigm that has great potential to enhance the capability of vehicle terminals (VTs) to support resource-hungry in-vehicle applications with low latency and high energy efficiency. In this article, we investigate an important computation offloading scheduling problem in a typical VEC scenario, where a VT traveling along an expressway intends to schedule its tasks waiting in the queue to minimize the long-term cost in terms of a tradeoff between task latency and energy consumption. Due to diverse task characteristics, dynamic wireless environment, and frequent handover events caused by vehicle movements, an optimal solution should take into account both where to schedule (i.e., local computation or offloading) and when to schedule (i.e., the order and time for execution) each task. To solve such a complicated stochastic optimization problem, we model it by a carefully designed Markov decision process (MDP) and resort to deep reinforcement learning (DRL) to deal with the enormous state space. Our DRL implementation is designed based on the state-of-the-art proximal policy optimization (PPO) algorithm. A parameter-shared network architecture combined with a convolutional neural network (CNN) is utilized to approximate both policy and value function, which can effectively extract representative features. A series of adjustments to the state and reward representations are taken to further improve the training efficiency. Extensive simulation experiments and comprehensive comparisons with six known baseline algorithms and their heuristic combinations clearly demonstrate the advantages of the proposed DRL-based offloading scheduling method.

163 citations


Journal ArticleDOI
TL;DR: An improved deep Q-network (DQN) algorithm is proposed to learn the resource allocation policy for the IoT edge computing system to improve the efficiency of resource utilization and has a better convergence performance than the original DQN algorithm.
Abstract: By leveraging mobile edge computing (MEC), a huge amount of data generated by Internet of Things (IoT) devices can be processed and analyzed at the network edge. However, the MEC system usually only has the limited virtual resources, which are shared and competed by IoT edge applications. Thus, we propose a resource allocation policy for the IoT edge computing system to improve the efficiency of resource utilization. The objective of the proposed policy is to minimize the long-term weighted sum of average completion time of jobs and average number of requested resources. The resource allocation problem in the MEC system is formulated as a Markov decision process (MDP). A deep reinforcement learning approach is applied to solve the problem. We also propose an improved deep Q-network (DQN) algorithm to learn the policy, where multiple replay memories are applied to separately store the experiences with small mutual influence. Simulation results show that the proposed algorithm has a better convergence performance than the original DQN algorithm, and the corresponding policy outperforms the other reference policies by lower completion time with fewer requested resources.

151 citations


Journal ArticleDOI
TL;DR: This work proposes a safe off-policy deep reinforcement learning algorithm to solve Volt-VAR control problems in a model-free manner, and outperforms the existing reinforcement learning algorithms and conventional optimization-based approaches on a large feeder.
Abstract: Volt-VAR control is critical to keeping distribution network voltages within allowable range, minimizing losses, and reducing wear and tear of voltage regulating devices. To deal with incomplete and inaccurate distribution network models, we propose a safe off-policy deep reinforcement learning algorithm to solve Volt-VAR control problems in a model-free manner. The Volt-VAR control problem is formulated as a constrained Markov decision process with discrete action space, and solved by our proposed constrained soft actor-critic algorithm. Our proposed reinforcement learning algorithm achieves scalability, sample efficiency, and constraint satisfaction by synergistically combining the merits of the maximum-entropy framework, the method of multiplier, a device-decoupled neural network structure, and an ordinal encoding scheme. Comprehensive numerical studies with the IEEE distribution test feeders show that our proposed algorithm outperforms the existing reinforcement learning algorithms and conventional optimization-based approaches on a large feeder.

150 citations


Journal ArticleDOI
TL;DR: This paper considers a wireless broadcast network where a base-station is updating many users on random information arrivals under a transmission capacity constraint, and develops a structural MDP scheduling algorithm and an index scheduling algorithm, leveraging Markov decision process (MDP) techniques and the Whittle's methodology for restless bandits.
Abstract: Age of information is a new network performance metric that captures the freshness of information at end-users . This paper studies the age of information from a scheduling perspective. To that end, we consider a wireless broadcast network where a base-station (BS) is updating many users on random information arrivals under a transmission capacity constraint. For the offline case when the arrival statistics are known to the BS, we develop a structural MDP scheduling algorithm and an index scheduling algorithm , leveraging Markov decision process (MDP) techniques and the Whittle's methodology for restless bandits. By exploring optimal structural results, we not only reduce the computational complexity of the MDP-based algorithm, but also simplify deriving a closed form of the Whittle index. Moreover, for the online case, we develop an MDP-based online scheduling algorithm and an index-based online scheduling algorithm . Both the structural MDP scheduling algorithm and the MDP-based online scheduling algorithm asymptotically minimize the average age, while the index scheduling algorithm minimizes the average age when the information arrival rates for all users are the same. Finally, the algorithms are validated via extensive numerical studies.

Journal ArticleDOI
TL;DR: Simulation results show that the proposed AI-based collaborative computing approach can adapt to a highly dynamic environment with outstanding performance and the service cost can be minimized via the optimal workload assignment and server selection in collaborative computing.
Abstract: Mobile edge computing (MEC) is a promising technology to support mission-critical vehicular applications, such as intelligent path planning and safety applications. In this paper, a collaborative edge computing framework is developed to reduce the computing service latency and improve service reliability for vehicular networks. First, a task partition and scheduling algorithm (TPSA) is proposed to decide the workload allocation and schedule the execution order of the tasks offloaded to the edge servers given a computation offloading strategy. Second, an artificial intelligence (AI) based collaborative computing approach is developed to determine the task offloading, computing, and result delivery policy for vehicles. Specifically, the offloading and computing problem is formulated as a Markov decision process. A deep reinforcement learning technique, i.e., deep deterministic policy gradient, is adopted to find the optimal solution in a complex urban transportation network. By our approach, the service cost, which includes computing service latency and service failure penalty, can be minimized via the optimal workload assignment and server selection in collaborative computing. Simulation results show that the proposed AI-based collaborative computing approach can adapt to a highly dynamic environment with outstanding performance.

Journal ArticleDOI
TL;DR: This paper addresses this problem by using a model-free deep reinforcement learning (DRL) method to optimize the battery energy arbitrage considering an accurate battery degradation model and a hybrid Convolutional Neural Network and Long Short Term Memory model is adopted to predict the price for the next day.
Abstract: Accurate estimation of battery degradation cost is one of the main barriers for battery participating on the energy arbitrage market. This paper addresses this problem by using a model-free deep reinforcement learning (DRL) method to optimize the battery energy arbitrage considering an accurate battery degradation model. Firstly, the control problem is formulated as a Markov Decision Process (MDP). Then a noisy network based deep reinforcement learning approach is proposed to learn an optimized control policy for storage charging/discharging strategy. To address the uncertainty of electricity price, a hybrid Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) model is adopted to predict the price for the next day. Finally, the proposed approach is tested on the historical U.K. wholesale electricity market prices. The results compared with model based Mixed Integer Linear Programming (MILP) have demonstrated the effectiveness and performance of the proposed framework.

Journal ArticleDOI
TL;DR: A proactive algorithm based on long short-term memory and deep reinforcement learning techniques to address the partial observability and the curse of high dimensionality in local network state space faced by each VUE-pair is proposed.
Abstract: In this paper, we investigate the problem of age of information (AoI)-aware radio resource management for expected long-term performance optimization in a Manhattan grid vehicle-to-vehicle network. With the observation of global network state at each scheduling slot, the roadside unit (RSU) allocates the frequency bands and schedules packet transmissions for all vehicle user equipment-pairs (VUE-pairs). We model the stochastic decision-making procedure as a discrete-time single-agent Markov decision process (MDP). The technical challenges in solving the optimal control policy originate from high spatial mobility and temporally varying traffic information arrivals of the VUE-pairs. To make the problem solving tractable, we first decompose the original MDP into a series of per-VUE-pair MDPs. Then we propose a proactive algorithm based on long short-term memory and deep reinforcement learning techniques to address the partial observability and the curse of high dimensionality in local network state space faced by each VUE-pair. With the proposed algorithm, the RSU makes the optimal frequency band allocation and packet scheduling decision at each scheduling slot in a decentralized way in accordance with the partial observations of the global network state at the VUE-pairs. Numerical experiments validate the theoretical analysis and demonstrate the significant performance improvements from the proposed algorithm.

Journal ArticleDOI
TL;DR: In this paper, the authors studied an unmanned aerial vehicle (UAV)-mounted mobile edge computing network, where the UAV executes computational tasks offloaded from mobile terminal users (TUs) and the motion of each TU follows a Gauss-Markov random model.
Abstract: In this letter, we study an unmanned aerial vehicle (UAV)-mounted mobile edge computing network, where the UAV executes computational tasks offloaded from mobile terminal users (TUs) and the motion of each TU follows a Gauss-Markov random model. To ensure the quality-of-service (QoS) of each TU, the UAV with limited energy dynamically plans its trajectory according to the locations of mobile TUs. Towards this end, we formulate the problem as a Markov decision process, wherein the UAV trajectory and UAV-TU association are modeled as the parameters to be optimized. To maximize the system reward and meet the QoS constraint, we develop a QoS-based action selection policy in the proposed algorithm based on double deep Q-network. Simulations show that the proposed algorithm converges more quickly and achieves a higher sum throughput than conventional algorithms.

Journal ArticleDOI
TL;DR: A reinforcement learning approach with value function approximation and feature learning is proposed for autonomous decision making of intelligent vehicles on highways and uses data-driven feature representation for value and policy approximation so that better learning efficiency can be achieved.
Abstract: Autonomous decision making is a critical and difficult task for intelligent vehicles in dynamic transportation environments. In this paper, a reinforcement learning approach with value function approximation and feature learning is proposed for autonomous decision making of intelligent vehicles on highways. In the proposed approach, the sequential decision making problem for lane changing and overtaking is modeled as a Markov decision process with multiple goals, including safety, speediness, smoothness, etc. In order to learn optimized policies for autonomous decision-making, a multiobjective approximate policy iteration (MO-API) algorithm is presented. The features for value function approximation are learned in a data-driven way, where sparse kernel-based features or manifold-based features can be constructed based on data samples. Compared with previous RL algorithms such as multiobjective Q-learning, the MO-API approach uses data-driven feature representation for value and policy approximation so that better learning efficiency can be achieved. A highway simulation environment using a 14 degree-of-freedom vehicle dynamics model was established to generate training data and test the performance of different decision-making methods for intelligent vehicles on highways. The results illustrate the advantages of the proposed MO-API method under different traffic conditions. Furthermore, we also tested the learned decision policy on a real autonomous vehicle to implement overtaking decision and control under normal traffic on highways. The experimental results also demonstrate the effectiveness of the proposed method.

Journal ArticleDOI
TL;DR: In this article, a real-time monitoring system is considered where multiple source nodes are responsible for sending update packets to a common destination node in order to maintain the freshness of information at the destination.
Abstract: In this paper, we study a real-time monitoring system in which multiple source nodes are responsible for sending update packets to a common destination node in order to maintain the freshness of information at the destination. Since it may not always be feasible to replace or recharge batteries in all source nodes, we consider that the nodes are powered through wireless energy transfer (WET) by the destination. For this system setup, we investigate the optimal online sampling policy (referred to as the age-optimal policy ) that jointly optimizes WET and scheduling of update packet transmissions with the objective of minimizing the long-term average weighted sum of Age of Information (AoI) values for different physical processes (observed by the source nodes) at the destination node, referred to as the sum-AoI . To solve this optimization problem, we first model this setup as an average cost Markov decision process (MDP) with finite state and action spaces. Due to the extreme curse of dimensionality in the state space of the formulated MDP, classical reinforcement learning algorithms are no longer applicable to our problem even for reasonable-scale settings. Motivated by this, we propose a deep reinforcement learning (DRL) algorithm that can learn the age-optimal policy in a computationally-efficient manner. We further characterize the structural properties of the age-optimal policy analytically, and demonstrate that it has a threshold-based structure with respect to the AoI values for different processes. We extend our analysis to characterize the structural properties of the policy that maximizes average throughput for our system setup, referred to as the throughput-optimal policy . Afterwards, we analytically demonstrate that the structures of the age-optimal and throughput-optimal policies are different. We also numerically demonstrate these structures as well as the impact of system design parameters on the optimal achievable average weighted sum-AoI.

Journal ArticleDOI
TL;DR: This article first encode the state of the service provisioning system and the resource allocation scheme and model the adjustment of allocated resources for services as a Markov decision process (MDP), and gets a trained resource allocating policy with the help of the reinforcement learning (RL) method.
Abstract: Edge computing (EC) is now emerging as a key paradigm to handle the increasing Internet-of-Things (IoT) devices connected to the edge of the network. By using the services deployed on the service provisioning system which is made up of edge servers nearby, these IoT devices are enabled to fulfill complex tasks effectively. Nevertheless, it also brings challenges in trustworthiness management. The volatile environment will make it difficult to comply with the service-level agreement (SLA), which is an important index of trustworthiness declared by these IoT services. In this article, by denoting the trustworthiness gain with how well the SLA can comply, we first encode the state of the service provisioning system and the resource allocation scheme and model the adjustment of allocated resources for services as a Markov decision process (MDP). Based on these, we get a trained resource allocating policy with the help of the reinforcement learning (RL) method. The trained policy can always maximize the services’ trustworthiness gain by generating appropriate resource allocation schemes dynamically according to the system states. By conducting a series of experiments on the YouTube request dataset, we show that the edge service provisioning system using our approach has 21.72% better performance at least compared to baselines.

Posted Content
TL;DR: This work develops nonasymptotic convergence guarantees for entropy-regularized NPG methods under softmax parameterization, focusing on tabular discounted Markov decision processes and demonstrates that the algorithm converges linearly at an astonishing rate that is independent of the dimension of the state-action space.
Abstract: Natural policy gradient (NPG) methods are among the most widely used policy optimization algorithms in contemporary reinforcement learning. This class of methods is often applied in conjunction with entropy regularization -- an algorithmic scheme that encourages exploration -- and is closely related to soft policy iteration and trust region policy optimization. Despite the empirical success, the theoretical underpinnings for NPG methods remain limited even for the tabular setting. This paper develops $\textit{non-asymptotic}$ convergence guarantees for entropy-regularized NPG methods under softmax parameterization, focusing on discounted Markov decision processes (MDPs). Assuming access to exact policy evaluation, we demonstrate that the algorithm converges linearly -- or even quadratically once it enters a local region around the optimal policy -- when computing optimal value functions of the regularized MDP. Moreover, the algorithm is provably stable vis-a-vis inexactness of policy evaluation. Our convergence results accommodate a wide range of learning rates, and shed light upon the role of entropy regularization in enabling fast convergence.

Proceedings ArticleDOI
14 Jun 2020
TL;DR: Zhang et al. as mentioned in this paper proposed a reinforcement learning based race balance network (RL-RBN) to learn balanced performance for different races based on large margin losses and formulated the process of finding the optimal margins for non-Caucasians as a Markov decision process and employed deep Q-learning to learn policies for an agent to select appropriate margin by approximating the Q-value function.
Abstract: Racial equality is an important theme of international human rights law, but it has been largely obscured when the overall face recognition accuracy is pursued blindly. More facts indicate racial bias indeed degrades the fairness of recognition system and the error rates on non-Caucasians are usually much higher than Caucasians. To encourage fairness, we introduce the idea of adaptive margin to learn balanced performance for different races based on large margin losses. A reinforcement learning based race balance network (RL-RBN) is proposed. We formulate the process of finding the optimal margins for non-Caucasians as a Markov decision process and employ deep Q-learning to learn policies for an agent to select appropriate margin by approximating the Q-value function. Guided by the agent, the skewness of feature scatter between races can be reduced. Besides, we provide two ethnicity aware training datasets, called BUPT-Globalface and BUPT-Balancedface dataset, which can be utilized to study racial bias from both data and algorithm aspects. Extensive experiments on RFW database show that RL-RBN successfully mitigates racial bias and learns more balanced performance.

Journal ArticleDOI
Vassilios Tsounis1, Mitja Alge1, Joonho Lee1, Farbod Farshidian1, Marco Hutter1 
09 Mar 2020
TL;DR: A novel technique for training neural-network policies for terrain-aware locomotion, which combines state-of-the-art methods for model-based motion planning and reinforcement learning is proposed, centered on formulating Markov decision processes using the evaluation of dynamic feasibility criteria in place of physical simulation.
Abstract: This letter addresses the problem of legged locomotion in non-flat terrain. As legged robots such as quadrupeds are to be deployed in terrains with geometries which are difficult to model and predict, the need arises to equip them with the capability to generalize well to unforeseen situations. In this work, we propose a novel technique for training neural-network policies for terrain-aware locomotion, which combines state-of-the-art methods for model-based motion planning and reinforcement learning. Our approach is centered on formulating Markov decision processes using the evaluation of dynamic feasibility criteria in place of physical simulation. We thus employ policy-gradient methods to independently train policies which respectively plan and execute foothold and base motions in 3D environments using both proprioceptive and exteroceptive measurements. We apply our method within a challenging suite of simulated terrain scenarios which contain features such as narrow bridges, gaps and stepping-stones, and train policies which succeed in locomoting effectively in all cases.

Journal ArticleDOI
TL;DR: Experimental results verify that the proposed deep Q-network with a PNC network can provide better solutions for dynamic scheduling problems in terms of manufacturing performance, computational efficiency, and adaptability compared with heuristic methods and a DQN with basic multilayer perceptrons.

Journal ArticleDOI
TL;DR: A new selective maintenance optimization for multi-state systems that can execute multiple consecutive missions over a finite horizon is developed and a customized deep reinforcement learning method is put forth to overcome the “curse of dimensionality” and mitigate the uncountable state space.

Journal ArticleDOI
TL;DR: In this paper, a joint optimization problem of transmission mode selection and resource allocation for cellular V2X communications is investigated, and a deep reinforcement learning (DRL)-based decentralized algorithm is proposed to maximize the sum capacity of vehicle-to-infrastructure users while meeting the latency and reliability requirements of V2V pairs.
Abstract: Cellular vehicle-to-everything (V2X) communication is crucial to support future diverse vehicular applications. However, for safety-critical applications, unstable vehicle-to-vehicle (V2V) links, and high signaling overhead of centralized resource allocation approaches become bottlenecks. In this article, we investigate a joint optimization problem of transmission mode selection and resource allocation for cellular V2X communications. In particular, the problem is formulated as a Markov decision process, and a deep reinforcement learning (DRL)-based decentralized algorithm is proposed to maximize the sum capacity of vehicle-to-infrastructure users while meeting the latency and reliability requirements of V2V pairs. Moreover, considering training limitation of local DRL models, a two-timescale federated DRL algorithm is developed to help obtain robust models. Wherein, the graph theory-based vehicle clustering algorithm is executed on a large timescale and in turn, the federated learning algorithm is conducted on a small timescale. The simulation results show that the proposed DRL-based algorithm outperforms other decentralized baselines, and validate the superiority of the two-timescale federated DRL algorithm for newly activated V2V pairs.

Journal ArticleDOI
TL;DR: A new reinforcement learning method is proposed for estimating an optimal treatment regime that is applicable to data collected using mobile technologies in an outpatient setting and accommodates an indefinite time horizon and minute-by-minute decision making that are common in mobile health applications.
Abstract: The vision for precision medicine is to use individual patient characteristics to inform a personalized treatment plan that leads to the best possible healthcare for each patient. Mobile technologi...

Posted Content
TL;DR: This work analyzes two approaches for learning in Constrained Markov Decision Processes and highlights a crucial difference between the two approaches; the linear programming approach results in stronger guarantees than in the dual formulation based approach.
Abstract: In many sequential decision-making problems, the goal is to optimize a utility function while satisfying a set of constraints on different utilities. This learning problem is formalized through Constrained Markov Decision Processes (CMDPs). In this paper, we investigate the exploration-exploitation dilemma in CMDPs. While learning in an unknown CMDP, an agent should trade-off exploration to discover new information about the MDP, and exploitation of the current knowledge to maximize the reward while satisfying the constraints. While the agent will eventually learn a good or optimal policy, we do not want the agent to violate the constraints too often during the learning process. In this work, we analyze two approaches for learning in CMDPs. The first approach leverages the linear formulation of CMDP to perform optimistic planning at each episode. The second approach leverages the dual formulation (or saddle-point formulation) of CMDP to perform incremental, optimistic updates of the primal and dual variables. We show that both achieves sublinear regret w.r.t.\ the main utility while having a sublinear regret on the constraint violations. That being said, we highlight a crucial difference between the two approaches; the linear programming approach results in stronger guarantees than in the dual formulation based approach.

Journal ArticleDOI
TL;DR: A sequential learning algorithm to learn an action-value function for each LTC, based on which the optimal tap positions can be directly determined, which allows the RL algorithm to explore the state and action spaces freely offline without impacting the system operation.
Abstract: In this paper, we address the problem of setting the tap positions of load tap changers (LTCs) for voltage regulation in power distribution systems. The objective is to find a policy that maps measurements of voltage magnitudes and topology information to LTC tap ratio changes so as to minimize the voltage deviation across the system. We formulate this problem as a Markov decision process (MDP), and propose a data and computationally efficient batch reinforcement learning (RL) algorithm to solve it. To circumvent the “curse of dimensionality” resulting from the large state and action spaces, we propose a sequential learning algorithm to learn an action-value function for each LTC, based on which the optimal tap positions can be directly determined. By taking advantage of a linearized power flow model, we propose an algorithm to estimate the voltage magnitudes under different tap settings, which allows the RL algorithm to explore the state and action spaces freely offline without impacting the system operation. The effectiveness of the proposed algorithm is validated via numerical simulations on the IEEE 13-bus and 123-bus distribution test feeders.

Posted Content
TL;DR: A survey of the integration of model-based reinforcement learning and planning, better known as model- based reinforcement learning, and a broad conceptual overview of planning-learning combinations for MDP optimization are presented.
Abstract: Sequential decision making, commonly formalized as Markov Decision Process (MDP) optimization, is a key challenge in artificial intelligence. Two key approaches to this problem are reinforcement learning (RL) and planning. This paper presents a survey of the integration of both fields, better known as model-based reinforcement learning. Model-based RL has two main steps. First, we systematically cover approaches to dynamics model learning, including challenges like dealing with stochasticity, uncertainty, partial observability, and temporal abstraction. Second, we present a systematic categorization of planning-learning integration, including aspects like: where to start planning, what budgets to allocate to planning and real data collection, how to plan, and how to integrate planning in the learning and acting loop. After these two key sections, we also discuss the potential benefits of model-based RL, like enhanced data efficiency, targeted exploration, and improved stability. Along the survey, we also draw connections to several related RL fields, like hierarchical RL and transfer, and other research disciplines, like behavioural psychology. Altogether, the survey presents a broad conceptual overview of planning-learning combinations for MDP optimization.

Journal ArticleDOI
TL;DR: A model-free deep reinforcement learning (DRL) method with dueling deep Q network (DDQN) structure is designed to optimize the DR management of IL under the time of use (TOU) tariff and variable electricity consumption patterns.
Abstract: As an important part of incentive demand response (DR), interruptible load (IL) can achieve a rapid response and improve demand side resilience. Yet, model-based optimization algorithms concerning with IL require the explicit physical or mathematical model of the system, which makes it difficult to adapt to realistic operation conditions. In this paper, a model-free deep reinforcement learning (DRL) method with dueling deep Q network (DDQN) structure is designed to optimize the DR management of IL under the time of use (TOU) tariff and variable electricity consumption patterns. The DDQN-based automatic demand response (ADR) architecture is firstly constructed, which provides a possibility for real-time application of DR. To obtain the maximum long-term profit, the DR management problem of IL is formulated as a Markov decision process (MDP), in which the state, action, and reward function are defined, respectively. The DDQN-based DRL algorithm is applied to solve this MDP for the DR strategy with maximum cumulative reward. The simulation results validate that the proposed algorithm with DDQN overcomes the noise and instability in traditional DQN, and realizes the goal of reducing both the peak load demand and the operation costs on the premise of regulating voltage to the safe limit.

Proceedings Article
01 Jan 2020
TL;DR: This work is the first to establish non-asymptotic convergence guarantees of policybased primal-dual methods for solving infinite-horizon discounted CMDPs, and it is shown that two samplebased NPG-PD algorithms inherit such non- ATM convergence properties and provide finite-sample complexity guarantees.
Abstract: We study sequential decision-making problems in which each agent aims to maximize the expected total reward while satisfying a constraint on the expected total utility. We employ the natural policy gradient method to solve the discounted infinite-horizon Constrained Markov Decision Processes (CMDPs) problem. Specifically, we propose a new Natural Policy Gradient Primal-Dual (NPGPD) method for CMDPs which updates the primal variable via natural policy gradient ascent and the dual variable via projected sub-gradient descent. Even though the underlying maximization involves a nonconcave objective function and a nonconvex constraint set under the softmax policy parametrization, we prove that our method achieves global convergence with sublinear rates regarding both the optimality gap and the constraint violation. Such a convergence is independent of the size of the state-action space, i.e., it is dimension-free. Furthermore, for the general smooth policy class, we establish sublinear rates of convergence regarding both the optimality gap and the constraint violation, up to a function approximation error caused by restricted policy parametrization. Finally, we show that two samplebased NPG-PD algorithms inherit such non-asymptotic convergence properties and provide finite-sample complexity guarantees. To the best of our knowledge, our work is the first to establish non-asymptotic convergence guarantees of policybased primal-dual methods for solving infinite-horizon discounted CMDPs. We also provide computational results to demonstrate merits of our approach.