Showing papers on "Markov decision process published in 2019"
••
TL;DR: Simulation results show that the proposed edge VM allocation and task scheduling approach can achieve near-optimal performance with very low complexity and the proposed learning-based computing offloading algorithm not only converges fast but also achieves a lower total cost compared with other offloading approaches.
Abstract: Internet of Things (IoT) computing offloading is a challenging issue, especially in remote areas where common edge/cloud infrastructure is unavailable. In this paper, we present a space-air-ground integrated network (SAGIN) edge/cloud computing architecture for offloading the computation-intensive applications considering remote energy and computation constraints, where flying unmanned aerial vehicles (UAVs) provide near-user edge computing and satellites provide access to the cloud computing. First, for UAV edge servers, we propose a joint resource allocation and task scheduling approach to efficiently allocate the computing resources to virtual machines (VMs) and schedule the offloaded tasks. Second, we investigate the computing offloading problem in SAGIN and propose a learning-based approach to learn the optimal offloading policy from the dynamic SAGIN environments. Specifically, we formulate the offloading decision making as a Markov decision process where the system state considers the network dynamics. To cope with the system dynamics and complexity, we propose a deep reinforcement learning-based computing offloading approach to learn the optimal offloading policy on-the-fly, where we adopt the policy gradient method to handle the large action space and actor-critic method to accelerate the learning process. Simulation results show that the proposed edge VM allocation and task scheduling approach can achieve near-optimal performance with very low complexity and the proposed learning-based computing offloading algorithm not only converges fast but also achieves a lower total cost compared with other offloading approaches.
537 citations
••
TL;DR: A model-free approach based on deep reinforcement learning is proposed to determine the optimal strategy for charging strategy due to the existence of randomness in traffic conditions, user's commuting behavior, and the pricing process of the utility.
Abstract: Driven by the recent advances in electric vehicle (EV) technologies, EVs have become important for smart grid economy. When EVs participate in demand response program which has real-time pricing signals, the charging cost can be greatly reduced by taking full advantage of these pricing signals. However, it is challenging to determine an optimal charging strategy due to the existence of randomness in traffic conditions, user’s commuting behavior, and the pricing process of the utility. Conventional model-based approaches require a model of forecast on the uncertainty and optimization for the scheduling process. In this paper, we formulate this scheduling problem as a Markov Decision Process (MDP) with unknown transition probability. A model-free approach based on deep reinforcement learning is proposed to determine the optimal strategy for this problem. The proposed approach can adaptively learn the transition probability and does not require any system model information. The architecture of the proposed approach contains two networks: a representation network to extract discriminative features from the electricity prices and a Q network to approximate the optimal action-value function. Numerous experimental results demonstrate the effectiveness of the proposed approach.
277 citations
••
TL;DR: A deep reinforcement learning model is proposed to control the traffic light cycle that incorporates multiple optimization elements to improve the performance, such as dueling network, target network, double Q-learning network, and prioritized experience replay.
Abstract: Existing inefficient traffic light cycle control causes numerous problems, such as long delay and waste of energy. To improve efficiency, taking real-time traffic information as an input and dynamically adjusting the traffic light duration accordingly is a must. Existing works either split the traffic signal into equal duration or only leverage limited traffic information. In this paper, we study how to decide the traffic signal duration based on the collected data from different sensors. We propose a deep reinforcement learning model to control the traffic light cycle. In the model, we quantify the complex traffic scenario as states by collecting traffic data and dividing the whole intersection into small grids. The duration changes of a traffic light are the actions, which are modeled as a high-dimension Markov decision process. The reward is the cumulative waiting time difference between two cycles. To solve the model, a convolutional neural network is employed to map states to rewards. The proposed model incorporates multiple optimization elements to improve the performance, such as dueling network, target network, double Q-learning network, and prioritized experience replay. We evaluate our model via simulation on a Simulation of Urban MObility simulator. Simulation results show the efficiency of our model in controlling traffic lights.
271 citations
•
TL;DR: This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs), and shows an important interplay between estimation error, approximation error, and exploration.
Abstract: Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution or how they cope with approximation error due to using a restricted class of parametric policies. This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: "tabular" policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy; and parametric policy classes (considering both log-linear and neural policy classes), which may not contain the optimal policy and where we provide agnostic learning results. One central contribution of this work is in providing approximation guarantees that are average case -- which avoid explicit worst-case dependencies on the size of state space -- by making a formal connection to supervised learning under distribution shift. This characterization shows an important interplay between estimation error, approximation error, and exploration (as characterized through a precisely defined condition number).
248 citations
••
TL;DR: This paper forms the online offloading problem as a Markov decision process by considering both the blockchain mining tasks and data processing tasks and introduces an adaptive genetic algorithm into the exploration of deep reinforcement learning to effectively avoid useless exploration and speed up the convergence without reducing performance.
Abstract: Offloading computation-intensive tasks (e.g., blockchain consensus processes and data processing tasks) to the edge/cloud is a promising solution for blockchain-empowered mobile edge computing. However, the traditional offloading approaches (e.g., auction-based and game-theory approaches) fail to adjust the policy according to the changing environment and cannot achieve long-term performance. Moreover, the existing deep reinforcement learning-based offloading approaches suffer from the slow convergence caused by high-dimensional action space. In this paper, we propose a new model-free deep reinforcement learning-based online computation offloading approach for blockchain-empowered mobile edge computing in which both mining tasks and data processing tasks are considered. First, we formulate the online offloading problem as a Markov decision process by considering both the blockchain mining tasks and data processing tasks. Then, to maximize long-term offloading performance, we leverage deep reinforcement learning to accommodate highly dynamic environments and address the computational complexity. Furthermore, we introduce an adaptive genetic algorithm into the exploration of deep reinforcement learning to effectively avoid useless exploration and speed up the convergence without reducing performance. Finally, our experimental results demonstrate that our algorithm can converge quickly and outperform three benchmark policies.
223 citations
••
TL;DR: Simulated results demonstrate the desired driving behaviors of an autonomous vehicle using both the reinforcement learning and inverse reinforcement learning techniques.
172 citations
••
TL;DR: In this article, a real-time IoT monitoring system is considered, in which the IoT devices sample a physical process with a sampling cost and send the status packet to a given destination with an updating cost.
Abstract: The effective operation of time-critical Internet of things (IoT) applications requires real-time reporting of fresh status information of underlying physical processes. In this paper, a real-time IoT monitoring system is considered, in which the IoT devices sample a physical process with a sampling cost and send the status packet to a given destination with an updating cost. This joint status sampling and updating process is designed to minimize the average age of information (AoI) at the destination node under an average energy cost constraint at each device. This stochastic problem is formulated as an infinite horizon average cost constrained Markov decision process (CMDP) and transformed into an unconstrained Markov decision process (MDP) using a Lagrangian method. For the single IoT device case, the optimal policy for the CMDP is shown to be a randomized mixture of two deterministic policies for the unconstrained MDP, which is of threshold type. This reveals a fundamental tradeoff between the average AoI at the destination and the sampling and updating costs. Then, a structure-aware optimal algorithm to obtain the optimal policy of the CMDP is proposed and the impact of the wireless channel dynamics is studied while demonstrating that channels having a larger mean channel gain and less scattering can achieve better AoI performance. For the case of multiple IoT devices, a low-complexity semi-distributed suboptimal policy is proposed with the updating control at the destination and the sampling control at each IoT device. Then, an online learning algorithm is developed to obtain this policy, which can be implemented at each IoT device and requires only the local knowledge and small signaling from the destination. The proposed learning algorithm is shown to converge almost surely to the suboptimal policy. Simulation results show the structural properties of the optimal policy for the single IoT device case; and show that the proposed policy for multiple IoT devices outperforms a zero-wait baseline policy, with average AoI reductions reaching up to 33%.
168 citations
••
TL;DR: This paper proposes novel container migration algorithms and architecture to support mobility tasks with various application requirements and demonstrates that the strategy outperforms the existing baseline approaches in terms of delay, power consumption, and migration cost.
Abstract: Fog Computing (FC) is a flexible architecture to support distributed domain-specific applications with cloud-like quality of service. However, current FC still lacks the mobility support mechanism when facing many mobile users with diversified application quality requirements. Such mobility support mechanism can be critical such as in the industrial internet where human, products, and devices are moveable. To fill in such gaps, in this paper we propose novel container migration algorithms and architecture to support mobility tasks with various application requirements. Our algorithms are realized from three aspects: 1) We consider mobile application tasks can be hosted in a container of a corresponding fog node that can be migrated, taking the communication delay and computational power consumption into consideration; 2) We further model such container migration strategy as multiple dimensional Markov Decision Process (MDP) spaces. To effectively reduce the large MDP spaces, efficient deep reinforcement learning algorithms are devised to achieve fast decision-making and 3) We implement the model and algorithms as a container migration prototype system and test its feasibility and performance. Extensive experiments show that our strategy outperforms the existing baseline approaches 2.9, 48.5 and 58.4 percent on average in terms of delay, power consumption, and migration cost, respectively.
161 citations
•
TL;DR: A general theory of regularized Markov Decision Processes that generalizes these approaches in two directions: a larger class of regularizers, and the general modified policy iteration approach, encompassing both policy iteration and value iteration.
Abstract: Many recent successful (deep) reinforcement learning algorithms make use of regularization, generally based on entropy or Kullback-Leibler divergence. We propose a general theory of regularized Markov Decision Processes that generalizes these approaches in two directions: we consider a larger class of regularizers, and we consider the general modified policy iteration approach, encompassing both policy iteration and value iteration. The core building blocks of this theory are a notion of regularized Bellman operator and the Legendre-Fenchel transform, a classical tool of convex optimization. This approach allows for error propagation analyses of general algorithmic schemes of which (possibly variants of) classical algorithms such as Trust Region Policy Optimization, Soft Q-learning, Stochastic Actor Critic or Dynamic Policy Programming are special cases. This also draws connections to proximal convex optimization, especially to Mirror Descent.
160 citations
••
TL;DR: A novel dynamic energy management system is developed to incorporate efficient management of energy storage system into MG real- time dispatch while considering power flow constraints and uncertainties in load, renewable generation and real-time electricity price.
Abstract: This paper focuses on economical operation of a microgrid (MG) in real-time. A novel dynamic energy management system is developed to incorporate efficient management of energy storage system into MG real-time dispatch while considering power flow constraints and uncertainties in load, renewable generation and real-time electricity price. The developed dynamic energy management mechanism does not require long-term forecast and optimization or distribution knowledge of the uncertainty, but can still optimize the long-term operational costs of MGs. First, the real-time scheduling problem is modeled as a finite-horizon Markov decision process over a day. Then, approximate dynamic programming and deep recurrent neural network learning are employed to derive a near optimal real-time scheduling policy. Last, using real power grid data from California independent system operator, a detailed simulation study is carried out to validate the effectiveness of the proposed method.
155 citations
••
TL;DR: In this paper, the authors formulate the service migration problem as a Markov decision process (MDP) and provide a mathematical framework to design optimal service migration policies in mobile edge computing.
Abstract: In mobile edge computing, local edge servers can host cloud-based services, which reduces network overhead and latency but requires service migrations as users move to new locations. It is challenging to make migration decisions optimally because of the uncertainty in such a dynamic cloud environment. In this paper, we formulate the service migration problem as a Markov decision process (MDP). Our formulation captures general cost models and provides a mathematical framework to design optimal service migration policies. In order to overcome the complexity associated with computing the optimal policy, we approximate the underlying state space by the distance between the user and service locations. We show that the resulting MDP is exact for the uniform 1-D user mobility, while it provides a close approximation for uniform 2-D mobility with a constant additive error. We also propose a new algorithm and a numerical technique for computing the optimal solution, which is significantly faster than traditional methods based on the standard value or policy iteration. We illustrate the application of our solution in practical scenarios where many theoretical assumptions are relaxed. Our evaluations based on real-world mobility traces of San Francisco taxis show the superior performance of the proposed solution compared to baseline solutions.
••
TL;DR: A novel strategy of FDI attacks is proposed, which aims to distort normal operation of a power system regulated by automatic voltage controls (AVCs) and can help maintain the security of the AVC system, even under heavy system loading.
Abstract: False data injection (FDI) attacks intend to threaten the security of power systems. In this paper, a novel strategy of FDI attacks is proposed, which aims to distort normal operation of a power system regulated by automatic voltage controls (AVCs). Such attacks can be launched from a single substation by the attacker who has little knowledge of the whole power grid. The optimal attack strategy is modeled as a partial observable Markov decision process (POMDP). Then, a $\mathcal {Q}$ -learning algorithm with nearest sequence memory is adopted to enable on-line learning and attacking. Stealthy attack strategies are also developed and incorporated into the POMDP model. Various tests are performed upon the IEEE 39-bus systems. Corresponding results verify the efficacy of the proposed attack strategies. The feasibility of independent and data-driven FDI attacks is investigated. Moreover, a bad data detection and correction method are presented based on kernel density estimation to mitigate the disruptive impacts of the proposed FDI attacks. Test results show that this defensive method can help maintain the security of the AVC system, even under heavy system loading.
•
TL;DR: Safe policy optimization algorithms based on a Lyapunov approach to solve continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through safe policies, i.e.,~policies that do not take the agent to undesirable situations are presented.
Abstract: We study continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through safe policies, i.e.,~policies that do not take the agent to undesirable situations. We formulate these problems as constrained Markov decision processes (CMDPs) and present safe policy optimization algorithms that are based on a Lyapunov approach to solve them. Our algorithms can use any standard policy gradient (PG) method, such as deep deterministic policy gradient (DDPG) or proximal policy optimization (PPO), to train a neural network policy, while guaranteeing near-constraint satisfaction for every policy update by projecting either the policy parameter or the action onto the set of feasible solutions induced by the state-dependent linearized Lyapunov constraints. Compared to the existing constrained PG algorithms, ours are more data efficient as they are able to utilize both on-policy and off-policy data. Moreover, our action-projection algorithm often leads to less conservative policy updates and allows for natural integration into an end-to-end PG training pipeline. We evaluate our algorithms and compare them with the state-of-the-art baselines on several simulated (MuJoCo) tasks, as well as a real-world indoor robot navigation problem, demonstrating their effectiveness in terms of balancing performance and constraint satisfaction. Videos of the experiments can be found in the following link: this https URL.
••
TL;DR: In this article, the authors proposed a learning-based approach for real-time scheduling of an MG considering the uncertainty of the load demand, renewable energy, and electricity price, which is modeled as a Markov Decision Process (MDP) with an objective of minimizing the daily operating cost.
Abstract: Driven by the recent advances and applications of smart-grid technologies, our electric power grid is undergoing radical modernization. Microgrid (MG) plays an important role in the course of modernization by providing a flexible way to integrate distributed renewable energy resources (RES) into the power grid. However, distributed RES, such as solar and wind, can be highly intermittent and stochastic. These uncertain resources combined with load demand result in random variations in both the supply and the demand sides, which make it difficult to effectively operate a MG. Focusing on this problem, this paper proposed a novel energy management approach for real-time scheduling of an MG considering the uncertainty of the load demand, renewable energy, and electricity price. Unlike the conventional model-based approaches requiring a predictor to estimate the uncertainty, the proposed solution is learning-based and does not require an explicit model of the uncertainty. Specifically, the MG energy management is modeled as a Markov Decision Process (MDP) with an objective of minimizing the daily operating cost. A deep reinforcement learning (DRL) approach is developed to solve the MDP. In the DRL approach, a deep feedforward neural network is designed to approximate the optimal action-value function, and the deep Q-network (DQN) algorithm is used to train the neural network. The proposed approach takes the state of the MG as inputs, and outputs directly the real-time generation schedules. Finally, using real power-grid data from the California Independent System Operator (CAISO), case studies are carried out to demonstrate the effectiveness of the proposed approach.
••
TL;DR: This paper studies average AoI minimization in cognitive radio energy harvesting communications with a primary user with access rights to spectrum and a secondary user who can utilize the spectrum only when it is left idle by the primary user.
Abstract: Age of information (AoI) is a performance metric that measures the timeliness and freshness of information, and is particularly relevant in applications with time-sensitive data. This paper studies average AoI minimization in cognitive radio energy harvesting communications. More specifically, the system studied has a primary user with access rights to spectrum, and a secondary user who can utilize the spectrum only when it is left idle by the primary user. The secondary user is an energy harvesting sensor that harvests ambient energy with which it performs spectrum sensing and status updates of its sensing data to a destination. The status-updates are sent by opportunistically accessing the primary user’s spectrum. The secondary user aims to minimize the average AoI by adaptively making sensing and update decisions based on its energy availability and the availability of the primary spectrum with either perfect or imperfect spectrum sensing. The sequential decision problems are formulated as partially observable Markov decision processes and solved by dynamic programming for finite and infinite horizon. The properties of the optimal sensing and updating policies are investigated and shown to have threshold structure. Numerical results are presented to confirm the analytical findings.
••
24 Jun 2019TL;DR: This paper proposes NFVdeep, an adaptive, online, deep reinforcement learning approach to automatically deploy SFCs for requests with different QoS requirements, which surpasses the state-of-the-art methods by 32.59% higher accepted throughput and 33.29% lower operation cost on average.
Abstract: With the evolution of network function virtualization (NFV), diverse network services can be flexibly offered as service function chains (SFCs) consisted of different virtual network functions (VNFs). However, network state and traffic typically exhibit unpredictable variations due to stochastically arriving requests with different quality of service (QoS) requirements. Thus, an adaptive online SFC deployment approach is needed to handle the real-time network variations and various service requests. In this paper, we firstly introduce a Markov decision process (MDP) model to capture the dynamic network state transitions. In order to jointly minimize the operation cost of NFV providers and maximize the total throughput of requests, we propose NFVdeep, an adaptive, online, deep reinforcement learning approach to automatically deploy SFCs for requests with different QoS requirements. Specifically, we use a serialization-and-backtracking method to effectively deal with large discrete action space. We also adopt a policy gradient based method to improve the training efficiency and convergence to optimality. Extensive experimental results demonstrate that NFVdeep converges fast in the training process and responds rapidly to arriving requests especially in large, frequently transferred network state space. Consequently, NFVdeep surpasses the state-of-the-art methods by 32.59% higher accepted throughput and 33.29% lower operation cost on average.
•
TL;DR: This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves regret.
Abstract: While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an ``optimistic version'' of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves $\tilde{O}(\sqrt{d^2 H^3 T} )$ regret. Here $d$ is the feature dimension, $H$ is the episode horizon, and $T$ is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
••
25 Jul 2019TL;DR: This work proposes a deep reinforcement learning based solution for order dispatching and conducts large scale online A/B tests on DiDi's ride-dispatching platform to show that the proposed method achieves significant improvement on both total driver income and user experience related metrics.
Abstract: Recent works on ride-sharing order dispatching have highlighted the importance of taking into account both the spatial and temporal dynamics in the dispatching process for improving the transportation system efficiency. At the same time, deep reinforcement learning has advanced to the point where it achieves superhuman performance in a number of fields. In this work, we propose a deep reinforcement learning based solution for order dispatching and we conduct large scale online A/B tests on DiDi's ride-dispatching platform to show that the proposed method achieves significant improvement on both total driver income and user experience related metrics. In particular, we model the ride dispatching problem as a Semi Markov Decision Process to account for the temporal aspect of the dispatching actions. To improve the stability of the value iteration with nonlinear function approximators like neural networks, we propose Cerebellar Value Networks (CVNet) with a novel distributed state representation layer. We further derive a regularized policy evaluation scheme for CVNet that penalizes large Lipschitz constant of the value network for additional robustness against adversarial perturbation and noises. Finally, we adapt various transfer learning methods to CVNet for increased learning adaptability and efficiency across multiple cities. We conduct extensive offline simulations based on real dispatching data as well as online AB tests through the DiDi's platform. Results show that CVNet consistently outperforms other recently proposed dispatching methods. We finally show that the performance can be further improved through the efficient use of transfer learning.
••
TL;DR: A deep deterministic policy gradient (DDPG) algorithm, which is an actor-critic-based reinforcement learning algorithm, was adapted to capture the USV’s experience during the path-following trials.
••
TL;DR: This paper proposes DQ-RSS, a deep-reinforcement-learning-based relay selection scheme in WSNs and uses DQN to process high-dimensional state spaces and accelerate the learning rate, and compares the network performance on the basis of three aspects: outage probability, system capacity, and energy consumption.
Abstract: Cooperative communication technology has become a research hotspot in wireless sensor networks (WSNs) in recent years, and will become one of the key technologies for improving spectrum utilization in wireless communication systems in the future. It leverages cooperation among multiple relay nodes in the wireless network to realize path transmission sharing, thereby improving the system throughput. In this paper, we model the process of cooperative communications with relay selection in WSNs as a Markov decision process and propose DQ-RSS, a deep-reinforcement-learning-based relay selection scheme, in WSNs. In DQ-RSS, a deep-Q-network (DQN) is trained according to the outage probability and mutual information, and the optimal relay is selected from a plurality of relay nodes without the need for a network model or prior data. More specifically, we use DQN to process high-dimensional state spaces and accelerate the learning rate. We compare DQ-RSS with the Q-learning-based relay selection scheme and evaluate the network performance on the basis of three aspects: outage probability, system capacity, and energy consumption. Simulation results indicate that DQ-RSS can achieve better performance on these elements and save the convergence time compared with existing schemes.
••
TL;DR: This paper linearly decomposes the per-SP Markov decision process to simplify the decision makings at a SP and derive an online scheme based on deep reinforcement learning to approach the optimal abstract control policies.
Abstract: With the cellular networks becoming increasingly agile, a major challenge lies in how to support diverse services for mobile users (MUs) over a common physical network infrastructure. Network slicing is a promising solution to tailor the network to match such service requests. This paper considers a system with radio access network (RAN)-only slicing, where the physical infrastructure is split into slices providing computation and communication functionalities. A limited number of channels are auctioned across scheduling slots to MUs of multiple service providers (SPs) (i.e., the tenants). Each SP behaves selfishly to maximize the expected long-term payoff from the competition with other SPs for the orchestration of channels, which provides its MUs with the opportunities to access the computation and communication slices. This problem is modelled as a stochastic game, in which the decision makings of a SP depend on the global network dynamics as well as the joint control policy of all SPs. To approximate the Nash equilibrium solutions, we first construct an abstract stochastic game with the local conjectures of channel auction among the SPs. We then linearly decompose the per-SP Markov decision process to simplify the decision makings at a SP and derive an online scheme based on deep reinforcement learning to approach the optimal abstract control policies. Numerical experiments show significant performance gains from our scheme.
••
TL;DR: A deep neural network is used to approximate the table, reducing the required storage space by a factor of 1000 and enabling the collision avoidance system to operate using current avionics systems.
Abstract: One approach to designing decision-making logic for an aircraft collision avoidance system frames the problem as a Markov decision process and optimizes the system using dynamic programming. The re...
••
TL;DR: Deep reinforcement learning (DRL) is utilizes to develop EMSs for a series HEV due to DRL's advantages of requiring no future driving information in derivation and good generalization in solving energy management problem formulated as a Markov decision process.
Abstract: It is essential to develop proper energy management strategies (EMSs) with broad adaptability for hybrid electric vehicles (HEVs). This paper utilizes deep reinforcement learning (DRL) to develop EMSs for a series HEV due to DRL's advantages of requiring no future driving information in derivation and good generalization in solving energy management problem formulated as a Markov decision process. History cumulative trip information is also integrated for effective state of charge guidance in DRL-based EMSs. The proposed method is systematically introduced from offline training to online applications; its learning ability, optimality, and generalization are validated by comparisons with fuel economy benchmark optimized by dynamic programming, and real-time EMSs based on model predictive control (MPC). Simulation results indicate that without a priori knowledge of future trip, original DRL-based EMS achieves an average 3.5% gap from benchmark, superior to MPC-based EMS with accurate prediction; after further applying output frequency adjustment, a mean gap of 8.7%, which is comparable with MPC-based EMS with mean prediction error of 1 m/s, is maintained with concurrently noteworthy improvement in reducing engine start times. Besides, its impressive computation speed of about 0.001 s per simulation step proves its practical application potential, and this method is independent of powertrain topology such that it is applicative for any type of HEVs even when future driving information is unavailable.
•
01 Jan 2019TL;DR: In this paper, a reinforcement learning approach for delayed rewards in finite Markov decision processes (MDPs) is proposed, which aims at making the expected future rewards zero, which simplifies Q-value estimation to computing the mean of the immediate reward.
Abstract: We propose RUDDER, a novel reinforcement learning approach for delayed rewards in finite Markov decision processes (MDPs). In MDPs the Q-values are equal to the expected immediate reward plus the expected future rewards. The latter are related to bias problems in temporal difference (TD) learning and to high variance problems in Monte Carlo (MC) learning. Both problems are even more severe when rewards are delayed. RUDDER aims at making the expected future rewards zero, which simplifies Q-value estimation to computing the mean of the immediate reward. We propose the following two new concepts to push the expected future rewards toward zero. (i) Reward redistribution that leads to return-equivalent decision processes with the same optimal policies and, when optimal, zero expected future rewards. (ii) Return decomposition via contribution analysis which transforms the reinforcement learning task into a regression task at which deep learning excels. On artificial tasks with delayed rewards, RUDDER is significantly faster than MC and exponentially faster than Monte Carlo Tree Search (MCTS), TD(λ), and reward shaping approaches. At Atari games, RUDDER on top of a Proximal Policy Optimization (PPO) baseline improves the scores, which is most prominent at games with delayed rewards.
••
TL;DR: A deep Q-network (DQN) based technique for task migration in MEC system that can learn the optimal task migration policy from previous experiences without necessarily acquiring the information about users’ mobility pattern in advance.
•
14 May 2019TL;DR: An off-policy evaluation procedure for highlighting episodes where applying a reinforcement learned policy is likely to have produced a substantially different outcome than the observed policy, and a class of structural causal models for generating counterfactual trajectories in finite partially observable Markov Decision Processes (POMDPs).
Abstract: We introduce an off-policy evaluation procedure for highlighting episodes where applying a reinforcement learned (RL) policy is likely to have produced a substantially different outcome than the observed policy. In particular, we introduce a class of structural causal models (SCMs) for generating counterfactual trajectories in finite partially observable Markov Decision Processes (POMDPs). We see this as a useful procedure for off-policy "debugging" in high-risk settings (e.g., healthcare); by decomposing the expected difference in reward between the RL and observed policy into specific episodes, we can identify episodes where the counterfactual difference in reward is most dramatic. This in turn can be used to facilitate review of specific episodes by domain experts. We demonstrate the utility of this procedure with a synthetic environment of sepsis management.
••
TL;DR: The Deep Centralized Multi-agent Actor Critic (DCMAC) as discussed by the authors is an off-policy actor-critic DRL algorithm that directly probes the state/belief space of the underlying MDP/POMDP, providing efficient life-cycle policies for large multi-component systems operating in high-dimensional spaces.
••
TL;DR: The role of emerging energy brokers (middlemen) in a localized event-driven market (LEM) at the distribution level for facilitating indirect customer-to-customer energy trading and some reinforcement learning and data-driven methods applied are explored.
Abstract: In this paper, we explore the role of emerging energy brokers (middlemen) in a localized event-driven market (LEM) at the distribution level for facilitating indirect customer-to-customer energy trading. This proposed LEM does not aim to replace any existing energy service or become the best market model; but instead to diversify the energy ecosystem at the edge of distribution networks. In light of this philosophy, the market mechanism will provide additional options for customers and prosumers who have the willingness to directly participate in the retail electricity market occasionally, on top of using existing utility services. It also helps in improving market efficiency and encouraging local-level power balance, while taking into account the characteristics of customers’ behavior. The energy trading process will be built as a Markov decision process with some reinforcement learning and data-driven methods applied. Some economic concepts, like search friction , related to this kind of typical search cost involved market model are also discussed.
••
TL;DR: In this article, the joint access control and battery prediction problems in a small-cell IoT system including multiple EH user equipments (UEs) and one base station (BS) with limited uplink access channels were investigated.
Abstract: Energy harvesting (EH) is a promising technique to fulfill the long-term and self-sustainable operations for Internet of Things (IoT) systems. In this paper, we study the joint access control and battery prediction problems in a small-cell IoT system including multiple EH user equipments (UEs) and one base station (BS) with limited uplink access channels. Each UE has a rechargeable battery with finite capacity. The system control is modeled as a Markov decision process without complete prior knowledge assumed at the BS, which also deals with large sizes in both state and action spaces. First, to handle the access control problem assuming causal battery and channel state information, we propose a scheduling algorithm that maximizes the uplink transmission sum rate based on reinforcement learning (RL) with deep ${Q}$ -network enhancement. Second, for the battery prediction problem, with a fixed round-robin access control policy adopted, we develop an RL-based algorithm to minimize the prediction loss (error) without any model knowledge about the energy source and energy arrival process. Finally, the joint access control and battery prediction problem is investigated, where we propose a two-layer RL network to simultaneously deal with maximizing the sum rate and minimizing the prediction loss: the first layer is for battery prediction, the second layer generates the access policy based on the output from the first layer. Experiment results show that the three proposed RL algorithms can achieve better performances compared with existing benchmarks.
••
TL;DR: A model-driven deep deterministic policy gradient algorithm is proposed to accomplish the assembly task through the learned policy without analyzing the contact states, and a fuzzy reward system is utilized for the complex assembly process to improve the learning efficiency.
Abstract: The automatic completion of multiple peg-in-hole assembly tasks by robots remains a formidable challenge because the traditional control strategies require a complex analysis of the contact model In this paper, the assembly task is formulated as a Markov decision process, and a model-driven deep deterministic policy gradient algorithm is proposed to accomplish the assembly task through the learned policy without analyzing the contact states In our algorithm, the learning process is driven by a simple traditional force controller In addition, a feedback exploration strategy is proposed to ensure that our algorithm can efficiently explore the optimal assembly policy and avoid risky actions, which can address the data efficiency and guarantee stability in realistic assembly scenarios To improve the learning efficiency, we utilize a fuzzy reward system for the complex assembly process Then, simulations and realistic experiments of a dual peg-in-hole assembly demonstrate the effectiveness of the proposed algorithm The advantages of the fuzzy reward system and feedback exploration strategy are validated by comparing the performances of different cases in simulations and experiments