scispace - formally typeset
Search or ask a question

Showing papers on "Markov decision process published in 2017"


Journal ArticleDOI
TL;DR: The design of asynchronous controller, which covers the well-known mode-independent controller and synchronous controller as special cases, is addressed and the DC motor device is applied to demonstrate the practicability of the derived asynchronous synthesis scheme.
Abstract: The issue of asynchronous passive control is addressed for Markov jump systems in this technical note. The asynchronization phenomenon appears between the system modes and controller modes, which is described by a hidden Markov model. Accordingly, a hidden Markov jump model is used to name the resultant closed-loop system. By utilizing the matrix inequality technique, three equivalent sufficient conditions are obtained, which can guarantee the hidden Markov jump systems to be stochastically passive. Based on the established conditions, the design of asynchronous controller, which covers the well-known mode-independent controller and synchronous controller as special cases, is addressed. The DC motor device is applied to demonstrate the practicability of the derived asynchronous synthesis scheme.

413 citations


Proceedings Article
04 Dec 2017
TL;DR: In this article, a simple generalization of the classic count-based approach can reach near state-of-the-art performance on various high-dimensional and/or continuous deep RL benchmarks.
Abstract: Count-based exploration algorithms are known to perform near-optimally when used in conjunction with tabular reinforcement learning (RL) methods for solving small discrete Markov decision processes (MDPs). It is generally thought that count-based methods cannot be applied in high-dimensional state spaces, since most states will only occur once. Recent deep RL exploration strategies are able to deal with high-dimensional continuous state spaces through complex heuristics, often relying on optimism in the face of uncertainty or intrinsic motivation. In this work, we describe a surprising finding: a simple generalization of the classic count-based approach can reach near state-of-the-art performance on various high-dimensional and/or continuous deep RL benchmarks. States are mapped to hash codes, which allows to count their occurrences with a hash table. These counts are then used to compute a reward bonus according to the classic count-based exploration theory. We find that simple hash functions can achieve surprisingly good results on many challenging tasks. Furthermore, we show that a domain-dependent learned hash code may further improve these results. Detailed analysis reveals important aspects of a good hash function: 1) having appropriate granularity and 2) encoding information relevant to solving the MDP. This exploration strategy achieves near state-of-the-art performance on both continuous control tasks and Atari 2600 games, hence providing a simple yet powerful baseline for solving MDPs that require considerable exploration.

314 citations


Proceedings ArticleDOI
25 Jun 2017
TL;DR: This paper forms a Markov decision process (MDP) to find dynamic transmission scheduling schemes, with the purpose of minimizing the long-run average age, and proposes both optimal off-line and online scheduling algorithms for the finite-approximate MDPs, depending on knowledge of time-varying arrivals.
Abstract: Age of information is a newly proposed metric that captures delay from an application layer perspective. The age measures the amount of time that elapsed from the moment the mostly recently received update was generated until the present time. In this paper, we study an age minimization problem over a wireless broadcast network with many users, where only one user can be served at a time. We formulate a Markov decision process (MDP) to find dynamic transmission scheduling schemes, with the purpose of minimizing the long-run average age. While showing that an optimal scheduling algorithm for the MDP is a simple stationary switch-type, we propose a sequence of finite-state approximations for our infinite-state MDP and prove its convergence. We then propose both optimal off-line and online scheduling algorithms for the finite-approximate MDPs, depending on knowledge of time-varying arrivals.

241 citations


Posted Content
TL;DR: This paper proposes an efficient reinforcement learning-based resource management algorithm, which learns on-the-fly the optimal policy of dynamic workload offloading and edge server provisioning to minimize the long-term system cost (including both service delay and operational cost).
Abstract: Mobile edge computing (a.k.a. fog computing) has recently emerged to enable in-situ processing of delay-sensitive applications at the edge of mobile networks. Providing grid power supply in support of mobile edge computing, however, is costly and even infeasible (in certain rugged or under-developed areas), thus mandating on-site renewable energy as a major or even sole power supply in increasingly many scenarios. Nonetheless, the high intermittency and unpredictability of renewable energy make it very challenging to deliver a high quality of service to users in energy harvesting mobile edge computing systems. In this paper, we address the challenge of incorporating renewables into mobile edge computing and propose an efficient reinforcement learning-based resource management algorithm, which learns on-the-fly the optimal policy of dynamic workload offloading (to the centralized cloud) and edge server provisioning to minimize the long-term system cost (including both service delay and operational cost). Our online learning algorithm uses a decomposition of the (offline) value iteration and (online) reinforcement learning, thus achieving a significant improvement of learning rate and run-time performance when compared to standard reinforcement learning algorithms such as Q-learning. We prove the convergence of the proposed algorithm and analytically show that the learned policy has a simple monotone structure amenable to practical implementation. Our simulation results validate the efficacy of our algorithm, which significantly improves the edge computing performance compared to fixed or myopic optimization schemes and conventional reinforcement learning algorithms.

216 citations


Posted Content
TL;DR: A general framework for entropy-regularized average-reward reinforcement learning in Markov decision processes (MDPs) is proposed, showing that using the conditional entropy of the joint state-action distributions as regularization yields a dual optimization problem closely resembling the Bellman optimality equations.
Abstract: We propose a general framework for entropy-regularized average-reward reinforcement learning in Markov decision processes (MDPs). Our approach is based on extending the linear-programming formulation of policy optimization in MDPs to accommodate convex regularization functions. Our key result is showing that using the conditional entropy of the joint state-action distributions as regularization yields a dual optimization problem closely resembling the Bellman optimality equations. This result enables us to formalize a number of state-of-the-art entropy-regularized reinforcement learning algorithms as approximate variants of Mirror Descent or Dual Averaging, and thus to argue about the convergence properties of these methods. In particular, we show that the exact version of the TRPO algorithm of Schulman et al. (2015) actually converges to the optimal policy, while the entropy-regularized policy gradient methods of Mnih et al. (2016) may fail to converge to a fixed point. Finally, we illustrate empirically the effects of using various regularization techniques on learning performance in a simple reinforcement learning setup.

199 citations


Journal ArticleDOI
TL;DR: A unified framework for PbRL is provided that describes the task formally and points out the different design principles that affect the evaluation task for the human as well as the computational complexity.
Abstract: Reinforcement learning (RL) techniques optimize the accumulated long-term reward of a suitably chosen reward function. However, designing such a reward function often requires a lot of task-specific prior knowledge. The designer needs to consider different objectives that do not only influence the learned behavior but also the learning progress. To alleviate these issues, preference-based reinforcement learning algorithms (PbRL) have been proposed that can directly learn from an expert's preferences instead of a hand-designed numeric reward. PbRL has gained traction in recent years due to its ability to resolve the reward shaping problem, its ability to learn from non numeric rewards and the possibility to reduce the dependence on expert knowledge. We provide a unified framework for PbRL that describes the task formally and points out the different design principles that affect the evaluation task for the human as well as the computational complexity. The design principles include the type of feedback that is assumed, the representation that is learned to capture the preferences, the optimization problem that has to be solved as well as how the exploration/exploitation problem is tackled. Furthermore, we point out shortcomings of current algorithms, propose open research questions and briefly survey practical tasks that have been solved using PbRL.

181 citations


Posted Content
TL;DR: Thompson Sampling as mentioned in this paper is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance.
Abstract: Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance. The algorithm addresses a broad range of problems in a computationally efficient manner and is therefore enjoying wide use. This tutorial covers the algorithm and its application, illustrating concepts through a range of examples, including Bernoulli bandit problems, shortest path problems, product recommendation, assortment, active learning with neural networks, and reinforcement learning in Markov decision processes. Most of these problems involve complex information structures, where information revealed by taking an action informs beliefs about other actions. We will also discuss when and why Thompson sampling is or is not effective and relations to alternative algorithms.

180 citations


Proceedings ArticleDOI
01 Oct 2017
TL;DR: An attention-aware deep reinforcement learning (ADRL) method for video face recognition, which aims to discard the misleading and confounding frames and find the focuses of attentions in face videos for person recognition.
Abstract: In this paper, we propose an attention-aware deep reinforcement learning (ADRL) method for video face recognition, which aims to discard the misleading and confounding frames and find the focuses of attentions in face videos for person recognition. We formulate the process of finding the attentions of videos as a Markov decision process and train the attention model through a deep reinforcement learning framework without using extra labels. Unlike existing attention models, our method takes information from both the image space and the feature space as the input to make better use of face information that is discarded in the feature learning process. Besides, our approach is attention-aware, which seeks different attentions of videos for the recognition of different pairs of videos. Our approach achieves very competitive video face recognition performance on three widely used video face datasets.

178 citations


Proceedings Article
06 Aug 2017
TL;DR: An Bayesian expected regret bound for PSRL in finite-horizon episodic Markov decision processes is established, which improves upon the best previous bound of $\tilde{O}(H S \sqrt{AT})$ for any reinforcement learning algorithm.
Abstract: Computational results demonstrate that posterior sampling for reinforcement learning (PSRL) dramatically outperforms existing algorithms driven by optimism, such as UCRL2. We provide insight into the extent of this performance boost and the phenomenon that drives it. We leverage this insight to establish an O(H√SAT) Bayesian regret bound for PSRL in finite-horizon episodic Markov decision processes. This improves upon the best previous Bayesian regret bound of O(HS √AT) for any reinforcement learning algorithm. Our theoretical results are supported by extensive empirical evaluation.

158 citations


Proceedings ArticleDOI
02 Feb 2017
TL;DR: In this article, the authors formulate the bid decision process as a reinforcement learning problem, where the state space is represented by the auction information and the campaign's real-time parameters, while an action is the bid price to set.
Abstract: The majority of online display ads are served through real-time bidding (RTB) --- each ad display impression is auctioned off in real-time when it is just being generated from a user visit. To place an ad automatically and optimally, it is critical for advertisers to devise a learning algorithm to cleverly bid an ad impression in real-time. Most previous works consider the bid decision as a static optimization problem of either treating the value of each impression independently or setting a bid price to each segment of ad volume. However, the bidding for a given ad campaign would repeatedly happen during its life span before the budget runs out. As such, each bid is strategically correlated by the constrained budget and the overall effectiveness of the campaign (e.g., the rewards from generated clicks), which is only observed after the campaign has completed. Thus, it is of great interest to devise an optimal bidding strategy sequentially so that the campaign budget can be dynamically allocated across all the available impressions on the basis of both the immediate and future rewards. In this paper, we formulate the bid decision process as a reinforcement learning problem, where the state space is represented by the auction information and the campaign's real-time parameters, while an action is the bid price to set. By modeling the state transition via auction competition, we build a Markov Decision Process framework for learning the optimal bidding policy to optimize the advertising performance in the dynamic real-time bidding environment. Furthermore, the scalability problem from the large real-world auction volume and campaign budget is well handled by state value approximation using neural networks. The empirical study on two large-scale real-world datasets and the live A/B testing on a commercial platform have demonstrated the superior performance and high efficiency compared to state-of-the-art methods.

157 citations


Journal ArticleDOI
TL;DR: Simulation results demonstrate the feasibility of the proposed learning approach at enabling agents to learn how to flock in a leader-follower topology, while operating in a nonstationary stochastic environment.
Abstract: In the past two decades, unmanned aerial vehicles (UAVs) have demonstrated their efficacy in supporting both military and civilian applications, where tasks can be dull, dirty, dangerous, or simply too costly with conventional methods. Many of the applications contain tasks that can be executed in parallel, hence the natural progression is to deploy multiple UAVs working together as a force multiplier. However, to do so requires autonomous coordination among the UAVs, similar to swarming behaviors seen in animals and insects. This paper looks at flocking with small fixed-wing UAVs in the context of a model-free reinforcement learning problem. In particular, Peng’s $Q(\lambda )$ with a variable learning rate is employed by the followers to learn a control policy that facilitates flocking in a leader-follower topology. The problem is structured as a Markov decision process, where the agents are modeled as small fixed-wing UAVs that experience stochasticity due to disturbances such as winds and control noises, as well as weight and balance issues. Learned policies are compared to ones solved using stochastic optimal control (i.e., dynamic programming) by evaluating the average cost incurred during flight according to a cost function. Simulation results demonstrate the feasibility of the proposed learning approach at enabling agents to learn how to flock in a leader-follower topology, while operating in a nonstationary stochastic environment.

Posted Content
TL;DR: This paper proposes a novel recommender system with the capability of continuously improving its strategies during the interactions with users and introduces an online user-agent interacting environment simulator, which can pre-train and evaluate model parameters offline before applying the model online.
Abstract: Recommender systems play a crucial role in mitigating the problem of information overload by suggesting users' personalized items or services. The vast majority of traditional recommender systems consider the recommendation procedure as a static process and make recommendations following a fixed strategy. In this paper, we propose a novel recommender system with the capability of continuously improving its strategies during the interactions with users. We model the sequential interactions between users and a recommender system as a Markov Decision Process (MDP) and leverage Reinforcement Learning (RL) to automatically learn the optimal strategies via recommending trial-and-error items and receiving reinforcements of these items from users' feedbacks. In particular, we introduce an online user-agent interacting environment simulator, which can pre-train and evaluate model parameters offline before applying the model online. Moreover, we validate the importance of list-wise recommendations during the interactions between users and agent, and develop a novel approach to incorporate them into the proposed framework LIRD for list-wide recommendations. The experimental results based on a real-world e-commerce dataset demonstrate the effectiveness of the proposed framework.

Proceedings ArticleDOI
01 Sep 2017
TL;DR: The authors propose a hierarchical deep reinforcement learning approach to learn a dialogue manager that operates at different temporal scales, including a top-level dialogue policy that selects among sub-tasks or options, a low-level dialog policy that chooses primitive actions to complete the subtask given by the toplevel policy, and a global state tracker that helps ensure all cross-subtask constraints be satisfied.
Abstract: Building a dialogue agent to fulfill complex tasks, such as travel planning, is challenging because the agent has to learn to collectively complete multiple subtasks. For example, the agent needs to reserve a hotel and book a flight so that there leaves enough time for commute between arrival and hotel check-in. This paper addresses this challenge by formulating the task in the mathematical framework of options over Markov Decision Processes (MDPs), and proposing a hierarchical deep reinforcement learning approach to learning a dialogue manager that operates at different temporal scales. The dialogue manager consists of: (1) a top-level dialogue policy that selects among subtasks or options, (2) a low-level dialogue policy that selects primitive actions to complete the subtask given by the top-level policy, and (3) a global state tracker that helps ensure all cross-subtask constraints be satisfied. Experiments on a travel planning task with simulated and real users show that our approach leads to significant improvements over three baselines, two based on handcrafted rules and the other based on flat deep reinforcement learning.

Journal ArticleDOI
TL;DR: A key insight from the numerical results is that the (s, S) inventory policy, popular in theory as well as practice, can be far from optimal for systems consisting of few components.

Proceedings ArticleDOI
08 Feb 2017
TL;DR: In this paper, a new autonomous braking system based on deep reinforcement learning is proposed, which automatically decides whether to apply the brake at each time step when confronting the risk of collision using the information on the obstacle obtained by the sensors.
Abstract: In this paper, we propose a new autonomous braking system based on deep reinforcement learning. The proposed autonomous braking system automatically decides whether to apply the brake at each time step when confronting the risk of collision using the information on the obstacle obtained by the sensors. The problem of designing brake control is formulated as searching for the optimal policy in Markov decision process (MDP) model where the state is given by the relative position of the obstacle and the vehicle's speed, and the action space is defined as the set of the brake actions including 1) no braking, 2) weak, 3) mid, 4) strong brakiong actions. The policy used for brake control is learned through computer simulations using the deep reinforcement learning method called deep Q-network (DQN). In order to derive desirable braking policy, we propose the reward function which balances the damage imposed to the obstacle in case of accident and the reward achieved when the vehicle runs out of risk as soon as possible. DQN is trained for the scenario where a vehicle is encountered with a pedestrian crossing the urban road. Experiments show that the control agent exhibits desirable control behavior and avoids collision without any mistake in various uncertain environments.

Journal ArticleDOI
TL;DR: In this article, decentralized Q-learning algorithms for stochastic games are presented, and their convergence for weakly acyclic case is studied for team problems as an important special case, where each decision maker has access only to its own decisions and cost realizations as well as state transitions.
Abstract: There are only a few learning algorithms applicable to stochastic dynamic teams and games which generalize Markov decision processes to decentralized stochastic control problems involving possibly self-interested decision makers. Learning in games is generally difficult because of the non-stationary environment in which each decision maker aims to learn its optimal decisions with minimal information in the presence of the other decision makers who are also learning. In stochastic dynamic games, learning is more challenging because, while learning, the decision makers alter the state of the system and hence the future cost. In this paper, we present decentralized Q-learning algorithms for stochastic games, and study their convergence for the weakly acyclic case which includes team problems as an important special case. The algorithms are decentralized in that each decision maker has access only to its own decisions and cost realizations as well as the state transitions; in particular, each decision maker is completely oblivious to the presence of the other decision makers. We show that these algorithms converge to equilibrium policies almost surely in large classes of stochastic games.

Journal ArticleDOI
TL;DR: The design of an optimal collision-free sensor schedule for a number of sensors which monitor different linear dynamical systems correspondingly is considered and a lower bound of the optimal cost is found, which enables to quantify the performance gap between any suboptimal schedule and an optimal one.

Proceedings Article
01 Sep 2017
TL;DR: A Thompson Sampling-based reinforcement learning algorithm with dynamic episodes (TSDE) that generates a sample from the posterior distribution over the unknown model parameters at the beginning of each episode and follows the optimal stationary policy for the sampled model for the rest of the episode.
Abstract: We consider the problem of learning an unknown Markov Decision Process (MDP) that is weakly communicating in the infinite horizon setting. We propose a Thompson Sampling-based reinforcement learning algorithm with dynamic episodes (TSDE). At the beginning of each episode, the algorithm generates a sample from the posterior distribution over the unknown model parameters. It then follows the optimal stationary policy for the sampled model for the rest of the episode. The duration of each episode is dynamically determined by two stopping criteria. The first stopping criterion controls the growth rate of episode length. The second stopping criterion happens when the number of visits to any state-action pair is doubled. We establish $\tilde O(HS\sqrt{AT})$ bounds on expected regret under a Bayesian setting, where $S$ and $A$ are the sizes of the state and action spaces, $T$ is time, and $H$ is the bound of the span. This regret bound matches the best available bound for weakly communicating MDPs. Numerical results show it to perform better than existing algorithms for infinite horizon MDPs.

Posted Content
TL;DR: A new task-specification language for Markov decision processes that is designed to be an improvement over reward functions by being environment independent and extended to probabilistic specifications in a way that permits approximations to be learned in finite time.
Abstract: We propose a new task-specification language for Markov decision processes that is designed to be an improvement over reward functions by being environment independent. The language is a variant of Linear Temporal Logic (LTL) that is extended to probabilistic specifications in a way that permits approximations to be learned in finite time. We provide several small environments that demonstrate the advantages of our geometric LTL (GLTL) language and illustrate how it can be used to specify standard reinforcement-learning tasks straightforwardly.

Proceedings ArticleDOI
Zeng Wei1, Jun Xu1, Yanyan Lan1, Jiafeng Guo1, Xueqi Cheng1 
07 Aug 2017
TL;DR: This paper proposes a novel learning to rank model on the basis of Markov decision process (MDP), referred to as MDPRank, which Experimental results on LETOR benchmark datasets showed thatMDPRank can outperform the state-of-the-art baselines.
Abstract: One of the central issues in learning to rank for information retrieval is to develop algorithms that construct ranking models by directly optimizing evaluation measures such as normalized discounted cumulative gain~(ND CG). Existing methods usually focus on optimizing a specific evaluation measure calculated at a fixed position, e.g., NDCG calculated at a fixed position K. In information retrieval the evaluation measures, including the widely used NDCG and P@K, are usually designed to evaluate the document ranking at all of the ranking positions, which provide much richer information than only measuring the document ranking at a single position. Thus, it is interesting to ask if we can devise an algorithm that has the ability of leveraging the measures calculated at all of the ranking postilions, for learning a better ranking model. In this paper, we propose a novel learning to rank model on the basis of Markov decision process (MDP), referred to as MDPRank. In the learning phase of MDPRank, the construction of a document ranking is considered as a sequential decision making, each corresponds to an action of selecting a document for the corresponding position. The policy gradient algorithm of REINFORCE is adopted to train the model parameters. The evaluation measures calculated at every ranking positions are utilized as the immediate rewards to the corresponding actions, which guide the learning algorithm to adjust the model parameters so that the measure is optimized. Experimental results on LETOR benchmark datasets showed that MDPRank can outperform the state-of-the-art baselines.

Journal Article
TL;DR: PomDPs.jl allows users to specify sequential decision making problems with minimal effort without sacrificing the expressive nature of POMDPs, making this framework viable for both educational and research purposes.
Abstract: POMDPs.jl is an open-source framework for solving Markov decision processes (MDPs) and partially observable MDPs (POMDPs). POMDPs.jl allows users to specify sequential decision making problems with minimal effort without sacrificing the expressive nature of POMDPs, making this framework viable for both educational and research purposes. It is written in the Julia language to allow flexible prototyping and large-scale computation that leverages the high-performance nature of the language. The associated JuliaPOMDP community also provides a number of state-of-the-art MDP and POMDP solvers and a rich library of support tools to help with implementing new solvers and evaluating the solution results. The most recent version of POMDPs.jl, the related packages, and documentation can be found at https://github.com/JuliaPOMDP/POMDPs.jl.

Journal ArticleDOI
TL;DR: It is shown that an unnormalized joint state and attack distribution conditioned on the sensor measurement information evolves in a linear recursive form, based on which the optimal estimates can be further calculated by evaluating the normalized marginal conditional distributions.
Abstract: The problem of secure state estimation and attack detection in cyber-physical systems is considered in this paper. A stochastic modeling framework is first introduced, based on which the attacked system is modeled as a finite-state hidden Markov model with switching transition probability matrices controlled by a Markov decision process. Based on this framework, a joint state and attack estimation problem is formulated and solved. Utilizing the change of probability measure approach, we show that an unnormalized joint state and attack distribution conditioned on the sensor measurement information evolves in a linear recursive form, based on which the optimal estimates can be further calculated by evaluating the normalized marginal conditional distributions. The estimation results are further applied to secure estimation of stable linear Gaussian systems, and extensions to more general systems are also discussed. The effectiveness of the results are illustrated by numerical examples and comparative simulation.

Journal ArticleDOI
TL;DR: This paper considers the optimal dynamic multicast scheduling to jointly minimize the average delay, power, and fetching costs for cache-enabled content-centric wireless networks, and forms this stochastic optimization problem as an infinite horizon average cost Markov decision process (MDP).
Abstract: Caching and multicasting at base stations are two promising approaches to support massive content delivery over wireless networks. However, existing scheduling designs do not fully exploit the advantages of the two approaches. In this paper, we consider the optimal dynamic multicast scheduling to jointly minimize the average delay, power, and fetching costs for cache-enabled content-centric wireless networks. We formulate this stochastic optimization problem as an infinite horizon average cost Markov decision process (MDP).By using relative value iteration and special structures of the request queue dynamics, we analyze the properties of the value function and the state-action cost function of the MDP for both the uniform and nonuniform channel cases. Based on these properties, we show that the optimal policy, which is adaptive to the request queue state, has a switch structure in the uniform case and a partial switch structure in the nonuniform case. Moreover, in the uniform case with two contents, we show that the switch curve is monotonically non-decreasing. Motivated by the switch structures of the optimal policy, we propose a low-complexity suboptimal policy, which exhibits similar switch structures to the optimal policy, and design a low-complexity algorithm to compute this policy.

Proceedings Article
01 Jan 2017
TL;DR: In this article, the authors present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying MDP is communicating with a finite, though unknown, diameter.
Abstract: We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our main result is a high probability regret upper bound of $\tilde{O}(D\sqrt{SAT})$ for any communicating MDP with $S$ states, $A$ actions and diameter $D$, when $T\ge S^5A$. Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy, in time horizon $T$. This result improves over the best previously known upper bound of $\tilde{O}(DS\sqrt{AT})$ achieved by any algorithm in this setting, and matches the dependence on $S$ in the established lower bound of $\Omega(\sqrt{DSAT})$ for this problem. Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.

Journal ArticleDOI
05 Jun 2017
TL;DR: The existence and optimality of Markov policies are proved and convex optimization-based tools to compute and analyze the policies are developed and a sensitivity analysis tool is developed to quantify the effect of ambiguity set parameters on the performance of distributionally robust policies.
Abstract: We consider the problem of constructing control policies that are robust against distribution errors in the model parameters of Markov decision processes. The Wasserstein metric is used to model the ambiguity set of admissible distributions. We prove the existence and optimality of Markov policies and develop convex optimization-based tools to compute and analyze the policies. Our methods, which are based on the Kantorovich convex relaxation and duality principle, have the following advantages. First, the proposed dual formulation of an associated Bellman equation resolves the infinite dimensionality issue that is inherent in its original formulation when the nominal distribution has a finite support. Second, our duality analysis identifies the structure of a worst-case distribution and provides a simple decentralized method for its construction. Third, a sensitivity analysis tool is developed to quantify the effect of ambiguity set parameters on the performance of distributionally robust policies. The effectiveness of our proposed tools is demonstrated through a human-centered air conditioning problem.

Journal ArticleDOI
TL;DR: The architectural design and conceptual framework for a Smart Maintenance Decision Support System (SMDSS) based on corporate data from a Fortune 500 company is outlined and existing solution algorithms and optimization models can be applied to large data sets to lay out executable decisions for managers.
Abstract: A framework for smart maintenance decision support system is provided.Applications using big data analytics and a specific case study for the electrical utility industry are detailed.An integrated expert system making use of Markov Decision Process and Analytical Hierarchy Process Models is developed. The purpose of this article is to outline the architectural design and the conceptual framework for a Smart Maintenance Decision Support System (SMDSS) based on corporate data from a Fortune 500 company. Motivated by the rapidly transforming landscape for big data analytics and predictive maintenance decision making, we have created a system capable of providing end users with recommendations to improve asset lifecycles. Methodologically, a cost minimization algorithm is used to analyze a large industry service and warranty data sets and two analytical decision models were developed and applied to a case study for an electrical circuit breaker maintenance problem. Some of these techniques can be applied to other industries, such as jet engine maintenance, and can be expanded to others with implications for robust decision analysis. The SMDSS provides a predictive analytical model that can be applied in manufacturing and service based industries. Our findings and results show that existing solution algorithms and optimization models can be applied to large data sets to lay out executable decisions for managers.

Journal ArticleDOI
01 Mar 2017
TL;DR: In this paper, an agent-based communication architecture is adopted to ensure peer-to-peer correspondence capability of the EV, customer, charging station, and dispatcher entities, and the results indicate that optimal route for EVs can be achieved while satisfying all constraints and providing V2G ancillary grid service.
Abstract: In the near future, gasoline-fueled vehicles are expected to be replaced by electrical vehicles (EVs) to save energy and reduce carbon emissions. A large penetration of EVs threatens the stability of the electric grid but also provides a potential for grid ancillary services, which strengthens the grid, if well managed. This paper incorporates grid-to-vehicle (G2V) and vehicle-to-grid (V2G) options in the travel path of logistics sector EVs. The paper offers a complete solution methodology to the multivariant EV routing problem rather than considering only one or two variants of the problem like in previous research. The variants considered include a stochastic environment, multiple dispatchers, time window constraints, simultaneous and nonsimultaneous pickup and delivery, and G2V and V2G service options. Stochastic demand forecasts of the G2V and V2G services at charging stations are modeled using hidden Markov model. The developed solver is based on a modified custom genetic algorithm incorporated with embedded Markov decision process and trust region optimization methods. An agent-based communication architecture is adopted to ensure peer-to-peer correspondence capability of the EV, customer, charging station, and dispatcher entities. The results indicate that optimal route for EVs can be achieved while satisfying all constraints and providing V2G ancillary grid service.

Journal ArticleDOI
TL;DR: In this article, a two-stage stochastic programming (2SSP) for day-ahead UC and dispatch decisions is combined with a Markov decision process (MDP) evolving at a daily timescale.

Journal ArticleDOI
TL;DR: In this paper, a queue-aware power and rate allocation with constraints of average fronthaul consumption for delay-sensitive traffic is formulated as an infinite horizon constrained partially observed Markov decision process, which takes both the urgent queue state information and the imperfect channel state information at transmitters (CSIT) into account.
Abstract: The cloud radio access network (C-RAN) provides high spectral and energy efficiency performances, low expenditures, and intelligent centralized system structures to operators, which have attracted intense interests in both academia and industry. In this paper, a hybrid coordinated multipoint transmission (H-CoMP) scheme is designed for the downlink transmission in C-RANs and fulfills the flexible tradeoff between cooperation gain and fronthaul consumption. The queue-aware power and rate allocation with constraints of average fronthaul consumption for the delay-sensitive traffic are formulated as an infinite horizon constrained partially observed Markov decision process, which takes both the urgent queue state information and the imperfect channel state information at transmitters (CSIT) into account. To deal with the curse of dimensionality involved with the equivalent Bellman equation, the linear approximation of postdecision value functions is utilized. A stochastic gradient algorithm is presented to allocate the queue-aware power and transmission rate with H-CoMP, which is robust against unpredicted traffic arrivals and uncertainties caused by the imperfect CSIT. Furthermore, to substantially reduce the computing complexity, an online learning algorithm is proposed to estimate the per-queue postdecision value functions and update the Lagrange multipliers. The simulation results demonstrate performance gains of the proposed stochastic gradient algorithms and confirm the asymptotical convergence of the proposed online learning algorithm.

Journal ArticleDOI
TL;DR: In this article, a Markov Decision Process (MDP) model for the maximum power point tracking (MPPT) photovoltaic process is defined and an RL algorithm is proposed and evaluated on a number of photivoltaic sources.