scispace - formally typeset
Search or ask a question

Showing papers on "Markov decision process published in 2015"


Proceedings ArticleDOI
07 Dec 2015
TL;DR: This work forms the online MOT problem as decision making in Markov Decision Processes (MDPs), where the lifetime of an object is modeled with a MDP, and a similarity function for data association is equivalent to learning a policy for the MDP.
Abstract: Online Multi-Object Tracking (MOT) has wide applications in time-critical video analysis scenarios, such as robot navigation and autonomous driving. In tracking-by-detection, a major challenge of online MOT is how to robustly associate noisy object detections on a new video frame with previously tracked objects. In this work, we formulate the online MOT problem as decision making in Markov Decision Processes (MDPs), where the lifetime of an object is modeled with a MDP. Learning a similarity function for data association is equivalent to learning a policy for the MDP, and the policy learning is approached in a reinforcement learning fashion which benefits from both advantages of offline-learning and online-learning for data association. Moreover, our framework can naturally handle the birth/death and appearance/disappearance of targets by treating them as state transitions in the MDP while leveraging existing online single object tracking methods. We conduct experiments on the MOT Benchmark [24] to verify the effectiveness of our method.

695 citations


Proceedings ArticleDOI
20 May 2015
TL;DR: The formulation of a sequential decision making problem for service migration using the framework of Markov Decision Process (MDP) captures general cost models and provides a mathematical framework to design optimal service migration policies.
Abstract: We study the dynamic service migration problem in mobile edge-clouds that host cloud-based services at the network edge. This offers the benefits of reduction in network overhead and latency but requires service migrations as user locations change over time. It is challenging to make these decisions in an optimal manner because of the uncertainty in node mobility as well as possible non-linearity of the migration and transmission costs. In this paper, we formulate a sequential decision making problem for service migration using the framework of Markov Decision Process (MDP). Our formulation captures general cost models and provides a mathematical framework to design optimal service migration policies. In order to overcome the complexity associated with computing the optimal policy, we approximate the underlying state space by the distance between the user and service locations. We show that the resulting MDP is exact for uniform one-dimensional mobility while it provides a close approximation for uniform two-dimensional mobility with a constant additive error term. We also propose a new algorithm and a numerical technique for computing the optimal solution which is significantly faster in computation than traditional methods based on value or policy iteration. We illustrate the effectiveness of our approach by simulation using real-world mobility traces of taxis in San Francisco.

242 citations


Journal ArticleDOI
Xianfu Chen, Jinsong Wu, Yueming Cai, Honggang Zhang1, Tao Chen 
TL;DR: An online reinforcement learning framework for the problem of traffic offloading in a stochastic heterogeneous cellular network, where the time-varying traffic in the network can be offloaded to nearby small cells, and a centralized Q-learning with compact state representation algorithm, which is named QC-learning.
Abstract: This paper first provides a brief survey on existing traffic offloading techniques in wireless networks. Particularly as a case study, we put forward an online reinforcement learning framework for the problem of traffic offloading in a stochastic heterogeneous cellular network (HCN), where the time-varying traffic in the network can be offloaded to nearby small cells. Our aim is to minimize the total discounted energy consumption of the HCN while maintaining the quality-of-service (QoS) experienced by mobile users. For each cell (i.e., a macro cell or a small cell), the energy consumption is determined by its system load, which is coupled with system loads in other cells due to the sharing over a common frequency band. We model the energy-aware traffic offloading problem in such HCNs as a discrete-time Markov decision process (DTMDP). Based on the traffic observations and the traffic offloading operations, the network controller gradually optimizes the traffic offloading strategy with no prior knowledge of the DTMDP statistics. Such a model-free learning framework is important, particularly when the state space is huge. In order to solve the curse of dimensionality, we design a centralized $Q$ -learning with compact state representation algorithm, which is named $QC$ -learning. Moreover, a decentralized version of the $QC$ -learning is developed based on the fact the macro base stations (BSs) can independently manage the operations of local small-cell BSs through making use of the global network state information obtained from the network controller. Simulations are conducted to show the effectiveness of the derived centralized and decentralized $QC$ -learning algorithms in balancing the tradeoff between energy saving and QoS satisfaction.

231 citations


Posted Content
TL;DR: In this paper, the authors present efficient reinforcement learning algorithms for risk-constrained Markov decision processes (MDPs), where risk is represented via a chance constraint or a constraint on the conditional value-at-risk (CVaR) of the cumulative cost.
Abstract: In many sequential decision-making problems one is interested in minimizing an expected cumulative cost while taking into account \emph{risk}, i.e., increased awareness of events of small probability and high consequences. Accordingly, the objective of this paper is to present efficient reinforcement learning algorithms for risk-constrained Markov decision processes (MDPs), where risk is represented via a chance constraint or a constraint on the conditional value-at-risk (CVaR) of the cumulative cost. We collectively refer to such problems as percentile risk-constrained MDPs. Specifically, we first derive a formula for computing the gradient of the Lagrangian function for percentile risk-constrained MDPs. Then, we devise policy gradient and actor-critic algorithms that (1) estimate such gradient, (2) update the policy in the descent direction, and (3) update the Lagrange multiplier in the ascent direction. For these algorithms we prove convergence to locally optimal policies. Finally, we demonstrate the effectiveness of our algorithms in an optimal stopping problem and an online marketing application.

229 citations


Journal ArticleDOI
TL;DR: This paper establishes an interesting decoupling property of the MDP that reduces it to two independent MDPs on disjoint state spaces and designs an online control algorithm for the decoupled problem that is provably cost-optimal.

228 citations


Journal ArticleDOI
TL;DR: A stochastic model predictive control-based energy management strategy using the vehicle location, traveling direction, and terrain information of the area for HEVs running in hilly regions with light traffic is proposed and shown that the developed method can help maintaining the battery SoC within its boundaries and achieve good energy consumption performance.
Abstract: The energy efficiency of parallel hybrid electric vehicles (HEVs) can degrade significantly when the battery state-of-charge (SoC) reaches its boundaries. The road grade has a great influence on the HEV battery charging and discharging processes, and therefore the HEV energy management can be benefited from the road grade preview. In real-world driving, the road grade ahead can be considered as a random variable because the future route is not always available to the HEV controller. This brief proposes a stochastic model predictive control-based energy management strategy using the vehicle location, traveling direction, and terrain information of the area for HEVs running in hilly regions with light traffic. The strategy does not require a determined route being known in advance. The road grade is modeled as a Markov chain and stochastic HEV fuel consumption and battery SoC models are developed. The HEV energy management problem is formulated as a finite-horizon Markov decision process and solved using stochastic dynamic programming. The proposed method is evaluated in simulation and compared with an equivalent consumption minimization strategy and the dynamic programming results. It is shown that the developed method can help maintaining the battery SoC within its boundaries and achieve good energy consumption performance.

204 citations


Journal ArticleDOI
TL;DR: For a discrete-time partially observed stochastic system with an exponential running cost, a solution in terms of the finite-dimensional dynamics of the system through a chain of measure transformation techniques is provided.
Abstract: We consider the problem of risk-sensitive stochastic control under a Markov modulated denial-of-service (DoS) attack strategy in which the attacker, using a hidden Markov model, stochastically jams the control packets in the system. For a discrete-time partially observed stochastic system with an exponential running cost, we provide a solution in terms of the finite-dimensional dynamics of the system through a chain of measure transformation techniques. We also prove a separation principle under which a recursive optimal control policy together with a newly defined information-state constitutes an equivalent completely observable stochastic control problem. Remarkably, on the transformed measure space, the solution to the optimal control problem appears as if it depends only on the sample-path (or path-estimation) of the DoS attack sequences in the system.

182 citations


Journal ArticleDOI
TL;DR: This survey reviews numerous applications of the Markov decision process (MDP) framework, a powerful decision-making tool to develop adaptive algorithms and protocols for WSNs, and various solution methods are discussed and compared to serve as a guide for using MDPs in W SNs.
Abstract: Wireless sensor networks (WSNs) consist of autonomous and resource-limited devices. The devices cooperate to monitor one or more physical phenomena within an area of interest. WSNs operate as stochastic systems because of randomness in the monitored environments. For long service time and low maintenance cost, WSNs require adaptive and robust methods to address data exchange, topology formulation, resource and power optimization, sensing coverage and object detection, and security challenges. In these problems, sensor nodes are used to make optimized decisions from a set of accessible strategies to achieve design goals. This survey reviews numerous applications of the Markov decision process (MDP) framework, a powerful decision-making tool to develop adaptive algorithms and protocols for WSNs. Furthermore, various solution methods are discussed and compared to serve as a guide for using MDPs in WSNs.

166 citations


Proceedings ArticleDOI
02 Mar 2015
TL;DR: In this article, a framework for automatically learning human user models from joint-action demonstrations that enables a robot to compute a robust policy for a collaborative task with a human is presented. But, it is not shown that the learned model can support effective teaming in human-robot collaborative tasks.
Abstract: We present a framework for automatically learning human user models from joint-action demonstrations that enables a robot to compute a robust policy for a collaborative task with a human. First, the demonstrated action sequences are clustered into different human types using an unsupervised learning algorithm. A reward function is then learned for each type through the employment of an inverse reinforcement learning algorithm. The learned model is then incorporated into a mixed-observability Markov decision process (MOMDP) formulation, wherein the human type is a partially observable variable. With this framework, we can infer online the human type of a new user that was not included in the training set, and can compute a policy for the robot that will be aligned to the preference of this user. In a human subject experiment (n=30), participants agreed more strongly that the robot anticipated their actions when working with a robot incorporating the proposed framework (p<0.01), compared to manually annotating robot actions. In trials where participants faced difficulty annotating the robot actions to complete the task, the proposed framework significantly improved team efficiency (p <0.01). The robot incorporating the framework was also found to be more responsive to human actions compared to policies computed using a hand-coded reward function by a domain expert (p<0.01). These results indicate that learning human user models from joint-action demonstrations and encoding them in a MOMDP formalism can support effective teaming in human-robot collaborative tasks.

161 citations


Proceedings Article
07 Dec 2015
TL;DR: In this paper, a risk-sensitive conditional-value-at-risk (CVaR) objective is proposed to minimize the expected cost under worst-case modeling errors, for a given error budget.
Abstract: In this paper we address the problem of decision making within a Markov decision process (MDP) framework where risk and modeling errors are taken into account. Our approach is to minimize a risk-sensitive conditional-value-at-risk (CVaR) objective, as opposed to a standard risk-neutral expectation. We refer to such problem as CVaR MDP. Our first contribution is to show that a CVaR objective, besides capturing risk sensitivity, has an alternative interpretation as expected cost under worst-case modeling errors, for a given error budget. This result, which is of independent interest, motivates CVaR MDPs as a unifying framework for risk-sensitive and robust decision making. Our second contribution is to present an approximate value-iteration algorithm for CVaR MDPs and analyze its convergence rate. To our knowledge, this is the first solution algorithm for CVaR MDPs that enjoys error guarantees. Finally, we present results from numerical experiments that corroborate our theoretical findings and show the practicality of our approach.

140 citations


Proceedings Article
07 Dec 2015
TL;DR: In this paper, an upper sample complexity bound of O(S|2|A|H2/∊2 ln 1/δ) was derived for episodic fixed-horizon MDPs.
Abstract: Recently, there has been significant progress in understanding reinforcement learning in discounted infinite-horizon Markov decision processes (MDPs) by deriving tight sample complexity bounds. However, in many real-world applications, an interactive learning agent operates for a fixed or bounded period of time, for example tutoring students for exams or handling customer service requests. Such scenarios can often be better treated as episodic fixed-horizon MDPs, for which only looser bounds on the sample complexity exist. A natural notion of sample complexity in this setting is the number of episodes required to guarantee a certain performance with high probability (PAC guarantee). In this paper, we derive an upper PAC bound O(|S|2|A|H2/∊2 ln 1/δ) and a lower PAC bound Ω(|S||A|H2/∊2 ln 1/δ+c) that match up to log-terms and an additional linear dependency on the number of states |S|. The lower bound is the first of its kind for this setting. Our upper bound leverages Bernstein's inequality to improve on previous bounds for episodic finite-horizon MDPs which have a time-horizon dependency of at least H3.

Journal ArticleDOI
TL;DR: This paper studies the electric vehicle (EV) charging scheduling problem to match the stochastic wind power, and innovatively incorporates the matching degree between wind power and EV charging load into the objective function.
Abstract: This paper studies the electric vehicle (EV) charging scheduling problem to match the stochastic wind power. Besides considering the optimality of the expected charging cost, the proposed model innovatively incorporates the matching degree between wind power and EV charging load into the objective function. Fully taking into account the uncertainty and dynamics in wind energy supply and EV charging demand, this stochastic and multistage matching is formulated as a Markov decision process. In order to enhance the computational efficiency, the effort is made in two aspects. Firstly, the problem size is reduced by aggregating EVs according to their remaining parking time. The charging scheduling is carried out on the level of aggregators and the optimality of the original problem is proved to be preserved. Secondly, the simulation-based policy improvement method is developed to obtain an improved charging policy from the base policy. The validation of the proposed model, scalability, and computational efficiency of the proposed methods are systematically investigated via numerical experiments.

Posted Content
TL;DR: A family of algorithms with provable guarantees that learn the underlying models and the latent contexts, and optimize the CMDPs are suggested.
Abstract: We consider a planning problem where the dynamics and rewards of the environment depend on a hidden static parameter referred to as the context. The objective is to learn a strategy that maximizes the accumulated reward across all contexts. The new model, called Contextual Markov Decision Process (CMDP), can model a customer's behavior when interacting with a website (the learner). The customer's behavior depends on gender, age, location, device, etc. Based on that behavior, the website objective is to determine customer characteristics, and to optimize the interaction between them. Our work focuses on one basic scenario--finite horizon with a small known number of possible contexts. We suggest a family of algorithms with provable guarantees that learn the underlying models and the latent contexts, and optimize the CMDPs. Bounds are obtained for specific naive implementations, and extensions of the framework are discussed, laying the ground for future research.

Journal ArticleDOI
TL;DR: This paper presents a novel algorithmic approach to reformulate a joint chance constraint as a constraint on the expectation of a summation of indicator random variables, which can be incorporated into the cost function by considering a dual formulation of the optimization problem.
Abstract: Existing approaches to constrained dynamic programming are limited to formulations where the constraints share the same additive structure of the objective function (that is, they can be represented as an expectation of the summation of one-stage costs). As such, these formulations cannot handle joint probabilistic (chance) constraints, whose structure is not additive. To bridge this gap, this paper presents a novel algorithmic approach for joint chance-constrained dynamic programming problems, where the probability of failure to satisfy given state constraints is explicitly bounded. Our approach is to (conservatively) reformulate a joint chance constraint as a constraint on the expectation of a summation of indicator random variables, which can be incorporated into the cost function by considering a dual formulation of the optimization problem. As a result, the primal variables can be optimized by standard dynamic programming, while the dual variable is optimized by a root-finding algorithm that converges exponentially. Error bounds on the primal and dual objective values are rigorously derived. We demonstrate algorithm effectiveness on three optimal control problems, namely a path planning problem, a Mars entry, descent and landing problem, and a Lunar landing problem. All Mars simulations are conducted using real terrain data of Mars, with four million discrete states at each time step. The numerical experiments are used to validate our theoretical and heuristic arguments that the proposed algorithm is both (i) computationally efficient, i.e., capable of handling real-world problems, and (ii) near-optimal, i.e., its degree of conservatism is very low.

Book ChapterDOI
11 Apr 2015
TL;DR: FAUST allows refining the outcomes of the verification procedures over the concrete dt MP in view of the quantified and tunable error, which depends on the dtMP dynamics and on the given formula.
Abstract: FAUST $^{\mathsf 2}$ is a software tool that generates formal abstractions of possibly non-deterministic discrete-time Markov processes dtMP defined over uncountable continuous state spaces A dtMP model is specified in MATLAB and abstracted as a finite-state Markov chain or a Markov decision process The abstraction procedure runs in MATLAB and employs parallel computations and fast manipulations based on vector calculus, which allows scaling beyond state-of-the-art alternatives The abstract model is formally put in relationship with the concrete dtMP via a user-defined maximum threshold on the approximation error introduced by the abstraction procedure FAUST $^{\mathsf 2}$ allows exporting the abstract model to well-known probabilistic model checkers, such as PRISM or MRMC Alternatively, it can handle internally the computation of PCTL properties eg safety or reach-avoid over the abstract model FAUST $^{\mathsf 2}$ allows refining the outcomes of the verification procedures over the concrete dtMP in view of the quantified and tunable error, which depends on the dtMP dynamics and on the given formula The toolbox is available at http://sourceforgenet/projects/faust2/

Journal ArticleDOI
TL;DR: An abstraction procedure is developed that maps a discrete-time stochastic system to an Interval-valued Markov Chain ( IMC) and a switch to a Bounded-parameter Markov Decision Process ( BMDP) and develops an efficient refinement algorithm that reduces the uncertainty in the abstraction.
Abstract: Formal methods are increasingly being used for control and verification of dynamic systems against complex specifications. In general, these methods rely on a relatively simple system model, such as a transition graph, Markov chain, or Markov decision process, and require abstraction of the original continuous-state dynamics. It can be difficult or impossible, however, to find a perfectly equivalent abstraction, particularly when the original system is stochastic. Here we develop an abstraction procedure that maps a discrete-time stochastic system to an Interval-valued Markov Chain ( IMC ) and a switched discrete-time stochastic system to a Bounded-parameter Markov Decision Process ( BMDP ). We construct model checking algorithms for these models against Probabilistic Computation Tree Logic ( PCTL ) formulas and a synthesis procedure for BMDP s. Finally, we develop an efficient refinement algorithm that reduces the uncertainty in the abstraction. The technique is illustrated through simulation.

Proceedings Article
21 Feb 2015
TL;DR: It is shown that while the so-called regression estimator is asymptotically optimal, for small sample sizes it may perform suboptimally compared to an ideal oracle up to a multiplicative factor that depends on the number of actions.
Abstract: This paper studies the off-policy evaluation problem, where one aims to estimate the value of a target policy based on a sample of observations collected by another policy We first consider the single-state, or multi-armed bandit case, establish a finite-time minimax risk lower bound, and analyze the risk of three standard estimators For the so-called regression estimator, we show that while it is asymptotically optimal, for small sample sizes it may perform suboptimally compared to an ideal oracle up to a multiplicative factor that depends on the number of actions We also show that the other two popular estimators can be arbitrarily worse than the optimal, even in the limit of infinitely many data points The performance of the estimators are studied in synthetic and real problems; illustrating the methods strengths and weaknesses We also discuss the implications of these results for off-policy evaluation problems in contextual bandits and fixed-horizon Markov decision processes

Journal ArticleDOI
TL;DR: Simulation results show that the subjective view of an SU tends to exaggerate the jamming probabilities and decreases its transmission probability, thus reducing the average SINR, and the subjectivity of a jammer tends to reduce its jamming probability, and thus increases the SU throughput.
Abstract: Jamming games between a cognitive radio enabled secondary user (SU) and a cognitive radio enabled jammer are considered, in which end-user decision making is modeled using prospect theory (PT). More specifically, the interactions between a user and a smart jammer regarding their respective choices of transmit power are formulated as a game under the assumption that the end-user decision making under uncertainty does not follow the traditional objective assumptions stipulated by expected utility theory, but rather follows the subjective deviations specified by PT. Two PT-based static jamming games are formulated to describe how subjective SU and jammer choose their transmit power to maximize their individual signal-to-interference-plus-noise ratio (SINR)-based utilities under uncertainties regarding the opponent’s actions and channel states, respectively. The Nash equilibria of the games are presented under various channel models and transmission costs. Moreover, a PT-based dynamic jamming game is presented to investigate the long-term interactions between a subjective and a smart jammer according to a Markov decision process with uncertainty on the SU’s future actions and the channel variations. Simulation results show that the subjective view of an SU tends to exaggerate the jamming probabilities and decreases its transmission probability, thus reducing the average SINR. On the other hand, the subjectivity of a jammer tends to reduce its jamming probability, and thus increases the SU throughput.

Posted Content
TL;DR: This paper shows that a CVaR objective, besides capturing risk sensitivity, has an alternative interpretation as expected cost under worst-case modeling errors, for a given error budget, and presents an approximate value-iteration algorithm forCVaR MDPs and analyzes its convergence rate.
Abstract: In this paper we address the problem of decision making within a Markov decision process (MDP) framework where risk and modeling errors are taken into account. Our approach is to minimize a risk-sensitive conditional-value-at-risk (CVaR) objective, as opposed to a standard risk-neutral expectation. We refer to such problem as CVaR MDP. Our first contribution is to show that a CVaR objective, besides capturing risk sensitivity, has an alternative interpretation as expected cost under worst-case modeling errors, for a given error budget. This result, which is of independent interest, motivates CVaR MDPs as a unifying framework for risk-sensitive and robust decision making. Our second contribution is to present an approximate value-iteration algorithm for CVaR MDPs and analyze its convergence rate. To our knowledge, this is the first solution algorithm for CVaR MDPs that enjoys error guarantees. Finally, we present results from numerical experiments that corroborate our theoretical findings and show the practicality of our approach.

Proceedings ArticleDOI
04 May 2015
TL;DR: In this paper, the authors show that the planning horizon is a complexity control parameter for the class of policies to be learned, and that a similar relationship can be observed empirically with a more general and data-dependent Rademacher complexity measure.
Abstract: For Markov decision processes with long horizons (i.e., discount factors close to one), it is common in practice to use reduced horizons during planning to speed computation. However, perhaps surprisingly, when the model available to the agent is estimated from data, as will be the case in most real-world problems, the policy found using a shorter planning horizon can actually be better than a policy learned with the true horizon. In this paper we provide a precise explanation for this phenomenon based on principles of learning theory. We show formally that the planning horizon is a complexity control parameter for the class of policies to be learned. In particular, it has an intuitive, monotonic relationship with a simple counting measure of complexity, and that a similar relationship can be observed empirically with a more general and data-dependent Rademacher complexity measure. Each complexity measure gives rise to a bound on the planning loss predicting that a planning horizon shorter than the true horizon can reduce overfitting and improve test performance, and we confirm these predictions empirically.

Posted Content
TL;DR: This paper forms the service migration problem as a Markov decision process (MDP), which captures general cost models and provides a mathematical framework to design optimal service migration policies and approximate the underlying state space by the distance between the user and service locations.
Abstract: In mobile edge computing, local edge servers can host cloud-based services, which reduces network overhead and latency but requires service migrations as users move to new locations. It is challenging to make migration decisions optimally because of the uncertainty in such a dynamic cloud environment. In this paper, we formulate the service migration problem as a Markov Decision Process (MDP). Our formulation captures general cost models and provides a mathematical framework to design optimal service migration policies. In order to overcome the complexity associated with computing the optimal policy, we approximate the underlying state space by the distance between the user and service locations. We show that the resulting MDP is exact for uniform one-dimensional user mobility while it provides a close approximation for uniform two-dimensional mobility with a constant additive error. We also propose a new algorithm and a numerical technique for computing the optimal solution which is significantly faster than traditional methods based on standard value or policy iteration. We illustrate the application of our solution in practical scenarios where many theoretical assumptions are relaxed. Our evaluations based on real-world mobility traces of San Francisco taxis show superior performance of the proposed solution compared to baseline solutions.

Posted Content
TL;DR: The underlying sequential decision-making problem as a Markov decision process is formulates and an auto-encoder network is applied to find a compact feature representation of the sensor measurements, which helps to mitigate the curse of dimensionality.
Abstract: Electric water heaters have the ability to store energy in their water buffer without impacting the comfort of the end user. This feature makes them a prime candidate for residential demand response. However, the stochastic and nonlinear dynamics of electric water heaters, makes it challenging to harness their flexibility. Driven by this challenge, this paper formulates the underlying sequential decision-making problem as a Markov decision process and uses techniques from reinforcement learning. Specifically, we apply an auto-encoder network to find a compact feature representation of the sensor measurements, which helps to mitigate the curse of dimensionality. A wellknown batch reinforcement learning technique, fitted Q-iteration, is used to find a control policy, given this feature representation. In a simulation-based experiment using an electric water heater with 50 temperature sensors, the proposed method was able to achieve good policies much faster than when using the full state information. In a lab experiment, we apply fitted Q-iteration to an electric water heater with eight temperature sensors. Further reducing the state vector did not improve the results of fitted Q-iteration. The results of the lab experiment, spanning 40 days, indicate that compared to a thermostat controller, the presented approach was able to reduce the total cost of energy consumption of the electric water heater by 15%.

Journal ArticleDOI
TL;DR: In this paper, a data-driven approach of finding optimal transmission policies for a solar-powered sensor node that attempts to maximize net bit rates by adapting its transmission parameters, power levels and modulation types, to the changes of channel fading and battery recharge is presented.
Abstract: Energy harvesting from the surroundings is a promising solution to perpetually power-up wireless sensor communications. This paper presents a data-driven approach of finding optimal transmission policies for a solar-powered sensor node that attempts to maximize net bit rates by adapting its transmission parameters, power levels and modulation types, to the changes of channel fading and battery recharge. We formulate this problem as a discounted Markov decision process (MDP) framework, whereby the energy harvesting process is stochastically quantized into several representative solar states with distinct energy arrivals and is totally driven by historical data records at a sensor node. With the observed solar irradiance at each time epoch, a mixed strategy is developed to compute the belief information of the underlying solar states for the choice of transmission parameters. In addition, a theoretical analysis is conducted for a simple on-off policy, in which a predetermined transmission parameter is utilized whenever a sensor node is active. We prove that such an optimal policy has a threshold structure with respect to battery states and evaluate the performance of an energy harvesting node by analyzing the expected net bit rate. The design framework is exemplified with real solar data records, and the results are useful in characterizing the interplay that occurs between energy harvesting and expenditure under various system configurations. Computer simulations show that the proposed policies significantly outperform other schemes with or without the knowledge of short-term energy harvesting and channel fading patterns.

Proceedings Article
06 Jul 2015
TL;DR: This paper provides a novel and unified error propagation analysis in Lp-norm of three well-known algorithms adapted to Stochastic Games and shows that it can achieve a stationary policy which is 2γe+e′/(1-γ)2 -optimal.
Abstract: This paper provides an analysis of error propagation in Approximate Dynamic Programming applied to zero-sum two-player Stochastic Games. We provide a novel and unified error propagation analysis in Lp-norm of three well-known algorithms adapted to Stochastic Games (namely Approximate Value Iteration, Approximate Policy Iteration and Approximate Generalized Policy Iteration). We show that we can achieve a stationary policy which is 2γe+e′/(1-γ)2 -optimal, where e is the value function approximation error and e′ is the approximate greedy operator error. In addition, we provide a practical algorithm (AGPI-Q) to solve infinite horizon γ-discounted two-player zero-sum Stochastic Games in a batch setting. It is an extension of the Fitted-Q algorithm (which solves Markov Decisions Processes from data) and can be non-parametric. Finally, we demonstrate experimentally the performance of AGPI-Q on a simultaneous two-player game, namely Alesia.

Book ChapterDOI
01 Aug 2015
TL;DR: This paper combines game theory and formal models to tackle the new challenges posed by the validation of decentralised smart contracts.
Abstract: Decentralised smart contracts represent the next step in the development of protocols that support the interaction of independent players without the presence of a coercing authority. Based on protocols i la BitCoin for digital currencies, smart contracts are believed to be a potentially enabling technology for a wealth of future applications. The validation of such an early developing technology is as necessary as it is complex. In this paper we combine game theory and formal models to tackle the new challenges posed by the validation of such systems.

Proceedings ArticleDOI
26 May 2015
TL;DR: In this paper, a probabilistic framework for synthesizing control policies for general multi-robot systems that is based on decentralized partially observable Markov decision processes (Dec-POMDPs) is presented.
Abstract: This paper presents a probabilistic framework for synthesizing control policies for general multi-robot systems that is based on decentralized partially observable Markov decision processes (Dec-POMDPs). Dec-POMDPs are a general model of decision-making where a team of agents must cooperate to optimize a shared objective in the presence of uncertainty. Dec-POMDPs also consider communication limitations, so execution is decentralized. While Dec-POMDPs are typically intractable to solve for real-world problems, recent research on the use of macro-actions in Dec-POMDPs has significantly increased the size of problem that can be practically solved. We show that, in contrast to most existing methods that are specialized to a particular problem class, our approach can synthesize control policies that exploit any opportunities for coordination that are present in the problem, while balancing uncertainty, sensor information, and information about other agents. We use three variants of a warehouse task to show that a single planner of this type can generate cooperative behavior using task allocation, direct communication, and signaling, as appropriate. This demonstrates that our algorithmic framework can automatically optimize control and communication policies for complex multi-robot systems.

Journal ArticleDOI
TL;DR: In this paper, the use and value of time and temperature information to manage perishables in the context of a retailer that sells a random lifetime product subject to stochastic demand and lost sales is addressed.
Abstract: We address the use and value of time and temperature information to manage perishables in the context of a retailer that sells a random lifetime product subject to stochastic demand and lost sales. The product's lifetime is largely determined by the temperature history and the flow time through the supply chain. We compare the case in which information on flow time and temperature history is available and used for inventory management to a base case in which such information is not available. We formulate the two cases as Markov Decision Processes and evaluate the value of information through an extensive simulation using representative, real world supply chain parameters.

Journal ArticleDOI
TL;DR: This work proposes a simple inventory replenishment and allocation heuristic to minimize the expected total cost over an infinite time horizon and shows that this policy leads to superior performance compared to existing heuristics in the literature, particularly when supplies are limited.

Journal ArticleDOI
TL;DR: This work presents the POMDP with Information Rewards (POMDP-IR) modeling framework, which rewards an agent for reaching a certain level of belief regarding a state feature, and demonstrates their use for active cooperative perception scenarios.
Abstract: Partially observable Markov decision processes (POMDPs) provide a principled framework for modeling an agent's decision-making problem when the agent needs to consider noisy state estimates. POMDP policies take into account an action's influence on the environment as well as the potential information gain. This is a crucial feature for robotic agents which generally have to consider the effect of actions on sensing. However, building POMDP models which reward information gain directly is not straightforward, but is important in domains such as robot-assisted surveillance in which the value of information is hard to quantify. Common techniques for uncertainty reduction such as expected entropy minimization lead to non-standard POMDPs that are hard to solve. We present the POMDP with Information Rewards (POMDP-IR) modeling framework, which rewards an agent for reaching a certain level of belief regarding a state feature. By remaining in the standard POMDP setting we can exploit many known results as well as successful approximate algorithms. We demonstrate our ideas in a toy problem as well as in real robot-assisted surveillance, showcasing their use for active cooperative perception scenarios. Finally, our experiments show that the POMDP-IR framework compares favorably with a related approach on benchmark domains.

Journal ArticleDOI
TL;DR: In classical Markov decision process (MDP) theory, a policy is searched for that minimizes the expected infinite horizon discounted cost in two cases, the expected utility framework, and conditional value-at-risk, a popular coherent risk measure.
Abstract: In classical Markov decision process (MDP) theory, we search for a policy that, say, minimizes the expected infinite horizon discounted cost. Expectation is, of course, a risk neutral measure, which does not suffice in many applications, particularly in finance. We replace the expectation with a general risk functional, and call such models risk-aware MDP models. We consider minimization of such risk functionals in two cases, the expected utility framework, and conditional value-at-risk, a popular coherent risk measure. Later, we consider risk-aware MDPs wherein the risk is expressed in the constraints. This includes stochastic dominance constraints, and the classical chance-constrained optimization problems. In each case, we develop a convex analytic approach to solve such risk-aware MDPs. In most cases, we show that the problem can be formulated as an infinite-dimensional linear program (LP) in occupation measures when we augment the state space. We provide a discretization method and finite approximati...